Run a Model (CLI)
The base command is:
vaq run <model_id>
Docker-first example with the published image:
docker run --rm \
--gpus all \
-e VAQ_HF_CACHE_HOST_PATH=/absolute/path/to/huggingface/cache \
-v /absolute/path/to/huggingface/cache:/root/.cache/huggingface \
ghcr.io/xschahl/vaquila:latest \
run Qwen/Qwen3-0.6B --gpu 0 --port 8000
Minimal examples
GPU launch:
vaq run Qwen/Qwen3-0.6B --gpu 0 --port 8000
CPU-only launch:
vaq run openai-community/gpt2 --device cpu --port 8000
Manual mode without estimation or auto-tuning:
vaq run Qwen/Qwen3-0.6B --gpu 0 --port 8000 --gpu-utilization 0.72 --cpu-utilization 0.60
Main arguments
model_id: Hugging Face model id to launch, for exampleQwen/Qwen3-0.6B.--portor-p: host port exposed by the vLLM API. The launch is blocked if the port is already in use.--device gpu|cpu: selects the compute backend. Usegpufor NVIDIA acceleration,cpufor CPU-only mode.--gpu: NVIDIA GPU index used in GPU mode.
Runtime sizing arguments
--max-num-seqs: maximum number of parallel requests handled by the runtime.--max-model-len: context length in tokens for each request.--buffer-gb: safety VRAM buffer reserved for the OS and other processes.--startup-timeout: startup timeout in seconds while vAquila waits for vLLM readiness.
These values directly affect memory usage. Larger values generally require more VRAM or RAM.
Manual override arguments
--gpu-utilization: manual GPU memory utilization ratio in(0, 1].--cpu-utilization: manual CPU limit ratio in(0, 1].
When one of these manual overrides is set, automatic estimation and optimization are bypassed for that launch.
Model behavior arguments
--quantization: runtime quantization strategy such asauto,none,fp8,awq, orgptq.--kv-cache-dtype: KV cache dtype such asauto,bfloat16, orfp8.--tool-call-parser: vLLM tool call parser.--reasoning-parser: vLLM reasoning parser.--enable-thinkingor--disable-thinking: enables or disables thinking mode.--allow-long-context-overrideor--no-allow-long-context-override: allows a context length above the model limit when supported, which is a risky advanced override.--trust-remote-codeor--no-trust-remote-code: allows model-specific custom code from the Hugging Face repository. Keep this disabled unless you trust the model source.
How to choose values
- Start with the defaults if you do not already know the model requirements.
- Increase
--max-num-seqsonly when you need more parallel requests. - Increase
--max-model-lenonly when you need more context. - Use manual utilization ratios only when you want to force a specific runtime budget.
- Prefer
--device cpuonly when GPU is unavailable or not desired.