Run a Model (CLI)

The base command is:

vaq run <model_id>

Docker-first example with the published image:

docker run --rm \
  --gpus all \
  -e VAQ_HF_CACHE_HOST_PATH=/absolute/path/to/huggingface/cache \
  -v /absolute/path/to/huggingface/cache:/root/.cache/huggingface \
  ghcr.io/xschahl/vaquila:latest \
  run Qwen/Qwen3-0.6B --gpu 0 --port 8000

Minimal examples

GPU launch:

vaq run Qwen/Qwen3-0.6B --gpu 0 --port 8000

CPU-only launch:

vaq run openai-community/gpt2 --device cpu --port 8000

Manual mode without estimation or auto-tuning:

vaq run Qwen/Qwen3-0.6B --gpu 0 --port 8000 --gpu-utilization 0.72 --cpu-utilization 0.60

Main arguments

model_id: Hugging Face model id to launch, for example Qwen/Qwen3-0.6B.
--port or -p: host port exposed by the vLLM API. The launch is blocked if the port is already in use.
--device gpu|cpu: selects the compute backend. Use gpu for NVIDIA acceleration, cpu for CPU-only mode.
--gpu: NVIDIA GPU index used in GPU mode.

Runtime sizing arguments

--max-num-seqs: maximum number of parallel requests handled by the runtime.
--max-model-len: context length in tokens for each request.
--buffer-gb: safety VRAM buffer reserved for the OS and other processes.
--startup-timeout: startup timeout in seconds while vAquila waits for vLLM readiness.

These values directly affect memory usage. Larger values generally require more VRAM or RAM.

Manual override arguments

--gpu-utilization: manual GPU memory utilization ratio in (0, 1].
--cpu-utilization: manual CPU limit ratio in (0, 1].

When one of these manual overrides is set, automatic estimation and optimization are bypassed for that launch.

Model behavior arguments

--quantization: runtime quantization strategy such as auto, none, fp8, awq, or gptq.
--kv-cache-dtype: KV cache dtype such as auto, bfloat16, or fp8.
--tool-call-parser: vLLM tool call parser.
--reasoning-parser: vLLM reasoning parser.
--enable-thinking or --disable-thinking: enables or disables thinking mode.
--allow-long-context-override or --no-allow-long-context-override: allows a context length above the model limit when supported, which is a risky advanced override.
--trust-remote-code or --no-trust-remote-code: allows model-specific custom code from the Hugging Face repository. Keep this disabled unless you trust the model source.

How to choose values

Start with the defaults if you do not already know the model requirements.
Increase --max-num-seqs only when you need more parallel requests.
Increase --max-model-len only when you need more context.
Use manual utilization ratios only when you want to force a specific runtime budget.
Prefer --device cpu only when GPU is unavailable or not desired.

Minimal examples​

Main arguments​

Runtime sizing arguments​

Manual override arguments​

Model behavior arguments​

How to choose values​