Files
spark-control/image/models.yaml
T
Grant 8ac455f5f5 v0.8.0:3 - add --max-num-batched-tokens=16384 to vision models (gemma4, qwen3-vl)
After the recent eugr/spark-vllm-docker update, vLLM became stricter about multimodal token budgets:

  ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is
  larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens.

Each image input produces 2496 tokens, but vLLM's default --max-num-batched-tokens of 2048 is just under. Same class of bug as the Qwen3.6 Mamba block-size assertion we fixed in 0.6.0:1, surfacing on different models.

Fix: bake --max-num-batched-tokens=16384 into every multimodal model entry. Now applied to:
  - qwen36 (already had it for the Mamba constraint; works for multimodal too since Qwen3.6 has vision)
  - gemma4 (crashed today on engine init)
  - qwen3-vl (would crash with the same error if anyone tried it)

The pre-flight Test button validates argparse but the 2048<2496 check happens at runtime engine init, so it's not caught by Test — only by actually trying to load. This is exactly the kind of bug v0.7's Test catches the *syntax* of but not the *semantics*; runtime errors like this still surface only on real swap. Known limitation documented in v0.7 release notes.
2026-05-12 14:47:32 -05:00

110 lines
3.5 KiB
YAML

# spark-control model catalog
#
# Edit this file (or override at runtime via the StartOS "Edit Model Catalog"
# action) to add or change available models.
#
# Each model entry produces this command on Spark 1:
# cd ~/spark-vllm-docker
# ./launch-cluster.sh [--solo] -d exec vllm serve <repo> \
# --port=<defaults.port> --host=<defaults.host> <vllm_args...>
defaults:
port: 8888
host: 0.0.0.0
models:
qwen3-vl:
display_name: "Qwen3-VL 235B (vision)"
description: >-
Qwen's flagship multimodal model. 235B total parameters with ~22B
active per token (Mixture-of-Experts). Handles text, images, and
many languages. The most capable model in this catalog — also the
slowest to load because it splits across both Sparks.
repo: RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4
size_gb: 135
mode: cluster
capabilities: [vision, multilingual]
expected_ready_seconds: 300
vllm_args:
- --gpu-memory-utilization=0.7
- -tp=2
- --distributed-executor-backend=ray
- --max-model-len=32768
- --max-num-batched-tokens=16384
gemma4:
display_name: "Gemma 4 31B"
description: >-
Google's mid-size reasoning model. 31B dense parameters with built-in
thinking mode and function-calling. Strong on math, logic, and
structured outputs; also supports vision input. Runs solo on one Spark.
repo: RedHatAI/gemma-4-31B-it-NVFP4
size_gb: 23
mode: solo
capabilities: [vision, reasoning, tools]
expected_ready_seconds: 240
vllm_args:
- --gpu-memory-utilization=0.8
- --max-model-len=32768
- --max-num-batched-tokens=16384
- --reasoning-parser=gemma4
- --tool-call-parser=gemma4
- --enable-auto-tool-choice
- --load-format=fastsafetensors
- --enable-prefix-caching
- --kv-cache-dtype=fp8
qwen36:
display_name: "Qwen3.6 35B-A3B (daily driver)"
description: >-
Qwen's latest fast Mixture-of-Experts model: 35B total parameters but
only ~3B active per token, making inference quick. Long 64K-token
context window. A good default for everyday chat and longer documents.
repo: RedHatAI/Qwen3.6-35B-A3B-NVFP4
size_gb: 20
mode: solo
capabilities: [reasoning]
expected_ready_seconds: 300
vllm_args:
- --gpu-memory-utilization=0.85
- --max-model-len=65536
- --max-num-batched-tokens=16384
- --reasoning-parser=qwen3
- --moe_backend=flashinfer_cutlass
- --load-format=fastsafetensors
- --enable-prefix-caching
- --kv-cache-dtype=fp8
qwen3-235b-fp8:
display_name: "Qwen3 235B-A22B FP8 (legacy)"
description: >-
Earlier generation of the Qwen 235B family in native FP8 precision.
Runs across both Sparks. Mostly superseded by Qwen3-VL above; keep
around for text-only baseline comparisons.
repo: Qwen/Qwen3-235B-A22B-FP8
size_gb: 220
mode: cluster
capabilities: []
expected_ready_seconds: 360
vllm_args:
- --gpu-memory-utilization=0.7
- -tp=2
- --distributed-executor-backend=ray
- --max-model-len=32768
qwen25-72b:
display_name: "Qwen2.5 72B (legacy)"
description: >-
Last-generation 72B dense model. Cluster mode required due to size.
Kept for compatibility and baseline comparison against newer Qwens.
repo: Qwen/Qwen2.5-72B-Instruct
size_gb: 145
mode: cluster
capabilities: []
expected_ready_seconds: 360
vllm_args:
- --gpu-memory-utilization=0.7
- -tp=2
- --distributed-executor-backend=ray
- --max-model-len=32768