v0.8.0:3 - add --max-num-batched-tokens=16384 to vision models (gemma4, qwen3-vl)

After the recent eugr/spark-vllm-docker update, vLLM became stricter about multimodal token budgets: ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens. Each image input produces 2496 tokens, but vLLM's default --max-num-batched-tokens of 2048 is just under. Same class of bug as the Qwen3.6 Mamba block-size assertion we fixed in 0.6.0:1, surfacing on different models. Fix: bake --max-num-batched-tokens=16384 into every multimodal model entry. Now applied to: - qwen36 (already had it for the Mamba constraint; works for multimodal too since Qwen3.6 has vision) - gemma4 (crashed today on engine init) - qwen3-vl (would crash with the same error if anyone tried it) The pre-flight Test button validates argparse but the 2048<2496 check happens at runtime engine init, so it's not caught by Test — only by actually trying to load. This is exactly the kind of bug v0.7's Test catches the *syntax* of but not the *semantics*; runtime errors like this still surface only on real swap. Known limitation documented in v0.7 release notes.
2026-05-12 14:47:32 -05:00
parent 000c55febe
commit 8ac455f5f5
3 changed files with 7 additions and 1 deletions
@@ -30,6 +30,7 @@ models:
      - -tp=2
      - --distributed-executor-backend=ray
      - --max-model-len=32768
+      - --max-num-batched-tokens=16384

  gemma4:
    display_name: "Gemma 4 31B"
@@ -45,6 +46,7 @@ models:
    vllm_args:
      - --gpu-memory-utilization=0.8
      - --max-model-len=32768
+      - --max-num-batched-tokens=16384
      - --reasoning-parser=gemma4
      - --tool-call-parser=gemma4
      - --enable-auto-tool-choice