After the recent eugr/spark-vllm-docker update, vLLM became stricter about multimodal token budgets:
ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is
larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens.
Each image input produces 2496 tokens, but vLLM's default --max-num-batched-tokens of 2048 is just under. Same class of bug as the Qwen3.6 Mamba block-size assertion we fixed in 0.6.0:1, surfacing on different models.
Fix: bake --max-num-batched-tokens=16384 into every multimodal model entry. Now applied to:
- qwen36 (already had it for the Mamba constraint; works for multimodal too since Qwen3.6 has vision)
- gemma4 (crashed today on engine init)
- qwen3-vl (would crash with the same error if anyone tried it)
The pre-flight Test button validates argparse but the 2048<2496 check happens at runtime engine init, so it's not caught by Test — only by actually trying to load. This is exactly the kind of bug v0.7's Test catches the *syntax* of but not the *semantics*; runtime errors like this still surface only on real swap. Known limitation documented in v0.7 release notes.
vLLM trips on launching Qwen3.6-35B-A3B-NVFP4 with:
AssertionError: In Mamba cache align mode, block_size (2096) must be
<= max_num_batched_tokens (2048).
Qwen3.6 uses a Mamba-attention hybrid. The default --max-num-batched-tokens of 2048 is just under the model's required block_size of 2096. The upstream sibling recipe (qwen3.5-35b-a3b-fp8.yaml) sets it to 16384; use the same value.
Earlier qwen36 swaps in this session worked because vLLM hadn't reached the Mamba-validation code path on that prior path (different attention backend pick or auto-retry). Whatever the reason, the explicit flag avoids the dance.
Also documented in known-issues.md.
Volume magpie-model-cache was owned by root, container drops to uid 1000. Fix:
docker run --rm -v magpie-model-cache:/cache alpine chown -R 1000:1000 /cache
+ docker restart magpie-tts. After ~3 GB NGC model download, healthy on :9000.
Aligned with sibling recipes in eugr/spark-vllm-docker. Applies on next swap to each model.
First real swap gemma4 -> qwen36 succeeded in 5:30 with --moe_backend=flashinfer_cutlass.