Files

T

Grant 8ac455f5f5 v0.8.0:3 - add --max-num-batched-tokens=16384 to vision models (gemma4, qwen3-vl)

After the recent eugr/spark-vllm-docker update, vLLM became stricter about multimodal token budgets:

  ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is
  larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens.

Each image input produces 2496 tokens, but vLLM's default --max-num-batched-tokens of 2048 is just under. Same class of bug as the Qwen3.6 Mamba block-size assertion we fixed in 0.6.0:1, surfacing on different models.

Fix: bake --max-num-batched-tokens=16384 into every multimodal model entry. Now applied to:
  - qwen36 (already had it for the Mamba constraint; works for multimodal too since Qwen3.6 has vision)
  - gemma4 (crashed today on engine init)
  - qwen3-vl (would crash with the same error if anyone tried it)

The pre-flight Test button validates argparse but the 2048<2496 check happens at runtime engine init, so it's not caught by Test — only by actually trying to load. This is exactly the kind of bug v0.7's Test catches the *syntax* of but not the *semantics*; runtime errors like this still surface only on real swap. Known limitation documented in v0.7 release notes.

2026-05-12 14:47:32 -05:00

3.2 KiB

Raw Blame History

Known issues

magpie-tts crash loop (Spark 2) — RESOLVED 2026-05-12

What Magpie is: NVIDIA's multilingual text-to-speech (TTS) model, served via the NIM (NVIDIA Inference Microservices) framework — a Riva Speech Server container that converts text into spoken audio. It's the counterpart to Parakeet (which is speech-to-text / STT). When working, it exposes /v1/audio/speech on port 9000 and is used by clients like Open WebUI for the "read aloud" feature.

Root cause: The Docker volume magpie-model-cache (mounted at /opt/nim/.cache in the container) was owned by root:root on the host. The container drops to uid 1000 (riva-server) and tries to download model manifests into that cache — got I/O error Permission denied (os error 13) — exited 1 — restarted infinitely (24,000+ restarts before the fix).

Fix:

ssh <spark-user>@<spark-2-host> 'docker run --rm -v magpie-model-cache:/cache alpine chown -R 1000:1000 /cache && docker restart magpie-tts'

The trick is the docker run --rm alpine chown — it runs as root inside the throwaway container, which is enough to chown the bind-mounted volume on the host, without needing sudo on the host itself. After the chown + restart, magpie downloaded its ~3 GB model from NGC into the cache and came up healthy on :9000.

If it ever recurs (e.g. after wiping the volume), re-run that one command.

Qwen3.6-35B-A3B `--moe_backend=flashinfer_cutlass` may fail on launch

This flag is Blackwell-specific. If vLLM in the container reports unrecognized arguments: --moe_backend or similar, edit models.yaml for qwen36 and drop that flag. The swap UI does NOT auto-fallback in v0.1 — failure surfaces in the log stream.

Qwen3.6 Mamba block-size assertion (fixed in v0.6.0:1)

Qwen3.6 uses a Mamba-attention hybrid that requires --max-num-batched-tokens >= 2096. vLLM's default is 2048, which trips AssertionError: In Mamba cache align mode, block_size (2096) must be <= max_num_batched_tokens (2048). Fix: bake --max-num-batched-tokens=16384 into the bundled qwen36 entry — matches the upstream qwen3.5-35b-a3b-fp8 recipe.

Multimodal token budget for vision models (fixed in v0.8.0:1)

After the eugr/spark-vllm-docker update, vLLM became stricter about multimodal token budgets. Vision-capable models like Gemma 4 31B and Qwen3-VL crash at engine init with ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048). Fix: bake --max-num-batched-tokens=16384 into every model that has the vision capability. Now applied to qwen3-vl, gemma4, and qwen36 (which was already set for the Mamba issue).

Two SSH paths to Spark 1 from the laptop

ssh <spark-user>@<spark-1-ip> does NOT work from the laptop because the NVIDIA Sync ssh_config only has a Host entry for <spark-1-host>.local. Always use the .local hostname or <spark-2-ip>-style entries that ARE matched.

Older models in `models.yaml`

The qwen3-235b-fp8 and qwen25-72b catalog entries are conservative guesses for vLLM flags — they're on disk but were never the focus of this project. First launch of either may fail or be suboptimal; capture working flags here.

3.2 KiB Raw Blame History