spark-control

Author	SHA1	Message	Date
Grant	8ac455f5f5	v0.8.0:3 - add --max-num-batched-tokens=16384 to vision models (gemma4, qwen3-vl) After the recent eugr/spark-vllm-docker update, vLLM became stricter about multimodal token budgets: ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens. Each image input produces 2496 tokens, but vLLM's default --max-num-batched-tokens of 2048 is just under. Same class of bug as the Qwen3.6 Mamba block-size assertion we fixed in 0.6.0:1, surfacing on different models. Fix: bake --max-num-batched-tokens=16384 into every multimodal model entry. Now applied to: - qwen36 (already had it for the Mamba constraint; works for multimodal too since Qwen3.6 has vision) - gemma4 (crashed today on engine init) - qwen3-vl (would crash with the same error if anyone tried it) The pre-flight Test button validates argparse but the 2048<2496 check happens at runtime engine init, so it's not caught by Test — only by actually trying to load. This is exactly the kind of bug v0.7's Test catches the syntax of but not the semantics; runtime errors like this still surface only on real swap. Known limitation documented in v0.7 release notes.	2026-05-12 14:47:32 -05:00
Grant	5827683a09	v0.6.0:1 - fix Qwen3.6 Mamba block-size assertion at launch vLLM trips on launching Qwen3.6-35B-A3B-NVFP4 with: AssertionError: In Mamba cache align mode, block_size (2096) must be <= max_num_batched_tokens (2048). Qwen3.6 uses a Mamba-attention hybrid. The default --max-num-batched-tokens of 2048 is just under the model's required block_size of 2096. The upstream sibling recipe (qwen3.5-35b-a3b-fp8.yaml) sets it to 16384; use the same value. Earlier qwen36 swaps in this session worked because vLLM hadn't reached the Mamba-validation code path on that prior path (different attention backend pick or auto-retry). Whatever the reason, the explicit flag avoids the dance. Also documented in known-issues.md.	2026-05-12 13:22:24 -05:00
Grant	ed54f85442	known-issues: mark magpie crash loop RESOLVED with chown fix recipe Volume magpie-model-cache was owned by root, container drops to uid 1000. Fix: docker run --rm -v magpie-model-cache:/cache alpine chown -R 1000:1000 /cache + docker restart magpie-tts. After ~3 GB NGC model download, healthy on :9000.	2026-05-12 11:12:25 -05:00
Grant	342e150266	Add safe optimization flags to gemma4 + qwen36 (fastsafetensors, prefix-caching, fp8 kv) Aligned with sibling recipes in eugr/spark-vllm-docker. Applies on next swap to each model. First real swap gemma4 -> qwen36 succeeded in 5:30 with --moe_backend=flashinfer_cutlass.	2026-05-12 09:49:08 -05:00
Grant	ae8efa1754	Initial scaffold: image/ FastAPI app, models.yaml, docs - image/ FastAPI app: /api/status, /api/swap, /api/swap/{id}/stream, /api/test-connection - models.yaml: 5-model catalog (qwen3-vl, gemma4, qwen36, qwen3-235b-fp8, qwen25-72b) - README, runbook, known-issues - Dry-run swap verified against live Spark 1 (gemma4 currently loaded)	2026-05-12 09:29:13 -05:00

5 Commits