v0.6.0:1 - fix Qwen3.6 Mamba block-size assertion at launch

vLLM trips on launching Qwen3.6-35B-A3B-NVFP4 with:
  AssertionError: In Mamba cache align mode, block_size (2096) must be
  <= max_num_batched_tokens (2048).

Qwen3.6 uses a Mamba-attention hybrid. The default --max-num-batched-tokens of 2048 is just under the model's required block_size of 2096. The upstream sibling recipe (qwen3.5-35b-a3b-fp8.yaml) sets it to 16384; use the same value.

Earlier qwen36 swaps in this session worked because vLLM hadn't reached the Mamba-validation code path on that prior path (different attention backend pick or auto-retry). Whatever the reason, the explicit flag avoids the dance.

Also documented in known-issues.md.
This commit is contained in:
Grant
2026-05-12 13:22:24 -05:00
parent ee8c2406b8
commit 5827683a09
3 changed files with 6 additions and 1 deletions
+1
View File
@@ -66,6 +66,7 @@ models:
vllm_args:
- --gpu-memory-utilization=0.85
- --max-model-len=65536
- --max-num-batched-tokens=16384
- --reasoning-parser=qwen3
- --moe_backend=flashinfer_cutlass
- --load-format=fastsafetensors