Add safe optimization flags to gemma4 + qwen36 (fastsafetensors, prefix-caching, fp8 kv)
Aligned with sibling recipes in eugr/spark-vllm-docker. Applies on next swap to each model. First real swap gemma4 -> qwen36 succeeded in 5:30 with --moe_backend=flashinfer_cutlass.
This commit is contained in:
@@ -39,6 +39,9 @@ models:
|
||||
- --reasoning-parser=gemma4
|
||||
- --tool-call-parser=gemma4
|
||||
- --enable-auto-tool-choice
|
||||
- --load-format=fastsafetensors
|
||||
- --enable-prefix-caching
|
||||
- --kv-cache-dtype=fp8
|
||||
|
||||
qwen36:
|
||||
display_name: "Qwen3.6 35B-A3B (daily driver)"
|
||||
@@ -52,6 +55,9 @@ models:
|
||||
- --max-model-len=65536
|
||||
- --reasoning-parser=qwen3
|
||||
- --moe_backend=flashinfer_cutlass
|
||||
- --load-format=fastsafetensors
|
||||
- --enable-prefix-caching
|
||||
- --kv-cache-dtype=fp8
|
||||
|
||||
qwen3-235b-fp8:
|
||||
display_name: "Qwen3 235B-A22B FP8 (legacy)"
|
||||
|
||||
Reference in New Issue
Block a user