Add safe optimization flags to gemma4 + qwen36 (fastsafetensors, prefix-caching, fp8 kv)
Aligned with sibling recipes in eugr/spark-vllm-docker. Applies on next swap to each model. First real swap gemma4 -> qwen36 succeeded in 5:30 with --moe_backend=flashinfer_cutlass.
This commit is contained in:
+14
@@ -2,6 +2,20 @@
|
||||
|
||||
Operating notes for running and maintaining the cluster via spark-control.
|
||||
|
||||
## Recent successful swaps
|
||||
|
||||
- **2026-05-12 — gemma4 → qwen36** via `POST /api/swap` from laptop dev server. ~5:30 to "Application startup complete." Inference works (`/v1/chat/completions` returns reasoning content via `reasoning` field). `--moe_backend=flashinfer_cutlass` confirmed valid by vLLM (logged "Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend").
|
||||
|
||||
## Optimization flags (added 2026-05-12)
|
||||
|
||||
Aligned `gemma4` and `qwen36` `vllm_args` with the project's sibling recipes (`qwen3.5-35b-a3b-fp8.yaml`, `gemma4-26b-a4b.yaml`):
|
||||
|
||||
- `--load-format=fastsafetensors` — faster cold-start weight load.
|
||||
- `--enable-prefix-caching` — reuse cached prefix tokens (e.g. system prompt) across requests.
|
||||
- `--kv-cache-dtype=fp8` — store KV cache in 8-bit FP; halves memory used per active context.
|
||||
|
||||
These take effect on the **next swap to that model**. If a swap fails after this change with errors mentioning fastsafetensors/prefix-caching/fp8, revert the entry in `models.yaml` and retry.
|
||||
|
||||
## Day-to-day
|
||||
|
||||
- The UI lives at `http://<your-start9>.local:9999` once the StartOS package is installed and configured.
|
||||
|
||||
Reference in New Issue
Block a user