Files
spark-control/runbook.md
T
Grant 342e150266 Add safe optimization flags to gemma4 + qwen36 (fastsafetensors, prefix-caching, fp8 kv)
Aligned with sibling recipes in eugr/spark-vllm-docker. Applies on next swap to each model.
First real swap gemma4 -> qwen36 succeeded in 5:30 with --moe_backend=flashinfer_cutlass.
2026-05-12 09:49:08 -05:00

76 lines
2.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# spark-control runbook
Operating notes for running and maintaining the cluster via spark-control.
## Recent successful swaps
- **2026-05-12 — gemma4 → qwen36** via `POST /api/swap` from laptop dev server. ~5:30 to "Application startup complete." Inference works (`/v1/chat/completions` returns reasoning content via `reasoning` field). `--moe_backend=flashinfer_cutlass` confirmed valid by vLLM (logged "Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend").
## Optimization flags (added 2026-05-12)
Aligned `gemma4` and `qwen36` `vllm_args` with the project's sibling recipes (`qwen3.5-35b-a3b-fp8.yaml`, `gemma4-26b-a4b.yaml`):
- `--load-format=fastsafetensors` — faster cold-start weight load.
- `--enable-prefix-caching` — reuse cached prefix tokens (e.g. system prompt) across requests.
- `--kv-cache-dtype=fp8` — store KV cache in 8-bit FP; halves memory used per active context.
These take effect on the **next swap to that model**. If a swap fails after this change with errors mentioning fastsafetensors/prefix-caching/fp8, revert the entry in `models.yaml` and retry.
## Day-to-day
- The UI lives at `http://<your-start9>.local:9999` once the StartOS package is installed and configured.
- Status auto-refreshes every 5 s.
- A swap takes 36 minutes depending on the model. Don't close the tab — but if you do, the swap continues; reopen and you'll re-attach to the log stream.
## Adding a new model
1. Add an entry to `models.yaml` (in the image source) or, post-install, via the "Edit Model Catalog" action in StartOS.
2. Confirm the weights are on the Spark: `ssh <spark-user>@<spark-1-host>.local 'ls ~/.cache/huggingface/hub/'`. If not, download with `./hf-download.sh <repo>` on Spark 1.
3. The new model appears in the UI on next refresh.
## Manual swap fallback
If the UI is unavailable and you need to swap by hand:
```bash
ssh <spark-user>@<spark-1-host>.local
cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh --solo -d exec vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
--port 8888 --host 0.0.0.0 --gpu-memory-utilization 0.8 \
--max-model-len 32768 --reasoning-parser gemma4 \
--tool-call-parser gemma4 --enable-auto-tool-choice
docker logs -f vllm_node # wait for "Application startup complete."
```
## Diagnostics
```bash
# Is vLLM serving?
curl -s http://<spark-1-ip>:8888/v1/models | jq .
# Cluster status (containers up?)
ssh <spark-user>@<spark-1-host>.local 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
# Tail current model's logs
ssh <spark-user>@<spark-1-host>.local 'docker logs --tail 200 -f vllm_node'
# Parakeet
curl -s http://<spark-2-ip>:8000/health
# Magpie (see known-issues.md)
curl -s http://<spark-2-ip>:9000/v1/health/ready
```
## Hard reset
If launch-cluster.sh gets stuck:
```bash
ssh <spark-user>@<spark-1-host>.local
cd ~/spark-vllm-docker
./launch-cluster.sh stop
docker ps -aq | xargs -r docker rm -f
# then relaunch your preferred model
```