342e150266
Aligned with sibling recipes in eugr/spark-vllm-docker. Applies on next swap to each model. First real swap gemma4 -> qwen36 succeeded in 5:30 with --moe_backend=flashinfer_cutlass.
2.9 KiB
2.9 KiB
spark-control runbook
Operating notes for running and maintaining the cluster via spark-control.
Recent successful swaps
- 2026-05-12 — gemma4 → qwen36 via
POST /api/swapfrom laptop dev server. ~5:30 to "Application startup complete." Inference works (/v1/chat/completionsreturns reasoning content viareasoningfield).--moe_backend=flashinfer_cutlassconfirmed valid by vLLM (logged "Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend").
Optimization flags (added 2026-05-12)
Aligned gemma4 and qwen36 vllm_args with the project's sibling recipes (qwen3.5-35b-a3b-fp8.yaml, gemma4-26b-a4b.yaml):
--load-format=fastsafetensors— faster cold-start weight load.--enable-prefix-caching— reuse cached prefix tokens (e.g. system prompt) across requests.--kv-cache-dtype=fp8— store KV cache in 8-bit FP; halves memory used per active context.
These take effect on the next swap to that model. If a swap fails after this change with errors mentioning fastsafetensors/prefix-caching/fp8, revert the entry in models.yaml and retry.
Day-to-day
- The UI lives at
http://<your-start9>.local:9999once the StartOS package is installed and configured. - Status auto-refreshes every 5 s.
- A swap takes 3–6 minutes depending on the model. Don't close the tab — but if you do, the swap continues; reopen and you'll re-attach to the log stream.
Adding a new model
- Add an entry to
models.yaml(in the image source) or, post-install, via the "Edit Model Catalog" action in StartOS. - Confirm the weights are on the Spark:
ssh <spark-user>@<spark-1-host>.local 'ls ~/.cache/huggingface/hub/'. If not, download with./hf-download.sh <repo>on Spark 1. - The new model appears in the UI on next refresh.
Manual swap fallback
If the UI is unavailable and you need to swap by hand:
ssh <spark-user>@<spark-1-host>.local
cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh --solo -d exec vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
--port 8888 --host 0.0.0.0 --gpu-memory-utilization 0.8 \
--max-model-len 32768 --reasoning-parser gemma4 \
--tool-call-parser gemma4 --enable-auto-tool-choice
docker logs -f vllm_node # wait for "Application startup complete."
Diagnostics
# Is vLLM serving?
curl -s http://<spark-1-ip>:8888/v1/models | jq .
# Cluster status (containers up?)
ssh <spark-user>@<spark-1-host>.local 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
# Tail current model's logs
ssh <spark-user>@<spark-1-host>.local 'docker logs --tail 200 -f vllm_node'
# Parakeet
curl -s http://<spark-2-ip>:8000/health
# Magpie (see known-issues.md)
curl -s http://<spark-2-ip>:9000/v1/health/ready
Hard reset
If launch-cluster.sh gets stuck:
ssh <spark-user>@<spark-1-host>.local
cd ~/spark-vllm-docker
./launch-cluster.sh stop
docker ps -aq | xargs -r docker rm -f
# then relaunch your preferred model