Files
spark-control/runbook.md
T
Grant ae8efa1754 Initial scaffold: image/ FastAPI app, models.yaml, docs
- image/ FastAPI app: /api/status, /api/swap, /api/swap/{id}/stream, /api/test-connection
- models.yaml: 5-model catalog (qwen3-vl, gemma4, qwen36, qwen3-235b-fp8, qwen25-72b)
- README, runbook, known-issues
- Dry-run swap verified against live Spark 1 (gemma4 currently loaded)
2026-05-12 09:29:13 -05:00

62 lines
1.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# spark-control runbook
Operating notes for running and maintaining the cluster via spark-control.
## Day-to-day
- The UI lives at `http://<your-start9>.local:9999` once the StartOS package is installed and configured.
- Status auto-refreshes every 5 s.
- A swap takes 36 minutes depending on the model. Don't close the tab — but if you do, the swap continues; reopen and you'll re-attach to the log stream.
## Adding a new model
1. Add an entry to `models.yaml` (in the image source) or, post-install, via the "Edit Model Catalog" action in StartOS.
2. Confirm the weights are on the Spark: `ssh <spark-user>@<spark-1-host>.local 'ls ~/.cache/huggingface/hub/'`. If not, download with `./hf-download.sh <repo>` on Spark 1.
3. The new model appears in the UI on next refresh.
## Manual swap fallback
If the UI is unavailable and you need to swap by hand:
```bash
ssh <spark-user>@<spark-1-host>.local
cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh --solo -d exec vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
--port 8888 --host 0.0.0.0 --gpu-memory-utilization 0.8 \
--max-model-len 32768 --reasoning-parser gemma4 \
--tool-call-parser gemma4 --enable-auto-tool-choice
docker logs -f vllm_node # wait for "Application startup complete."
```
## Diagnostics
```bash
# Is vLLM serving?
curl -s http://<spark-1-ip>:8888/v1/models | jq .
# Cluster status (containers up?)
ssh <spark-user>@<spark-1-host>.local 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
# Tail current model's logs
ssh <spark-user>@<spark-1-host>.local 'docker logs --tail 200 -f vllm_node'
# Parakeet
curl -s http://<spark-2-ip>:8000/health
# Magpie (see known-issues.md)
curl -s http://<spark-2-ip>:9000/v1/health/ready
```
## Hard reset
If launch-cluster.sh gets stuck:
```bash
ssh <spark-user>@<spark-1-host>.local
cd ~/spark-vllm-docker
./launch-cluster.sh stop
docker ps -aq | xargs -r docker rm -f
# then relaunch your preferred model
```