# spark-control runbook Operating notes for running and maintaining the cluster via spark-control. ## Day-to-day - The UI lives at `http://.local:9999` once the StartOS package is installed and configured. - Status auto-refreshes every 5 s. - A swap takes 3–6 minutes depending on the model. Don't close the tab — but if you do, the swap continues; reopen and you'll re-attach to the log stream. ## Adding a new model 1. Add an entry to `models.yaml` (in the image source) or, post-install, via the "Edit Model Catalog" action in StartOS. 2. Confirm the weights are on the Spark: `ssh @.local 'ls ~/.cache/huggingface/hub/'`. If not, download with `./hf-download.sh ` on Spark 1. 3. The new model appears in the UI on next refresh. ## Manual swap fallback If the UI is unavailable and you need to swap by hand: ```bash ssh @.local cd ~/spark-vllm-docker ./launch-cluster.sh stop ./launch-cluster.sh --solo -d exec vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \ --port 8888 --host 0.0.0.0 --gpu-memory-utilization 0.8 \ --max-model-len 32768 --reasoning-parser gemma4 \ --tool-call-parser gemma4 --enable-auto-tool-choice docker logs -f vllm_node # wait for "Application startup complete." ``` ## Diagnostics ```bash # Is vLLM serving? curl -s http://:8888/v1/models | jq . # Cluster status (containers up?) ssh @.local 'cd ~/spark-vllm-docker && ./launch-cluster.sh status' # Tail current model's logs ssh @.local 'docker logs --tail 200 -f vllm_node' # Parakeet curl -s http://:8000/health # Magpie (see known-issues.md) curl -s http://:9000/v1/health/ready ``` ## Hard reset If launch-cluster.sh gets stuck: ```bash ssh @.local cd ~/spark-vllm-docker ./launch-cluster.sh stop docker ps -aq | xargs -r docker rm -f # then relaunch your preferred model ```