Files
spark-control/runbook.md
T
Keysat 8d839e3714 v0.13.0:4 - redaction gateway, embeddings proxy, expanded audio API
- Add redaction gateway (redaction_gateway.py, redaction/ scrub + tests)
- Add embeddings proxy and spark_embed service (Dockerfile + main.py)
- Expand audio_proxy with speaker-aware handling; deep_health/health/server updates
- Package: configureSparks action + sparkConfig model updates, manifest/main wiring
- Docs: AUDIO_API, EMBEDDINGS, REDACTION_GATEWAY; HANDOFF and runbook/known-issues refresh
2026-06-11 17:45:57 -05:00

4.2 KiB
Raw Blame History

spark-control runbook

Operating notes for running and maintaining the cluster via spark-control.

Prerequisites (per Spark)

spark-control is a controller, not a runtime. Each Spark in your cluster must already have the upstream eugr/spark-vllm-docker project set up:

  1. Clone https://github.com/eugr/spark-vllm-docker to ~/spark-vllm-docker on Spark 1 (the head node).
  2. Build the vLLM container: ./build-and-copy.sh -c (on a cluster) or ./build-and-copy.sh (solo).
  3. Pre-download any models you want in the catalog: ./hf-download.sh <repo> -c --copy-parallel.
  4. Verify: ./launch-cluster.sh status returns sensibly.
  5. Set up passwordless SSH from your Start9 server's spark-control container to each Spark (use the Show Public Key action — see README.md "Post-install setup").

Sharing this package with someone else who has a similar dual-DGX-Spark setup: they do the same per-Spark prerequisites, then sideload the .s9pk on their Start9 and run the setup actions.

Recent successful swaps

  • 2026-05-12 — gemma4 → qwen36 via POST /api/swap from laptop dev server. ~5:30 to "Application startup complete." Inference works (/v1/chat/completions returns reasoning content via reasoning field). --moe_backend=flashinfer_cutlass confirmed valid by vLLM (logged "Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend").

Optimization flags (added 2026-05-12)

Aligned gemma4 and qwen36 vllm_args with the project's sibling recipes (qwen3.5-35b-a3b-fp8.yaml, gemma4-26b-a4b.yaml):

  • --load-format=fastsafetensors — faster cold-start weight load.
  • --enable-prefix-caching — reuse cached prefix tokens (e.g. system prompt) across requests.
  • --kv-cache-dtype=fp8 — store KV cache in 8-bit FP; halves memory used per active context.

These take effect on the next swap to that model. If a swap fails after this change with errors mentioning fastsafetensors/prefix-caching/fp8, revert the entry in models.yaml and retry.

Day-to-day

  • The UI lives at http://<your-start9>.local:9999 once the StartOS package is installed and configured.
  • Status auto-refreshes every 5 s.
  • A swap takes 36 minutes depending on the model. Don't close the tab — but if you do, the swap continues; reopen and you'll re-attach to the log stream.

Adding a new model

  1. Add an entry to image/models.yaml. Required fields: display_name, repo, size_gb, mode (solo or cluster), vllm_args. Optional but recommended: description (one paragraph — what the model is, what it's good for, how it differs from others; renders below the meta tags in each card), capabilities (tags like [vision, reasoning, tools]), expected_ready_seconds.
  2. Confirm the weights are on the Spark: ssh <spark-user>@<spark-1-host>.local 'ls ~/.cache/huggingface/hub/'. If not, download with ./hf-download.sh <repo> on Spark 1.
  3. Rebuild + redeploy the package: cd package && make x86 && make install.

If description is omitted, the card simply hides that section — no need to populate it for every model. Keep descriptions generic (not user-specific) so the catalog stays portable.

Manual swap fallback

If the UI is unavailable and you need to swap by hand:

ssh <spark-user>@<spark-1-host>.local
cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh --solo -d exec vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
  --port 8888 --host 0.0.0.0 --gpu-memory-utilization 0.8 \
  --max-model-len 32768 --reasoning-parser gemma4 \
  --tool-call-parser gemma4 --enable-auto-tool-choice
docker logs -f vllm_node      # wait for "Application startup complete."

Diagnostics

# Is vLLM serving?
curl -s http://<spark-1-ip>:8888/v1/models | jq .

# Cluster status (containers up?)
ssh <spark-user>@<spark-1-host>.local 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'

# Tail current model's logs
ssh <spark-user>@<spark-1-host>.local 'docker logs --tail 200 -f vllm_node'

# Parakeet
curl -s http://<spark-2-ip>:8000/health

# Kokoro TTS (v0.14.0+)
curl -s http://<spark-2-ip>:8880/health

Hard reset

If launch-cluster.sh gets stuck:

ssh <spark-user>@<spark-1-host>.local
cd ~/spark-vllm-docker
./launch-cluster.sh stop
docker ps -aq | xargs -r docker rm -f
# then relaunch your preferred model