Add Spark prerequisites section to runbook (spark-vllm-docker is upstream + Spark-side)

This commit is contained in:
Grant
2026-05-12 10:05:17 -05:00
parent 53a0b01d88
commit 34bdbb7aba
+12
View File
@@ -2,6 +2,18 @@
Operating notes for running and maintaining the cluster via spark-control. Operating notes for running and maintaining the cluster via spark-control.
## Prerequisites (per Spark)
spark-control is a **controller**, not a runtime. Each Spark in your cluster must already have the upstream `eugr/spark-vllm-docker` project set up:
1. Clone `https://github.com/eugr/spark-vllm-docker` to `~/spark-vllm-docker` on Spark 1 (the head node).
2. Build the vLLM container: `./build-and-copy.sh -c` (on a cluster) or `./build-and-copy.sh` (solo).
3. Pre-download any models you want in the catalog: `./hf-download.sh <repo> -c --copy-parallel`.
4. Verify: `./launch-cluster.sh status` returns sensibly.
5. Set up passwordless SSH from your Start9 server's spark-control container to each Spark (use the Show Public Key action — see README.md "Post-install setup").
Sharing this package with someone else who has a similar dual-DGX-Spark setup: they do the same per-Spark prerequisites, then sideload the `.s9pk` on their Start9 and run the setup actions.
## Recent successful swaps ## Recent successful swaps
- **2026-05-12 — gemma4 → qwen36** via `POST /api/swap` from laptop dev server. ~5:30 to "Application startup complete." Inference works (`/v1/chat/completions` returns reasoning content via `reasoning` field). `--moe_backend=flashinfer_cutlass` confirmed valid by vLLM (logged "Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend"). - **2026-05-12 — gemma4 → qwen36** via `POST /api/swap` from laptop dev server. ~5:30 to "Application startup complete." Inference works (`/v1/chat/completions` returns reasoning content via `reasoning` field). `--moe_backend=flashinfer_cutlass` confirmed valid by vLLM (logged "Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend").