Add Spark prerequisites section to runbook (spark-vllm-docker is upstream + Spark-side)

2026-05-12 10:05:17 -05:00
parent 53a0b01d88
commit 34bdbb7aba
1 changed files with 12 additions and 0 deletions
@@ -2,6 +2,18 @@

 Operating notes for running and maintaining the cluster via spark-control.

+## Prerequisites (per Spark)
+
+spark-control is a **controller**, not a runtime. Each Spark in your cluster must already have the upstream `eugr/spark-vllm-docker` project set up:
+
+1. Clone `https://github.com/eugr/spark-vllm-docker` to `~/spark-vllm-docker` on Spark 1 (the head node).
+2. Build the vLLM container: `./build-and-copy.sh -c` (on a cluster) or `./build-and-copy.sh` (solo).
+3. Pre-download any models you want in the catalog: `./hf-download.sh <repo> -c --copy-parallel`.
+4. Verify: `./launch-cluster.sh status` returns sensibly.
+5. Set up passwordless SSH from your Start9 server's spark-control container to each Spark (use the Show Public Key action — see README.md "Post-install setup").
+
+Sharing this package with someone else who has a similar dual-DGX-Spark setup: they do the same per-Spark prerequisites, then sideload the `.s9pk` on their Start9 and run the setup actions.
+
 ## Recent successful swaps

 - **2026-05-12 — gemma4 → qwen36** via `POST /api/swap` from laptop dev server. ~5:30 to "Application startup complete." Inference works (`/v1/chat/completions` returns reasoning content via `reasoning` field). `--moe_backend=flashinfer_cutlass` confirmed valid by vLLM (logged "Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend").