26070eb191
Make the cluster topology configurable so an adopter wired differently (vLLM on both Sparks, port 8000, different container name, no Parakeet) can monitor without forking. Covers the OpenClaw report P4/P5/#6. - VLLM_CONTAINER override (default vllm_node), validated at the boundary and quote_arg-quoted into the swap log-tail + pre-flight validator exec. - DISABLED_SERVICES list: hidden services show no tile and are skipped by status/deep-health/connectivity probes (kills the Parakeet-on-8000 collision). - kind: vllm custom service monitors a second Spark's vLLM via the shared probe_vllm_endpoint; /api/endpoints gains a disabled flag. Swap mechanism intentionally not generalized to raw docker run (that's coordination, roadmap item 4).
145 lines
8.7 KiB
Markdown
145 lines
8.7 KiB
Markdown
# spark-control runbook
|
||
|
||
Operating notes for running and maintaining the cluster via spark-control.
|
||
|
||
## Prerequisites (per Spark)
|
||
|
||
spark-control is a **controller**, not a runtime. Each Spark in your cluster must already have the upstream `eugr/spark-vllm-docker` project set up:
|
||
|
||
1. Clone `https://github.com/eugr/spark-vllm-docker` to `~/spark-vllm-docker` on Spark 1 (the head node).
|
||
2. Build the vLLM container: `./build-and-copy.sh -c` (on a cluster) or `./build-and-copy.sh` (solo).
|
||
3. Pre-download any models you want in the catalog: `./hf-download.sh <repo> -c --copy-parallel`.
|
||
4. Verify: `./launch-cluster.sh status` returns sensibly.
|
||
5. Set up passwordless SSH from your Start9 server's spark-control container to each Spark (use the Show Public Key action — see README.md "Post-install setup").
|
||
|
||
Sharing this package with someone else who has a similar dual-DGX-Spark setup: they do the same per-Spark prerequisites, then sideload the `.s9pk` on their Start9 and run the setup actions.
|
||
|
||
## Recent successful swaps
|
||
|
||
- **2026-05-12 — gemma4 → qwen36** via `POST /api/swap` from laptop dev server. ~5:30 to "Application startup complete." Inference works (`/v1/chat/completions` returns reasoning content via `reasoning` field). `--moe_backend=flashinfer_cutlass` confirmed valid by vLLM (logged "Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend").
|
||
|
||
## Optimization flags (added 2026-05-12)
|
||
|
||
Aligned `gemma4` and `qwen36` `vllm_args` with the project's sibling recipes (`qwen3.5-35b-a3b-fp8.yaml`, `gemma4-26b-a4b.yaml`):
|
||
|
||
- `--load-format=fastsafetensors` — faster cold-start weight load.
|
||
- `--enable-prefix-caching` — reuse cached prefix tokens (e.g. system prompt) across requests.
|
||
- `--kv-cache-dtype=fp8` — store KV cache in 8-bit FP; halves memory used per active context.
|
||
|
||
These take effect on the **next swap to that model**. If a swap fails after this change with errors mentioning fastsafetensors/prefix-caching/fp8, revert the entry in `models.yaml` and retry.
|
||
|
||
## Day-to-day
|
||
|
||
- The UI lives at `http://<your-start9>.local:9999` once the StartOS package is installed and configured.
|
||
- Status auto-refreshes every 5 s.
|
||
- A swap takes 3–6 minutes depending on the model. Don't close the tab — but if you do, the swap continues; reopen and you'll re-attach to the log stream.
|
||
|
||
## matrix-bridge bot tile (optional)
|
||
|
||
If you run the matrix-bridge bot container on a Spark, set its SSH user in **Configure Sparks** (e.g. the user that owns `~/matrix-bridge`) and a tile appears under "Always-on services" with status, Update, Restart, Stop/Start, and View logs. Status is docker-state only (no HTTP health), so a `running` badge means the container is up, not necessarily that the bot is connected.
|
||
|
||
The **Update** button runs `git fetch && git reset --hard origin/<branch> && docker compose up -d --build` as that SSH user. For it to reach your git remote:
|
||
|
||
1. `~/matrix-bridge` must be a clone of the repo (not loose files). Gitignored secrets (`.env`, etc.) survive a `git reset --hard`.
|
||
2. If that user has more than one SSH key, pin the remote's key so git doesn't offer the wrong one first (a common `Permission denied (publickey)` cause). In the user's `~/.ssh/config`:
|
||
|
||
```
|
||
Host <your-git-host>
|
||
Port <port>
|
||
IdentityFile ~/.ssh/id_ed25519
|
||
IdentitiesOnly yes
|
||
```
|
||
|
||
3. Spark Control's own package key must be authorized for that SSH user (Show Public Key → add to their `authorized_keys`) unless it's the same user Spark Control already uses for that Spark.
|
||
|
||
## Configurable topology (v0.24.0+)
|
||
|
||
For a cluster wired differently from the reference layout, three optional knobs in **Configure Sparks** (no fork needed):
|
||
|
||
- **vLLM container name** — defaults to `vllm_node`. Set it if your swappable vLLM on Spark 1 runs under a different container name; the swap log-tail and the pre-flight validator `docker exec` into it by name.
|
||
- **Services to hide** — comma-separated `parakeet,kokoro,embeddings,qdrant`. Hidden services show no tile and are never probed (status, deep-health, or connectivity log). Use this when a service you don't run would otherwise be probed at a port something else answers — e.g. a vLLM on port 8000 colliding with Parakeet's default.
|
||
- **Monitor a second vLLM** — the swap machinery only drives the Spark 1 vLLM, but you can *monitor* a vLLM on another Spark by adding a custom service of `kind: vllm` to `/data/services-overrides.yaml`:
|
||
|
||
```yaml
|
||
custom:
|
||
- key: vllm-spark2
|
||
kind: vllm
|
||
host: <spark-2-ip>
|
||
user: <ssh-user>
|
||
container: vllm_node
|
||
port: 8000
|
||
```
|
||
|
||
It gets a read-only tile: loaded model (via `/v1/models`), container state, and start/stop/restart. (Spark Control's SSH key must be authorized for that user — Show Public Key.)
|
||
|
||
## Adding a new model
|
||
|
||
1. Add an entry to `image/models.yaml`. Required fields: `display_name`, `repo`, `size_gb`, `mode` (`solo` or `cluster`), `vllm_args`. Optional but recommended: `description` (one paragraph — what the model is, what it's good for, how it differs from others; renders below the meta tags in each card), `capabilities` (tags like `[vision, reasoning, tools]`), `expected_ready_seconds`.
|
||
2. Confirm the weights are on the Spark: `ssh <spark-user>@<spark-1-host> 'ls ~/.cache/huggingface/hub/'`. If not, download with `./hf-download.sh <repo>` on Spark 1.
|
||
3. Rebuild + redeploy the package: `cd package && make x86 && make install`.
|
||
|
||
If `description` is omitted, the card simply hides that section — no need to populate it for every model. Keep descriptions generic (not user-specific) so the catalog stays portable.
|
||
|
||
### Local / fine-tuned models (v0.23.0+)
|
||
|
||
A model that lives as a directory on a Spark (e.g. a LoRA-merged fine-tune) instead of an HF repo: use the **"+ Add local model"** button under LLM swap (or a `custom:` entry with `local_path` instead of `repo` in the override YAML). The directory must already exist on the Spark; only its parent dir is mounted, so a `--chat-template` must live **inside** `local_path`.
|
||
|
||
**Load-bearing contract:** on swap, spark-control prefixes the launch with `VLLM_SPARK_EXTRA_DOCKER_ARGS="-v <path>:<path>"` so `launch-cluster.sh` bind-mounts the dir into the vLLM container at the same path. This relies on the upstream `eugr/spark-vllm-docker` `launch-cluster.sh` expanding `$VLLM_SPARK_EXTRA_DOCKER_ARGS` **unquoted** into its `docker run` (verified against the on-Spark script 2026-06-17: line ~11 appends it to `DOCKER_ARGS`, used unquoted in `docker run`). If a future upstream version quotes that variable, local-model mounts would silently fail — re-check this before pulling launch-cluster.sh updates.
|
||
|
||
## Manual swap fallback
|
||
|
||
If the UI is unavailable and you need to swap by hand:
|
||
|
||
```bash
|
||
ssh <spark-user>@<spark-1-host>
|
||
cd ~/spark-vllm-docker
|
||
./launch-cluster.sh stop
|
||
./launch-cluster.sh --solo -d exec vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
|
||
--port 8888 --host 0.0.0.0 --gpu-memory-utilization 0.8 \
|
||
--max-model-len 32768 --reasoning-parser gemma4 \
|
||
--tool-call-parser gemma4 --enable-auto-tool-choice
|
||
docker logs -f vllm_node # wait for "Application startup complete."
|
||
```
|
||
|
||
## Sideload (`make install`) can't reach the server
|
||
|
||
Symptom: `make install` fails with `package.sideload: error sending request for url (https://immense-voyage.local/rpc/v1)`. Cause seen 2026-06-17: `immense-voyage.local` stopped resolving via mDNS from the Mac (`curl https://immense-voyage.local/...` → exit 6, "couldn't resolve host"), even though the server is up — `curl -sk https://<server-ip>/rpc/v1` returns 200.
|
||
|
||
- **Don't** work around it with `start-cli -H https://<server-ip> package install`: TLS connects but it returns `UNAUTHORIZED`, because start-cli's stored credential is bound to the registered `.local` host, not the IP.
|
||
- **Fix:** make the name resolve again, then re-run `make install`:
|
||
- `sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder` (flush mDNS), or
|
||
- `echo "<server-ip> immense-voyage.local" | sudo tee -a /etc/hosts` (deterministic; remove later).
|
||
|
||
Note this only blocks installing to *your own* Start9 — building and publishing the s9pk to Gitea Releases is unaffected (adopters still pull the latest).
|
||
|
||
## Diagnostics
|
||
|
||
```bash
|
||
# Is vLLM serving?
|
||
curl -s http://<spark-1-ip>:8888/v1/models | jq .
|
||
|
||
# Cluster status (containers up?)
|
||
ssh <spark-user>@<spark-1-host> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
|
||
|
||
# Tail current model's logs
|
||
ssh <spark-user>@<spark-1-host> 'docker logs --tail 200 -f vllm_node'
|
||
|
||
# Parakeet
|
||
curl -s http://<spark-2-ip>:8000/health
|
||
|
||
# Kokoro TTS (v0.14.0+)
|
||
curl -s http://<spark-2-ip>:8880/health
|
||
```
|
||
|
||
## Hard reset
|
||
|
||
If launch-cluster.sh gets stuck:
|
||
|
||
```bash
|
||
ssh <spark-user>@<spark-1-host>
|
||
cd ~/spark-vllm-docker
|
||
./launch-cluster.sh stop
|
||
docker ps -aq | xargs -r docker rm -f
|
||
# then relaunch your preferred model
|
||
```
|