df9f244eae
The dashboard menu is now the set of models actually downloaded on the Sparks, not a hard-coded catalog. models.yaml + overrides are reframed as launch recipes matched to an on-disk model by repo; an on-disk model with no recipe is flagged needs_setup and its launch settings are inferred from its config.json for a one-time operator confirmation (discovery.py). - delete now removes weights AND the menu card (delete_from_disk sweeps all hosts; the delete endpoint resolves keys via the live menu) - new GET /api/models/suggest; /api/models returns the menu + a recipes list (download autocomplete); GET /api/models/disk-status removed - dropped the two legacy Qwen recipes (235B FP8, 2.5 72B) - tests: +test_discovery.py (cache parsing, infer_recipe, build_menu merge)
149 lines
9.1 KiB
Markdown
149 lines
9.1 KiB
Markdown
# spark-control runbook
|
||
|
||
Operating notes for running and maintaining the cluster via spark-control.
|
||
|
||
## Prerequisites (per Spark)
|
||
|
||
spark-control is a **controller**, not a runtime. Each Spark in your cluster must already have the upstream `eugr/spark-vllm-docker` project set up:
|
||
|
||
1. Clone `https://github.com/eugr/spark-vllm-docker` to `~/spark-vllm-docker` on Spark 1 (the head node).
|
||
2. Build the vLLM container: `./build-and-copy.sh -c` (on a cluster) or `./build-and-copy.sh` (solo).
|
||
3. Pre-download any models you want in the catalog: `./hf-download.sh <repo> -c --copy-parallel`.
|
||
4. Verify: `./launch-cluster.sh status` returns sensibly.
|
||
5. Set up passwordless SSH from your Start9 server's spark-control container to each Spark (use the Show Public Key action — see README.md "Post-install setup").
|
||
|
||
Sharing this package with someone else who has a similar dual-DGX-Spark setup: they do the same per-Spark prerequisites, then sideload the `.s9pk` on their Start9 and run the setup actions.
|
||
|
||
## Recent successful swaps
|
||
|
||
- **2026-05-12 — gemma4 → qwen36** via `POST /api/swap` from laptop dev server. ~5:30 to "Application startup complete." Inference works (`/v1/chat/completions` returns reasoning content via `reasoning` field). `--moe_backend=flashinfer_cutlass` confirmed valid by vLLM (logged "Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend").
|
||
|
||
## Optimization flags (added 2026-05-12)
|
||
|
||
Aligned `gemma4` and `qwen36` `vllm_args` with the project's sibling recipes (`qwen3.5-35b-a3b-fp8.yaml`, `gemma4-26b-a4b.yaml`):
|
||
|
||
- `--load-format=fastsafetensors` — faster cold-start weight load.
|
||
- `--enable-prefix-caching` — reuse cached prefix tokens (e.g. system prompt) across requests.
|
||
- `--kv-cache-dtype=fp8` — store KV cache in 8-bit FP; halves memory used per active context.
|
||
|
||
These take effect on the **next swap to that model**. If a swap fails after this change with errors mentioning fastsafetensors/prefix-caching/fp8, revert the entry in `models.yaml` and retry.
|
||
|
||
## Day-to-day
|
||
|
||
- The UI lives at `http://<your-start9>.local:9999` once the StartOS package is installed and configured.
|
||
- Status auto-refreshes every 5 s.
|
||
- A swap takes 3–6 minutes depending on the model. Don't close the tab — but if you do, the swap continues; reopen and you'll re-attach to the log stream.
|
||
|
||
## matrix-bridge bot tile (optional)
|
||
|
||
If you run the matrix-bridge bot container on a Spark, set its SSH user in **Configure Sparks** (e.g. the user that owns `~/matrix-bridge`) and a tile appears under "Always-on services" with status, Update, Restart, Stop/Start, and View logs. Status is docker-state only (no HTTP health), so a `running` badge means the container is up, not necessarily that the bot is connected.
|
||
|
||
The **Update** button runs `git fetch && git reset --hard origin/<branch> && docker compose up -d --build` as that SSH user. For it to reach your git remote:
|
||
|
||
1. `~/matrix-bridge` must be a clone of the repo (not loose files). Gitignored secrets (`.env`, etc.) survive a `git reset --hard`.
|
||
2. If that user has more than one SSH key, pin the remote's key so git doesn't offer the wrong one first (a common `Permission denied (publickey)` cause). In the user's `~/.ssh/config`:
|
||
|
||
```
|
||
Host <your-git-host>
|
||
Port <port>
|
||
IdentityFile ~/.ssh/id_ed25519
|
||
IdentitiesOnly yes
|
||
```
|
||
|
||
3. Spark Control's own package key must be authorized for that SSH user (Show Public Key → add to their `authorized_keys`) unless it's the same user Spark Control already uses for that Spark.
|
||
|
||
## Configurable topology (v0.24.0+)
|
||
|
||
For a cluster wired differently from the reference layout, three optional knobs in **Configure Sparks** (no fork needed):
|
||
|
||
- **vLLM container name** — defaults to `vllm_node`. Set it if your swappable vLLM on Spark 1 runs under a different container name; the swap log-tail and the pre-flight validator `docker exec` into it by name.
|
||
- **Services to hide** — comma-separated `parakeet,kokoro,embeddings,qdrant`. Hidden services show no tile and are never probed (status, deep-health, or connectivity log). Use this when a service you don't run would otherwise be probed at a port something else answers — e.g. a vLLM on port 8000 colliding with Parakeet's default.
|
||
- **Monitor a second vLLM** — the swap machinery only drives the Spark 1 vLLM, but you can *monitor* a vLLM on another Spark by adding a custom service of `kind: vllm` to `/data/services-overrides.yaml`:
|
||
|
||
```yaml
|
||
custom:
|
||
- key: vllm-spark2
|
||
kind: vllm
|
||
host: <spark-2-ip>
|
||
user: <ssh-user>
|
||
container: vllm_node
|
||
port: 8000
|
||
```
|
||
|
||
It gets a read-only tile: loaded model (via `/v1/models`), container state, and start/stop/restart. (Spark Control's SSH key must be authorized for that user — Show Public Key.)
|
||
|
||
## Adding a new model
|
||
|
||
The menu is whatever's downloaded on the Sparks, so the normal path is just:
|
||
**download it, then set it up once.**
|
||
|
||
1. **Download** from the dashboard (**+ Download a new model**, paste the HF repo) or on Spark 1 with `./hf-download.sh <repo>`. When it finishes it appears on the menu by itself.
|
||
2. **Set it up.** If Spark Control already has a recipe for it (see below), it's ready to switch to. Otherwise it shows a **"needs setup"** card: the first switch reads the model's `config.json`, proposes how to launch it (family/parsers, solo vs cluster, vLLM flags), and you confirm once. The confirmed recipe persists to `/data/models-overrides.yaml` (survives package updates).
|
||
|
||
### Bundling a launch recipe (optional — skips the setup prompt)
|
||
|
||
To make a known model launch correctly the instant it's downloaded, add a *recipe* to `image/models.yaml`. These are **not** the menu — they're matched to an on-disk model by `repo`. Required: `display_name`, `repo`, `size_gb`, `mode` (`solo`/`cluster`), `vllm_args`. Optional: `description`, `capabilities` (e.g. `[vision, reasoning, tools]`), `expected_ready_seconds`. Then rebuild + redeploy: `cd package && make x86 && make install`. Keep descriptions generic (not user-specific) so the recipes stay portable.
|
||
|
||
### Local / fine-tuned models (v0.23.0+)
|
||
|
||
A model that lives as a directory on a Spark (e.g. a LoRA-merged fine-tune) instead of an HF repo: use the **"+ Add local model"** button under LLM swap (or a `custom:` entry with `local_path` instead of `repo` in the override YAML). The directory must already exist on the Spark; only its parent dir is mounted, so a `--chat-template` must live **inside** `local_path`.
|
||
|
||
**Load-bearing contract:** on swap, spark-control prefixes the launch with `VLLM_SPARK_EXTRA_DOCKER_ARGS="-v <path>:<path>"` so `launch-cluster.sh` bind-mounts the dir into the vLLM container at the same path. This relies on the upstream `eugr/spark-vllm-docker` `launch-cluster.sh` expanding `$VLLM_SPARK_EXTRA_DOCKER_ARGS` **unquoted** into its `docker run` (verified against the on-Spark script 2026-06-17: line ~11 appends it to `DOCKER_ARGS`, used unquoted in `docker run`). If a future upstream version quotes that variable, local-model mounts would silently fail — re-check this before pulling launch-cluster.sh updates.
|
||
|
||
## Manual swap fallback
|
||
|
||
If the UI is unavailable and you need to swap by hand:
|
||
|
||
```bash
|
||
ssh <spark-user>@<spark-1-host>
|
||
cd ~/spark-vllm-docker
|
||
./launch-cluster.sh stop
|
||
./launch-cluster.sh --solo -d exec vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
|
||
--port 8888 --host 0.0.0.0 --gpu-memory-utilization 0.8 \
|
||
--max-model-len 32768 --reasoning-parser gemma4 \
|
||
--tool-call-parser gemma4 --enable-auto-tool-choice
|
||
docker logs -f vllm_node # wait for "Application startup complete."
|
||
```
|
||
|
||
## Sideload (`make install`) can't reach the server
|
||
|
||
Symptom: `make install` fails with `package.sideload: error sending request for url (https://immense-voyage.local/rpc/v1)`. Cause seen 2026-06-17: `immense-voyage.local` stopped resolving via mDNS from the Mac (`curl https://immense-voyage.local/...` → exit 6, "couldn't resolve host"), even though the server is up — `curl -sk https://<server-ip>/rpc/v1` returns 200.
|
||
|
||
- **Don't** work around it with `start-cli -H https://<server-ip> package install`: TLS connects but it returns `UNAUTHORIZED`, because start-cli's stored credential is bound to the registered `.local` host, not the IP.
|
||
- **Fix:** make the name resolve again, then re-run `make install`:
|
||
- `sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder` (flush mDNS), or
|
||
- `echo "<server-ip> immense-voyage.local" | sudo tee -a /etc/hosts` (deterministic; remove later).
|
||
|
||
Note this only blocks installing to *your own* Start9 — building and publishing the s9pk to Gitea Releases is unaffected (adopters still pull the latest).
|
||
|
||
## Diagnostics
|
||
|
||
```bash
|
||
# Is vLLM serving?
|
||
curl -s http://<spark-1-ip>:8888/v1/models | jq .
|
||
|
||
# Cluster status (containers up?)
|
||
ssh <spark-user>@<spark-1-host> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
|
||
|
||
# Tail current model's logs
|
||
ssh <spark-user>@<spark-1-host> 'docker logs --tail 200 -f vllm_node'
|
||
|
||
# Parakeet
|
||
curl -s http://<spark-2-ip>:8000/health
|
||
|
||
# Kokoro TTS (v0.14.0+)
|
||
curl -s http://<spark-2-ip>:8880/health
|
||
```
|
||
|
||
## Hard reset
|
||
|
||
If launch-cluster.sh gets stuck:
|
||
|
||
```bash
|
||
ssh <spark-user>@<spark-1-host>
|
||
cd ~/spark-vllm-docker
|
||
./launch-cluster.sh stop
|
||
docker ps -aq | xargs -r docker rm -f
|
||
# then relaunch your preferred model
|
||
```
|