Files

T

Keysat df9f244eae v0.26.0:0 - disk-driven model menu (scan sparks; recipes; needs-setup)

The dashboard menu is now the set of models actually downloaded on the
Sparks, not a hard-coded catalog. models.yaml + overrides are reframed as
launch recipes matched to an on-disk model by repo; an on-disk model with
no recipe is flagged needs_setup and its launch settings are inferred from
its config.json for a one-time operator confirmation (discovery.py).

- delete now removes weights AND the menu card (delete_from_disk sweeps all
  hosts; the delete endpoint resolves keys via the live menu)
- new GET /api/models/suggest; /api/models returns the menu + a recipes list
  (download autocomplete); GET /api/models/disk-status removed
- dropped the two legacy Qwen recipes (235B FP8, 2.5 72B)
- tests: +test_discovery.py (cache parsing, infer_recipe, build_menu merge)

2026-06-18 11:09:56 -05:00

9.1 KiB

Raw Blame History

spark-control runbook

Operating notes for running and maintaining the cluster via spark-control.

Prerequisites (per Spark)

spark-control is a controller, not a runtime. Each Spark in your cluster must already have the upstream eugr/spark-vllm-docker project set up:

Clone https://github.com/eugr/spark-vllm-docker to ~/spark-vllm-docker on Spark 1 (the head node).
Build the vLLM container: ./build-and-copy.sh -c (on a cluster) or ./build-and-copy.sh (solo).
Pre-download any models you want in the catalog: ./hf-download.sh <repo> -c --copy-parallel.
Verify: ./launch-cluster.sh status returns sensibly.
Set up passwordless SSH from your Start9 server's spark-control container to each Spark (use the Show Public Key action — see README.md "Post-install setup").

Sharing this package with someone else who has a similar dual-DGX-Spark setup: they do the same per-Spark prerequisites, then sideload the .s9pk on their Start9 and run the setup actions.

Recent successful swaps

2026-05-12 — gemma4 → qwen36 via POST /api/swap from laptop dev server. ~5:30 to "Application startup complete." Inference works (/v1/chat/completions returns reasoning content via reasoning field). --moe_backend=flashinfer_cutlass confirmed valid by vLLM (logged "Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend").

Optimization flags (added 2026-05-12)

Aligned gemma4 and qwen36 vllm_args with the project's sibling recipes (qwen3.5-35b-a3b-fp8.yaml, gemma4-26b-a4b.yaml):

--load-format=fastsafetensors — faster cold-start weight load.
--enable-prefix-caching — reuse cached prefix tokens (e.g. system prompt) across requests.
--kv-cache-dtype=fp8 — store KV cache in 8-bit FP; halves memory used per active context.

These take effect on the next swap to that model. If a swap fails after this change with errors mentioning fastsafetensors/prefix-caching/fp8, revert the entry in models.yaml and retry.

Day-to-day

The UI lives at http://<your-start9>.local:9999 once the StartOS package is installed and configured.
Status auto-refreshes every 5 s.
A swap takes 3–6 minutes depending on the model. Don't close the tab — but if you do, the swap continues; reopen and you'll re-attach to the log stream.

matrix-bridge bot tile (optional)

If you run the matrix-bridge bot container on a Spark, set its SSH user in Configure Sparks (e.g. the user that owns ~/matrix-bridge) and a tile appears under "Always-on services" with status, Update, Restart, Stop/Start, and View logs. Status is docker-state only (no HTTP health), so a running badge means the container is up, not necessarily that the bot is connected.

The Update button runs git fetch && git reset --hard origin/<branch> && docker compose up -d --build as that SSH user. For it to reach your git remote:

~/matrix-bridge must be a clone of the repo (not loose files). Gitignored secrets (.env, etc.) survive a git reset --hard.
If that user has more than one SSH key, pin the remote's key so git doesn't offer the wrong one first (a common Permission denied (publickey) cause). In the user's ~/.ssh/config:
```
Host <your-git-host>
    Port <port>
    IdentityFile ~/.ssh/id_ed25519
    IdentitiesOnly yes
```
Spark Control's own package key must be authorized for that SSH user (Show Public Key → add to their authorized_keys) unless it's the same user Spark Control already uses for that Spark.

Configurable topology (v0.24.0+)

For a cluster wired differently from the reference layout, three optional knobs in Configure Sparks (no fork needed):

vLLM container name — defaults to vllm_node. Set it if your swappable vLLM on Spark 1 runs under a different container name; the swap log-tail and the pre-flight validator docker exec into it by name.
Services to hide — comma-separated parakeet,kokoro,embeddings,qdrant. Hidden services show no tile and are never probed (status, deep-health, or connectivity log). Use this when a service you don't run would otherwise be probed at a port something else answers — e.g. a vLLM on port 8000 colliding with Parakeet's default.
Monitor a second vLLM — the swap machinery only drives the Spark 1 vLLM, but you can monitor a vLLM on another Spark by adding a custom service of kind: vllm to /data/services-overrides.yaml:
```
custom:
  - key: vllm-spark2
    kind: vllm
    host: <spark-2-ip>
    user: <ssh-user>
    container: vllm_node
    port: 8000
```
It gets a read-only tile: loaded model (via /v1/models), container state, and start/stop/restart. (Spark Control's SSH key must be authorized for that user — Show Public Key.)

Adding a new model

The menu is whatever's downloaded on the Sparks, so the normal path is just: download it, then set it up once.

Download from the dashboard (+ Download a new model, paste the HF repo) or on Spark 1 with ./hf-download.sh <repo>. When it finishes it appears on the menu by itself.
Set it up. If Spark Control already has a recipe for it (see below), it's ready to switch to. Otherwise it shows a "needs setup" card: the first switch reads the model's config.json, proposes how to launch it (family/parsers, solo vs cluster, vLLM flags), and you confirm once. The confirmed recipe persists to /data/models-overrides.yaml (survives package updates).

Bundling a launch recipe (optional — skips the setup prompt)

To make a known model launch correctly the instant it's downloaded, add a recipe to image/models.yaml. These are not the menu — they're matched to an on-disk model by repo. Required: display_name, repo, size_gb, mode (solo/cluster), vllm_args. Optional: description, capabilities (e.g. [vision, reasoning, tools]), expected_ready_seconds. Then rebuild + redeploy: cd package && make x86 && make install. Keep descriptions generic (not user-specific) so the recipes stay portable.

Local / fine-tuned models (v0.23.0+)

A model that lives as a directory on a Spark (e.g. a LoRA-merged fine-tune) instead of an HF repo: use the "+ Add local model" button under LLM swap (or a custom: entry with local_path instead of repo in the override YAML). The directory must already exist on the Spark; only its parent dir is mounted, so a --chat-template must live inside local_path.

Load-bearing contract: on swap, spark-control prefixes the launch with VLLM_SPARK_EXTRA_DOCKER_ARGS="-v <path>:<path>" so launch-cluster.sh bind-mounts the dir into the vLLM container at the same path. This relies on the upstream eugr/spark-vllm-docker launch-cluster.sh expanding $VLLM_SPARK_EXTRA_DOCKER_ARGS unquoted into its docker run (verified against the on-Spark script 2026-06-17: line ~11 appends it to DOCKER_ARGS, used unquoted in docker run). If a future upstream version quotes that variable, local-model mounts would silently fail — re-check this before pulling launch-cluster.sh updates.

Manual swap fallback

If the UI is unavailable and you need to swap by hand:

ssh <spark-user>@<spark-1-host>
cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh --solo -d exec vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
  --port 8888 --host 0.0.0.0 --gpu-memory-utilization 0.8 \
  --max-model-len 32768 --reasoning-parser gemma4 \
  --tool-call-parser gemma4 --enable-auto-tool-choice
docker logs -f vllm_node      # wait for "Application startup complete."

Sideload (`make install`) can't reach the server

Symptom: make install fails with package.sideload: error sending request for url (https://immense-voyage.local/rpc/v1). Cause seen 2026-06-17: immense-voyage.local stopped resolving via mDNS from the Mac (curl https://immense-voyage.local/... → exit 6, "couldn't resolve host"), even though the server is up — curl -sk https://<server-ip>/rpc/v1 returns 200.

Don't work around it with start-cli -H https://<server-ip> package install: TLS connects but it returns UNAUTHORIZED, because start-cli's stored credential is bound to the registered .local host, not the IP.
Fix: make the name resolve again, then re-run make install:
- sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder (flush mDNS), or
- echo "<server-ip> immense-voyage.local" | sudo tee -a /etc/hosts (deterministic; remove later).

Note this only blocks installing to your own Start9 — building and publishing the s9pk to Gitea Releases is unaffected (adopters still pull the latest).

Diagnostics

# Is vLLM serving?
curl -s http://<spark-1-ip>:8888/v1/models | jq .

# Cluster status (containers up?)
ssh <spark-user>@<spark-1-host> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'

# Tail current model's logs
ssh <spark-user>@<spark-1-host> 'docker logs --tail 200 -f vllm_node'

# Parakeet
curl -s http://<spark-2-ip>:8000/health

# Kokoro TTS (v0.14.0+)
curl -s http://<spark-2-ip>:8880/health

Hard reset

If launch-cluster.sh gets stuck:

ssh <spark-user>@<spark-1-host>
cd ~/spark-vllm-docker
./launch-cluster.sh stop
docker ps -aq | xargs -r docker rm -f
# then relaunch your preferred model

9.1 KiB Raw Blame History Unescape Escape