4 Commits

Author SHA1 Message Date
Keysat df9f244eae v0.26.0:0 - disk-driven model menu (scan sparks; recipes; needs-setup)
The dashboard menu is now the set of models actually downloaded on the
Sparks, not a hard-coded catalog. models.yaml + overrides are reframed as
launch recipes matched to an on-disk model by repo; an on-disk model with
no recipe is flagged needs_setup and its launch settings are inferred from
its config.json for a one-time operator confirmation (discovery.py).

- delete now removes weights AND the menu card (delete_from_disk sweeps all
  hosts; the delete endpoint resolves keys via the live menu)
- new GET /api/models/suggest; /api/models returns the menu + a recipes list
  (download autocomplete); GET /api/models/disk-status removed
- dropped the two legacy Qwen recipes (235B FP8, 2.5 72B)
- tests: +test_discovery.py (cache parsing, infer_recipe, build_menu merge)
2026-06-18 11:09:56 -05:00
Keysat c0b35184ba docs: trim Current state to live status — coordination epic shipped 2026-06-18 08:09:59 -05:00
Keysat 7ecd77f1e5 docs: defer raw-docker swap generalization — multi-node rationale recorded 2026-06-18 07:58:25 -05:00
Keysat 6bcda6e348 docs: v0.25.0:0 installed live — update Current state 2026-06-18 07:11:33 -05:00
15 changed files with 813 additions and 241 deletions
+8 -4
View File
@@ -55,12 +55,16 @@ Subsystem guidance lives in `docs/guides/` and loads when matching files are tou
## Current state ## Current state
- **Live service runs v0.22.0:0** (installed and serving). **v0.25.0:0 is the latest in tree — coordination layer (swap lock + webhook + schedule registry); built/typechecked clean, NOT yet committed/tagged/installed (this session's work).** It stacks on three releases also staged-but-not-live: v0.24.0:0 (configurable topology — committed `26070eb`, tagged, pushed to `gitea/master`), v0.23.0:0 (local/fine-tuned models — committed/tagged/Gitea-published). **Close-out backlog for all of these: (a) commit/tag/push v0.25.0:0; (b) `make release` to publish s9pk assets to Gitea Releases (needs `GITEA_URL` + write `GITEA_TOKEN`, neither in env); (c) the live install.** Installs blocked on the same mDNS issue (P3 line below). Working features: swap dashboard; chat / transcribe / diarize(+chunk) / TTS proxies; embeddings + rerank + hybrid search (Qdrant); `/scrub` + `/rehydrate`; label-merge incl. dual-channel; per-Spark SSH-key copy + WireGuard `VPN <ip>` hardware-card badge; configurable vLLM port (blank ⇒ 8888); **configurable topology** (vLLM container name, hide-services list, second-Spark vLLM monitor — v0.24.0:0); local/fine-tuned models (v0.23.0:0); **coordination layer** (v0.25.0:0 — GPU swap reservation lock with `423`-enforced manual-swap pause + human Release override, swap_complete/swap_failed webhook, read-only schedule registry; API in `docs/COORDINATION.md`). Everything from v0.23 onward lands live once the installs go through. Spark 2 audio stack healthy. Security hardening (v0.19.0:0 — shellsafe SSH-injection guard, Qdrant path-injection, same-origin CSRF guard) shipped and stable; evidence in `EVALUATION.md`. - **Built, install pending: v0.26.0:0 — disk-driven model menu.** The dashboard now lists what's *actually downloaded* on the Sparks instead of a hard-coded catalog. `models.yaml` + overrides are reframed as **launch recipes** matched to an on-disk model by `repo` (no longer "the menu"); `image/app/discovery.py` does the merge: `build_menu` scans both Sparks (`disk.list_cached_models`, one `du` per host) recipes; an on-disk model with no recipe is `needs_setup` and `infer_recipe` reads its `config.json` to prefill a one-time setup form (operator confirms; saved to `/data` overrides). Delete now removes weights **and** the card (`delete_from_disk` sweeps all hosts; the delete endpoint resolves keys via the live menu so discovered models are deletable). New `GET /api/models/suggest`; `/api/models` returns the menu + a `recipes` list (download-box autocomplete); `GET /api/models/disk-status` removed (folded into `/api/models`). Dropped the two legacy Qwen recipes (235B FP8, 2.5 72B). Build/typecheck clean; **install (live-service restart) needs go/no-go.** Why a recipe layer survives a "menu = disk" redesign: a folder can't tell you parsers / solo-vs-cluster / MoE backend (Gemma MoE needs `marlin` on GB10) — disk drives *presence*, recipes drive *launch*.
- **matrix-bridge bot tile (done, v0.21.0:1, verified live):** `bot`-kind service tile — status badge from docker-state only (no HTTP port), plus **Update** / Restart / Stop/Start / **View logs**. Code: `app/matrix_bridge.py` + `/api/matrix-bridge/{update,logs}` (update streams; 25-min cap; fail-loud). Driven directly as `modelo` on Spark 2 (**no `sudo -iu`** — spark2 has no passwordless sudo). User is a blank-default Configure-Sparks field (`matrix_bridge_user`); blank → tile hidden (portable). Host reuses `spark2_host` (`192.168.1.87` = the bot's box `spark-32d0`); container/dir/branch are env-overridable defaults. **Load-bearing ops dep:** Update's `git fetch` runs as `modelo`, which needs `modelo`'s `~/.ssh/config` pinning the Gitea deploy key with `IdentitiesOnly yes` — else the wrong key is offered and Gitea denies (publickey). Optional next, only if the bot dev asks: Docker `HEALTHCHECK` for running-but-disconnected detection (spec §Note). - **Live: v0.25.0:0** (installed 2026-06-18). The OpenClaw/Johnny-5 coexistence epic is fully shipped & live: configurable `VLLM_PORT` (v0.22, blank ⇒ 8888), local/fine-tuned models (v0.23), configurable topology (v0.24 — `VLLM_CONTAINER`, `DISABLED_SERVICES` hide-list, second-Spark `kind: vllm` monitor), coordination layer (v0.25 — swap reservation lock with `423`-enforced manual-swap pause + `?force=true` Release override, `swap_complete`/`swap_failed` webhook, read-only schedule registry; consumer API in `docs/COORDINATION.md`).
- **Tests:** offline pytest harness in `image/tests/``cd image && .venv/bin/python -m pytest` (124 passing). Covers `build_launch_command` (incl. the shell-injection round-trip + local-model bind-mount), the transcript↔diarizer label-merge, the `shellsafe` validators, `matrix_bridge.build_update_command` (+ phase detection), the configurable-topology layer (`test_topology.py`), and the coordination layer (`test_coordination.py`: swap-lock lifecycle/expiry/token-auth, schedule-registry CRUD, webhook payload + HMAC signature — `now` is injected into the lock so expiry is tested without sleeping). Mock-heavy swap/proxy/endpoint tests deliberately skipped (low ROI). Redaction + live-audio suites remain standalone scripts. - **Other live features:** swap dashboard; chat / transcribe / diarize(+chunk) / TTS proxies; embeddings + rerank + hybrid search (Qdrant); `/scrub` + `/rehydrate`; label-merge incl. dual-channel; per-Spark SSH-key copy + WireGuard `VPN <ip>` hardware badge. Security hardening (v0.19 — shellsafe SSH-injection guard, Qdrant path-injection, same-origin CSRF guard) stable (`EVALUATION.md`). Spark 2 audio/embeddings stack healthy.
- **matrix-bridge bot tile (v0.21.0:1, live):** `bot`-kind tile (docker-state badge; Update/Restart/Stop-Start/View-logs) for the Matrix bot on Spark 2, driven as `modelo` (no `sudo -iu`; blank `matrix_bridge_user` ⇒ tile hidden; host reuses `spark2_host`). Code: `app/matrix_bridge.py` + `/api/matrix-bridge/{update,logs}`. **Load-bearing:** Update's `git fetch` runs as `modelo` and needs `modelo`'s `~/.ssh/config` pinning the Gitea deploy key with `IdentitiesOnly yes` (else publickey denial). Optional next only if the bot dev asks: Docker `HEALTHCHECK`.
- **Tests:** offline pytest harness in `image/tests/``cd image && .venv/bin/python -m pytest` (137 passing). Covers `build_launch_command` (incl. the shell-injection round-trip + local-model bind-mount), the transcript↔diarizer label-merge, the `shellsafe` validators, `matrix_bridge.build_update_command` (+ phase detection), the configurable-topology layer (`test_topology.py`), the coordination layer (`test_coordination.py`: swap-lock lifecycle/expiry/token-auth, schedule-registry CRUD, webhook payload + HMAC signature — `now` is injected into the lock so expiry is tested without sleeping), and the disk-driven menu (`test_discovery.py`: cache-dirname↔repo parsing, the cache-listing parser incl. incomplete-download filtering, and `infer_recipe` family/mode mapping — Qwen3-MoE→flashinfer_cutlass, Gemma-MoE→marlin, vision caps, solo-vs-cluster by size/host-count). The `build_menu` merge + `/api/models/suggest` are exercised by hand against the live cluster (mock-heavy unit tests there would test the mocks). Redaction + live-audio suites remain standalone scripts.
- **Signal Engine "flakiness":** diagnosed as *not* a server bug — transient 14s unresponsiveness while the single GPU is busy. Client-side remedy (in-flight cap 2 / ceiling 3 / retry-on-timeout+503) drafted and **forwarded to that dev (owner confirmed 2026-06-15)**. Awaiting whether they want the measured concurrency knee. - **Signal Engine "flakiness":** diagnosed as *not* a server bug — transient 14s unresponsiveness while the single GPU is busy. Client-side remedy (in-flight cap 2 / ceiling 3 / retry-on-timeout+503) drafted and **forwarded to that dev (owner confirmed 2026-06-15)**. Awaiting whether they want the measured concurrency knee.
- **Stance (decided, not built):** no public interface / no API-token auth — LAN + WireGuard/Tailscale split-tunnel only; the CSRF guard covers the browser-driven vector. - **Stance (decided, not built):** no public interface / no API-token auth — LAN + WireGuard/Tailscale split-tunnel only; the CSRF guard covers the browser-driven vector.
- **Known limits:** `/health` blips while the GPU is busy (mitigated client-side); dual-channel can miss a quiet local word under loud remote bleed; connectivity log misses sub-5s outages between 5s polls; diarizer caps at 4 speakers; matrix-bridge badge won't visibly flip on a fast `docker restart` (status re-checked only after the command returns). - **Known limits:** `/health` blips while the GPU is busy (mitigated client-side); dual-channel can miss a quiet local word under loud remote bleed; connectivity log misses sub-5s outages between 5s polls; diarizer caps at 4 speakers; matrix-bridge badge won't visibly flip on a fast `docker restart` (status re-checked only after the command returns).
- **Infra gotcha (safety):** passwordless sudo is NOT configured on spark2 — design unprivileged probes for any Spark feature (the badge uses `ip`, not `sudo wg show`). spark2 sits on the `starttunnel` WireGuard subnet (`10.59.211.6/24`, survives reboot). Owner declined SSH-key rotation after the 2026-06-12 history scrub (only the key *name* leaked) — don't re-flag. - **Infra gotcha (safety):** passwordless sudo is NOT configured on spark2 — design unprivileged probes for any Spark feature (the badge uses `ip`, not `sudo wg show`). spark2 sits on the `starttunnel` WireGuard subnet (`10.59.211.6/24`, survives reboot). Owner declined SSH-key rotation after the 2026-06-12 history scrub (only the key *name* leaked) — don't re-flag.
- **Hosting:** self-hosted Gitea — remote `gitea`, branch `master`, over SSH; push after committing. (Wart: commit `8d839e3` is mislabeled `v0.13.0:4` but contains through v0.18.0:0.) - **Hosting:** self-hosted Gitea — remote `gitea`, branch `master`, over SSH; push after committing. (Wart: commit `8d839e3` is mislabeled `v0.13.0:4` but contains through v0.18.0:0.)
- **Next — committed 2026-06-17: OpenClaw/Johnny-5 coexistence epic (full plan + design stance in `ROADMAP.md` → "Cluster coordination").** Stance: Spark Control = control plane / GPU arbiter, **not** a job runner; business cron jobs live in separate services that *call* its swap API (swaps are already API-driven via `POST /api/swap`). Sequence: (1) **configurable `VLLM_PORT`** — SHIPPED **v0.22.0:0** (Configure-Sparks field, blank ⇒ 8888; + `_env_int` hardening in `config.py` so a blank/bad port no longer crashes startup, killing a P3 tech-debt item). Committed `136a471`, pushed, tagged `v0.22.0`, rebuilt clean, installed, and **published to the self-hosted Gitea Releases** 2026-06-17 (`make release``scripts/gitea-release.sh`, takes `GITEA_URL` + a write token). **Distribution model (decided 2026-06-17):** Gitea Releases + a read-only token the adopter's agent uses to pull the latest s9pk (`GET /api/v1/repos/grant/spark-control/releases/latest` → download the `.s9pk` asset → sideload). Note: Gitea returns `browser_download_url` on its `.local` ROOT_URL, which won't resolve off-LAN — a remote adopter pulls via whatever address reaches the Gitea (the WireGuard IP). (2) **local-path/fine-tuned models** — DONE in tree, staged as **v0.23.0:0** (`ModelDef.local_path` + exactly-one-source validator; swap bind-mounts the dir at the same container path via the launch script's `VLLM_SPARK_EXTRA_DOCKER_ARGS` hook, **no `launch-cluster.sh` change**; "+ Add local model" UI form + `local` badge; `validate_local_path`; disk-delete refused for local; 94 tests pass. Reviewer-agent pass done, findings addressed (path validation + chat-template-location guard folded into the `ModelDef` validator so YAML/override entries are checked too; `_merge_overrides` skips a bad entry instead of failing the whole catalog; `VLLM_SPARK_EXTRA_DOCKER_ARGS` contract documented in `runbook.md`). **Committed `e783653`, tagged `v0.23.0`, built clean, published to Gitea Releases — but `make install` to the live Start9 FAILED: `immense-voyage.local` wasn't resolving via mDNS from the Mac (server up at `192.168.1.72`; `start-cli -H <ip>` reaches it but returns UNAUTHORIZED, auth bound to the registered `.local` host). FINISH-HERE: flush mDNS (`sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder`) or add a hosts entry, then re-run `cd package && make install`** (details in runbook → "Sideload can't reach the server"). (3) **configurable topology** — DONE in tree, staged as **v0.24.0:0** (built clean, not yet committed/installed). Three optional Configure-Sparks knobs: vLLM container name (`VLLM_CONTAINER`, blank ⇒ `vllm_node`, threaded into the swap log-tail + validator exec via `quote_arg`); "services to hide" (`DISABLED_SERVICES` comma list → `Settings.disabled_services` frozenset, skipped by `services_from_settings`, the `check_*` probes, deep-health `run_all`, and connectivity logging — kills the Parakeet-on-8000 collision); second-Spark vLLM monitor via a `kind: vllm` custom service in `services-overrides.yaml` (`probe_vllm_endpoint` shared with `check_vllm`). `/api/endpoints` gained a `disabled` flag; the health-dot hides when disabled. 102 tests pass (+8 in `test_topology.py`). Swap mechanism deliberately NOT generalized to raw `docker run` (that's coordination, item 4). Install pending — same mDNS situation as v0.23.0. (4) **coordination layer** — DONE in tree, staged as **v0.25.0:0** (brought forward 2026-06-17 on request rather than waiting for our own automation). `image/app/coordination.py` + `docs/COORDINATION.md`: swap reservation lock (`GET/POST/DELETE /api/swap/lock`, secret token, `423`-enforced in `post_swap`, TTL-bounded in-memory, `?force=true` human override, dashboard banner + swap-button pause), swap webhook (`swap_complete`/`swap_failed` fired outside the swap lock from `SwapManager._run`, optional HMAC `X-Spark-Signature`, Configure-Sparks URL+secret), schedule registry (`GET/POST/DELETE /api/schedule`, read-only "Scheduled jobs" panel). +20 tests (`test_coordination.py`). Built/typechecked clean; commit + install pending. Still-open older threads: audio concurrency sweep (only if the Signal Engine dev wants the knee; needs a quiet window); optional matrix-bridge Docker `HEALTHCHECK` if the bot dev asks; Parakeet long-audio guard deferred (rationale in ROADMAP). - **Design stance (decided):** Spark Control = control plane / GPU arbiter, **not** a job runner; recurring business jobs live in separate services that *call* the swap API (`POST /api/swap`). Full epic history (v0.22→v0.25) is in git log + `ROADMAP.md` → "Cluster coordination".
- **Usage note (2026-06-18):** owner's daily driver is the solo **Qwen3.6 35B**; the 235B `cluster` models are dormant. Keeping `launch-cluster.sh` (the `eugr/spark-vllm-docker` community standard, mirrors NVIDIA's `dgx-spark-playbooks` Ray+RoCE design) is still correct even single-node — it supplies the maintained, hardware-tuned vLLM images; raw docker would mean DIY image upkeep for no gain. Spark 2 stays the speech/embeddings box regardless.
- **Next steps (all low-priority / externally gated; P2/P3 tech-debt backlog in `ROADMAP.md`):** (1) raw-`docker run` swap generalization — **DEFERRED** (rationale in ROADMAP; revisit only if an adopter wants Spark Control to *drive*, not just monitor, raw-docker swaps — cleanest fix is the adopter adopting `launch-cluster.sh`). (2) audio concurrency knee — only if the Signal Engine dev wants it (needs a quiet window). (3) matrix-bridge Docker `HEALTHCHECK` — only if the bot dev asks. (4) Parakeet long-audio guard — deferred (rationale in ROADMAP).
+1 -1
View File
@@ -92,7 +92,7 @@ Now that hosts are configured, Show Public Key will give you the paste-ready ins
From the Spark Control service page, click the Web UI button. You should see: From the Spark Control service page, click the Web UI button. You should see:
- A **top status bar** with the currently loaded LLM (or "no model loaded" if Spark 1's vLLM container is fresh). - A **top status bar** with the currently loaded LLM (or "no model loaded" if Spark 1's vLLM container is fresh).
- An **LLM tab** with cards for each model in the bundled catalog. Models you've downloaded show "on disk" badges; others show "not downloaded". - An **LLM tab** whose cards are the models actually downloaded on your Sparks (the dashboard scans them on load). A model Spark Control doesn't yet know how to launch shows a "needs setup" card; the first switch reads its files, proposes settings, and asks you to confirm once. Use **+ Download a new model** to fetch one — it appears here when it finishes.
- An **Audio / Speech tab** with health status and Install / Start / Stop / Restart buttons for Parakeet and Kokoro. - An **Audio / Speech tab** with health status and Install / Start / Stop / Restart buttons for Parakeet and Kokoro.
If the dashboard loads and both Spark hardware cards show CPU/RAM/GPU stats, **you're in**. If the dashboard loads and both Spark hardware cards show CPU/RAM/GPU stats, **you're in**.
+2 -2
View File
@@ -112,14 +112,14 @@ Fields: `service` (required), `ok` (required), `source` (optional, free-form), `
## Status ## Status
**v0.2.3 / s9pk version 0.13.0:4** — installed and verified on a Start9 server. Five bundled LLMs in the catalog (qwen3-vl, gemma4, qwen36, qwen3-235b-fp8, qwen2.5-72b), plus any custom models added through the UI. **s9pk version 0.26.0:0** — installed and verified on a Start9 server. The LLM menu is whatever's downloaded on the Sparks (scanned live, not hard-coded); bundled *launch recipes* (qwen3-vl, gemma4, gemma4-26b, qwen36) tell it how to launch known models, and anything else gets a "needs setup" card that infers + saves its settings on first use.
### What v0.2 added on top of v0.1 ### What v0.2 added on top of v0.1
- **Service discovery API** (`/api/endpoints`) for other LAN services - **Service discovery API** (`/api/endpoints`) for other LAN services
- **Kokoro-82M TTS** replaces Magpie/Riva NIM as the default TTS backend (v0.14.0). Magpie's decoder had a ~30-50% truncation rate on multi-sentence inputs and ate 49 GB of GPU memory; Kokoro is 24/24 reliable at every input length tested, uses 1.3 GB GPU, and renders in ~1s. See HANDOFF.md and the release notes for the migration story. - **Kokoro-82M TTS** replaces Magpie/Riva NIM as the default TTS backend (v0.14.0). Magpie's decoder had a ~30-50% truncation rate on multi-sentence inputs and ate 49 GB of GPU memory; Kokoro is 24/24 reliable at every input length tested, uses 1.3 GB GPU, and renders in ~1s. See HANDOFF.md and the release notes for the migration story.
- **Always-on services panel** with Start/Stop/Restart for Parakeet + Kokoro, plus per-service host configuration in Configure Sparks (so they can live on Spark 1, Spark 2, or anywhere) - **Always-on services panel** with Start/Stop/Restart for Parakeet + Kokoro, plus per-service host configuration in Configure Sparks (so they can live on Spark 1, Spark 2, or anywhere)
- **Model download** from the dashboard — paste an HF repo, pick solo or cluster, watch percent progress with bytes/rate/ETA. After completion, an "Add to catalog" dialog appears pre-filled. - **Model download** from the dashboard — paste an HF repo (with autocomplete for known models), pick solo or cluster, watch percent progress with bytes/rate/ETA. After completion the model appears on the menu automatically; if it's unrecognized, a pre-filled "set up this model" dialog offers to configure it.
- **spark-vllm-docker update check** — banner shows "N commits behind upstream"; Apply Update runs `git pull && ./build-and-copy.sh -c` over SSH with a streamed log - **spark-vllm-docker update check** — banner shows "N commits behind upstream"; Apply Update runs `git pull && ./build-and-copy.sh -c` over SSH with a streamed log
- **Per-model Advanced settings** — knobs for max context, GPU memory %, and three optimization toggles (fastsafetensors, prefix caching, FP8 KV cache). Persisted to `/data/models-overrides.yaml` so they survive package updates. Bundled and custom models alike. - **Per-model Advanced settings** — knobs for max context, GPU memory %, and three optimization toggles (fastsafetensors, prefix caching, FP8 KV cache). Persisted to `/data/models-overrides.yaml` so they survive package updates. Bundled and custom models alike.
- **Diarization with speaker fingerprints** via Sortformer + TitaNet, exposed at `/api/audio/diarize-chunk` for chunked workflows - **Diarization with speaker fingerprints** via Sortformer + TitaNet, exposed at `/api/audio/diarize-chunk` for chunked workflows
+13 -1
View File
@@ -16,7 +16,19 @@ Sequenced:
- **Swap lock** with holder + TTL (`POST` / `GET` / `DELETE /api/swap/lock`). Acquire returns a secret token; the swap endpoint refuses any real swap (`423`) that doesn't present it in `X-Swap-Lock-Token`, so the dashboard's manual swap is paused while a scheduler holds it (with a `?force=true` human override). In-memory + TTL-bounded → resets to unlocked on restart; re-acquire with the token extends. Enforced in `post_swap`, not advisory. - **Swap lock** with holder + TTL (`POST` / `GET` / `DELETE /api/swap/lock`). Acquire returns a secret token; the swap endpoint refuses any real swap (`423`) that doesn't present it in `X-Swap-Lock-Token`, so the dashboard's manual swap is paused while a scheduler holds it (with a `?force=true` human override). In-memory + TTL-bounded → resets to unlocked on restart; re-acquire with the token extends. Enforced in `post_swap`, not advisory.
- **Swap-event webhook** (`swap_complete` / `swap_failed`) to a configurable URL (Configure-Sparks field), fired from `SwapManager._run` *outside* the swap lock; optional shared secret ⇒ `X-Spark-Signature` HMAC. Fire-and-forget (5 s, no retries); dry runs don't fire. - **Swap-event webhook** (`swap_complete` / `swap_failed`) to a configurable URL (Configure-Sparks field), fired from `SwapManager._run` *outside* the swap lock; optional shared secret ⇒ `X-Spark-Signature` HMAC. Fire-and-forget (5 s, no retries); dry runs don't fire.
- **Schedule visibility** — `GET/POST/DELETE /api/schedule`; read-only "Scheduled jobs" dashboard panel, registered by external schedulers. Spark Control stores and displays, never executes. - **Schedule visibility** — `GET/POST/DELETE /api/schedule`; read-only "Scheduled jobs" dashboard panel, registered by external schedulers. Spark Control stores and displays, never executes.
- Still NOT generalized: the swap *mechanism* to raw `docker run` (that's the adopter's own crons' job). Tests: `image/tests/test_coordination.py` (22 cases — lock lifecycle/expiry/token, the single-read swap gate, schedule CRUD + id validation, webhook payload+signature). Known limit: lock + schedules are in-memory (a restart frees the lock and empties the registry until schedulers re-register) — persist to `/data` only if that bites. - Tests: `image/tests/test_coordination.py` (22 cases — lock lifecycle/expiry/token, the single-read swap gate, schedule CRUD + id validation, webhook payload+signature). Known limit: lock + schedules are in-memory (a restart frees the lock and empties the registry until schedulers re-register) — persist to `/data` only if that bites.
### Generalizing the swap mechanism to raw `docker run` — DEFERRED (decided 2026-06-18, research-backed; was item 4's last open thread)
Our swap drives `~/spark-vllm-docker/launch-cluster.sh` over SSH on Spark 1 (`./launch-cluster.sh stop`, then `[VLLM_SPARK_EXTRA_DOCKER_ARGS=…] ./launch-cluster.sh [--solo ]-d exec vllm serve <model> <args>`, then `docker logs -f` until the ready marker). The OpenClaw adopter launches vLLM with a plain `docker run` instead, so the swap button can't drive his cluster — only monitor it. The portability fix would be a configurable "swap backend": keep `launch-cluster.sh` as the default and add a "bring your own command" mode (operator-authored stop/launch templates in `services-overrides.yaml` with quoted `{model}`/`{container}`/`{port}`/`{extra_args}` substitution; ready-detection unchanged; the vLLM-argparse pre-flight disabled for that backend).
**Why deferred, not built:**
- **Raw docker is not an upgrade for *us* — for half our catalog it's impossible.** `launch-cluster.sh` is the `eugr/spark-vllm-docker` community project (de-facto DGX Spark standard; mirrors NVIDIA's own `dgx-spark-playbooks` Ray+RDMA architecture). Its headline job is **multi-node** serving: our 235B `cluster` models (Qwen3-VL 235B, Qwen3 235B) exceed one Spark's 128 GB and *must* shard across both Sparks via Ray over the 200 Gbps ConnectX/RoCE link — plumbing (NCCL/MTU/per-node env) that a single-node `docker run` cannot do. So we keep the helper script; switching our own cluster to raw docker is off the table.
- **The feature is therefore portability-only** (for differently-wired adopters), and the one known adopter doesn't need it — he swaps via his own crons and uses Spark Control to watch.
- **Untestable on our hardware** — our cluster uses the helper script, so we can't validate a real raw-docker swap without risking the live vLLM.
- The one real standing risk is eugr's single-maintainer status; fallback is community forks or migrating to NVIDIA's official `dgx-spark-playbooks` launcher (same design). No reason to switch now.
**Revisit only if** an adopter explicitly wants Spark Control to *drive* (not just monitor) swaps on a raw-`docker run` cluster. At that point, get their actual working `docker run` command and build the command-template backend to it.
## Near term ## Near term
- parakeet-asr long-audio memory guard — **deferred 2026-06-15, low priority.** A duration cap on `/v1/audio/diarize`: Sortformer runs the whole file in one pass (`diarizer.py:128-135`) over Spark 2's *shared* 128 GB unified memory (also feeding Kokoro/embeddings/Qdrant), so one giant single file can thrash into swap. **Precautionary — no observed incident**, and the production consumer (Recap Relay) already chunks via `/diarize-chunk` (~5-min, already bounded), so the only exposed path is a consumer POSTing one huge file to the full `/diarize`. When picked up: add a configurable `MAX_DIARIZE_SECONDS` guard in `diarizer.py` right after `duration` is computed (~line 130) → raise → HTTP 413 in `main.py` (mirrors the existing `MAX_UPLOAD_MB` 413); ship via the Reapply-patches action (restarts the live parakeet-asr container → needs go/no-go). Leave transcription out of v1 (upstream/un-patched file; parakeet-TDT handles long audio better). Revisit only if a consumer starts sending long single files. - parakeet-asr long-audio memory guard — **deferred 2026-06-15, low priority.** A duration cap on `/v1/audio/diarize`: Sortformer runs the whole file in one pass (`diarizer.py:128-135`) over Spark 2's *shared* 128 GB unified memory (also feeding Kokoro/embeddings/Qdrant), so one giant single file can thrash into swap. **Precautionary — no observed incident**, and the production consumer (Recap Relay) already chunks via `/diarize-chunk` (~5-min, already bounded), so the only exposed path is a consumer POSTing one huge file to the full `/diarize`. When picked up: add a configurable `MAX_DIARIZE_SECONDS` guard in `diarizer.py` right after `duration` is computed (~line 130) → raise → HTTP 413 in `main.py` (mirrors the existing `MAX_UPLOAD_MB` 413); ship via the Reapply-patches action (restarts the live parakeet-asr container → needs go/no-go). Leave transcription out of v1 (upstream/un-patched file; parakeet-TDT handles long audio better). Revisit only if a consumer starts sending long single files.
+2 -1
View File
@@ -39,6 +39,7 @@ Two kinds, both run with the `image/.venv` interpreter (system python3 has no de
## Layout ## Layout
- `image/app/server.py` — FastAPI entry; routers live in sibling modules (`audio_proxy.py`, `llm_proxy.py`, `embeddings_proxy.py`, `redaction_gateway.py`, `swap.py`, `health.py`, `deep_health.py`, `connectivity.py`, …). - `image/app/server.py` — FastAPI entry; routers live in sibling modules (`audio_proxy.py`, `llm_proxy.py`, `embeddings_proxy.py`, `redaction_gateway.py`, `swap.py`, `health.py`, `deep_health.py`, `connectivity.py`, …).
- `image/app/discovery.py` — the disk-driven model menu. `/api/models` lists what's actually downloaded on the Sparks (via `disk.list_cached_models`); `models.yaml`/overrides are *launch recipes* matched by repo, not the menu. An on-disk model with no recipe is `needs_setup``infer_recipe` reads its `config.json` to prefill a setup form the operator confirms once.
- `image/app/static/` — the dashboard UI. - `image/app/static/` — the dashboard UI.
- `image/models.yaml`vLLM model catalog bundled into the image. - `image/models.yaml`bundled vLLM **launch recipes** (how to launch a known model), NOT the dashboard menu — the menu is the on-disk scan.
- `image/spark_embed/` — Dockerfile + app for the embeddings container; built ON a Spark (ARM64, NGC PyTorch base — see the audio/cluster rule for NGC torch-pinning caveats). - `image/spark_embed/` — Dockerfile + app for the embeddings container; built ON a Spark (ARM64, NGC PyTorch base — see the audio/cluster rule for NGC torch-pinning caveats).
+209
View File
@@ -0,0 +1,209 @@
"""Disk-driven model menu + launch-recipe inference.
The dashboard's model list is whatever is actually downloaded on the Sparks
(see `disk.list_cached_models`), NOT a hard-coded catalog. The bundled/overridden
catalog entries are *launch recipes*: matched to an on-disk model by repo, they
say HOW to launch it. A completed model on disk with no matching recipe shows up
as `needs_setup` — the first switch reads its `config.json`, proposes a recipe
(`infer_recipe`) the operator confirms once, and that confirmed recipe is saved
to /data so it's a normal card from then on.
Why a recipe layer at all, if the menu is the disk? Because a folder on disk
doesn't say how to launch it: the per-family parsers (`--reasoning-parser`,
`--tool-call-parser`), the MoE backend (some Gemma MoE checkpoints need
`marlin` on GB10), and solo-vs-cluster topology can't be read off a directory.
We infer a best guess from the model's own config + size, but the operator
confirms it — a wrong guess is cheap, a wrong launch is not.
"""
from __future__ import annotations
import asyncio
import re
from .config import Settings
from .disk import list_cached_models, probe_disk
from .overrides import extract_knobs_from_args
# A model whose weights exceed this can't fit one Spark's 128 GB beside a KV
# cache, so it must shard across both via Ray. A heuristic prefill only — the
# operator confirms mode in the setup form, so the exact cutoff isn't critical.
SINGLE_SPARK_BYTES = 115 * 1000 ** 3
# Generic knob defaults applied to every inferred recipe (the operator can tweak
# these in the setup form). Family-specific flags (parsers, MoE backend) are
# layered on separately by `_detect_family`.
_COMMON_KNOBS = {
"max_model_len": 32768,
"gpu_memory_utilization": 0.85,
"fastsafetensors": True,
"prefix_caching": True,
"kv_cache_dtype": "fp8",
}
def repo_to_key(repo: str) -> str:
"""Stable, URL-safe menu key for a discovered model with no recipe key yet.
'RedHatAI/Qwen3.6-35B-A3B-NVFP4' -> 'redhatai-qwen3-6-35b-a3b-nvfp4'. The same
slug is used by the menu, the setup form, and `_identify_current_model`, so a
loaded-but-unconfigured model still highlights as active."""
return re.sub(r"[^a-z0-9_-]+", "-", repo.lower()).strip("-")
def _detect_family(config: dict) -> tuple[str, list[str], list[str]]:
"""Return (family_label, vllm_flags, capabilities) inferred from config.json.
Only family-specific, non-knob flags (parsers, MoE backend) go in vllm_flags;
generic knob defaults are handled by the caller. Best-effort and operator-
confirmed, so a wrong guess is cheap."""
arch = " ".join(config.get("architectures") or [])
mtype = str(config.get("model_type") or "")
s = (arch + " " + mtype).lower()
is_moe = (
"moe" in s
or any(config.get(k) for k in ("num_experts", "n_routed_experts", "num_local_experts"))
)
is_vision = (
"conditionalgeneration" in s
or "vision" in s
or "vlforcausallm" in s
or "vision_config" in config
or "image_token_index" in config
)
flags: list[str] = []
caps: list[str] = []
label = "Generic"
if mtype.startswith("qwen3") or "qwen3" in s:
label = "Qwen3 (MoE)" if is_moe else "Qwen3"
flags.append("--reasoning-parser=qwen3")
caps.append("reasoning")
if is_moe:
flags.append("--moe_backend=flashinfer_cutlass")
elif "gemma" in s:
label = "Gemma (MoE)" if is_moe else "Gemma"
flags += ["--reasoning-parser=gemma4", "--tool-call-parser=gemma4", "--enable-auto-tool-choice"]
caps += ["reasoning", "tools"]
if is_moe:
# The fast flashinfer/CUTLASS FP4 path errors on GB10 for Gemma MoE;
# marlin is the working fallback (see the Gemma 26B trial notes).
flags.append("--moe_backend=marlin")
if is_vision and "vision" not in caps:
caps.append("vision")
return label, flags, caps
def _infer_mode(total_bytes: int, on_host_count: int) -> str:
"""Solo unless the weights are present on both Sparks or too big for one."""
if on_host_count >= 2 or total_bytes > SINGLE_SPARK_BYTES:
return "cluster"
return "solo"
def infer_recipe(repo: str, config: dict, total_bytes: int, on_host_count: int) -> dict:
"""Propose a launch recipe for a discovered model — prefills the setup form."""
label, flags, caps = _detect_family(config or {})
mode = _infer_mode(total_bytes, on_host_count)
vllm_args = list(flags)
vllm_args.append("--max-num-batched-tokens=16384")
knobs = dict(_COMMON_KNOBS)
if mode == "cluster":
# Large models shard across both Sparks via Ray; leave more headroom.
vllm_args += ["-tp=2", "--distributed-executor-backend=ray"]
knobs["gpu_memory_utilization"] = 0.7
return {
"key": repo_to_key(repo),
"repo": repo,
"display_name": repo.split("/")[-1],
"mode": mode,
"capabilities": caps,
"vllm_args": vllm_args,
"knobs": knobs,
"family": label,
}
def _menu_entry_from_recipe(m, *, on_disk: bool, total_bytes: int, per_host: list[dict]) -> dict:
d = m.model_dump()
d["effective_knobs"] = {**extract_knobs_from_args(m.vllm_args), **(m.knobs or {})}
d["needs_setup"] = False
d["on_disk"] = on_disk
d["total_bytes"] = total_bytes
d["per_host"] = per_host
return d
async def build_menu(settings: Settings, catalog) -> dict[str, dict]:
"""The disk-driven model menu: every completed model on the Sparks, annotated
with its launch recipe (matched by repo) or flagged `needs_setup` if none.
Two SSH scans total (one per Spark), run in parallel — much cheaper than the
old per-recipe disk probe. A host that errors is skipped, not fatal."""
hosts = [(settings.spark1_host, settings.spark1_user)]
if settings.spark2_host:
hosts.append((settings.spark2_host, settings.spark2_user))
scans = await asyncio.gather(
*(list_cached_models(h, u, settings) for h, u in hosts),
return_exceptions=True,
)
by_repo: dict[str, dict] = {}
for (h, _u), res in zip(hosts, scans):
if isinstance(res, Exception):
continue
for repo, size, complete in res:
e = by_repo.setdefault(repo, {"total_bytes": 0, "per_host": [], "complete": False})
e["total_bytes"] += size
e["per_host"].append({"host": h, "size_bytes": size})
e["complete"] = e["complete"] or complete
recipe_by_repo = {m.repo: (k, m) for k, m in catalog.models.items() if m.repo}
menu: dict[str, dict] = {}
for repo, info in by_repo.items():
# Skip half-fetched / corrupt caches (no finished snapshot) — they'd show
# as broken cards. In-flight downloads surface in the download panel.
if not info["complete"]:
continue
if repo in recipe_by_repo:
key, m = recipe_by_repo[repo]
menu[key] = _menu_entry_from_recipe(
m, on_disk=True, total_bytes=info["total_bytes"], per_host=info["per_host"]
)
else:
key = repo_to_key(repo)
menu[key] = {
"display_name": repo.split("/")[-1],
"repo": repo,
"local_path": None,
"size_gb": round(info["total_bytes"] / 1e9, 1),
"mode": _infer_mode(info["total_bytes"], len(info["per_host"])),
"capabilities": [],
"expected_ready_seconds": 300,
"vllm_args": [],
"description": None,
"knobs": None,
"custom": False,
"needs_setup": True,
"effective_knobs": {},
"on_disk": True,
"total_bytes": info["total_bytes"],
"per_host": info["per_host"],
}
# Local/fine-tuned recipes live as a directory, not an HF cache entry — probe
# each by path and include it if present. Their keys are unique catalog keys
# (and local models carry repo="" per ModelDef), so they never collide with a
# discovered repo's slug or an HF recipe key above.
for key, m in catalog.models.items():
if not m.local_path:
continue
st = await probe_disk(m.repo, m.mode, settings, local_path=m.local_path)
if not st.on_disk:
continue
menu[key] = _menu_entry_from_recipe(
m,
on_disk=True,
total_bytes=st.total_bytes,
per_host=[{"host": r.host, "size_bytes": r.size_bytes} for r in st.per_host if r.on_disk],
)
return menu
+89 -3
View File
@@ -10,6 +10,7 @@ model or one tied to an in-flight swap/download.
""" """
from __future__ import annotations from __future__ import annotations
import asyncio import asyncio
import json
import re import re
from dataclasses import dataclass from dataclasses import dataclass
from typing import Optional from typing import Optional
@@ -36,6 +37,87 @@ def repo_to_cache_dirname(repo: str) -> str:
return dn return dn
def cache_dirname_to_repo(dirname: str) -> Optional[str]:
"""Inverse of `repo_to_cache_dirname`: 'models--org--name' -> 'org/name'.
A repo has exactly one '/', so the org is the first '--'-segment and the name
is everything after (names may themselves contain single dashes). Returns
None for anything that isn't a model cache dir."""
if not dirname.startswith("models--"):
return None
parts = dirname[len("models--"):].split("--")
if len(parts) < 2 or not parts[0] or not parts[1]:
return None
return f"{parts[0]}/{'--'.join(parts[1:])}"
def parse_cache_listing(out: str) -> list[tuple[str, int, bool]]:
"""Parse the 'size|complete|dirname' lines from `list_cached_models`'s scan.
Returns [(repo, size_bytes, complete), ...], skipping non-model lines. Pure
function so the parsing is unit-testable without SSH."""
items: list[tuple[str, int, bool]] = []
for line in out.splitlines():
line = line.strip()
if line.count("|") < 2:
continue
size_s, complete_s, dirname = line.split("|", 2)
repo = cache_dirname_to_repo(dirname.strip())
if not repo:
continue
try:
size = int(size_s)
except ValueError:
size = 0
items.append((repo, size, complete_s.strip() == "1"))
return items
async def list_cached_models(host: str, user: str, settings: Settings) -> list[tuple[str, int, bool]]:
"""Enumerate every Hugging Face model cached on a host: (repo, size_bytes, complete).
'complete' = the cache has at least one snapshot carrying a config.json (a
finished download, not a half-fetched/corrupt dir). One SSH round-trip; the
glob's no-match case is handled by the `[ -d ]` guard."""
if not host or not user:
return []
cmd = (
'HUB="$HOME/.cache/huggingface/hub"; '
'for d in "$HUB"/models--*; do '
'[ -d "$d" ] || continue; '
'n=$(basename "$d"); '
'sz=$(du -sb "$d" 2>/dev/null | cut -f1); sz=${sz:-0}; '
'if ls "$d"/snapshots/*/config.json >/dev/null 2>&1; then c=1; else c=0; fi; '
'echo "${sz}|${c}|${n}"; '
'done'
)
rc, out, err = await ssh_run(host, user, cmd, settings, timeout=30.0)
if rc != 0:
return []
return parse_cache_listing(out)
async def read_model_config(host: str, user: str, repo: str, settings: Settings) -> Optional[dict]:
"""Read a cached model's config.json (first snapshot) for launch inference.
Returns the parsed dict, or None if absent/unreadable. The dirname is
whitelisted (repo_to_cache_dirname) so it's safe to embed unquoted."""
if not host or not user:
return None
dn = repo_to_cache_dirname(repo)
cmd = (
f'D=$(ls -d "$HOME/.cache/huggingface/hub/{dn}/snapshots/"*/ 2>/dev/null | head -1); '
f'[ -n "$D" ] && cat "${{D}}config.json" 2>/dev/null'
)
rc, out, err = await ssh_run(host, user, cmd, settings, timeout=20.0)
if rc != 0 or not out.strip():
return None
try:
return json.loads(out)
except (ValueError, TypeError):
return None
@dataclass @dataclass
class HostDiskResult: class HostDiskResult:
host: str host: str
@@ -159,10 +241,14 @@ async def delete_host(host: str, user: str, repo: str, settings: Settings) -> Ho
return HostDiskResult(host=host, on_disk=False, size_bytes=freed) return HostDiskResult(host=host, on_disk=False, size_bytes=freed)
async def delete_from_disk(repo: str, mode: str, settings: Settings) -> DiskStatus: async def delete_from_disk(repo: str, settings: Settings) -> DiskStatus:
"""rm -rf the model's cache dir on the relevant Sparks. Idempotent.""" """rm -rf the model's cache dir on ALL configured Sparks. Idempotent.
We sweep both Sparks regardless of the model's declared mode: a 'remove from
disk & menu' must leave nothing behind, and rm of an absent dir reports 0
bytes freed (FREED 0), so an extra host is harmless."""
hosts: list[tuple[str, str]] = [(settings.spark1_host, settings.spark1_user)] hosts: list[tuple[str, str]] = [(settings.spark1_host, settings.spark1_user)]
if mode == "cluster" and settings.spark2_host: if settings.spark2_host:
hosts.append((settings.spark2_host, settings.spark2_user)) hosts.append((settings.spark2_host, settings.spark2_user))
results = await asyncio.gather(*(delete_host(h, u, repo, settings) for h, u in hosts)) results = await asyncio.gather(*(delete_host(h, u, repo, settings) for h, u in hosts))
+77 -56
View File
@@ -15,7 +15,8 @@ from .coordination import LockHeld, ScheduleRegistry, SwapLockManager, WebhookNo
from .custom_services import add_custom_service, delete_custom_service from .custom_services import add_custom_service, delete_custom_service
from .audio_proxy import build_router as build_audio_router from .audio_proxy import build_router as build_audio_router
from .deep_health import DeepHealth from .deep_health import DeepHealth
from .disk import delete_from_disk, probe_disk from .discovery import build_menu, infer_recipe, repo_to_key
from .disk import delete_from_disk, probe_host, read_model_config
from .download import DownloadManager from .download import DownloadManager
from .llm_proxy import build_router as build_llm_router from .llm_proxy import build_router as build_llm_router
from .embeddings_proxy import build_router as build_embeddings_router from .embeddings_proxy import build_router as build_embeddings_router
@@ -25,7 +26,7 @@ from .health import check_kokoro, check_parakeet, check_vllm, check_embeddings,
from .matrix_bridge import MatrixBridgeManager from .matrix_bridge import MatrixBridgeManager
from .models import ModelDef, load_catalog from .models import ModelDef, load_catalog
from .nim import SUGGESTED_NIMS, CATALOG_URL, NimManager from .nim import SUGGESTED_NIMS, CATALOG_URL, NimManager
from .overrides import add_custom, delete_custom, extract_knobs_from_args, load_overrides, set_knobs from .overrides import add_custom, delete_custom, load_overrides, set_knobs
from .services import docker_state, run_action, services_from_settings from .services import docker_state, run_action, services_from_settings
from .shellsafe import validate_container, validate_image, validate_repo from .shellsafe import validate_container, validate_image, validate_repo
from .speech_models import SpeechModelsManager from .speech_models import SpeechModelsManager
@@ -161,20 +162,65 @@ def _reload_catalog() -> None:
swap_manager.reload_catalog(catalog) swap_manager.reload_catalog(catalog)
def _recipe_summaries() -> list[dict]:
"""Known launch recipes (bundled + saved), for the download panel's autocomplete.
These are NOT the menu — the menu is what's on disk. This is just the set of
repos Spark Control already knows how to launch, so the download box can
suggest them by name without putting phantom cards on the dashboard."""
out = []
for m in catalog.models.values():
if m.repo:
out.append({"repo": m.repo, "display_name": m.display_name, "mode": m.mode})
return out
@app.get("/api/models") @app.get("/api/models")
async def get_models() -> dict: async def get_models() -> dict:
out_models: dict[str, dict] = {} """The model menu = what's actually downloaded on the Sparks (one scan per
for key, m in catalog.models.items(): Spark), each annotated with its launch recipe or flagged `needs_setup`.
d = m.model_dump()
# Always include effective knobs for the UI (defaults from base args + any overrides) Does SSH, so it's the slower of the model endpoints; the front-end calls it on
d["effective_knobs"] = {**extract_knobs_from_args(m.vllm_args), **(m.knobs or {})} load, after a swap/download/delete, and on a slow timer — not every poll."""
out_models[key] = d if not settings.configured:
return {"configured": False, "defaults": catalog.defaults.model_dump(), "models": {}, "recipes": []}
menu = await build_menu(settings, catalog)
return { return {
"configured": True,
"defaults": catalog.defaults.model_dump(), "defaults": catalog.defaults.model_dump(),
"models": out_models, "models": menu,
"recipes": _recipe_summaries(),
} }
@app.get("/api/models/suggest")
async def suggest_model(repo: str = Query(...)) -> dict:
"""Read a downloaded model's config.json + size and propose a launch recipe.
Prefills the 'set up this model' form for an on-disk model that has no recipe
yet. The operator confirms/edits, then POSTs it to /api/models to save."""
if not settings.configured:
raise HTTPException(503, "spark1 not configured")
try:
validate_repo(repo)
except ValueError as e:
raise HTTPException(400, str(e))
hosts = [(settings.spark1_host, settings.spark1_user)]
if settings.spark2_host:
hosts.append((settings.spark2_host, settings.spark2_user))
# Config from whichever Spark has it; size summed across the Sparks that do.
sizes = await asyncio.gather(*(probe_host(h, u, repo, settings) for h, u in hosts))
total = sum(r.size_bytes for r in sizes if r.on_disk)
on_hosts = sum(1 for r in sizes if r.on_disk)
config = None
for (h, u), r in zip(hosts, sizes):
if r.on_disk:
config = await read_model_config(h, u, repo, settings)
if config is not None:
break
return infer_recipe(repo, config or {}, total, on_hosts)
class KnobsBody(BaseModel): class KnobsBody(BaseModel):
knobs: dict knobs: dict
@@ -238,71 +284,43 @@ async def del_model(key: str) -> dict:
return {"ok": True, "key": key} return {"ok": True, "key": key}
@app.get("/api/models/disk-status")
async def get_models_disk_status() -> dict:
"""Probe each catalog model's HF cache on the appropriate Spark(s) in parallel.
Result is keyed by model key: {on_disk, total_bytes, per_host:[{host,on_disk,size_bytes,error?}]}.
Designed to be called once on dashboard load; takes ~13s depending on Spark count.
"""
if not settings.configured:
return {"configured": False, "models": {}}
keys = list(catalog.models.keys())
statuses = await asyncio.gather(*(
probe_disk(
catalog.models[k].repo,
catalog.models[k].mode,
settings,
local_path=catalog.models[k].local_path,
)
for k in keys
), return_exceptions=True)
out: dict[str, dict] = {}
for k, s in zip(keys, statuses):
if isinstance(s, Exception):
out[k] = {"on_disk": False, "total_bytes": 0, "per_host": [], "error": str(s)}
continue
out[k] = {
"on_disk": s.on_disk,
"total_bytes": s.total_bytes,
"per_host": [
{"host": r.host, "on_disk": r.on_disk, "size_bytes": r.size_bytes, **({"error": r.error} if r.error else {})}
for r in s.per_host
],
}
return {"configured": True, "models": out}
@app.delete("/api/models/{key}/disk") @app.delete("/api/models/{key}/disk")
async def del_model_disk(key: str) -> dict: async def del_model_disk(key: str) -> dict:
"""Delete a model's weights from the Spark filesystem(s). The catalog entry stays. """Remove a model's weights from the Sparks — and thus from the menu, since the
menu IS the disk. Resolves the key against the live menu, so a discovered
model (no saved recipe) is deletable too.
Safety rails: Safety rails:
- Refuses a local/fine-tuned directory (hand-placed, not re-downloadable).
- Refuses if the model is currently loaded on vLLM. - Refuses if the model is currently loaded on vLLM.
- Refuses if a swap or download is in flight. - Refuses if a swap or this model's own download is in flight.
- Idempotent: if the cache dir is already gone on a host, that host reports 0 bytes freed. - Idempotent across both Sparks: an already-absent cache dir frees 0 bytes.
""" """
if key not in catalog.models: if not settings.configured:
raise HTTPException(503, "spark1 not configured")
menu = await build_menu(settings, catalog)
entry = menu.get(key)
if entry is None:
raise HTTPException(404, f"unknown model: {key}") raise HTTPException(404, f"unknown model: {key}")
m = catalog.models[key]
# Never rm a local fine-tune directory from the dashboard — it's irreplaceable # Never rm a local fine-tune directory from the dashboard — it's irreplaceable
# training output the user placed by hand, not a re-downloadable HF cache. # training output the user placed by hand, not a re-downloadable HF cache.
if m.local_path: if entry.get("local_path"):
raise HTTPException( raise HTTPException(
400, 400,
"this is a local model; its directory must be managed on the Spark, not deleted from here", "this is a local model; its directory must be managed on the Spark, not deleted from here",
) )
repo = entry["repo"]
# Refuse if currently loaded # Refuse if currently loaded
try: try:
vllm = await check_vllm(settings) vllm = await check_vllm(settings)
except Exception: except Exception:
vllm = {} vllm = {}
if vllm.get("ok") and vllm.get("current_model") == m.repo: if vllm.get("ok") and vllm.get("current_model") == repo:
raise HTTPException( raise HTTPException(
409, 409,
f"'{m.display_name}' is the currently loaded model. Switch to a different model first, then try again." f"'{entry['display_name']}' is the currently loaded model. Switch to a different model first, then try again."
) )
# Refuse if a swap is in flight # Refuse if a swap is in flight
@@ -312,10 +330,10 @@ async def del_model_disk(key: str) -> dict:
# Refuse if a download is in flight for this same repo (a different model's download is fine) # Refuse if a download is in flight for this same repo (a different model's download is fine)
if download_manager.current_job_id: if download_manager.current_job_id:
job = download_manager.get(download_manager.current_job_id) job = download_manager.get(download_manager.current_job_id)
if job and job.repo == m.repo: if job and job.repo == repo:
raise HTTPException(409, "this model is currently downloading; cancel or wait for it to finish") raise HTTPException(409, "this model is currently downloading; cancel or wait for it to finish")
status = await delete_from_disk(m.repo, m.mode, settings) status = await delete_from_disk(repo, settings)
# Audit log # Audit log
record_report( record_report(
f"disk:{key}", f"disk:{key}",
@@ -326,7 +344,7 @@ async def del_model_disk(key: str) -> dict:
return { return {
"ok": True, "ok": True,
"key": key, "key": key,
"repo": m.repo, "repo": repo,
"bytes_freed": status.total_bytes, "bytes_freed": status.total_bytes,
"per_host": [ "per_host": [
{"host": r.host, "size_bytes": r.size_bytes, **({"error": r.error} if r.error else {})} {"host": r.host, "size_bytes": r.size_bytes, **({"error": r.error} if r.error else {})}
@@ -881,10 +899,13 @@ async def get_status() -> dict:
def _identify_current_model(repo: str | None) -> str | None: def _identify_current_model(repo: str | None) -> str | None:
if not repo: if not repo:
return None return None
# A recipe-backed model keys by its recipe key; a discovered model (loaded but
# not yet set up) keys by the same slug build_menu uses, so it still
# highlights as the active card.
for key, m in catalog.models.items(): for key, m in catalog.models.items():
if m.repo == repo: if m.repo == repo:
return key return key
return None return repo_to_key(repo)
class SwapRequest(BaseModel): class SwapRequest(BaseModel):
+158 -121
View File
@@ -19,8 +19,8 @@ const state = {
configured: true, configured: true,
timer_handle: null, timer_handle: null,
deep_health: {}, deep_health: {},
disk_status: {}, // keyed by model key: { on_disk, total_bytes, per_host } models_loaded: false, // true once the first disk scan (/api/models) returns
disk_status_loaded: false, recipes: [], // known launch recipes (for the download autocomplete)
lock: { held: false }, // GPU swap reservation (coordination layer) lock: { held: false }, // GPU swap reservation (coordination layer)
schedules: [], // schedules external automation has registered schedules: [], // schedules external automation has registered
}; };
@@ -65,67 +65,69 @@ function renderCards() {
const lockTip = locked const lockTip = locked
? `Reserved by ${state.lock.holder || 'automation'}${state.lock.expires_at ? ' until ' + fmtClock(state.lock.expires_at) : ''}` ? `Reserved by ${state.lock.holder || 'automation'}${state.lock.expires_at ? ' until ' + fmtClock(state.lock.expires_at) : ''}`
: ''; : '';
for (const key of Object.keys(state.models)) { const keys = Object.keys(state.models);
if (keys.length === 0) {
// The menu is the disk: nothing downloaded (or the scan hasn't returned yet).
root.innerHTML = state.models_loaded
? `<div class="empty-menu muted">No models downloaded on the Sparks yet. Use <strong>+ Download a new model</strong> above to fetch one — it'll appear here when it's done.</div>`
: `<div class="empty-menu muted">Scanning the Sparks for downloaded models…</div>`;
return;
}
for (const key of keys) {
const m = state.models[key]; const m = state.models[key];
const isActive = key === state.current_model_key; const isActive = key === state.current_model_key;
const card = document.createElement('div'); const card = document.createElement('div');
card.className = 'card' + (isActive ? ' active' : ''); card.className = 'card' + (isActive ? ' active' : '') + (m.needs_setup ? ' needs-setup' : '');
const desc = m.description const desc = m.description
? `<div class="desc">${escapeHtml(m.description)}</div>` ? `<div class="desc">${escapeHtml(m.description)}</div>`
: ''; : '';
const customPill = m.custom ? `<span class="tag custom-pill">custom</span>` : ''; const customPill = m.custom ? `<span class="tag custom-pill">custom</span>` : '';
const localPill = m.local_path ? `<span class="tag local-pill" title="Served from a directory on the Spark, not Hugging Face">local</span>` : ''; const localPill = m.local_path ? `<span class="tag local-pill" title="Served from a directory on the Spark, not Hugging Face">local</span>` : '';
// Disk-presence pill + trash button. Until /api/models/disk-status comes back, // Every card on the menu is on disk by definition — show its real size.
// we don't know — render a neutral placeholder. const gb = (m.total_bytes || 0) / 1e9;
const disk = state.disk_status[key]; const diskPill = gb > 0
let diskPill = ''; ? `<span class="tag on-disk" title="Weights present on the Spark(s)">on disk · ${gb.toFixed(1)} GB</span>`
if (state.disk_status_loaded) { : '';
if (disk && disk.on_disk) { const setupPill = m.needs_setup
const gb = (disk.total_bytes / 1e9); ? `<span class="tag setup-pill" title="On disk, but Spark Control hasn't been told how to launch it">needs setup</span>`
diskPill = `<span class="tag on-disk" title="Weights present on disk">on disk · ${gb.toFixed(1)} GB</span>`; : '';
} else { // Trash = remove weights from disk AND from the menu. Disabled if active / mid-swap.
diskPill = `<span class="tag not-on-disk" title="Weights not downloaded">not downloaded</span>`;
}
}
// Trash button — hidden if not on disk; disabled (with tooltip) if currently loaded.
// Never offered for local models: their directory is hand-placed training output, // Never offered for local models: their directory is hand-placed training output,
// not a re-downloadable HF cache (the server refuses the delete too). // not a re-downloadable HF cache (the server refuses the delete too).
let trashBtn = ''; let trashBtn = '';
if (state.disk_status_loaded && disk && disk.on_disk && !m.local_path) { if (!m.local_path) {
const disabled = isActive || isSwapping; const disabled = isActive || isSwapping;
const tip = isActive const tip = isActive
? 'Currently loaded — switch to another model first' ? 'Currently loaded — switch to another model first'
: isSwapping : isSwapping
? 'A swap is in progress' ? 'A swap is in progress'
: 'Delete weights from disk'; : 'Remove weights from disk & menu';
trashBtn = `<button class="icon-btn danger" data-disk-del-key="${key}" title="${escapeHtml(tip)}" aria-label="Delete from disk" ${disabled ? 'disabled' : ''}>${trashIcon}</button>`; trashBtn = `<button class="icon-btn danger" data-disk-del-key="${key}" title="${escapeHtml(tip)}" aria-label="Remove from disk and menu" ${disabled ? 'disabled' : ''}>${trashIcon}</button>`;
} }
// Primary card action: "Switch to this" (green) when on disk; "Download" (blue) when not. // Primary action: "Current" / "Switch to this", or "Set up & switch" for a
// Before disk-status loads we render the swap button as a sensible default. // model on disk that has no launch recipe yet.
const isOnDisk = !state.disk_status_loaded || (disk && disk.on_disk); const swapBlocked = isSwapping || locked;
const dlInFlight = !!(typeof dlState !== 'undefined' && dlState && dlState.job_id); const lockTipAttr = locked ? ` title="${escapeHtml(lockTip)}"` : '';
let primaryBtn = ''; let primaryBtn = '';
if (isActive) { if (isActive) {
primaryBtn = `<button class="btn" disabled>Current</button>`; primaryBtn = `<button class="btn" disabled>Current</button>`;
} else if (isOnDisk) { } else if (m.needs_setup) {
const swapBlocked = isSwapping || locked; primaryBtn = `<button class="btn primary" data-setup-key="${key}"${lockTipAttr} ${swapBlocked ? 'disabled' : ''}>Set up &amp; switch</button>`;
const tip = locked ? ` title="${escapeHtml(lockTip)}"` : '';
primaryBtn = `<button class="btn primary" data-swap-key="${key}"${tip} ${swapBlocked ? 'disabled' : ''}>Switch to this</button>`;
} else if (m.local_path) {
// A local model can't be "downloaded" — its directory has to exist on the Spark.
primaryBtn = `<button class="btn" disabled title="Directory not found on the Spark — create it there, then refresh">Not found on Spark</button>`;
} else { } else {
const tip = dlInFlight ? 'A download is already in progress' : 'Download weights to the Spark(s)'; primaryBtn = `<button class="btn primary" data-swap-key="${key}"${lockTipAttr} ${swapBlocked ? 'disabled' : ''}>Switch to this</button>`;
primaryBtn = `<button class="btn info" data-download-key="${key}" title="${escapeHtml(tip)}" ${dlInFlight ? 'disabled' : ''}>Download</button>`;
} }
// The Test/Advanced controls need a saved recipe; hide them until setup is done.
const recipeActions = m.needs_setup ? '' : `
<button class="btn test-btn" data-test-key="${key}" title="Pre-flight check the launch command without starting the engine">Test</button>
<button class="btn adv-btn" data-adv-key="${key}" title="Advanced settings">Advanced</button>`;
card.innerHTML = ` card.innerHTML = `
<div class="name">${escapeHtml(m.display_name)}</div> <div class="name">${escapeHtml(m.display_name)}</div>
<div class="meta"> <div class="meta">
<span class="tag mode-${m.mode}">${m.mode}</span> <span class="tag mode-${m.mode}">${m.mode}</span>
<span class="tag">${m.size_gb} GB</span> ${diskPill}
${setupPill}
${customPill} ${customPill}
${localPill} ${localPill}
${diskPill}
${(m.capabilities || []).map(c => `<span class="tag cap">${escapeHtml(c)}</span>`).join('')} ${(m.capabilities || []).map(c => `<span class="tag cap">${escapeHtml(c)}</span>`).join('')}
</div> </div>
${desc} ${desc}
@@ -136,9 +138,7 @@ function renderCards() {
</div> </div>
<div class="spacer"></div> <div class="spacer"></div>
<div class="card-actions"> <div class="card-actions">
${primaryBtn} ${primaryBtn}${recipeActions}
<button class="btn test-btn" data-test-key="${key}" title="Pre-flight check the launch command without starting the engine">Test</button>
<button class="btn adv-btn" data-adv-key="${key}" title="Advanced settings">Advanced</button>
${trashBtn} ${trashBtn}
</div> </div>
<div class="test-result hidden" data-test-result-for="${key}"></div> <div class="test-result hidden" data-test-result-for="${key}"></div>
@@ -148,8 +148,8 @@ function renderCards() {
for (const btn of root.querySelectorAll('[data-swap-key]')) { for (const btn of root.querySelectorAll('[data-swap-key]')) {
btn.addEventListener('click', () => triggerSwap(btn.dataset.swapKey)); btn.addEventListener('click', () => triggerSwap(btn.dataset.swapKey));
} }
for (const btn of root.querySelectorAll('[data-download-key]')) { for (const btn of root.querySelectorAll('[data-setup-key]')) {
btn.addEventListener('click', () => triggerDownloadForKey(btn.dataset.downloadKey)); btn.addEventListener('click', () => openSetupForKey(btn.dataset.setupKey));
} }
for (const btn of root.querySelectorAll('[data-adv-key]')) { for (const btn of root.querySelectorAll('[data-adv-key]')) {
btn.addEventListener('click', () => openAdvanced(btn.dataset.advKey)); btn.addEventListener('click', () => openAdvanced(btn.dataset.advKey));
@@ -1170,24 +1170,44 @@ async function pollStatus() {
} }
} }
let menuLoadInFlight = false;
async function loadModels() { async function loadModels() {
const data = await fetchJSON('/api/models'); // The menu is whatever's downloaded on the Sparks — /api/models does the scan
state.defaults = data.defaults || {}; // (SSH), so this is the slower model call. Best-effort: a transient failure
state.models = data.models || {}; // leaves the previous menu in place rather than blanking the dashboard.
// Guard against overlap: init() fires this un-awaited and pollStatus()'s
// empty-menu fallback may call it again before the scan returns.
if (menuLoadInFlight) return;
menuLoadInFlight = true;
try {
const data = await fetchJSON('/api/models');
state.defaults = data.defaults || {};
state.models = data.models || {};
state.recipes = data.recipes || [];
state.models_loaded = true;
populateDownloadSuggestions();
renderCards();
} catch (e) {
console.warn('model menu load failed:', e.message);
} finally {
menuLoadInFlight = false;
}
} }
async function loadDiskStatus() { // Populate the download box's autocomplete with known recipes not currently on
// Probes each catalog model's HF cache over SSH; takes a beat. Best-effort. // disk — so common/bundled models stay discoverable without phantom menu cards.
try { function populateDownloadSuggestions() {
const r = await fetchJSON('/api/models/disk-status'); const dl = el('#dl-suggestions');
if (r && r.models) { if (!dl) return;
state.disk_status = r.models; const onDiskRepos = new Set(Object.values(state.models).map(m => m.repo).filter(Boolean));
state.disk_status_loaded = true; dl.innerHTML = '';
renderCards(); for (const r of state.recipes || []) {
} if (onDiskRepos.has(r.repo)) continue;
} catch (e) { const opt = document.createElement('option');
// Silent — pills just won't render. Don't block dashboard. opt.value = r.repo;
console.warn('disk-status probe failed:', e.message); opt.label = `${r.display_name} (${r.mode})`;
dl.appendChild(opt);
} }
} }
@@ -1201,14 +1221,12 @@ function fmtBytesShort(n) {
function openDiskDeleteDialog(key) { function openDiskDeleteDialog(key) {
const m = state.models[key]; const m = state.models[key];
const disk = state.disk_status[key]; if (!m || !m.on_disk) return;
if (!m || !disk || !disk.on_disk) return;
const dlg = el('#disk-delete-dialog'); const dlg = el('#disk-delete-dialog');
el('#dd-summary').innerHTML = `Free <strong>${fmtBytesShort(disk.total_bytes)}</strong> by removing <strong>${escapeHtml(m.display_name)}</strong> (<code>${escapeHtml(m.repo)}</code>) from disk.`; el('#dd-summary').innerHTML = `Free <strong>${fmtBytesShort(m.total_bytes)}</strong> by removing <strong>${escapeHtml(m.display_name)}</strong> (<code>${escapeHtml(m.repo)}</code>) from the Sparks. This also takes it off the menu.`;
const hostsEl = el('#dd-hosts'); const hostsEl = el('#dd-hosts');
hostsEl.innerHTML = ''; hostsEl.innerHTML = '';
for (const h of (disk.per_host || [])) { for (const h of (m.per_host || [])) {
if (!h.on_disk) continue;
const li = document.createElement('li'); const li = document.createElement('li');
li.innerHTML = `<code>${escapeHtml(h.host)}</code> — ${fmtBytesShort(h.size_bytes)}`; li.innerHTML = `<code>${escapeHtml(h.host)}</code> — ${fmtBytesShort(h.size_bytes)}`;
hostsEl.appendChild(li); hostsEl.appendChild(li);
@@ -1227,20 +1245,19 @@ function openDiskDeleteDialog(key) {
try { try {
const r = await fetchJSON(`/api/models/${encodeURIComponent(key)}/disk`, { method: 'DELETE' }); const r = await fetchJSON(`/api/models/${encodeURIComponent(key)}/disk`, { method: 'DELETE' });
dlg.close(); dlg.close();
// Optimistically clear local disk state for this key, then refresh. // Optimistically drop the card, then re-scan the menu (it's gone from disk).
delete state.disk_status[key]; delete state.models[key];
renderCards(); renderCards();
// Eagerly re-probe so size is accurate (and shows "not downloaded" pill). await loadModels();
loadDiskStatus();
const freed = r && typeof r.bytes_freed === 'number' ? fmtBytesShort(r.bytes_freed) : ''; const freed = r && typeof r.bytes_freed === 'number' ? fmtBytesShort(r.bytes_freed) : '';
console.log(`Deleted ${m.display_name} from disk${freed ? ` — freed ${freed}` : ''}.`); console.log(`Removed ${m.display_name} from disk${freed ? ` — freed ${freed}` : ''}.`);
} catch (e) { } catch (e) {
errEl.textContent = e.message || 'Delete failed'; errEl.textContent = e.message || 'Delete failed';
errEl.classList.remove('hidden'); errEl.classList.remove('hidden');
} finally { } finally {
confirm.disabled = false; confirm.disabled = false;
cancel.disabled = false; cancel.disabled = false;
confirm.textContent = 'Delete from disk'; confirm.textContent = 'Remove from disk & menu';
} }
}; };
cancel.onclick = onCancel; cancel.onclick = onCancel;
@@ -1341,38 +1358,6 @@ async function releaseLock() {
pollCoordination(); pollCoordination();
} }
async function triggerDownloadForKey(modelKey) {
const m = state.models[modelKey];
if (!m) return;
if (dlState.job_id) {
alert('A download is already in progress; wait for it to finish.');
return;
}
// Pick the download target from the model's mode:
// solo -> spark1 only
// cluster -> both Sparks (fetch on Spark 1, rsync to Spark 2 in parallel)
const dlMode = m.mode === 'cluster' ? 'cluster' : 'spark1';
const sizeNote = m.size_gb ? ` (~${m.size_gb} GB)` : '';
const target = m.mode === 'cluster' ? 'both Sparks' : 'Spark 1';
if (!confirm(`Download "${m.display_name}"${sizeNote} to ${target}? Large models can take a while; you can watch progress in the download panel.`)) {
return;
}
dlState.last_repo = m.repo;
dlState.last_mode = dlMode;
try {
const r = await fetchJSON('/api/download', {
method: 'POST',
headers: { 'content-type': 'application/json' },
body: JSON.stringify({ repo: m.repo, mode: dlMode }),
});
// Open the download panel + attach to progress stream
openDownloadForm();
attachToDownload(r.job_id);
} catch (e) {
alert('Failed to start download: ' + e.message);
}
}
async function attachToSwap(jobId, needsBackfill) { async function attachToSwap(jobId, needsBackfill) {
if (state.swap_eventsource) { if (state.swap_eventsource) {
state.swap_eventsource.close(); state.swap_eventsource.close();
@@ -1603,12 +1588,14 @@ function handleDownloadDone(d) {
el('#dl-title').textContent = 'Done'; el('#dl-title').textContent = 'Done';
el('#dl-phase').textContent = 'Done ✓'; el('#dl-phase').textContent = 'Done ✓';
el('#dl-progress-fill').style.width = '100%'; el('#dl-progress-fill').style.width = '100%';
// Offer to add to catalog // The new model now appears on the menu (the menu is the disk). If it matched
// a known recipe it's ready to switch to; if not, offer to set it up.
const repo = dlState.last_repo; const repo = dlState.last_repo;
const mode = dlState.last_mode; loadModels().then(() => {
if (repo) { if (!repo) return;
setTimeout(() => openCatalogDialog(repo, mode), 600); const entry = Object.values(state.models).find(m => m.repo === repo);
} if (entry && entry.needs_setup) setTimeout(() => openSetupDialog(repo, { thenSwap: false }), 600);
});
} }
dlState.job_id = null; dlState.job_id = null;
} }
@@ -1721,21 +1708,67 @@ function openAdvanced(key) {
dlg.showModal(); dlg.showModal();
} }
function openCatalogDialog(repo, mode) { // Context carried from openSetupDialog -> the submit handler: the inferred
// launch flags (parsers/MoE backend) and whether to swap right after saving.
let setupCtx = { key: '', repo: '', vllm_args: [], thenSwap: false };
// "Set up & switch" on a needs-setup card.
async function openSetupForKey(key) {
const m = state.models[key];
if (!m) return;
if (state.lock && state.lock.held) {
const until = state.lock.expires_at ? ' until ' + fmtClock(state.lock.expires_at) : '';
alert(`The GPU swap path is reserved by ${state.lock.holder || 'automation'}${until}. Use "Release" on the reservation banner to override.`);
return;
}
await openSetupDialog(m.repo, { thenSwap: true });
}
// Open the "set up this model" dialog, prefilled from inference (config.json +
// size). The operator confirms once; on save the recipe persists and (if
// thenSwap) we switch to it.
async function openSetupDialog(repo, opts = {}) {
const dlg = el('#catalog-dialog'); const dlg = el('#catalog-dialog');
const key = repo.split('/').pop().toLowerCase().replace(/[^a-z0-9_-]/g, '-'); let sug = null;
el('#cd-key').value = key; try {
el('#cd-name').value = repo.split('/').pop(); sug = await fetchJSON(`/api/models/suggest?repo=${encodeURIComponent(repo)}`);
} catch (e) {
console.warn('recipe suggestion failed:', e.message);
}
const fallbackKey = repo.toLowerCase().replace(/[^a-z0-9_-]+/g, '-').replace(/^-+|-+$/g, '');
setupCtx = {
key: (sug && sug.key) || fallbackKey,
repo,
vllm_args: (sug && sug.vllm_args) || [],
thenSwap: !!opts.thenSwap,
};
el('#cd-key').value = setupCtx.key;
el('#cd-name').value = (sug && sug.display_name) || repo.split('/').pop();
el('#cd-repo').value = repo; el('#cd-repo').value = repo;
el('#cd-size').value = ''; el('#cd-size').value = '';
el('#cd-mode').value = mode || 'solo'; el('#cd-mode').value = (sug && sug.mode) || 'solo';
el('#cd-desc').value = ''; el('#cd-desc').value = '';
el('#cd-mml').value = 32768; const knobs = (sug && sug.knobs) || {};
el('#cd-gmu').value = 0.85; el('#cd-mml').value = knobs.max_model_len || 32768;
el('#cd-gmu-out').value = '0.85'; el('#cd-gmu').value = knobs.gpu_memory_utilization || 0.85;
el('#cd-fst').checked = true; el('#cd-gmu-out').value = parseFloat(el('#cd-gmu').value).toFixed(2);
el('#cd-pcache').checked = true; el('#cd-fst').checked = knobs.fastsafetensors !== false;
el('#cd-fp8').checked = true; el('#cd-pcache').checked = knobs.prefix_caching !== false;
el('#cd-fp8').checked = (knobs.kv_cache_dtype || 'fp8') === 'fp8';
const det = el('#cd-detected');
if (det) {
if (sug) {
const caps = (sug.capabilities || []).join(', ');
const flags = setupCtx.vllm_args.length ? `: <code>${escapeHtml(setupCtx.vllm_args.join(' '))}</code>` : '';
det.innerHTML = `Detected <strong>${escapeHtml(sug.family || 'Generic')}</strong>${caps ? ` · ${escapeHtml(caps)}` : ''}. Launch flags set automatically${flags}.`;
} else {
det.textContent = "Couldn't auto-detect this model's settings — pick mode and knobs manually.";
}
det.classList.remove('hidden');
}
const submit = el('#cd-submit');
if (submit) submit.textContent = setupCtx.thenSwap ? 'Save & switch' : 'Save settings';
dlg.showModal(); dlg.showModal();
} }
@@ -1745,13 +1778,15 @@ function setupCatalogDialog() {
el('#catalog-form').addEventListener('submit', async (e) => { el('#catalog-form').addEventListener('submit', async (e) => {
e.preventDefault(); e.preventDefault();
const body = { const body = {
key: el('#cd-key').value.trim(), key: el('#cd-key').value.trim() || setupCtx.key,
display_name: el('#cd-name').value.trim(), display_name: el('#cd-name').value.trim(),
repo: el('#cd-repo').value.trim(), repo: el('#cd-repo').value.trim(),
size_gb: parseFloat(el('#cd-size').value) || 0, size_gb: parseFloat(el('#cd-size').value) || 0,
mode: el('#cd-mode').value, mode: el('#cd-mode').value,
description: el('#cd-desc').value.trim() || null, description: el('#cd-desc').value.trim() || null,
vllm_args: [], // The inferred family flags (parsers / MoE backend); knob-controlled flags
// are layered on by the server from `knobs`, so no duplication.
vllm_args: setupCtx.vllm_args || [],
knobs: { knobs: {
max_model_len: parseInt(el('#cd-mml').value, 10) || 32768, max_model_len: parseInt(el('#cd-mml').value, 10) || 32768,
gpu_memory_utilization: parseFloat(el('#cd-gmu').value), gpu_memory_utilization: parseFloat(el('#cd-gmu').value),
@@ -1769,8 +1804,9 @@ function setupCatalogDialog() {
el('#catalog-dialog').close(); el('#catalog-dialog').close();
closeDownloadPanel(); closeDownloadPanel();
await loadModels(); await loadModels();
if (setupCtx.thenSwap) triggerSwap(body.key);
pollStatus(); pollStatus();
} catch (e) { alert('Add to catalog failed: ' + e.message); } } catch (e) { alert('Saving the model setup failed: ' + e.message); }
}); });
} }
@@ -2212,21 +2248,22 @@ async function init() {
} catch {} } catch {}
setupDashboardTabs(); setupDashboardTabs();
setupEndpointCollapse(); setupEndpointCollapse();
await loadModels(); // Fire the (SSH-backed) menu scan without awaiting — it self-renders a
// "Scanning…" state and fills in when it returns, so a slow/unreachable
// cluster never blocks first paint. pollStatus() below paints the rest.
loadModels();
await pollStatus(); await pollStatus();
await renderServices(); await renderServices();
pollCoordination(); pollCoordination();
pollHardware(); pollHardware();
pollUpdates(); pollUpdates();
// Disk-status probe runs after first paint — slow over SSH and not blocking.
loadDiskStatus();
// Speech-model patches panel — slow over SSH, runs after first paint. // Speech-model patches panel — slow over SSH, runs after first paint.
renderSpeechModels(); renderSpeechModels();
setInterval(pollStatus, 5000); setInterval(pollStatus, 5000);
setInterval(pollCoordination, 5000); // swap lock + schedule registry setInterval(pollCoordination, 5000); // swap lock + schedule registry
setInterval(pollHardware, 8000); // every 8s setInterval(pollHardware, 8000); // every 8s
setInterval(pollUpdates, 300000); // every 5 min setInterval(pollUpdates, 300000); // every 5 min
setInterval(loadDiskStatus, 60000); // every 60s — disk state changes rarely setInterval(loadModels, 60000); // every 60s — re-scan the Sparks for added/removed models
setInterval(renderSpeechModels, 120000); // every 2 min — patches change rarely setInterval(renderSpeechModels, 120000); // every 2 min — patches change rarely
} }
+11 -9
View File
@@ -241,9 +241,10 @@
<dialog id="catalog-dialog" class="modal"> <dialog id="catalog-dialog" class="modal">
<form method="dialog" class="modal-form" id="catalog-form"> <form method="dialog" class="modal-form" id="catalog-form">
<h3>Add downloaded model to catalog</h3> <h3>Set up this model</h3>
<p class="muted small">It will appear as a new card you can swap to. Knob values become its default launch flags — you can tweak later via the model's "Advanced" panel.</p> <p class="muted small">This model is downloaded, but Spark Control needs to know how to launch it. We've guessed from the model's own files — confirm or adjust, and it's saved so you're never asked again.</p>
<label class="modal-row"><span>Key (URL-safe id)</span><input type="text" id="cd-key" required pattern="[a-zA-Z0-9_-]+"></label> <p id="cd-detected" class="muted small cd-detected hidden"></p>
<label class="modal-row"><span>Key (URL-safe id)</span><input type="text" id="cd-key" required pattern="[a-zA-Z0-9_-]+" readonly></label>
<label class="modal-row"><span>Display name</span><input type="text" id="cd-name" required></label> <label class="modal-row"><span>Display name</span><input type="text" id="cd-name" required></label>
<label class="modal-row"><span>Repo (read-only)</span><input type="text" id="cd-repo" readonly></label> <label class="modal-row"><span>Repo (read-only)</span><input type="text" id="cd-repo" readonly></label>
<label class="modal-row"><span>Size (GB)</span><input type="number" id="cd-size" step="0.1" min="0"></label> <label class="modal-row"><span>Size (GB)</span><input type="number" id="cd-size" step="0.1" min="0"></label>
@@ -264,7 +265,7 @@
</fieldset> </fieldset>
<div class="modal-actions"> <div class="modal-actions">
<button type="button" id="cd-cancel" class="btn">Cancel</button> <button type="button" id="cd-cancel" class="btn">Cancel</button>
<button type="submit" class="btn primary">Add to catalog</button> <button type="submit" id="cd-submit" class="btn primary">Save settings</button>
</div> </div>
</form> </form>
</dialog> </dialog>
@@ -302,14 +303,14 @@
<dialog id="disk-delete-dialog" class="modal"> <dialog id="disk-delete-dialog" class="modal">
<form method="dialog" class="modal-form"> <form method="dialog" class="modal-form">
<h3>Delete model weights from disk?</h3> <h3>Remove this model from the Sparks?</h3>
<p id="dd-summary" class="muted small"></p> <p id="dd-summary" class="muted small"></p>
<ul class="muted small dd-hosts" id="dd-hosts"></ul> <ul class="muted small dd-hosts" id="dd-hosts"></ul>
<p class="muted small">This is reversible — you can re-download from the catalog at any time. The catalog entry stays intact.</p> <p class="muted small">This deletes the weights and removes the card from the menu. You can always download it again later (re-downloading restores its saved settings).</p>
<p id="dd-error" class="muted small dd-error hidden"></p> <p id="dd-error" class="muted small dd-error hidden"></p>
<div class="modal-actions"> <div class="modal-actions">
<button type="button" id="dd-cancel" class="btn">Cancel</button> <button type="button" id="dd-cancel" class="btn">Cancel</button>
<button type="button" id="dd-confirm" class="btn danger">Delete from disk</button> <button type="button" id="dd-confirm" class="btn danger">Remove from disk &amp; menu</button>
</div> </div>
</form> </form>
</dialog> </dialog>
@@ -354,11 +355,12 @@
<div class="download-form" id="download-form"> <div class="download-form" id="download-form">
<label class="dl-row"> <label class="dl-row">
<span class="dl-label">HuggingFace repo</span> <span class="dl-label">HuggingFace repo</span>
<input type="text" id="dl-repo" placeholder="e.g. RedHatAI/Qwen3.6-35B-A3B-NVFP4" autocomplete="off"> <input type="text" id="dl-repo" placeholder="e.g. RedHatAI/Qwen3.6-35B-A3B-NVFP4" autocomplete="off" list="dl-suggestions">
<datalist id="dl-suggestions"></datalist>
<a id="dl-hf-link" class="dl-hf-link hidden" href="#" target="_blank" rel="noopener" title="Open on Hugging Face"></a> <a id="dl-hf-link" class="dl-hf-link hidden" href="#" target="_blank" rel="noopener" title="Open on Hugging Face"></a>
</label> </label>
<div class="dl-help muted small"> <div class="dl-help muted small">
<a href="https://huggingface.co/models?other=vllm" target="_blank" rel="noopener">Browse vLLM-compatible models</a> Type any repo, or pick a known one from the list. <a href="https://huggingface.co/models?other=vllm" target="_blank" rel="noopener">Browse vLLM-compatible models</a>
· NVFP4-quantized models (e.g. <code>RedHatAI/...</code>) are best for Blackwell hardware · NVFP4-quantized models (e.g. <code>RedHatAI/...</code>) are best for Blackwell hardware
</div> </div>
<div class="dl-row"> <div class="dl-row">
+6
View File
@@ -778,6 +778,12 @@ main {
.card .local-pill { color: var(--warn); border-color: rgba(245, 158, 11, 0.4); } .card .local-pill { color: var(--warn); border-color: rgba(245, 158, 11, 0.4); }
.tag.on-disk { color: var(--accent); border-color: rgba(74, 222, 128, 0.4); } .tag.on-disk { color: var(--accent); border-color: rgba(74, 222, 128, 0.4); }
.tag.not-on-disk { color: var(--muted); border-color: var(--border); opacity: 0.7; } .tag.not-on-disk { color: var(--muted); border-color: var(--border); opacity: 0.7; }
.tag.setup-pill { color: var(--warn); border-color: rgba(245, 158, 11, 0.4); }
.card.needs-setup { border-style: dashed; }
.card-actions .btn[data-setup-key] { flex: 1; }
.empty-menu { grid-column: 1 / -1; padding: 28px 16px; text-align: center; border: 1px dashed var(--border); border-radius: 10px; }
.cd-detected { padding: 8px 10px; border: 1px solid var(--border); border-radius: 8px; background: rgba(255,255,255,0.02); }
.cd-detected code { word-break: break-all; }
.card-actions .icon-btn.danger { color: var(--error); border-color: rgba(239, 68, 68, 0.3); margin-left: auto; } .card-actions .icon-btn.danger { color: var(--error); border-color: rgba(239, 68, 68, 0.3); margin-left: auto; }
.card-actions .icon-btn.danger:hover:not(:disabled) { background: rgba(239, 68, 68, 0.08); border-color: var(--error); color: var(--error); } .card-actions .icon-btn.danger:hover:not(:disabled) { background: rgba(239, 68, 68, 0.08); border-color: var(--error); color: var(--error); }
.card-actions .icon-btn.danger:disabled { opacity: 0.35; cursor: not-allowed; } .card-actions .icon-btn.danger:disabled { opacity: 0.35; cursor: not-allowed; }
+37 -37
View File
@@ -1,9 +1,14 @@
# spark-control model catalog # spark-control launch recipes
# #
# Edit this file (or override at runtime via the StartOS "Edit Model Catalog" # These are NOT the dashboard menu. The menu is whatever is actually downloaded
# action) to add or change available models. # on the Sparks — Spark Control scans the Hugging Face cache on each load and
# shows what it finds. These entries are launch *recipes*: matched to an on-disk
# model by `repo`, they say HOW to launch it. A downloaded model with no recipe
# here shows up as "needs setup", and the dashboard infers + saves one on first
# use (from the model's own config.json). Add a recipe to make a known model
# launch correctly the moment it's downloaded, with no setup prompt.
# #
# Each model entry produces this command on Spark 1: # Each recipe produces this command on Spark 1:
# cd ~/spark-vllm-docker # cd ~/spark-vllm-docker
# ./launch-cluster.sh [--solo] -d exec vllm serve <repo> \ # ./launch-cluster.sh [--solo] -d exec vllm serve <repo> \
# --port=<defaults.port> --host=<defaults.host> <vllm_args...> # --port=<defaults.port> --host=<defaults.host> <vllm_args...>
@@ -54,6 +59,34 @@ models:
- --enable-prefix-caching - --enable-prefix-caching
- --kv-cache-dtype=fp8 - --kv-cache-dtype=fp8
gemma4-26b:
display_name: "Gemma 4 26B-A4B (vision, light)"
description: >-
Lighter, faster sibling of the Gemma 4 31B above: a Mixture-of-Experts
model with 26B total parameters but only ~4B active per token, so it
generates quickly. Takes images as well as text (good for tasks like
reading a business card into structured text). Reasoning is a bit
shallower than the dense 31B. Runs solo on one Spark.
repo: nvidia/Gemma-4-26B-A4B-NVFP4
size_gb: 17
mode: solo
capabilities: [vision, reasoning, tools]
expected_ready_seconds: 240
vllm_args:
- --gpu-memory-utilization=0.8
- --max-model-len=32768
- --max-num-batched-tokens=16384
- --reasoning-parser=gemma4
- --tool-call-parser=gemma4
- --enable-auto-tool-choice
# MoE backend: research found this model's expert layers fall back to
# 'marlin' on GB10 (the fast flashinfer_cutlass path errors on sm_121).
# If a swap fails to start, this flag is the first thing to flip.
- --moe_backend=marlin
- --load-format=fastsafetensors
- --enable-prefix-caching
- --kv-cache-dtype=fp8
qwen36: qwen36:
display_name: "Qwen3.6 35B-A3B (daily driver)" display_name: "Qwen3.6 35B-A3B (daily driver)"
description: >- description: >-
@@ -74,36 +107,3 @@ models:
- --load-format=fastsafetensors - --load-format=fastsafetensors
- --enable-prefix-caching - --enable-prefix-caching
- --kv-cache-dtype=fp8 - --kv-cache-dtype=fp8
qwen3-235b-fp8:
display_name: "Qwen3 235B-A22B FP8 (legacy)"
description: >-
Earlier generation of the Qwen 235B family in native FP8 precision.
Runs across both Sparks. Mostly superseded by Qwen3-VL above; keep
around for text-only baseline comparisons.
repo: Qwen/Qwen3-235B-A22B-FP8
size_gb: 220
mode: cluster
capabilities: []
expected_ready_seconds: 360
vllm_args:
- --gpu-memory-utilization=0.7
- -tp=2
- --distributed-executor-backend=ray
- --max-model-len=32768
qwen25-72b:
display_name: "Qwen2.5 72B (legacy)"
description: >-
Last-generation 72B dense model. Cluster mode required due to size.
Kept for compatibility and baseline comparison against newer Qwens.
repo: Qwen/Qwen2.5-72B-Instruct
size_gb: 145
mode: cluster
capabilities: []
expected_ready_seconds: 360
vllm_args:
- --gpu-memory-utilization=0.7
- -tp=2
- --distributed-executor-backend=ray
- --max-model-len=32768
+190
View File
@@ -0,0 +1,190 @@
"""Disk-driven menu helpers: cache-dir parsing + launch-recipe inference.
All offline — pure functions over a fake cache listing and fake config.json
dicts. The SSH scan, the menu merge, and the suggest endpoint that wire these
together are exercised by hand against the live cluster (mock-heavy unit tests of
those would test the mocks).
"""
import asyncio
from app import discovery
from app.config import Settings
from app.disk import DiskStatus, cache_dirname_to_repo, parse_cache_listing
from app.discovery import repo_to_key, infer_recipe, _detect_family
from app.models import load_catalog
# ---- cache dirname <-> repo ----
def test_cache_dirname_to_repo_roundtrip():
assert cache_dirname_to_repo("models--RedHatAI--Qwen3.6-35B-A3B-NVFP4") == "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
def test_cache_dirname_name_with_double_dash():
# The org is the first segment; everything after is the name (single '/').
assert cache_dirname_to_repo("models--org--weird--name") == "org/weird--name"
def test_cache_dirname_rejects_non_model_dirs():
assert cache_dirname_to_repo("datasets--foo--bar") is None
assert cache_dirname_to_repo("models--onlyorg") is None
assert cache_dirname_to_repo("random") is None
# ---- parse_cache_listing ----
def test_parse_cache_listing_complete_and_incomplete():
out = (
"20000000000|1|models--RedHatAI--Qwen3.6-35B-A3B-NVFP4\n"
"5000000000|0|models--some--half-downloaded\n"
"\n"
"garbage line with no pipes\n"
"123|1|not-a-model-dir\n"
)
items = parse_cache_listing(out)
assert items == [
("RedHatAI/Qwen3.6-35B-A3B-NVFP4", 20000000000, True),
("some/half-downloaded", 5000000000, False),
]
def test_parse_cache_listing_bad_size_defaults_zero():
items = parse_cache_listing("notanumber|1|models--a--b")
assert items == [("a/b", 0, True)]
# ---- repo_to_key ----
def test_repo_to_key_is_url_safe_and_stable():
assert repo_to_key("RedHatAI/Qwen3.6-35B-A3B-NVFP4") == "redhatai-qwen3-6-35b-a3b-nvfp4"
# Idempotent enough to be a stable id across calls.
assert repo_to_key("nvidia/Gemma-4-26B-A4B-NVFP4") == "nvidia-gemma-4-26b-a4b-nvfp4"
# ---- family detection ----
def test_detect_qwen3_moe():
cfg = {"architectures": ["Qwen3MoeForCausalLM"], "model_type": "qwen3_moe", "num_experts": 128}
label, flags, caps = _detect_family(cfg)
assert "--reasoning-parser=qwen3" in flags
assert "--moe_backend=flashinfer_cutlass" in flags
assert "reasoning" in caps
assert "MoE" in label
def test_detect_gemma_moe_uses_marlin():
cfg = {"architectures": ["Gemma4MoeForConditionalGeneration"], "model_type": "gemma4_moe", "num_local_experts": 8}
label, flags, caps = _detect_family(cfg)
assert "--reasoning-parser=gemma4" in flags
assert "--tool-call-parser=gemma4" in flags
assert "--moe_backend=marlin" in flags # NOT flashinfer_cutlass — GB10 footgun
assert "vision" in caps # ConditionalGeneration => multimodal
assert "tools" in caps
def test_detect_generic_has_no_family_flags():
label, flags, caps = _detect_family({"architectures": ["LlamaForCausalLM"], "model_type": "llama"})
assert flags == []
assert label == "Generic"
def test_detect_vision_from_config_keys():
_, _, caps = _detect_family({"model_type": "qwen3", "vision_config": {"x": 1}})
assert "vision" in caps
# ---- infer_recipe (the prefill the setup form receives) ----
def test_infer_recipe_solo_small_model():
cfg = {"architectures": ["Qwen3ForCausalLM"], "model_type": "qwen3"}
rec = infer_recipe("RedHatAI/Qwen3.6-35B-A3B-NVFP4", cfg, total_bytes=20_000_000_000, on_host_count=1)
assert rec["mode"] == "solo"
assert rec["key"] == "redhatai-qwen3-6-35b-a3b-nvfp4"
assert rec["repo"] == "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
assert "--reasoning-parser=qwen3" in rec["vllm_args"]
assert "-tp=2" not in rec["vllm_args"]
assert rec["knobs"]["kv_cache_dtype"] == "fp8"
def test_infer_recipe_cluster_when_on_both_hosts():
rec = infer_recipe("org/big", {}, total_bytes=10_000_000_000, on_host_count=2)
assert rec["mode"] == "cluster"
assert "-tp=2" in rec["vllm_args"]
assert "--distributed-executor-backend=ray" in rec["vllm_args"]
assert rec["knobs"]["gpu_memory_utilization"] == 0.7
def test_infer_recipe_cluster_when_too_big_for_one_spark():
rec = infer_recipe("org/huge", {}, total_bytes=200_000_000_000, on_host_count=1)
assert rec["mode"] == "cluster"
# ---- build_menu merge (disk scan recipes) ----
def _both_spark_settings(monkeypatch) -> Settings:
for k in ("SPARK1_HOST", "SPARK1_USER", "SPARK2_HOST", "SPARK2_USER"):
monkeypatch.delenv(k, raising=False)
monkeypatch.setenv("SPARK1_HOST", "1.1.1.1")
monkeypatch.setenv("SPARK1_USER", "u")
monkeypatch.setenv("SPARK2_HOST", "2.2.2.2")
monkeypatch.setenv("SPARK2_USER", "u")
return Settings.from_env()
def test_build_menu_merges_recipe_discovered_and_hides_incomplete(monkeypatch):
cat = load_catalog("models.yaml") # bundled recipes incl. qwen36 + gemma4
settings = _both_spark_settings(monkeypatch)
async def fake_list(host, user, s):
if host == "1.1.1.1":
return [
("RedHatAI/Qwen3.6-35B-A3B-NVFP4", 20_000_000_000, True), # recipe match
("someorg/mystery-7B", 7_000_000_000, True), # needs setup
("broken/half", 1_000_000_000, False), # incomplete -> hidden
]
return [] # spark2 empty
async def fake_probe(repo, mode, s, *, local_path=None):
return DiskStatus(repo=local_path or repo, on_disk=False, total_bytes=0, per_host=[])
monkeypatch.setattr(discovery, "list_cached_models", fake_list)
monkeypatch.setattr(discovery, "probe_disk", fake_probe)
menu = asyncio.run(discovery.build_menu(settings, cat))
# Recipe-matched: keyed by recipe key, ready (not needs_setup), real size.
assert "qwen36" in menu
assert menu["qwen36"]["needs_setup"] is False
assert menu["qwen36"]["total_bytes"] == 20_000_000_000
# Discovered-without-recipe: slug key, needs_setup.
slug = repo_to_key("someorg/mystery-7B")
assert menu[slug]["needs_setup"] is True
# Incomplete download is filtered out entirely.
assert all("half" not in k for k in menu)
# A recipe with nothing on disk (e.g. gemma4) must NOT appear — the menu is the disk.
assert "gemma4" not in menu
def test_build_menu_sums_cluster_model_across_both_sparks(monkeypatch):
cat = load_catalog("models.yaml")
settings = _both_spark_settings(monkeypatch)
async def fake_list(host, user, s):
# Same repo present on BOTH Sparks — one card, sizes summed (not two cards).
return [("org/sharded-235B", 70_000_000_000, True)]
async def fake_probe(repo, mode, s, *, local_path=None):
return DiskStatus(repo=repo, on_disk=False, total_bytes=0, per_host=[])
monkeypatch.setattr(discovery, "list_cached_models", fake_list)
monkeypatch.setattr(discovery, "probe_disk", fake_probe)
menu = asyncio.run(discovery.build_menu(settings, cat))
key = repo_to_key("org/sharded-235B")
assert list(menu) == [key] # exactly one card
assert menu[key]["total_bytes"] == 140_000_000_000 # summed across both hosts
assert len(menu[key]["per_host"]) == 2
assert menu[key]["mode"] == "cluster" # present on 2 hosts -> cluster
+2 -2
View File
@@ -1,10 +1,10 @@
import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk' import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk'
export const v0_1_0 = VersionInfo.of({ export const v0_1_0 = VersionInfo.of({
version: '0.25.0:0', version: '0.26.0:0',
releaseNotes: { releaseNotes: {
en_US: en_US:
"v0.25.0:0 — cluster coordination layer (GPU arbiter). For clusters where automation, not just this dashboard, swaps models. Three additions: (1) Swap reservation lock — an external scheduler can reserve the GPU swap path (POST /api/swap/lock) and gets a secret token; while held, any swap without the token is refused (423), so the dashboard's manual swap is paused and shows who holds the GPU and until when (with a human Release override). The lock is TTL-bounded and self-frees. (2) Swap webhook — set a URL (and optional signing secret) in Configure Sparks; Spark Control POSTs a swap_complete / swap_failed event after each swap so downstream consumers re-point their model config. (3) Schedule registry — your automation can register its cron jobs (POST /api/schedule) for a read-only \"Scheduled jobs\" panel on the dashboard; Spark Control only displays them, it never runs them. New API: /api/swap/lock (GET/POST/DELETE), /api/schedule (GET/POST/DELETE). See docs/COORDINATION.md. Spark Control remains a control plane, not a job runner — business pipelines stay in their own services and call the swap API.", "v0.26.0:0 — the model menu is now what's actually on your Sparks. The dashboard scans both Sparks for downloaded models and shows exactly those — no more hard-coded list. (1) Delete means delete: removing a model frees its weights AND takes the card off the menu (re-download later to bring it back, with its saved settings). (2) Download a new model and it appears on the menu by itself when it finishes. (3) Models Spark Control doesn't recognize show a \"needs setup\" card — the first time you switch to one, it reads the model's own files, guesses how to launch it (which family, solo vs both Sparks, the right vLLM flags), and asks you to confirm once; after that it's a normal card. (4) The download box now autocompletes known-good models. (5) Each install shows its own Sparks' models, so a shared copy no longer displays someone else's list. Removed the two legacy Qwen entries (235B FP8, 2.5 72B) — they'll still appear if you actually have them downloaded. No consumer-API changes; the /v1 proxy and swap API are unchanged.",
}, },
migrations: { migrations: {
up: async ({ effects }) => {}, up: async ({ effects }) => {},
+8 -4
View File
@@ -74,11 +74,15 @@ For a cluster wired differently from the reference layout, three optional knobs
## Adding a new model ## Adding a new model
1. Add an entry to `image/models.yaml`. Required fields: `display_name`, `repo`, `size_gb`, `mode` (`solo` or `cluster`), `vllm_args`. Optional but recommended: `description` (one paragraph — what the model is, what it's good for, how it differs from others; renders below the meta tags in each card), `capabilities` (tags like `[vision, reasoning, tools]`), `expected_ready_seconds`. The menu is whatever's downloaded on the Sparks, so the normal path is just:
2. Confirm the weights are on the Spark: `ssh <spark-user>@<spark-1-host> 'ls ~/.cache/huggingface/hub/'`. If not, download with `./hf-download.sh <repo>` on Spark 1. **download it, then set it up once.**
3. Rebuild + redeploy the package: `cd package && make x86 && make install`.
If `description` is omitted, the card simply hides that section — no need to populate it for every model. Keep descriptions generic (not user-specific) so the catalog stays portable. 1. **Download** from the dashboard (**+ Download a new model**, paste the HF repo) or on Spark 1 with `./hf-download.sh <repo>`. When it finishes it appears on the menu by itself.
2. **Set it up.** If Spark Control already has a recipe for it (see below), it's ready to switch to. Otherwise it shows a **"needs setup"** card: the first switch reads the model's `config.json`, proposes how to launch it (family/parsers, solo vs cluster, vLLM flags), and you confirm once. The confirmed recipe persists to `/data/models-overrides.yaml` (survives package updates).
### Bundling a launch recipe (optional — skips the setup prompt)
To make a known model launch correctly the instant it's downloaded, add a *recipe* to `image/models.yaml`. These are **not** the menu — they're matched to an on-disk model by `repo`. Required: `display_name`, `repo`, `size_gb`, `mode` (`solo`/`cluster`), `vllm_args`. Optional: `description`, `capabilities` (e.g. `[vision, reasoning, tools]`), `expected_ready_seconds`. Then rebuild + redeploy: `cd package && make x86 && make install`. Keep descriptions generic (not user-specific) so the recipes stay portable.
### Local / fine-tuned models (v0.23.0+) ### Local / fine-tuned models (v0.23.0+)