v0.25.0:0 - cluster coordination layer (swap lock + webhook + schedule registry)

GPU-arbiter safety layer for when automation, not just the dashboard, swaps models: - swap reservation lock (POST/GET/DELETE /api/swap/lock); 423-enforced in post_swap via a single-read gate, TTL-bounded, secret-token auth, human force-release override + dashboard banner - swap webhook (swap_complete/swap_failed) fired outside the swap lock, optional HMAC signature, configurable URL+secret - read-only schedule registry (GET/POST/DELETE /api/schedule) + dashboard panel New module image/app/coordination.py; docs/COORDINATION.md for consumers; 22 offline tests in test_coordination.py.
docs: v0.24.0:0 committed/tagged/pushed — Gitea release asset + live install still pending
2026-06-18 07:07:08 -05:00 · 2026-06-17 23:11:14 -05:00 · 2026-06-17 23:03:33 -05:00 · 2026-06-17 22:36:41 -05:00 · 2026-06-17 22:27:41 -05:00 · 2026-06-17 21:29:27 -05:00
44 changed files with 2862 additions and 114 deletions
@@ -11,5 +11,11 @@ node_modules/
 dist/
 build/
 .DS_Store
+
+# Claude Code — deny by default, allow-list shared wiring (see standards/portability.md)
 .claude/*
 !.claude/rules/
+!.claude/agents/
+!.claude/commands/
+!.claude/skills/
+!.claude/settings.json
@@ -6,6 +6,9 @@ Browser-based StartOS 0.4 package controlling a dual NVIDIA DGX Spark AI cluster

 Subsystem guidance lives in `docs/guides/` and loads when matching files are touched (Claude Code lazy-loads via `.claude/rules/` symlinks; other agents read the guides directly): `startos-package.md` (build/versioning, `package/**`), `fastapi-image.md` (dev server/env/layout, `image/**`), `redaction.md` (vendoring + test gates), `audio-speech.md` (parakeet patches, cluster-container footguns, audio testing). **Read `docs/guides/audio-speech.md` before touching the Sparks' containers over SSH** — ops sessions don't trip the path scoping.

+> **Inbox check:** At session start, if `~/Projects/standards/INBOX.md` exists, scan it for
+> items tagged `(spark-control)` and surface them before proposing next steps; triage with `/triage`.
+
 ## Stack

 - Two halves, always coordinated:
@@ -20,6 +23,7 @@ Subsystem guidance lives in `docs/guides/` and loads when matching files are tou
 ```bash
 (cd package && make x86)                                  # build the s9pk; make install sideloads (restarts live service — ask first)
 (cd image && uvicorn app.server:app --port 9999)          # local dev — needs env vars, see fastapi-image rule
+(cd image && .venv/bin/python -m pytest)                          # offline unit suite (launch-cmd injection, label-merge)
 (cd image && .venv/bin/python -m app.redaction.test_gateway)      # offline redaction suite 1
 (cd image && .venv/bin/python app/redaction/test_scrub_leak.py)   # offline redaction suite 2
 ./scripts/test-audio-with-speakers.sh <audio-file>        # e2e audio — hits the LIVE cluster
@@ -29,7 +33,7 @@ Subsystem guidance lives in `docs/guides/` and loads when matching files are tou

 - `image/app/` — FastAPI app (`server.py` entry, routers in sibling modules, `static/` dashboard UI).
 - `package/startos/` — StartOS manifest, interfaces, actions, version + release notes.
- `docs/` — `AUDIO_API.md`, `EMBEDDINGS.md`, `REDACTION_GATEWAY.md` (consumer-facing API refs; update with API changes).
+- `docs/` — `AUDIO_API.md`, `EMBEDDINGS.md`, `REDACTION_GATEWAY.md`, `COORDINATION.md` (consumer-facing API refs; update with API changes).
 - `README.md` (overview), `HANDOFF.md` (fresh-user install guide), `runbook.md` (ops notes), `known-issues.md`, `ROADMAP.md` (longer-term backlog — items move into "Current state" below when picked up).

 ## Conventions
@@ -51,37 +55,12 @@ Subsystem guidance lives in `docs/guides/` and loads when matching files are tou

 ## Current state

- **Working (v0.18.0:0, installed and serving):** swap dashboard; chat / transcribe / diarize(+chunk) / TTS proxies; embeddings + rerank + hybrid search (Qdrant); `/scrub` + `/rehydrate`; label-merge incl. dual-channel mode. Spark 2 audio stack is healthy (11k+ requests/12h, all 200).
- **In progress — Signal Engine "flakiness":** diagnosed, not a server bug — transient 1–4s unresponsiveness while the single GPU is continuously busy. Remedy is client-side; a drafted message (in-flight cap 2, hard ceiling 3 global across audio endpoints, retry-with-backoff on timeout/503) is with the owner to forward to that dev.
- **Decided, not implemented:** remote access stays WireGuard/Tailscale split-tunnel — no public interface, so no API auth built; an empirical concurrency sweep is offered but needs the owner's explicit OK in a quiet window. **Revisit (full-eval 2026-06-12):** the "LAN-only, so no auth" call is now load-bearing against RCE — unquoted user input reaches the SSH shell on several endpoints, so the network boundary is the *only* thing preventing cluster takeover. Quoting the injection sinks (work queue) is needed regardless of the auth decision; a defense-in-depth auth/CSRF gate is the follow-on.
- **Known limits:** `/health` blips while the GPU is busy (mitigated client-side); dual-channel can miss a quiet local word under loud remote bleed; the connectivity log misses sub-5s outages between 5s polls; diarizer caps at 4 speakers.
- **Portability:** working tree scrubbed 2026-06-12 — all owner-specific IPs/hostnames/usernames/names replaced with placeholders in tracked files; `claude-code-starter-prompt.md` deleted (old build-time prompt). Real cluster values live only in StartOS install config, shell env vars, and the gitignored `settings.local.json`. **Caveat (full-eval 2026-06-12): git *history* was not rewritten** — the old IPs/hosts/user `modelo`/key name are still recoverable pre-`50c67cd`. The scrub is working-tree-only; treat the repo as private until history is rewritten (see work queue below).
- **Repo wart:** commit `367d986` is labeled `v0.13.0:4` but actually contains everything through v0.18.0:0 — per-version commits for v0.14–v0.18 are missing. Keep commit messages accurate going forward.
- **Hosting:** repo pushes to the owner's self-hosted Gitea — remote `gitea`, branch `master`, over SSH (host alias + key live in the local `~/.ssh/config`; no owner-specific details belong in the repo). Push there after committing.
- **Next (pre-eval backlog):** (1) owner forwards the concurrency note to the Signal Engine dev; (2) run the concurrency sweep if the dev wants the measured knee; (3) add the `--memory` cap to parakeet-asr via the Reapply-patches action; (4) pick the next item from ROADMAP.md.
-
-### Full-eval triage (2026-06-12)
-
-Source: `EVALUATION.md` at repo root (full evidence, file:line pointers, scorecard). Findings triaged below; do these before the pre-eval backlog above where they overlap.
-
-**Work queue — P0/P1, fix before sharing the package wider:**
-1. ~~**[P0] Shell-quote/validate every user value crossing into SSH**~~ — **DONE (code, 2026-06-12; not yet shipped).** New `image/app/shellsafe.py` (`validate_repo`/`validate_image`/`validate_container` whitelists + `quote_arg`/`quote_args`). Boundary validation added to `POST /api/models` (repo) and `POST /api/nim/install` (image+container); `shlex.quote` applied at every SSH sink — `models.build_launch_command` (repo+args, covers `vllm_args`+knobs), `download._do` (repo), `nim._do` (image/container/volume/port/env), `services.docker_state`+`run_action` (container). Verified: injection survives only as a single quoted token, vLLM preflight `shlex.split` round-trip intact, both redaction suites still pass. Side-benefit: NGC key now `shlex.quote`'d in `nim._do` (was single-quoted) — closes the quote-breakout half of the P2 NGC-key item; the process-list-exposure half remains. **Ship step pending:** version bump + release notes + rebuilt s9pk.
-2. **[P0] Decide the git-history question** — owner IPs/hosts/user `modelo`/key name persist pre-`50c67cd` despite the working-tree scrub. Either rewrite history (`git-filter-repo`) + rotate the `id_ed25519_shared` key, or keep the repo private-forever. Blocks any public/shared publish. **(Open — git-ops decision, not code.)**
-3. ~~**[P1] Defense-in-depth gate on mutating endpoints**~~ — **DONE (code, 2026-06-12; not yet shipped).** `csrf_guard` HTTP middleware in `server.py` rejects state-changing requests whose `Origin`/`Referer` hostname ≠ the served host. Scoped to control endpoints; the programmatic API surface is exempt (`/v1/*`, `/scrub`, `/rehydrate`, `/api/search`, `/api/audio/`, `/api/health-event`) so downstream consumers are unaffected. No app-layer token auth (deliberate — would break consumers + the non-technical owner). Verified via TestClient: cross-origin control POST→403, same-origin/no-Origin→pass, exempt prefixes always pass, GET never blocked. **Verify on-box:** confirm the StartOS reverse proxy passes `Host`/`Origin` so the dashboard isn't false-positive-blocked.
-4. ~~**[P1] Validate the Qdrant `collection`**~~ — **DONE (code, 2026-06-12; not yet shipped).** `_safe_collection` whitelist (`[A-Za-z0-9._-]`, rejects `..`) + URL-encoded path segment in `embeddings_proxy.py`. The raw `filter` is left as a passthrough (Qdrant parses it; pydantic enforces `dict`) — locking it to an allowlist would break hybrid-search consumers; the path segment was the real injection vector.
-
-**Shipping (all of #1/#3/#4 batched):** version bumped `0.18.0:1`→`0.19.0:0` with release notes (`versions/v0_1_0.ts`). Rebuild `make x86`; `make install` (live-service restart) needs explicit go-ahead. Not committed yet.
-
-**Known debt — P2, track but not blocking:**
- Test coverage is redaction-only; swap state machine, proxies, SSH wrapper, and the package have zero automated tests. Live-cluster paths (swap exec, audio, embeddings/search) couldn't be exercised at all — biggest blind spot.
- Loose dependency floors permit vulnerable `python-multipart`/`starlette` (DoS CVEs) on rebuild; no lockfile; no upload size caps (`pyproject.toml:6-13`).
- StartOS registry blockers (only if pursuing the registry): source not public + `packageRepo`/`upstreamRepo` are `example.com` placeholders (`manifest/index.ts:12-13`).
- Opaque HTTP 500 on `POST /api/models` / `PUT /knobs` when `MODELS_OVERRIDES` unset in dev (write to read-only `/data`) — catch the `OSError`.
- NGC API key inlined single-quoted into a remote shell command (`nim.py:147`) — pass via stdin/env.
- Global mutable `catalog` reassigned via `global`, shared across async requests with no snapshot (`server.py:107`) — latent race as concurrency grows.
- Container runs uvicorn as **root** bound to `0.0.0.0:9999` (no `USER` in Dockerfile) — amplifies any RCE blast radius.
-
-**Parked — P3+, do in bulk when next touching docs/packaging:**
- README Status block stale (`v0.2.3 / 0.13.0:4` → v0.18.0:1, undercounts features); deprecated `@app.on_event` + hardcoded `app.version="0.1.0"`; `NimInstallBody.register` shadows `BaseModel` (rename → `register_service`); httpx class names leak into TTS/speech-models error text; one unescaped `innerHTML` sink (`app.js:177`) + `task_id` reflected in scrub JSON.
- Packaging cosmetics: `marketingUrl` placeholder; broken `instructions.md` source link; per-service SSH users (`parakeet_user` etc.) absent from the Configure-Sparks action inputSpec (silent default-empty); `Makefile` builds only x86 though manifest declares `aarch64`; release notes describe the scrub, not capabilities.
- Hardening misc: no body/upload size limits on `/v1/audio/*`, `/v1/chat/completions`, `/scrub`; `int(_env(...))` startup crash on bad `VLLM_PORT`; upstream error text echoed to clients.
+- **Live service runs v0.22.0:0** (installed and serving). **v0.25.0:0 is the latest in tree — coordination layer (swap lock + webhook + schedule registry); built/typechecked clean, NOT yet committed/tagged/installed (this session's work).** It stacks on three releases also staged-but-not-live: v0.24.0:0 (configurable topology — committed `26070eb`, tagged, pushed to `gitea/master`), v0.23.0:0 (local/fine-tuned models — committed/tagged/Gitea-published). **Close-out backlog for all of these: (a) commit/tag/push v0.25.0:0; (b) `make release` to publish s9pk assets to Gitea Releases (needs `GITEA_URL` + write `GITEA_TOKEN`, neither in env); (c) the live install.** Installs blocked on the same mDNS issue (P3 line below). Working features: swap dashboard; chat / transcribe / diarize(+chunk) / TTS proxies; embeddings + rerank + hybrid search (Qdrant); `/scrub` + `/rehydrate`; label-merge incl. dual-channel; per-Spark SSH-key copy + WireGuard `VPN <ip>` hardware-card badge; configurable vLLM port (blank ⇒ 8888); **configurable topology** (vLLM container name, hide-services list, second-Spark vLLM monitor — v0.24.0:0); local/fine-tuned models (v0.23.0:0); **coordination layer** (v0.25.0:0 — GPU swap reservation lock with `423`-enforced manual-swap pause + human Release override, swap_complete/swap_failed webhook, read-only schedule registry; API in `docs/COORDINATION.md`). Everything from v0.23 onward lands live once the installs go through. Spark 2 audio stack healthy. Security hardening (v0.19.0:0 — shellsafe SSH-injection guard, Qdrant path-injection, same-origin CSRF guard) shipped and stable; evidence in `EVALUATION.md`.
+- **matrix-bridge bot tile (done, v0.21.0:1, verified live):** `bot`-kind service tile — status badge from docker-state only (no HTTP port), plus **Update** / Restart / Stop/Start / **View logs**. Code: `app/matrix_bridge.py` + `/api/matrix-bridge/{update,logs}` (update streams; 25-min cap; fail-loud). Driven directly as `modelo` on Spark 2 (**no `sudo -iu`** — spark2 has no passwordless sudo). User is a blank-default Configure-Sparks field (`matrix_bridge_user`); blank → tile hidden (portable). Host reuses `spark2_host` (`192.168.1.87` = the bot's box `spark-32d0`); container/dir/branch are env-overridable defaults. **Load-bearing ops dep:** Update's `git fetch` runs as `modelo`, which needs `modelo`'s `~/.ssh/config` pinning the Gitea deploy key with `IdentitiesOnly yes` — else the wrong key is offered and Gitea denies (publickey). Optional next, only if the bot dev asks: Docker `HEALTHCHECK` for running-but-disconnected detection (spec §Note).
+- **Tests:** offline pytest harness in `image/tests/` — `cd image && .venv/bin/python -m pytest` (124 passing). Covers `build_launch_command` (incl. the shell-injection round-trip + local-model bind-mount), the transcript↔diarizer label-merge, the `shellsafe` validators, `matrix_bridge.build_update_command` (+ phase detection), the configurable-topology layer (`test_topology.py`), and the coordination layer (`test_coordination.py`: swap-lock lifecycle/expiry/token-auth, schedule-registry CRUD, webhook payload + HMAC signature — `now` is injected into the lock so expiry is tested without sleeping). Mock-heavy swap/proxy/endpoint tests deliberately skipped (low ROI). Redaction + live-audio suites remain standalone scripts.
+- **Signal Engine "flakiness":** diagnosed as *not* a server bug — transient 1–4s unresponsiveness while the single GPU is busy. Client-side remedy (in-flight cap 2 / ceiling 3 / retry-on-timeout+503) drafted and **forwarded to that dev (owner confirmed 2026-06-15)**. Awaiting whether they want the measured concurrency knee.
+- **Stance (decided, not built):** no public interface / no API-token auth — LAN + WireGuard/Tailscale split-tunnel only; the CSRF guard covers the browser-driven vector.
+- **Known limits:** `/health` blips while the GPU is busy (mitigated client-side); dual-channel can miss a quiet local word under loud remote bleed; connectivity log misses sub-5s outages between 5s polls; diarizer caps at 4 speakers; matrix-bridge badge won't visibly flip on a fast `docker restart` (status re-checked only after the command returns).
+- **Infra gotcha (safety):** passwordless sudo is NOT configured on spark2 — design unprivileged probes for any Spark feature (the badge uses `ip`, not `sudo wg show`). spark2 sits on the `starttunnel` WireGuard subnet (`10.59.211.6/24`, survives reboot). Owner declined SSH-key rotation after the 2026-06-12 history scrub (only the key *name* leaked) — don't re-flag.
+- **Hosting:** self-hosted Gitea — remote `gitea`, branch `master`, over SSH; push after committing. (Wart: commit `8d839e3` is mislabeled `v0.13.0:4` but contains through v0.18.0:0.)
+- **Next — committed 2026-06-17: OpenClaw/Johnny-5 coexistence epic (full plan + design stance in `ROADMAP.md` → "Cluster coordination").** Stance: Spark Control = control plane / GPU arbiter, **not** a job runner; business cron jobs live in separate services that *call* its swap API (swaps are already API-driven via `POST /api/swap`). Sequence: (1) **configurable `VLLM_PORT`** — SHIPPED **v0.22.0:0** (Configure-Sparks field, blank ⇒ 8888; + `_env_int` hardening in `config.py` so a blank/bad port no longer crashes startup, killing a P3 tech-debt item). Committed `136a471`, pushed, tagged `v0.22.0`, rebuilt clean, installed, and **published to the self-hosted Gitea Releases** 2026-06-17 (`make release` → `scripts/gitea-release.sh`, takes `GITEA_URL` + a write token). **Distribution model (decided 2026-06-17):** Gitea Releases + a read-only token the adopter's agent uses to pull the latest s9pk (`GET /api/v1/repos/grant/spark-control/releases/latest` → download the `.s9pk` asset → sideload). Note: Gitea returns `browser_download_url` on its `.local` ROOT_URL, which won't resolve off-LAN — a remote adopter pulls via whatever address reaches the Gitea (the WireGuard IP). (2) **local-path/fine-tuned models** — DONE in tree, staged as **v0.23.0:0** (`ModelDef.local_path` + exactly-one-source validator; swap bind-mounts the dir at the same container path via the launch script's `VLLM_SPARK_EXTRA_DOCKER_ARGS` hook, **no `launch-cluster.sh` change**; "+ Add local model" UI form + `local` badge; `validate_local_path`; disk-delete refused for local; 94 tests pass. Reviewer-agent pass done, findings addressed (path validation + chat-template-location guard folded into the `ModelDef` validator so YAML/override entries are checked too; `_merge_overrides` skips a bad entry instead of failing the whole catalog; `VLLM_SPARK_EXTRA_DOCKER_ARGS` contract documented in `runbook.md`). **Committed `e783653`, tagged `v0.23.0`, built clean, published to Gitea Releases — but `make install` to the live Start9 FAILED: `immense-voyage.local` wasn't resolving via mDNS from the Mac (server up at `192.168.1.72`; `start-cli -H <ip>` reaches it but returns UNAUTHORIZED, auth bound to the registered `.local` host). FINISH-HERE: flush mDNS (`sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder`) or add a hosts entry, then re-run `cd package && make install`** (details in runbook → "Sideload can't reach the server"). (3) **configurable topology** — DONE in tree, staged as **v0.24.0:0** (built clean, not yet committed/installed). Three optional Configure-Sparks knobs: vLLM container name (`VLLM_CONTAINER`, blank ⇒ `vllm_node`, threaded into the swap log-tail + validator exec via `quote_arg`); "services to hide" (`DISABLED_SERVICES` comma list → `Settings.disabled_services` frozenset, skipped by `services_from_settings`, the `check_*` probes, deep-health `run_all`, and connectivity logging — kills the Parakeet-on-8000 collision); second-Spark vLLM monitor via a `kind: vllm` custom service in `services-overrides.yaml` (`probe_vllm_endpoint` shared with `check_vllm`). `/api/endpoints` gained a `disabled` flag; the health-dot hides when disabled. 102 tests pass (+8 in `test_topology.py`). Swap mechanism deliberately NOT generalized to raw `docker run` (that's coordination, item 4). Install pending — same mDNS situation as v0.23.0. (4) **coordination layer** — DONE in tree, staged as **v0.25.0:0** (brought forward 2026-06-17 on request rather than waiting for our own automation). `image/app/coordination.py` + `docs/COORDINATION.md`: swap reservation lock (`GET/POST/DELETE /api/swap/lock`, secret token, `423`-enforced in `post_swap`, TTL-bounded in-memory, `?force=true` human override, dashboard banner + swap-button pause), swap webhook (`swap_complete`/`swap_failed` fired outside the swap lock from `SwapManager._run`, optional HMAC `X-Spark-Signature`, Configure-Sparks URL+secret), schedule registry (`GET/POST/DELETE /api/schedule`, read-only "Scheduled jobs" panel). +20 tests (`test_coordination.py`). Built/typechecked clean; commit + install pending. Still-open older threads: audio concurrency sweep (only if the Signal Engine dev wants the knee; needs a quiet window); optional matrix-bridge Docker `HEALTHCHECK` if the bot dev asks; Parakeet long-audio guard deferred (rationale in ROADMAP).
@@ -12,13 +12,13 @@ This is a capable, well-documented single-operator control plane: a ~960-line Fa

 - **Command injection → cluster RCE** is reported by *both* the evaluator (P1) and the security-auditor (P0) at the same sinks (`models.py:80`, `swap.py:101`, `download.py:129`, `nim.py:145-166`, `services.py:144`). The evaluator demonstrated `build_launch_command` producing a live `;`-separated command from a hostile `repo`. Merged as **one P0** — the auditor's adversarial evidence (browser/CSRF reachability over plaintext HTTP, no auth) escalates the evaluator's network-gated P1.
 - **No auth on state-mutating endpoints** is the shared root enabler: the evaluator filed it P2 (documented/intentional), the auditor filed the **CSRF** angle P1 (a malicious page in the operator's browser can `fetch()` the mutating routes and chain into the P0 injections). Merged into one P1, noting the auditor's CSRF evidence escalates the evaluator's original P2.
- **Owner data exposure**: the evaluator flagged real IPs/username in the (gitignored, untracked) `.claude/settings.local.json`; the auditor independently found the same class of data — IPs, hostnames, user `modelo`, key name — persisting in **git history** despite the v0.18.0:1 working-tree scrub. These are the same concern at two locations; the git-history copy is the P0.
+- **Owner data exposure**: the evaluator flagged real IPs/username in the (gitignored, untracked) `.claude/settings.local.json`; the auditor independently found the same class of data — IPs, hostnames, user `<spark-user>`, key name — persisting in **git history** despite the v0.18.0:1 working-tree scrub. These are the same concern at two locations; the git-history copy is the P0.
 - **Front-end output hygiene**: the evaluator flagged `current_model` rendered via `innerHTML` without `escapeHtml` (`app.js:177`, P3); the exerciser noted `task_id` echoed verbatim in scrub JSON. The auditor read the UI as broadly `escapeHtml`-clean — see Disagreements.

 ## Priority queue

 - [P0] Command injection via unquoted user input (`repo`, `vllm_args`, NIM `image`/`container`/`port`, custom-service `container`) interpolated into SSH shell commands → arbitrary RCE as the SSH user on the Sparks — `models.py:80`, `swap.py:101`, `download.py:129`, `nim.py:145-166`, `services.py:144`; demonstrated via `build_launch_command` — evaluator + security-auditor
- [P0] Owner infra topology (IPs `192.168.1.103/.87`, QSFP `192.168.100.10/11`, hosts `spark-27ea`/`spark-32d0`, user `modelo`, key `id_ed25519_shared`) persists in git history pre-`50c67cd` despite the working-tree scrub → target list for the unauthenticated endpoints — security-auditor
+- [P0] Owner infra topology (IPs `<spark-1-ip>`/`<spark-2-ip>`, QSFP `<spark-1-qsfp-ip>`/`<spark-2-qsfp-ip>`, hosts `<spark-1-host>`/`<spark-2-host>`, user `<spark-user>`, key `<ssh-key>`) persisted in git history despite the working-tree scrub → target list for the unauthenticated endpoints — security-auditor [RESOLVED 2026-06-12: history rewritten with git filter-repo; 0 hits across all refs]
 - [P1] No auth + no CSRF protection on state-changing endpoints (plaintext `http`, `interfaces.ts:8`) → any LAN peer, or a malicious page in the operator's browser, can drive swap/install/stop/delete and chain into the P0 injections — security-auditor (CSRF P1) + evaluator (auth P2, escalated)
 - [P1] SSRF / Qdrant path injection: caller `collection` interpolated into the Qdrant URL with no validation and raw `filter` forwarded verbatim — `embeddings_proxy.py:237,175,204` — security-auditor
 - [P2] Test coverage is redaction-only; the swap state machine, proxies, SSH wrapper, and the StartOS package have zero automated tests — evaluator
@@ -62,7 +62,7 @@ No lens score was overturned by cross-agent evidence; Security stays at 2 with t
 ## Suggested order of work

 1. **Close the injection sinks** — `shlex.quote` or strict-regex-validate every user-controlled value crossing into SSH (`repo`, `vllm_args`, NIM `image`/`container`/`port`, custom-service names); the safe pattern already exists in `disk.py:_SAFE_DIRNAME`. Cheap, local, independent of the auth decision. (P0)
-2. **Decide the git-history question** before any wider sharing — rewrite history (`git-filter-repo`) and rotate the named `id_ed25519_shared` key, or commit to keeping the repo private-forever. (P0)
+2. **Decide the git-history question** before any wider sharing — rewrite history (`git-filter-repo`) and rotate the named `<ssh-key>` key, or commit to keeping the repo private-forever. (P0)
 3. **Add a defense-in-depth gate** on mutating endpoints — an `Origin`/referer check or a shared-token header in middleware — so a misconfigured StartOS exposure isn't instant RCE; leave read-only probes open. (P1)
 4. **Harden the remaining inputs** — validate the Qdrant `collection`, pin dependency floors + commit a lockfile, add upload size caps, drop the root container `USER`. (P1–P2)
 5. **Add a minimal pytest harness** for `build_launch_command` (incl. injection cases), the swap state transitions, and `_merge_words_with_speakers` — the untested core. (P2)
@@ -1,6 +1,6 @@
 MIT License

-Copyright (c) 2026 Grant
+Copyright (c) 2026 Alice

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -2,8 +2,24 @@

 Longer-term backlog, roughly ordered. An item moves to "Current state" in CLAUDE.md when picked up.

+## Cluster coordination — OpenClaw coexistence (committed 2026-06-17, from Johnny 5 report 2026-06-16)
+
+Driven by the one other Spark Control adopter (a colleague running OpenClaw + cron jobs against his own dual Sparks; report at the date above). His cluster is configured differently from ours (vLLM on **both** Sparks, port 8000, raw `docker run`, container `vllm-gemma4`) and an automated cron physically swaps models — so his notes are partly *portability gaps* (the package hard-codes our layout) and partly *coordination gaps* (his dashboard and his crons fight over the GPU).
+
+**Design stance (decided):** Spark Control is the **control plane / GPU arbiter, not a job runner.** Recurring business pipelines (his "Daily Vol" generator; our own future scheduled jobs) live in *separate* application services that *call* Spark Control's swap API. The dividing line is what a scheduled job *does*: control-plane actions (swap a model, warm it, restart a service, run a health sweep) are in scope for an in-package scheduler; business logic (scrape / summarize / build / deploy) stays in the app layer. Swaps are already API-driven (`POST /api/swap` → `GET /api/swap/{id}` / `…/stream`, `POST /api/swap/{key}/validate`) and non-browser clients pass the CSRF guard, so an external scheduler can drive swaps **today** — the items below add the *safety* layer, not the capability.
+
+Sequenced:
+1. **Configurable `VLLM_PORT`** — DONE, v0.22.0:0. Field in Configure Sparks (blank ⇒ 8888); numeric-setting parsing hardened so a blank/bad value falls back instead of crashing startup. Was the immediate "vLLM unreachable" bug for an adopter on port 8000.
+2. **Local-path / fine-tuned model support** — DONE, v0.23.0:0. Catalog/`ModelDef` gained `local_path` (exactly one of `repo`/`local_path`); swap bind-mounts the dir into the vLLM container at the same path via the launch script's `VLLM_SPARK_EXTRA_DOCKER_ARGS` hook (no `launch-cluster.sh` change); "+ Add local model" form + `local` badge; disk-delete refused for local models; `validate_local_path` boundary check. His merged `ten31-v2` was the motivating case.
+3. **Configurable topology** — DONE, v0.24.0:0. Three optional Configure-Sparks knobs: vLLM container name (`VLLM_CONTAINER`, blank ⇒ `vllm_node`; threaded through the swap log-tail + pre-flight validator via `quote_arg`); "services to hide" (`DISABLED_SERVICES`, comma list — hidden services show no tile and are skipped by status/deep-health/connectivity probes, killing the Parakeet-on-8000 collision); and a second-Spark vLLM monitor via a `kind: vllm` custom service in `services-overrides.yaml` (read-only tile probed through the shared `probe_vllm_endpoint`). `/api/endpoints` gained a `disabled` flag. Covers report P4/P5/#6. (Generalizing the *swap* mechanism to the adopter's raw `docker run` was deliberately left out — that's coordination, item 4; he swaps via his own crons and uses Spark Control to monitor.)
+4. **Coordination layer** — DONE in tree, staged as **v0.25.0:0** (built/typechecked clean; install pending). All three primitives shipped; `image/app/coordination.py` + `docs/COORDINATION.md`. Brought forward 2026-06-17 on request rather than waiting for our own automation.
+   - **Swap lock** with holder + TTL (`POST` / `GET` / `DELETE /api/swap/lock`). Acquire returns a secret token; the swap endpoint refuses any real swap (`423`) that doesn't present it in `X-Swap-Lock-Token`, so the dashboard's manual swap is paused while a scheduler holds it (with a `?force=true` human override). In-memory + TTL-bounded → resets to unlocked on restart; re-acquire with the token extends. Enforced in `post_swap`, not advisory.
+   - **Swap-event webhook** (`swap_complete` / `swap_failed`) to a configurable URL (Configure-Sparks field), fired from `SwapManager._run` *outside* the swap lock; optional shared secret ⇒ `X-Spark-Signature` HMAC. Fire-and-forget (5 s, no retries); dry runs don't fire.
+   - **Schedule visibility** — `GET/POST/DELETE /api/schedule`; read-only "Scheduled jobs" dashboard panel, registered by external schedulers. Spark Control stores and displays, never executes.
+   - Still NOT generalized: the swap *mechanism* to raw `docker run` (that's the adopter's own crons' job). Tests: `image/tests/test_coordination.py` (22 cases — lock lifecycle/expiry/token, the single-read swap gate, schedule CRUD + id validation, webhook payload+signature). Known limit: lock + schedules are in-memory (a restart frees the lock and empties the registry until schedulers re-register) — persist to `/data` only if that bites.
+
 ## Near term
- parakeet-asr `--memory` cap, shipped via the Reapply-patches action (guards against swap-thrash on very long audio).
+- parakeet-asr long-audio memory guard — **deferred 2026-06-15, low priority.** A duration cap on `/v1/audio/diarize`: Sortformer runs the whole file in one pass (`diarizer.py:128-135`) over Spark 2's *shared* 128 GB unified memory (also feeding Kokoro/embeddings/Qdrant), so one giant single file can thrash into swap. **Precautionary — no observed incident**, and the production consumer (Recap Relay) already chunks via `/diarize-chunk` (~5-min, already bounded), so the only exposed path is a consumer POSTing one huge file to the full `/diarize`. When picked up: add a configurable `MAX_DIARIZE_SECONDS` guard in `diarizer.py` right after `duration` is computed (~line 130) → raise → HTTP 413 in `main.py` (mirrors the existing `MAX_UPLOAD_MB` 413); ship via the Reapply-patches action (restarts the live parakeet-asr container → needs go/no-go). Leave transcription out of v1 (upstream/un-patched file; parakeet-TDT handles long audio better). Revisit only if a consumer starts sending long single files.
 - Controlled concurrency sweep of the audio endpoints in a quiet window — replace the reasoned in-flight cap (2, ceiling 3) with the measured knee.

 ## Audio quality
@@ -22,3 +38,21 @@ Longer-term backlog, roughly ordered. An item moves to "Current state" in CLAUDE
 - Per-model configurable vLLM flags editable from the UI (today: edit `models.yaml` and rebuild).
 - Spark host update actions (OS/driver) from the UI.
 - Open WebUI link-out integration; richer per-service detail views.
+
+## Tech debt (from the 2026-06-12 full-eval — see EVALUATION.md)
+
+P0/P1 security findings are all fixed in v0.19.0:0. Remaining, none blocking:
+
+**P2 — track:**
+- No automated tests beyond the two redaction suites — swap state machine, proxies, SSH wrapper, and the StartOS package are untested; live-cluster paths (swap exec, audio, embeddings/search) are exercised only by hand. Biggest coverage gap; a small pytest harness for `build_launch_command` (incl. injection cases), swap transitions, and `_merge_words_with_speakers` is the highest-value start.
+- Loose dependency floors permit vulnerable `python-multipart`/`starlette` (DoS CVEs) on rebuild; no lockfile; no upload size caps (`pyproject.toml`).
+- Opaque HTTP 500 on `POST /api/models` / `PUT /knobs` when `MODELS_OVERRIDES` unset in dev (write to read-only `/data`) — catch the `OSError`.
+- NGC API key still appears on the remote process command line (`nim.py`) — the quote-breakout risk is fixed; pass via stdin/env to also remove the process-list exposure.
+- Global mutable `catalog` reassigned via `global`, shared across async requests with no snapshot (`server.py`) — latent race as concurrency grows.
+- Container runs uvicorn as **root** bound to `0.0.0.0:9999` (no `USER` in Dockerfile) — amplifies any RCE blast radius.
+
+**P3 — bulk-fix when next touching docs/packaging:**
+- README Status block stale (`v0.2.3 / 0.13.0:4` → now v0.19.0:0); deprecated `@app.on_event` + hardcoded `app.version="0.1.0"`; `NimInstallBody.register` shadows `BaseModel` (rename → `register_service`); httpx class names leak into TTS/speech-models error text; one unescaped `innerHTML` sink (`app.js`) + `task_id` reflected in scrub JSON.
+- Packaging: `marketingUrl`/`packageRepo`/`upstreamRepo` are `example.com` placeholders; broken `instructions.md` source link; per-service SSH users (`parakeet_user` etc.) absent from the Configure-Sparks action inputSpec (silent default-empty); `Makefile` builds only x86 though the manifest declares `aarch64`.
+- Hardening misc: no body/upload size limits on `/v1/audio/*`, `/v1/chat/completions`, `/scrub`; `int(_env(...))` startup crash on bad `VLLM_PORT`; upstream error text echoed to clients.
+- StartOS registry (only if ever pursuing it): source must be public + real repo URLs.
@@ -9,7 +9,7 @@ from the live deployment.
 ## 1. Connection / auth

 - **Base URL:** `https://<spark-control-host>` (the operator's Start9 LAN address,
-  e.g. `https://192.168.1.72:62419`). A `.local` form also exists (survives IP
+  e.g. `https://<spark-control-host>:62419`). A `.local` form also exists (survives IP
  changes); the operator can provide it.
 - **TLS:** Start9's self-signed Root CA. On the LAN, set `verify=False` /
  `rejectUnauthorized:false` (curl `-k`), or install the Start9 Root CA into your
@@ -0,0 +1,157 @@
+# Cluster coordination through Spark Control (v0.25.0)
+
+Spark Control is the **GPU arbiter, not a job runner.** Your recurring pipelines
+(model-warming crons, "daily X" generators, batch jobs) live in your own
+services and *drive Spark Control's swap API*. This page documents the safety
+layer around that: a **swap reservation lock**, a **swap-event webhook**, and a
+**read-only schedule registry**.
+
+If only the dashboard ever swaps models, you don't need any of this — it's for
+when something automated also swaps.
+
+All endpoints are on the Spark Control host (same LAN/VPN URL as the LLM, audio,
+and embeddings proxies). There is no API-token auth by design (LAN + split-tunnel
+VPN only); a non-browser client passes the same-origin guard automatically.
+
+---
+
+## 1. Swap reservation lock
+
+A short, TTL-bounded reservation of the swap path. While a lock is held, **any
+real swap that doesn't present the holder's token is refused with `423 Locked`**
+— including the dashboard's manual swap. The holder *name* is descriptive; the
+returned **token** is the secret that authorises swaps and the release.
+
+The lock is in-memory: it resets to *unlocked* if Spark Control restarts (the
+safe-for-availability default), and the swap engine's own in-progress guard
+still prevents two swaps running at once.
+
+### `POST /api/swap/lock` — acquire (or extend)
+
+```json
+// request
+{ "holder": "openclaw-daily-vol", "ttl_seconds": 900, "note": "daily vol run" }
+
+// 200 response
+{
+  "held": true,
+  "holder": "openclaw-daily-vol",
+  "acquired_at": "2026-06-17T12:00:00+00:00",
+  "expires_at":  "2026-06-17T12:15:00+00:00",
+  "seconds_remaining": 900,
+  "note": "daily vol run",
+  "token": "a1b2c3…"        // SECRET — store it; needed to swap and to release
+}
+```
+
+- `ttl_seconds` is optional (default 900) and clamped to `[1, 86400]`.
+- **`409`** if a *different* holder already holds it (body includes the current
+  `lock` state). To **extend** your own lock, POST again with the same `holder`
+  **and** your `token` — the token is preserved and the window slides forward.
+
+### `GET /api/swap/lock` — status (no token)
+
+```json
+{ "held": true, "holder": "openclaw-daily-vol", "expires_at": "…", "seconds_remaining": 612, "note": "…" }
+// or
+{ "held": false }
+```
+
+### `DELETE /api/swap/lock` — release
+
+Send your token in the `X-Swap-Lock-Token` header (or `?token=`):
+
+```
+DELETE /api/swap/lock
+X-Swap-Lock-Token: a1b2c3…
+```
+
+- **`403`** if the token doesn't match. The dashboard's human override is
+  `DELETE /api/swap/lock?force=true` (no token).
+
+### Swapping while you hold the lock
+
+Pass the token on the swap call; the dashboard (no token) is then blocked:
+
+```
+POST /api/swap
+X-Swap-Lock-Token: a1b2c3…
+{ "model_key": "gemma-3-27b" }
+```
+
+Recommended scheduler flow: **acquire → swap (with token) → poll `/api/swap/{id}`
+→ release**. Always release in a `finally`; if you crash, the TTL frees it.
+
+> `POST /api/swap/{key}/validate` (pre-flight) and dry-run swaps are **not**
+> blocked by the lock — they don't touch the cluster.
+
+---
+
+## 2. Swap-event webhook
+
+Configure a URL in **Configure Sparks → "Swap webhook URL"**. After every real
+swap, Spark Control POSTs:
+
+```json
+{
+  "event": "swap_complete",   // or "swap_failed"
+  "job_id": "1a2b3c4d",
+  "model_key": "gemma-3-27b",
+  "state": "ready",           // or "failed"
+  "returncode": 0,
+  "started_at": "2026-06-17T12:00:00+00:00",
+  "finished_at": "2026-06-17T12:03:11+00:00",
+  "dry_run": false
+}
+```
+
+Headers: `X-Spark-Event: swap_complete`. If you set a **webhook secret**, the
+body is signed: `X-Spark-Signature: sha256=<hmac>` (HMAC-SHA256 of the raw body
+with the shared secret). Verify it like:
+
+```python
+import hmac, hashlib
+expected = "sha256=" + hmac.new(secret.encode(), raw_body, hashlib.sha256).hexdigest()
+assert hmac.compare_digest(expected, request.headers["X-Spark-Signature"])
+```
+
+Delivery is best-effort and fire-and-forget (5 s timeout, no retries) — a
+webhook failure never affects the swap itself. Dry runs don't fire.
+
+---
+
+## 3. Schedule registry (read-only display)
+
+So the dashboard can show *what's scheduled to touch the GPU and when*, your
+schedulers register their jobs here. **Spark Control only displays these — it
+never executes them.**
+
+### `POST /api/schedule` — register / update
+
+```json
+// request (pass a stable `id` to update in place on re-register)
+{ "id": "daily-vol", "name": "Daily Vol", "owner": "openclaw",
+  "cron": "0 6 * * *", "next_run": "2026-06-18T06:00:00Z",
+  "description": "Swaps to the big model, generates the vol report" }
+
+// response: the stored entry (generates an id if you omit one)
+```
+
+`name` is required; `id` (if given) must match `[A-Za-z0-9_.-]` (≤64 chars).
+
+### `GET /api/schedule` — list
+
+```json
+{ "schedules": [ { "id": "daily-vol", "name": "Daily Vol", "owner": "openclaw",
+  "cron": "0 6 * * *", "next_run": "…", "description": "…",
+  "registered_at": "…", "updated_at": "…" } ] }
+```
+
+### `DELETE /api/schedule/{id}` — deregister
+
+```json
+{ "deleted": true }
+```
+
+The registry is in-memory — re-register your schedules on your own startup so
+they survive a Spark Control restart.
@@ -24,12 +24,17 @@ Other env vars: `BIND_PORT`, `MODELS_YAML`, `SSH_DIR`, `SSH_KNOWN_HOSTS`, `MODEL

 ## Tests

-No pytest harness — each suite is a standalone script run with the `image/.venv` interpreter (system python3 has no deps). See the redaction and audio rules for the suites themselves.
+Two kinds, both run with the `image/.venv` interpreter (system python3 has no deps):
+
+- **pytest unit suite** — offline, pure functions, no cluster. `.venv/bin/python -m pytest` from `image/`. Lives in `image/tests/`; currently covers `build_launch_command` (incl. the shell-injection / `shlex` round-trip invariant) and the transcript↔diarizer label-merge (`_merge_words_with_speakers`). Install the test dep once with `pip install -e '.[dev]'`. Add new pure-function coverage here.
+- **Standalone scripts** — the redaction suites and the live-cluster audio e2e are run directly (not via pytest). See the redaction and audio rules.

 ## Conventions

 - Pydantic request models go at **module scope**, never inside a `build_router()` body (FastAPI silently 422s otherwise).
 - New external-facing endpoints get documented in `docs/` (`AUDIO_API.md`, `EMBEDDINGS.md`, `REDACTION_GATEWAY.md`) and noted in release notes.
+- **SSH-input safety:** any user-supplied value that reaches an SSH command on the Sparks MUST go through `app/shellsafe.py` — validate against a whitelist at the API boundary, then `quote_arg`/`quote_args` (`shlex.quote`) at the sink. Never raw f-string a user value into a command string. Existing sinks: `models.build_launch_command`, `download`, `nim`, `services`; `disk.py` keeps its own `_SAFE_DIRNAME` because it needs `$HOME` to expand server-side. The vLLM pre-flight (`validate.py`) relies on `shlex.split` cleanly reversing this quoting — preserve that invariant.
+- **CSRF / same-origin:** state-mutating *control* endpoints are guarded by the `csrf_guard` middleware in `server.py` (rejects requests whose `Origin`/`Referer` host ≠ the served host). A new endpoint meant to be called **cross-origin by downstream apps** (a proxy/data endpoint) must be added to `_CSRF_EXEMPT_PREFIXES`, or browser POSTs from those apps will 403. No app-layer token auth by design (LAN/VPN-only; would break consumers).

 ## Layout

@@ -25,6 +25,22 @@ npm run prettier   # prettier --write startos (no semicolons, single quotes, tra
 - Version format is `X.Y.Z:N` (`:N` = revision). Bump in `package/startos/versions/v0_1_0.ts`; **replace** the release notes — never leave old notes behind under an extra key (any unknown key fails `tsc`).
 - New external-facing endpoints get noted in release notes for downstream app developers (Recap Relay, Ten31 Transcripts, CRM, Signal Engine consume these APIs).

+## Releasing to Gitea
+
+The s9pk is distributed via Gitea **Releases** (the binary is gitignored — never commit it). Adopters pull the latest asset with a read-only token. Per-version ritual:
+
+```bash
+# 1. bump version in startos/versions/v0_1_0.ts (+ replace release notes), then:
+cd package && make x86                       # build
+# 2. commit + push the source change
+git tag vX.Y.Z && git push gitea vX.Y.Z      # tag — plain vX.Y.Z, NO ':' (git refs forbid it)
+make install                                 # optional: sideload to your own server (restarts it — go/no-go)
+# 3. publish the s9pk as a release asset (needs a write-scoped token):
+GITEA_URL=https://<gitea-host> GITEA_TOKEN=<write-token> make release
+```
+
+`make release` → `scripts/gitea-release.sh`: creates/reuses the release for the tag and uploads (replacing) the s9pk asset; idempotent, fails loud on real HTTP errors. `GITEA_INSECURE=1` skips TLS verify for a self-signed LAN cert. Hand adopters a **read-only** token (repository: Read), ideally on a dedicated reader account; their agent then `GET`s `/api/v1/repos/<owner>/spark-control/releases/latest` and downloads the `.s9pk` asset. Note Gitea returns `browser_download_url` on its configured ROOT_URL (may be a `.local` name) — an off-LAN adopter pulls via whatever address actually reaches the Gitea.
+
 ## Layout

 - `package/startos/` — manifest, interfaces, actions (`configureSparks`, `showPublicKey`), `versions/v0_1_0.ts` (current version string + release notes).
@@ -41,7 +41,7 @@ from .config import Settings
 logger = logging.getLogger("spark-control.audio")


-# Kokoro default voice. The four curated voices below were Grant-tested for
+# Kokoro default voice. The four curated voices below were Alice-tested for
 # narration/recap-style content; bm_george is the default. Clients can pass
 # any of Kokoro's 67 voices in the `voice` field — see /v1/models.
 DEFAULT_VOICE = "bm_george"
@@ -100,7 +100,7 @@ def build_router(settings: Settings, deep_health: Any = None) -> APIRouter:
                "kind": "stt",
            },
        ]
-        # Curated first — these are the four Grant chose for narration/recap.
+        # Curated first — these are the four Alice chose for narration/recap.
        seen = set()
        for v in CURATED_VOICES:
            data.append({
@@ -1,13 +1,54 @@
 from __future__ import annotations
+import logging
 import os
 from dataclasses import dataclass
 from pathlib import Path

+from .shellsafe import validate_container
+
+log = logging.getLogger(__name__)
+

 def _env(name: str, default: str = "") -> str:
    return os.environ.get(name, default)


+def _env_container(name: str, default: str) -> str:
+    """Resolve a container-name env var, validating it at the config boundary.
+
+    The value flows into `docker logs`/`docker exec` over SSH, so it's quoted at
+    the sink — but per the repo's two-layer convention it's also whitelist-checked
+    here. A malformed optional value falls back to `default` rather than crashing
+    daemon startup (mirrors `_env_int` for VLLM_PORT)."""
+    val = os.environ.get(name, "") or default
+    try:
+        return validate_container(val)
+    except ValueError:
+        log.warning("ignoring invalid %s=%r; using %r", name, val, default)
+        return default
+
+
+def _env_set(name: str) -> frozenset[str]:
+    """Parse a comma-separated env var into a lowercased frozenset of keys.
+
+    Used by DISABLED_SERVICES so an adopter whose cluster doesn't run a given
+    support service can switch its tile + probes off entirely (rather than have
+    the probe hit whatever else listens on that port — e.g. a vLLM sharing
+    Parakeet's default 8000)."""
+    raw = os.environ.get(name, "")
+    return frozenset(part.strip().lower() for part in raw.split(",") if part.strip())
+
+
+def _env_int(name: str, default: int) -> int:
+    """Parse an int env var, falling back to `default` when unset, blank, or
+    malformed. The StartOS Configure panel passes optional numeric fields as an
+    empty string when left blank, so a bare int("") would crash daemon startup."""
+    try:
+        return int(os.environ.get(name, "") or default)
+    except (TypeError, ValueError):
+        return default
+
+
 def _resolve_models_yaml() -> str:
    if env := os.environ.get("MODELS_YAML"):
        return env
@@ -42,12 +83,19 @@ class Settings:
    qdrant_user: str
    qdrant_container: str
    qdrant_collection: str
+    matrix_bridge_host: str
+    matrix_bridge_user: str
+    matrix_bridge_container: str
+    matrix_bridge_dir: str
+    matrix_bridge_branch: str
    redaction_map_db: str
    redaction_map_ttl: int
    ssh_key_path: str
    ssh_known_hosts: str
    models_yaml: str
    vllm_port: int
+    vllm_container: str
+    disabled_services: frozenset[str]
    parakeet_port: int
    kokoro_port: int
    embed_port: int
@@ -55,6 +103,8 @@ class Settings:
    bind_port: int
    open_webui_url: str
    ngc_api_key: str
+    swap_webhook_url: str
+    swap_webhook_secret: str

    @classmethod
    def from_env(cls) -> "Settings":
@@ -81,20 +131,47 @@ class Settings:
            qdrant_user=_env("QDRANT_USER") or spark2_user,
            qdrant_container=_env("QDRANT_CONTAINER") or "qdrant",
            qdrant_collection=_env("QDRANT_COLLECTION", ""),
+            # matrix-bridge bot container, driven as its own SSH user (the owner
+            # of the ~/matrix-bridge git clone) so git/docker run unprivileged.
+            # The user is BLANK by default and set via the "Configure Sparks"
+            # action; leaving it blank reports the service as unconfigured, which
+            # hides the tile. That keeps the shared package portable — a
+            # deployment without the bot never shows a stray tile or a hardcoded
+            # username. Host defaults to Spark 2 (same box); container/dir/branch
+            # are sensible defaults. All are env-overridable.
+            matrix_bridge_host=_env("MATRIX_BRIDGE_HOST") or spark2_host,
+            matrix_bridge_user=_env("MATRIX_BRIDGE_USER"),
+            matrix_bridge_container=_env("MATRIX_BRIDGE_CONTAINER") or "matrix-bridge",
+            matrix_bridge_dir=_env("MATRIX_BRIDGE_DIR") or "~/matrix-bridge",
+            matrix_bridge_branch=_env("MATRIX_BRIDGE_BRANCH") or "master",
            # Redaction gateway pseudonym-map store (server-held de-anon key).
            redaction_map_db=_env("REDACTION_MAP_DB", "/data/redaction_maps.db"),
-            redaction_map_ttl=int(_env("REDACTION_MAP_TTL", "7200")),
+            redaction_map_ttl=_env_int("REDACTION_MAP_TTL", 7200),
            ssh_key_path=_env("SSH_KEY_PATH"),
            ssh_known_hosts=_env("SSH_KNOWN_HOSTS"),
            models_yaml=_resolve_models_yaml(),
-            vllm_port=int(_env("VLLM_PORT", "8888")),
-            parakeet_port=int(_env("PARAKEET_PORT", "8000")),
-            kokoro_port=int(_env("KOKORO_PORT", "8880")),
-            embed_port=int(_env("EMBED_PORT", "8088")),
-            qdrant_port=int(_env("QDRANT_PORT", "6333")),
-            bind_port=int(_env("BIND_PORT", "9999")),
+            vllm_port=_env_int("VLLM_PORT", 8888),
+            # Container name for the swappable vLLM on Spark 1. Defaults to the
+            # bundled launch-cluster.sh container; override if you named yours
+            # something else (the swap log-tail and pre-flight validator exec
+            # into it by name).
+            vllm_container=_env_container("VLLM_CONTAINER", "vllm_node"),
+            # Built-in support-service keys (parakeet, kokoro, embeddings,
+            # qdrant) the deployment doesn't run — hidden from the dashboard and
+            # never probed.
+            disabled_services=_env_set("DISABLED_SERVICES"),
+            parakeet_port=_env_int("PARAKEET_PORT", 8000),
+            kokoro_port=_env_int("KOKORO_PORT", 8880),
+            embed_port=_env_int("EMBED_PORT", 8088),
+            qdrant_port=_env_int("QDRANT_PORT", 6333),
+            bind_port=_env_int("BIND_PORT", 9999),
            open_webui_url=_env("OPEN_WEBUI_URL", ""),
            ngc_api_key=_env("NGC_API_KEY", ""),
+            # Coordination layer: fire a swap-lifecycle webhook to this URL so
+            # downstream consumers re-point their model config on a swap. Blank
+            # ⇒ disabled. The optional secret HMAC-signs the body (X-Spark-Signature).
+            swap_webhook_url=_env("SWAP_WEBHOOK_URL", ""),
+            swap_webhook_secret=_env("SWAP_WEBHOOK_SECRET", ""),
        )

    @property
@@ -0,0 +1,342 @@
+"""Cluster-coordination layer: the GPU swap lock, swap-event webhook, and the
+read-only schedule registry.
+
+Spark Control is the **control plane / GPU arbiter, not a job runner.** Recurring
+business pipelines live in separate services that *call* the swap API. These
+three primitives add the *safety* layer around that:
+
+- **Swap lock** — a TTL-bounded reservation of the swap path. An external
+  scheduler acquires it before swapping; while held by someone else the
+  dashboard's manual swap is refused (enforced in the swap endpoint, not
+  advisory). Holder name is descriptive; the returned token is the secret that
+  authorises a swap or a release.
+- **Webhook** — fires `swap_complete` / `swap_failed` to a configurable URL so
+  downstream consumers re-point their provider config when the running model
+  changes. Optionally HMAC-signed.
+- **Schedule registry** — a read-only view the dashboard surfaces, *registered
+  by* external schedulers. Spark Control stores what it's told; it does not own
+  or execute any schedule.
+
+All state is in-memory (mirroring the swap/download/NIM job managers). On a
+restart the lock resets to *unlocked* — the available-by-default failure mode;
+the swap manager's own in-progress guard still prevents two swaps at once —
+and schedulers re-register their schedules.
+"""
+from __future__ import annotations
+import hashlib
+import hmac
+import json
+import logging
+import re
+import uuid
+from dataclasses import dataclass
+from datetime import datetime, timedelta, timezone
+from typing import Optional
+
+import httpx
+
+log = logging.getLogger(__name__)
+
+# A lock reserves the GPU for a window; clamp the TTL so a buggy client can
+# neither pin the cluster forever nor take a zero-length (useless) lock.
+LOCK_TTL_MIN = 1
+LOCK_TTL_MAX = 86_400      # 24h
+LOCK_TTL_DEFAULT = 900     # 15 min
+
+# Schedule ids are reflected to the dashboard and used as a URL path segment on
+# delete, so a caller-supplied id is whitelist-checked. Generated ids are hex.
+_SCHEDULE_ID_RE = re.compile(r"^[A-Za-z0-9_.-]{1,64}$")
+
+
+def valid_schedule_id(value: str) -> bool:
+    """Whitelist check for a caller-supplied schedule id (register and delete)."""
+    return bool(_SCHEDULE_ID_RE.match(value or ""))
+
+
+def _now() -> datetime:
+    return datetime.now(timezone.utc)
+
+
+def _iso(dt: datetime) -> str:
+    return dt.isoformat()
+
+
+# ---------------------------------------------------------------- swap lock ----
+
+class LockHeld(Exception):
+    """The lock is held by a different holder. Carries the public lock state so
+    the endpoint can return holder + expiry in the 409 body."""
+
+    def __init__(self, state: dict) -> None:
+        self.state = state
+        super().__init__("swap lock is held by another holder")
+
+
+@dataclass
+class LockState:
+    holder: str
+    token: str
+    acquired_at: datetime
+    expires_at: datetime
+    note: str = ""
+
+    def public(self, now: datetime) -> dict:
+        """Token-free view safe to expose on GET / in error bodies."""
+        return {
+            "held": True,
+            "holder": self.holder,
+            "acquired_at": _iso(self.acquired_at),
+            "expires_at": _iso(self.expires_at),
+            "seconds_remaining": max(0, int((self.expires_at - now).total_seconds())),
+            "note": self.note,
+        }
+
+
+class SwapLockManager:
+    """In-memory, TTL-bounded reservation of the GPU swap path.
+
+    `now` is injectable on every method purely so the expiry logic is testable
+    without sleeping; production calls omit it and get wall-clock UTC.
+    """
+
+    def __init__(self) -> None:
+        self._lock: Optional[LockState] = None
+
+    def _active(self, now: Optional[datetime] = None) -> Optional[LockState]:
+        """The current lock if one is held and unexpired; lazily clears an
+        expired lock so it never lingers."""
+        now = now or _now()
+        if self._lock is not None and self._lock.expires_at <= now:
+            self._lock = None
+        return self._lock
+
+    def status(self, now: Optional[datetime] = None) -> dict:
+        now = now or _now()
+        active = self._active(now)
+        return active.public(now) if active else {"held": False}
+
+    def acquire(
+        self,
+        holder: str,
+        ttl_seconds: Optional[int] = None,
+        note: str = "",
+        token: Optional[str] = None,
+        *,
+        now: Optional[datetime] = None,
+    ) -> LockState:
+        """Acquire a free lock (new token), or extend one already held by
+        presenting its token. A request without the token is refused even if the
+        holder name matches — the name is descriptive, the token is the secret.
+        """
+        now = now or _now()
+        holder = (holder or "").strip()
+        if not holder:
+            raise ValueError("holder is required")
+        ttl = ttl_seconds if ttl_seconds is not None else LOCK_TTL_DEFAULT
+        try:
+            ttl = int(ttl)
+        except (TypeError, ValueError):
+            ttl = LOCK_TTL_DEFAULT
+        ttl = max(LOCK_TTL_MIN, min(LOCK_TTL_MAX, ttl))
+
+        active = self._active(now)
+        if active is not None:
+            # Held — only the token-holder may extend/re-acquire.
+            if not (token and hmac.compare_digest(active.token, token)):
+                raise LockHeld(active.public(now))
+            self._lock = LockState(
+                holder=holder or active.holder,
+                token=active.token,
+                acquired_at=active.acquired_at,
+                expires_at=now + timedelta(seconds=ttl),
+                note=note or active.note,
+            )
+            return self._lock
+
+        self._lock = LockState(
+            holder=holder,
+            token=uuid.uuid4().hex,
+            acquired_at=now,
+            expires_at=now + timedelta(seconds=ttl),
+            note=note,
+        )
+        return self._lock
+
+    def verify(self, token: Optional[str], now: Optional[datetime] = None) -> bool:
+        """True iff `token` matches the currently-active lock."""
+        active = self._active(now)
+        return bool(active and token and hmac.compare_digest(active.token, token))
+
+    def is_blocked_by(self, token: Optional[str], now: Optional[datetime] = None) -> Optional[dict]:
+        """Single-read swap gate. Returns the public lock state if an active
+        lock blocks a swap carrying this token, else None. Does exactly one
+        `_active()` read so the decision can't straddle a TTL expiry the way a
+        separate status()+verify() pair could (which, at the expiry tick, would
+        spuriously refuse a swap that should now be allowed)."""
+        now = now or _now()
+        active = self._active(now)
+        if active is None:
+            return None
+        if token and hmac.compare_digest(active.token, token):
+            return None
+        return active.public(now)
+
+    def release(
+        self,
+        token: Optional[str] = None,
+        *,
+        force: bool = False,
+        now: Optional[datetime] = None,
+    ) -> bool:
+        """Release the lock. Returns False if nothing was held. Requires the
+        matching token unless `force` (the human override from the dashboard)."""
+        active = self._active(now)
+        if active is None:
+            return False
+        if not force and not self.verify(token, now):
+            raise PermissionError("token does not hold the lock")
+        self._lock = None
+        return True
+
+
+# ----------------------------------------------------------------- webhook ----
+
+def build_webhook_payload(
+    *,
+    event: str,
+    job_id: str,
+    model_key: str,
+    state: str,
+    returncode: Optional[int],
+    started_at: Optional[str],
+    finished_at: Optional[str],
+    dry_run: bool,
+) -> dict:
+    return {
+        "event": event,            # swap_complete | swap_failed
+        "job_id": job_id,
+        "model_key": model_key,
+        "state": state,
+        "returncode": returncode,
+        "started_at": started_at,
+        "finished_at": finished_at,
+        "dry_run": dry_run,
+    }
+
+
+def sign_payload(secret: str, body: bytes) -> str:
+    """`X-Spark-Signature` value: sha256 HMAC of the exact JSON body the
+    consumer receives, so they can recompute and trust it."""
+    return "sha256=" + hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
+
+
+class WebhookNotifier:
+    """Fire-and-forget POST of swap-lifecycle events. A webhook failure is
+    logged and swallowed — it must never affect the swap outcome."""
+
+    def __init__(self, url: str, secret: str = "", timeout: float = 5.0) -> None:
+        self.url = (url or "").strip()
+        self.secret = secret or ""
+        self.timeout = timeout
+
+    @property
+    def enabled(self) -> bool:
+        return bool(self.url)
+
+    async def fire(self, event: str, payload: dict) -> None:
+        if not self.enabled:
+            return
+        body = json.dumps(payload).encode()
+        headers = {
+            "content-type": "application/json",
+            "user-agent": "spark-control-webhook",
+            "x-spark-event": event,
+        }
+        if self.secret:
+            headers["x-spark-signature"] = sign_payload(self.secret, body)
+        try:
+            async with httpx.AsyncClient(timeout=self.timeout) as client:
+                await client.post(self.url, content=body, headers=headers)
+        except Exception as e:  # noqa: BLE001 — best-effort, never propagate
+            log.warning("swap webhook to %s failed: %s", self.url, e)
+
+
+# -------------------------------------------------------- schedule registry ----
+
+@dataclass
+class ScheduleEntry:
+    id: str
+    name: str
+    owner: str = ""
+    cron: str = ""
+    next_run: str = ""
+    description: str = ""
+    registered_at: str = ""
+    updated_at: str = ""
+
+    def public(self) -> dict:
+        return {
+            "id": self.id,
+            "name": self.name,
+            "owner": self.owner,
+            "cron": self.cron,
+            "next_run": self.next_run,
+            "description": self.description,
+            "registered_at": self.registered_at,
+            "updated_at": self.updated_at,
+        }
+
+
+class ScheduleRegistry:
+    """What external schedulers tell us about their cron jobs. Read-only from the
+    dashboard's side; Spark Control never executes any of it."""
+
+    def __init__(self) -> None:
+        self._items: dict[str, ScheduleEntry] = {}
+
+    def list(self) -> list[dict]:
+        return [e.public() for e in self._items.values()]
+
+    def register(
+        self,
+        *,
+        name: str,
+        id: Optional[str] = None,
+        owner: str = "",
+        cron: str = "",
+        next_run: str = "",
+        description: str = "",
+    ) -> ScheduleEntry:
+        name = (name or "").strip()
+        if not name:
+            raise ValueError("name is required")
+        if id is not None:
+            id = id.strip()
+            if id and not valid_schedule_id(id):
+                raise ValueError("id must match [A-Za-z0-9_.-] (max 64 chars)")
+        ts = _iso(_now())
+        existing = self._items.get(id) if id else None
+        if existing is not None:
+            existing.name = name
+            existing.owner = owner.strip()
+            existing.cron = cron
+            existing.next_run = next_run
+            existing.description = description
+            existing.updated_at = ts
+            return existing
+        sid = id or uuid.uuid4().hex[:8]
+        entry = ScheduleEntry(
+            id=sid,
+            name=name,
+            owner=owner.strip(),
+            cron=cron,
+            next_run=next_run,
+            description=description,
+            registered_at=ts,
+            updated_at=ts,
+        )
+        self._items[sid] = entry
+        return entry
+
+    def delete(self, schedule_id: str) -> bool:
+        return self._items.pop(schedule_id, None) is not None
@@ -10,6 +10,17 @@ Format:
        port: 8001
        health_path: /health
        image: nvcr.io/nim/nvidia/riva-multilingual:latest
+
+A `kind: vllm` entry monitors an additional vLLM on another Spark (read-only —
+the swap machinery only drives the primary Spark 1 vLLM). It gets a health tile
+probed via /v1/models plus container state and start/stop/restart:
+    custom:
+      - key: vllm-spark2
+        kind: vllm
+        host: <spark-2-ip>
+        user: <ssh-user>
+        container: vllm_node
+        port: 8000
 """
 from __future__ import annotations
 import os
@@ -377,6 +377,10 @@ class DeepHealth:
    async def run_all(self) -> dict[str, ProbeResult]:
        results = {}
        for name in self.PROBES:
+            # Don't deep-probe a service the deployment switched off — its port
+            # may be answered by something else (e.g. a vLLM on Parakeet's 8000).
+            if name in self.settings.disabled_services:
+                continue
            results[name] = await self.run_one(name)
        return results

@@ -15,6 +15,7 @@ from dataclasses import dataclass
 from typing import Optional

 from .config import Settings
+from .shellsafe import quote_arg
 from .ssh import ssh_run


@@ -76,16 +77,52 @@ async def probe_host(host: str, user: str, repo: str, settings: Settings) -> Hos
    return HostDiskResult(host=host, on_disk=True, size_bytes=size)


-async def probe_disk(repo: str, mode: str, settings: Settings) -> DiskStatus:
-    """Probe one model across the relevant Sparks based on its mode (solo|cluster)."""
+async def probe_local_host(host: str, user: str, path: str, settings: Settings) -> HostDiskResult:
+    """Return whether a local model directory exists on this host and its size.
+
+    For locally fine-tuned models (a Spark directory, not an HF cache entry). The
+    path is whitelisted at the API boundary (shellsafe.validate_local_path); we
+    shlex-quote it here in depth.
+    """
+    if not host or not user:
+        return HostDiskResult(host=host or "?", on_disk=False, error="host not configured")
+    qp = quote_arg(path)
+    cmd = f"if [ -d {qp} ]; then du -sb {qp} 2>/dev/null | cut -f1; else echo MISSING; fi"
+    rc, out, err = await ssh_run(host, user, cmd, settings, timeout=20.0)
+    if rc != 0:
+        return HostDiskResult(host=host, on_disk=False, error=(err or out).strip() or f"rc={rc}")
+    raw = out.strip()
+    if raw == "MISSING" or raw == "":
+        return HostDiskResult(host=host, on_disk=False)
+    try:
+        size = int(raw.splitlines()[-1])
+    except ValueError:
+        return HostDiskResult(host=host, on_disk=False, error=f"unparsable du output: {raw!r}")
+    return HostDiskResult(host=host, on_disk=True, size_bytes=size)
+
+
+async def probe_disk(
+    repo: str, mode: str, settings: Settings, *, local_path: str | None = None
+) -> DiskStatus:
+    """Probe one model across the relevant Sparks based on its mode (solo|cluster).
+
+    A local model (local_path set) is probed by directory; otherwise by HF cache.
+    """
    hosts: list[tuple[str, str]] = [(settings.spark1_host, settings.spark1_user)]
    if mode == "cluster" and settings.spark2_host:
        hosts.append((settings.spark2_host, settings.spark2_user))

+    if local_path:
+        results = await asyncio.gather(
+            *(probe_local_host(h, u, local_path, settings) for h, u in hosts)
+        )
+        key = local_path
+    else:
        results = await asyncio.gather(*(probe_host(h, u, repo, settings) for h, u in hosts))
+        key = repo
    on_disk = any(r.on_disk for r in results)
    total = sum(r.size_bytes for r in results)
-    return DiskStatus(repo=repo, on_disk=on_disk, total_bytes=total, per_host=list(results))
+    return DiskStatus(repo=key, on_disk=on_disk, total_bytes=total, per_host=list(results))


 async def delete_host(host: str, user: str, repo: str, settings: Settings) -> HostDiskResult:
@@ -26,6 +26,9 @@ echo GPU=$(nvidia-smi --query-gpu=name,utilization.gpu,temperature.gpu,power.dra
 echo GPU_MEM_USED_MIB=$(nvidia-smi --query-compute-apps=used_gpu_memory --format=csv,noheader,nounits 2>/dev/null | awk '{s+=$1} END {print s+0}')
 DEFIF=$(ip route show default 2>/dev/null | awk '{print $5; exit}')
 echo MAC=$(cat /sys/class/net/$DEFIF/address 2>/dev/null)
+WGIF=$(ip -o link show type wireguard 2>/dev/null | awk -F': ' 'NR==1 {print $2}')
+echo WG_IFACE=$WGIF
+echo WG_ADDR=$(ip -o -4 addr show "$WGIF" 2>/dev/null | awk 'NR==1 {print $4}')
 """.strip()


@@ -84,6 +87,11 @@ def _parse(out: str) -> dict:
    # MAC address on the default-route interface (for Wake-on-LAN)
    if info.get("mac"):
        parsed["mac"] = info["mac"].lower()
+    # WireGuard tunnel membership: name + address of the first wg interface, if
+    # any. Read-only and unprivileged (`ip` needs no root), so it never depends
+    # on sudo and never breaks the probe — absence just yields no badge.
+    parsed["wg_iface"] = info.get("wg_iface") or None
+    parsed["wg_addr"] = info.get("wg_addr") or None
    return parsed


@@ -6,17 +6,28 @@ from .config import Settings
 _TIMEOUT = 3.0


-async def check_vllm(settings: Settings) -> dict:
-    base_url = (
-        f"http://{settings.spark1_host}:{settings.vllm_port}/v1"
-        if settings.spark1_host
-        else None
-    )
-    if not settings.spark1_host:
-        return {"ok": False, "error": "spark1 not configured", "base_url": base_url}
+def _disabled(settings: Settings, key: str) -> dict | None:
+    """A clean 'disabled' verdict if `key` is in DISABLED_SERVICES, else None.
+
+    Lets an adopter who doesn't run a given support service switch its probe off
+    entirely — so the probe never hits whatever else listens on that port, and
+    the connectivity log doesn't record it as perpetually down."""
+    if key in settings.disabled_services:
+        return {"ok": False, "disabled": True, "error": "disabled", "base_url": None}
+    return None
+
+
+async def probe_vllm_endpoint(host: str, port: int) -> dict:
+    """Probe any OpenAI-compatible vLLM at host:port via /v1/models.
+
+    Shared by the primary (Spark 1) health check and any extra vLLM registered
+    as a custom service (kind: vllm) to monitor a second Spark."""
+    base_url = f"http://{host}:{port}/v1" if host else None
+    if not host:
+        return {"ok": False, "error": "vllm host not configured", "base_url": base_url}
    try:
        async with httpx.AsyncClient(timeout=_TIMEOUT) as c:
-            r = await c.get(f"http://{settings.spark1_host}:{settings.vllm_port}/v1/models")
+            r = await c.get(f"http://{host}:{port}/v1/models")
            r.raise_for_status()
            ids = [m["id"] for m in r.json().get("data", [])]
            return {
@@ -29,7 +40,15 @@ async def check_vllm(settings: Settings) -> dict:
        return {"ok": False, "error": str(e), "base_url": base_url}


+async def check_vllm(settings: Settings) -> dict:
+    if not settings.spark1_host:
+        return {"ok": False, "error": "spark1 not configured", "base_url": None}
+    return await probe_vllm_endpoint(settings.spark1_host, settings.vllm_port)
+
+
 async def check_parakeet(settings: Settings) -> dict:
+    if d := _disabled(settings, "parakeet"):
+        return d
    base_url = (
        f"http://{settings.parakeet_host}:{settings.parakeet_port}"
        if settings.parakeet_host
@@ -47,6 +66,8 @@ async def check_parakeet(settings: Settings) -> dict:


 async def check_kokoro(settings: Settings) -> dict:
+    if d := _disabled(settings, "kokoro"):
+        return d
    base_url = (
        f"http://{settings.kokoro_host}:{settings.kokoro_port}"
        if settings.kokoro_host
@@ -68,6 +89,8 @@ async def check_kokoro(settings: Settings) -> dict:


 async def check_embeddings(settings: Settings) -> dict:
+    if d := _disabled(settings, "embeddings"):
+        return d
    base_url = (
        f"http://{settings.embed_host}:{settings.embed_port}"
        if settings.embed_host
@@ -89,6 +112,8 @@ async def check_embeddings(settings: Settings) -> dict:


 async def check_qdrant(settings: Settings) -> dict:
+    if d := _disabled(settings, "qdrant"):
+        return d
    base_url = (
        f"http://{settings.qdrant_host}:{settings.qdrant_port}"
        if settings.qdrant_host
@@ -0,0 +1,186 @@
+"""Update + logs for the matrix-bridge bot container on the Spark.
+
+matrix-bridge is a single Docker container managed by docker compose out of a
+git clone at `~matrix_bridge_user/matrix-bridge`. Status (the badge) and
+start/stop/restart ride the generic service machinery in `services.py`
+(`docker_state` / `run_action`). The two things that don't fit that mould live
+here:
+
+  - **Update** — `git fetch && git reset --hard origin/<branch> && docker
+    compose up -d --build`. Long-running (docker build), so it streams like the
+    vLLM `UpdateManager`: fire-and-forget job, SSE stream, fail-loud rc.
+  - **Logs** — a one-shot `docker logs --tail N` for diagnosing a red badge.
+
+We connect **directly as the configured user** (`modelo` — the repo owner), so
+git never trips its dubious-ownership guard and docker runs via the user's
+docker-group membership. We deliberately do NOT `sudo -iu modelo`: this Spark
+has no passwordless sudo, so a sudo wrap would hang in SSH BatchMode.
+"""
+from __future__ import annotations
+import asyncio
+import time
+import uuid
+from dataclasses import dataclass, field
+from datetime import datetime, timezone
+from typing import Optional
+
+from .config import Settings
+from .shellsafe import quote_arg
+from .ssh import ssh_run, ssh_stream, StreamHandle
+
+# Hard ceiling on a single update. A first build after a base-image bump is
+# slow (minutes); the cache makes later ones quick. 25 min is generous headroom
+# without letting a genuinely wedged build spin forever.
+_UPDATE_TIMEOUT_S = 1500
+
+
+def build_update_command(directory: str, branch: str) -> str:
+    """The update one-liner, run from the bot's git clone as its owner.
+
+    `directory` and `branch` come from operator config (not request input), so
+    they're interpolated directly — same trust model as the Spark hostnames in
+    `health`/`updates`. `directory` may be `~/...`, which must stay unquoted so
+    the remote login shell expands it; quoting would defeat that.
+    """
+    return (
+        f"cd {directory} && "
+        f"git fetch origin && "
+        f"git reset --hard origin/{branch} && "
+        f"docker compose up -d --build"
+    )
+
+
+def _phase_for(line: str) -> Optional[str]:
+    """Map a streamed output line to a human-readable phase, or None to keep
+    the current phase. Kept loose — compose/buildkit output varies by version."""
+    low = line.lower()
+    if "git reset" in low or "head is now at" in low:
+        return "Resetting to the latest release…"
+    if "docker compose" in low or "buildkit" in low or low.startswith("step ") or "=> " in line or "building " in low:
+        return "Building the bot image…"
+    if "recreate" in low or "starting" in low or "started" in low or "container matrix-bridge" in low:
+        return "Recreating the container…"
+    if "already up to date" in low:
+        return "No new code; rebuilding…"
+    return None
+
+
+@dataclass
+class UpdateJob:
+    id: str
+    started_at: str
+    state: str = "starting"
+    lines: list[str] = field(default_factory=list)
+    returncode: Optional[int] = None
+    finished_at: Optional[str] = None
+    phase: str = "Starting…"
+
+    def append(self, line: str) -> None:
+        self.lines.append(line)
+        if len(self.lines) > 1000:
+            del self.lines[: len(self.lines) - 1000]
+
+
+class MatrixBridgeManager:
+    def __init__(self, settings: Settings) -> None:
+        self.settings = settings
+        self.lock = asyncio.Lock()
+        self.jobs: dict[str, UpdateJob] = {}
+        self.current_job_id: Optional[str] = None
+
+    def _configured(self) -> bool:
+        s = self.settings
+        return bool(s.matrix_bridge_host and s.matrix_bridge_user)
+
+    def get(self, job_id: str) -> UpdateJob | None:
+        return self.jobs.get(job_id)
+
+    async def fetch_logs(self, tail: int = 100) -> dict:
+        """One-shot `docker logs --tail N <container>` (stderr merged in)."""
+        s = self.settings
+        if not self._configured():
+            return {"ok": False, "error": "matrix-bridge host not configured"}
+        tail = max(1, min(int(tail), 1000))
+        # tail is already int-clamped, but quote at the sink anyway so the
+        # shellsafe convention (no raw interpolation into an SSH command) holds
+        # regardless of caller.
+        cmd = f"docker logs --tail {quote_arg(str(tail))} {quote_arg(s.matrix_bridge_container)} 2>&1"
+        rc, out, err = await ssh_run(
+            s.matrix_bridge_host, s.matrix_bridge_user, cmd, s, timeout=20
+        )
+        return {
+            "ok": rc == 0,
+            "rc": rc,
+            "container": s.matrix_bridge_container,
+            "output": (out or err).strip(),
+        }
+
+    async def trigger_update(self) -> UpdateJob:
+        if not self._configured():
+            raise RuntimeError("matrix-bridge host not configured")
+        if self.lock.locked():
+            raise RuntimeError("An update is already in progress")
+        job = UpdateJob(
+            id=uuid.uuid4().hex[:8],
+            started_at=datetime.now(timezone.utc).isoformat(),
+        )
+        self.jobs[job.id] = job
+        self.current_job_id = job.id
+        asyncio.create_task(self._run(job))
+        return job
+
+    async def _run(self, job: UpdateJob) -> None:
+        async with self.lock:
+            try:
+                await self._do(job)
+                if job.state != "failed":
+                    job.state = "done"
+                    job.returncode = 0
+                    job.phase = "Done"
+            except asyncio.TimeoutError:
+                job.append(f"[error] update timed out after {_UPDATE_TIMEOUT_S}s")
+                job.state = "failed"
+                job.returncode = 124
+                job.phase = "Timed out"
+            except Exception as e:
+                job.append(f"[error] {type(e).__name__}: {e}")
+                job.state = "failed"
+                if job.returncode is None:
+                    job.returncode = 1
+            finally:
+                job.finished_at = datetime.now(timezone.utc).isoformat()
+                if self.current_job_id == job.id:
+                    self.current_job_id = None
+
+    async def _do(self, job: UpdateJob) -> None:
+        s = self.settings
+        cmd = build_update_command(s.matrix_bridge_dir, s.matrix_bridge_branch)
+        job.append(f"$ {cmd}")
+        job.state = "running"
+        job.phase = "Fetching latest code…"
+
+        handle = StreamHandle()
+        gen = ssh_stream(s.matrix_bridge_host, s.matrix_bridge_user, cmd, s, handle=handle)
+        deadline = time.monotonic() + _UPDATE_TIMEOUT_S
+        try:
+            while True:
+                remaining = deadline - time.monotonic()
+                if remaining <= 0:
+                    raise asyncio.TimeoutError
+                try:
+                    line = await asyncio.wait_for(gen.__anext__(), timeout=remaining)
+                except StopAsyncIteration:
+                    break
+                job.append(line)
+                phase = _phase_for(line)
+                if phase:
+                    job.phase = phase
+        finally:
+            # Closing the generator terminates the underlying ssh process and
+            # populates handle.returncode via ssh_stream's finally block.
+            await gen.aclose()
+
+        rc = handle.returncode or 0
+        if rc != 0:
+            job.state = "failed"
+            job.returncode = rc
@@ -1,15 +1,33 @@
 from __future__ import annotations
+import logging
 from typing import Literal, Optional
 import yaml
-from pydantic import BaseModel, Field
+from pydantic import BaseModel, Field, model_validator

 from .overrides import apply_knobs_to_args, load_overrides
-from .shellsafe import quote_arg, quote_args
+from .shellsafe import quote_arg, quote_args, validate_local_path
+
+log = logging.getLogger(__name__)
+
+
+def _chat_template_path(vllm_args: list[str]) -> str | None:
+    """Extract the path from a `--chat-template=<path>` arg, if present."""
+    for a in vllm_args:
+        if a.startswith("--chat-template="):
+            return a.split("=", 1)[1]
+    return None
+
+
+def _is_within(path: str, base: str) -> bool:
+    """True if `path` is `base` itself or lives inside it (lexical check)."""
+    base = base.rstrip("/")
+    return path == base or path.startswith(base + "/")


 class ModelDef(BaseModel):
    display_name: str
-    repo: str
+    repo: str = ""                   # HF 'org/name'; empty for a local model
+    local_path: str | None = None    # absolute dir on the Spark; set => local model
    size_gb: float
    mode: Literal["solo", "cluster"]
    capabilities: list[str] = Field(default_factory=list)
@@ -19,6 +37,38 @@ class ModelDef(BaseModel):
    knobs: dict | None = None       # user-customized; merged at launch time
    custom: bool = False             # True if this came from /data overrides

+    @model_validator(mode="after")
+    def _validate_source(self) -> "ModelDef":
+        if bool(self.repo) == bool(self.local_path):
+            raise ValueError(
+                f"model {self.display_name!r} must set exactly one of 'repo' (HF) "
+                f"or 'local_path' (Spark directory)"
+            )
+        if self.local_path:
+            # Single place that enforces the path whitelist, so YAML/override
+            # entries get the same boundary check as the API. The quote_arg sink
+            # is still defense-in-depth.
+            validate_local_path(self.local_path)
+            # Only local_path is bind-mounted into the vLLM container, so any
+            # --chat-template path must live inside it or vLLM can't find it.
+            tmpl = _chat_template_path(self.vllm_args)
+            if tmpl is not None and not _is_within(tmpl, self.local_path):
+                raise ValueError(
+                    f"--chat-template path {tmpl!r} must be inside the model "
+                    f"directory {self.local_path!r} (only that directory is mounted "
+                    f"into the container)"
+                )
+        return self
+
+    @property
+    def is_local(self) -> bool:
+        return bool(self.local_path)
+
+    @property
+    def source(self) -> str:
+        """What `vllm serve` is pointed at: the local dir if set, else the HF repo."""
+        return self.local_path if self.local_path else self.repo
+

 class Defaults(BaseModel):
    port: int = 8888
@@ -47,7 +97,8 @@ def _merge_overrides(catalog: Catalog) -> Catalog:
            continue
        defaults_dump = {
            "display_name": entry.get("display_name", key),
-            "repo": entry["repo"],
+            "repo": entry.get("repo", ""),
+            "local_path": entry.get("local_path"),
            "size_gb": float(entry.get("size_gb", 0)),
            "mode": entry.get("mode", "solo"),
            "capabilities": entry.get("capabilities") or [],
@@ -57,7 +108,12 @@ def _merge_overrides(catalog: Catalog) -> Catalog:
            "knobs": entry.get("knobs"),
            "custom": True,
        }
+        # A single malformed override entry (bad path, missing source, etc.) must
+        # not take down the whole catalog — skip it and keep the rest loadable.
+        try:
            new_models[key] = ModelDef.model_validate(defaults_dump)
+        except Exception as e:
+            log.warning("skipping invalid custom model %r: %s", key, e)

    return Catalog(defaults=catalog.defaults, models=new_models)

@@ -78,7 +134,21 @@ def build_launch_command(key: str, model: ModelDef, defaults: Defaults) -> str:
    solo = "--solo " if model.mode == "solo" else ""
    base_args = apply_knobs_to_args(list(model.vllm_args), model.knobs)
    args = [f"--port={defaults.port}", f"--host={defaults.host}", *base_args]
-    # repo + args are user-controlled (custom models, knobs); shlex.quote each so
-    # they cannot break out of the SSH shell command. shlex.split (used by the
+    # source + args are user-controlled (custom models, knobs); shlex.quote each
+    # so they cannot break out of the SSH shell command. shlex.split (used by the
    # vLLM pre-flight validator) cleanly reverses this quoting.
-    return f"./launch-cluster.sh {solo}-d exec vllm serve {quote_arg(model.repo)} {quote_args(args)}"
+    prefix = ""
+    if model.local_path:
+        # A local model's directory isn't in the HF cache the launch script
+        # already mounts, so bind-mount it at the SAME path inside the vllm
+        # container via the script's VLLM_SPARK_EXTRA_DOCKER_ARGS hook. Same
+        # path inside and out means `vllm serve <dir>` and any
+        # `--chat-template=<dir>/...` arg both resolve. No launch-cluster.sh
+        # change needed. (The env assignment sits before the script, so the
+        # validator's `serve`-keyed shlex round-trip is unaffected.)
+        mount = quote_arg(f"-v {model.local_path}:{model.local_path}")
+        prefix = f"VLLM_SPARK_EXTRA_DOCKER_ARGS={mount} "
+    return (
+        f"{prefix}./launch-cluster.sh {solo}-d exec vllm serve "
+        f"{quote_arg(model.source)} {quote_args(args)}"
+    )
@@ -14,7 +14,7 @@ Shape:
    custom:
      - key: my-new-model
        display_name: My New Model (from download)
-        repo: my-org/my-model
+        repo: my-org/my-model        # an HF repo; OR set local_path instead (exactly one)
        size_gb: 20
        mode: solo
        description: null
@@ -25,6 +25,12 @@ Shape:
          fastsafetensors: true
          prefix_caching: true
          kv_cache_dtype: fp8
+      - key: my-finetune                       # a local/fine-tuned model (a directory on the Spark)
+        display_name: My Fine-tune
+        local_path: /home/you/models/my-finetune
+        size_gb: 59
+        mode: solo
+        vllm_args: [--chat-template=/home/you/models/my-finetune/chat_template.jinja]
 """
 from __future__ import annotations
 import os
@@ -3,14 +3,15 @@ import asyncio
 import json
 from pathlib import Path

-from fastapi import FastAPI, HTTPException
+from fastapi import FastAPI, HTTPException, Query, Request
 from fastapi.responses import FileResponse, JSONResponse, StreamingResponse
 from fastapi.staticfiles import StaticFiles
-from pydantic import BaseModel
+from pydantic import BaseModel, ValidationError
 from typing import Literal

 from .config import Settings
 from .connectivity import get_mac, record_report, record_state, summary as connectivity_summary
+from .coordination import LockHeld, ScheduleRegistry, SwapLockManager, WebhookNotifier, valid_schedule_id
 from .custom_services import add_custom_service, delete_custom_service
 from .audio_proxy import build_router as build_audio_router
 from .deep_health import DeepHealth
@@ -20,8 +21,9 @@ from .llm_proxy import build_router as build_llm_router
 from .embeddings_proxy import build_router as build_embeddings_router
 from .redaction_gateway import build_router as build_redaction_router, MapStore
 from .hardware import HardwareProbe
-from .health import check_kokoro, check_parakeet, check_vllm, check_embeddings, check_qdrant
-from .models import load_catalog
+from .health import check_kokoro, check_parakeet, check_vllm, check_embeddings, check_qdrant, probe_vllm_endpoint
+from .matrix_bridge import MatrixBridgeManager
+from .models import ModelDef, load_catalog
 from .nim import SUGGESTED_NIMS, CATALOG_URL, NimManager
 from .overrides import add_custom, delete_custom, extract_knobs_from_args, load_overrides, set_knobs
 from .services import docker_state, run_action, services_from_settings
@@ -36,13 +38,19 @@ from .wol import send_local_broadcast, send_via_peer

 settings = Settings.from_env()
 catalog = load_catalog(settings.models_yaml)
-swap_manager = SwapManager(settings, catalog)
+# Coordination layer (GPU arbiter): swap-lifecycle webhook, the swap reservation
+# lock, and the read-only schedule registry. See coordination.py.
+swap_webhook = WebhookNotifier(settings.swap_webhook_url, settings.swap_webhook_secret)
+swap_lock = SwapLockManager()
+schedule_registry = ScheduleRegistry()
+swap_manager = SwapManager(settings, catalog, notifier=swap_webhook)
 download_manager = DownloadManager(settings)
 update_manager = UpdateManager(settings)
 hardware_probe = HardwareProbe(settings)
 nim_manager = NimManager(settings)
 deep_health = DeepHealth(settings)
 speech_models = SpeechModelsManager(settings)
+matrix_bridge = MatrixBridgeManager(settings)

 app = FastAPI(title="spark-control", version="0.1.0")

@@ -65,6 +73,10 @@ _CSRF_EXEMPT_PREFIXES = (
    "/api/audio/",        # diarize-chunk / label-merge / transcribe-with-speakers
    "/api/health-event",  # health reports posted by consumer apps
 )
+# Note: the coordination endpoints (/api/swap/lock, /api/schedule) are
+# intentionally NOT exempt. External schedulers are non-browser clients (no
+# Origin header) so they pass the guard already — same as /api/swap — while a
+# malicious page can't drive them from the operator's browser. Don't add them.


@app.middleware("http")
@@ -181,7 +193,8 @@ async def put_model_knobs(key: str, body: KnobsBody) -> dict:
 class CustomModelBody(BaseModel):
    key: str
    display_name: str
-    repo: str
+    repo: str = ""
+    local_path: str | None = None
    size_gb: float = 0
    mode: Literal["solo", "cluster"] = "solo"
    description: str | None = None
@@ -194,8 +207,17 @@ class CustomModelBody(BaseModel):
 async def post_model(body: CustomModelBody) -> dict:
    if not body.key or not body.key.replace("-", "").replace("_", "").isalnum():
        raise HTTPException(400, "key must be alphanumeric/-/_ only")
+    # Validate the full entry BEFORE persisting (exactly-one source, local-path
+    # whitelist, chat-template location). Doing it via ModelDef means the API and
+    # the YAML-override path share one set of rules, and a bad entry can't be
+    # written to /data and then break catalog load.
    try:
-        validate_repo(body.repo)
+        ModelDef.model_validate(body.model_dump())
+        if body.repo:
+            validate_repo(body.repo)  # HF charset (the model only validates local paths)
+    except ValidationError as e:
+        msg = e.errors()[0]["msg"] if e.errors() else str(e)
+        raise HTTPException(400, msg.removeprefix("Value error, "))
    except ValueError as e:
        raise HTTPException(400, str(e))
    if body.key in catalog.models and not catalog.models[body.key].custom:
@@ -227,7 +249,13 @@ async def get_models_disk_status() -> dict:
        return {"configured": False, "models": {}}
    keys = list(catalog.models.keys())
    statuses = await asyncio.gather(*(
-        probe_disk(catalog.models[k].repo, catalog.models[k].mode, settings) for k in keys
+        probe_disk(
+            catalog.models[k].repo,
+            catalog.models[k].mode,
+            settings,
+            local_path=catalog.models[k].local_path,
+        )
+        for k in keys
    ), return_exceptions=True)
    out: dict[str, dict] = {}
    for k, s in zip(keys, statuses):
@@ -258,6 +286,14 @@ async def del_model_disk(key: str) -> dict:
        raise HTTPException(404, f"unknown model: {key}")
    m = catalog.models[key]

+    # Never rm a local fine-tune directory from the dashboard — it's irreplaceable
+    # training output the user placed by hand, not a re-downloadable HF cache.
+    if m.local_path:
+        raise HTTPException(
+            400,
+            "this is a local model; its directory must be managed on the Spark, not deleted from here",
+        )
+
    # Refuse if currently loaded
    try:
        vllm = await check_vllm(settings)
@@ -401,6 +437,53 @@ async def wake_spark(name: str) -> dict:
    return {"ok": True, "spark": name, "mac": mac, "delivered_via": delivered_via}


+@app.post("/api/spark/{name}/ssh-key")
+async def spark_ssh_key(name: str) -> dict:
+    """Ensure the named Spark has an ed25519 keypair and return its PUBLIC key.
+
+    This is the Spark's *outbound* identity — the key it uses to log in to other
+    machines (e.g. the operator's Mac). It is the opposite direction from, and
+    distinct from, the package's own key shown by the StartOS "Show Public Key"
+    action (which grants this dashboard SSH access to the Sparks).
+
+    Non-destructive: generates the key only if absent, never overwrites an
+    existing one (which may already be an identity the Spark uses elsewhere).
+    Public keys are not secret, so returning it is safe. No request-supplied
+    value reaches the command — `name` is constrained to a fixed set and
+    host/user come from operator config — so there is nothing to shell-quote.
+    """
+    if name not in ("spark1", "spark2"):
+        raise HTTPException(404, f"unknown spark: {name}")
+    host = settings.spark1_host if name == "spark1" else settings.spark2_host
+    user = settings.spark1_user if name == "spark1" else settings.spark2_user
+    if not host or not user:
+        raise HTTPException(400, f"{name} is not configured")
+    # Empty passphrase so the key is usable unattended; comment carries the
+    # remote hostname so it's identifiable in an authorized_keys file later.
+    cmd = (
+        "set -e; "
+        "mkdir -p ~/.ssh && chmod 700 ~/.ssh; "
+        "if [ ! -f ~/.ssh/id_ed25519 ]; then "
+        'ssh-keygen -t ed25519 -N "" -C "spark-control@$(hostname)" -f ~/.ssh/id_ed25519 >/dev/null 2>&1; '
+        "echo CREATED=1; else echo CREATED=0; fi; "
+        "[ -f ~/.ssh/id_ed25519.pub ] || ssh-keygen -y -f ~/.ssh/id_ed25519 > ~/.ssh/id_ed25519.pub; "
+        "echo PUBKEY=$(cat ~/.ssh/id_ed25519.pub)"
+    )
+    rc, out, err = await ssh_run(host, user, cmd, settings, timeout=15)
+    if rc != 0:
+        raise HTTPException(502, f"couldn't read/create the SSH key on {name}: {err.strip() or out.strip() or f'rc={rc}'}")
+    created = False
+    pubkey = ""
+    for line in out.splitlines():
+        if line.startswith("CREATED="):
+            created = line.strip() == "CREATED=1"
+        elif line.startswith("PUBKEY="):
+            pubkey = line[len("PUBKEY="):].strip()
+    if not pubkey:
+        raise HTTPException(502, f"no public key returned from {name}")
+    return {"ok": True, "spark": name, "host": host, "user": user, "pubkey": pubkey, "created": created}
+
+
@app.get("/api/services")
 async def get_services() -> dict:
    """Lifecycle state of always-on support services (Parakeet, Kokoro, …).
@@ -427,6 +510,15 @@ async def get_services() -> dict:
            http = await check_embeddings(settings)
        elif name == "qdrant":
            http = await check_qdrant(settings)
+        elif svc.kind == "vllm":
+            # An extra vLLM monitored on another Spark (registered as a custom
+            # service). Probe its own host/port, not the primary Spark 1 one.
+            http = await probe_vllm_endpoint(svc.host, svc.port)
+        elif svc.kind == "bot":
+            # No HTTP health endpoint (host networking, no port) — judged purely
+            # by docker state. http_ready stays None so the badge isn't pinned
+            # to a "Starting…" verdict that can never clear.
+            http = {"ok": None, "base_url": None}
        else:
            # Custom services expose a /health endpoint by convention.
            http = await check_kokoro(settings) if svc.kind == "tts" else {"ok": None, "base_url": svc.host and f"http://{svc.host}:{svc.port}"}
@@ -437,11 +529,13 @@ async def get_services() -> dict:
            "container": svc.container,
            "kind": svc.kind,
            "base_url": http.get("base_url"),
-            "http_ready": bool(http.get("ok")),
+            # None (not False) for services with no HTTP surface (the bot), so
+            # the UI judges them by docker state alone instead of "Starting…".
+            "http_ready": None if svc.kind == "bot" else bool(http.get("ok")),
            # Prefer the check fn's own top-level model key (embeddings reports
            # it there); fall back to a model field inside detail for services
            # whose /health embeds it (parakeet).
-            "model": http.get("model") or ((http.get("detail") or {}).get("model") if isinstance(http.get("detail"), dict) else None),
+            "model": http.get("model") or http.get("current_model") or ((http.get("detail") or {}).get("model") if isinstance(http.get("detail"), dict) else None),
            "docker_state": docker.get("state"),
            "restart_count": docker.get("restart_count"),
            "started_at": docker.get("started_at"),
@@ -453,7 +547,10 @@ async def get_services() -> dict:
    results = await asyncio.gather(*[one(n) for n in services.keys()])
    for name, info in results:
        out[name] = info
-        # Feed http reachability into the connectivity log (transition-only)
+        # Feed http reachability into the connectivity log (transition-only).
+        # Skip services with no HTTP surface (http_ready is None) — they'd
+        # otherwise register as perpetually "down".
+        if info.get("http_ready") is not None:
            record_state(name, bool(info.get("http_ready")))
    return out

@@ -559,7 +656,7 @@ async def stream_nim_install(job_id: str):
@app.delete("/api/services/{name}")
 async def del_service(name: str) -> dict:
    # Only allow deleting custom services (not the bundled built-in keys)
-    if name in ("parakeet", "kokoro", "embeddings", "qdrant"):
+    if name in ("parakeet", "kokoro", "embeddings", "qdrant", "matrix-bridge"):
        raise HTTPException(400, "built-in service; cannot delete (use Configure Sparks to point at a different host)")
    delete_custom_service(name)
    return {"ok": True, "name": name}
@@ -578,6 +675,81 @@ async def service_action(name: str, action: str) -> dict:
    return {"name": name, "action": action, **result}


+# ---- matrix-bridge bot: update (git pull + rebuild) + logs ----
+# Status badge + start/stop/restart ride the generic /api/services machinery
+# above (the bot is a registered ServiceDef). Only the long-running Update and
+# the logs view need bespoke endpoints.
+
+def _serialize_mb_update(job) -> dict:
+    return {
+        "id": job.id,
+        "state": job.state,
+        "phase": job.phase,
+        "started_at": job.started_at,
+        "finished_at": job.finished_at,
+        "returncode": job.returncode,
+        "lines": job.lines,
+    }
+
+
+@app.post("/api/matrix-bridge/update")
+async def post_matrix_bridge_update() -> dict:
+    """Pull latest code, rebuild, and recreate the bot container. Long-running
+    (docker build) — returns a job id to stream."""
+    try:
+        job = await matrix_bridge.trigger_update()
+    except RuntimeError as e:
+        raise HTTPException(409 if "in progress" in str(e) else 503, str(e))
+    return {"job_id": job.id, "state": job.state}
+
+
+@app.get("/api/matrix-bridge/update/{job_id}")
+async def get_matrix_bridge_update(job_id: str) -> dict:
+    job = matrix_bridge.get(job_id)
+    if job is None:
+        raise HTTPException(404, "no such job")
+    return _serialize_mb_update(job)
+
+
+@app.get("/api/matrix-bridge/update/{job_id}/stream")
+async def stream_matrix_bridge_update(job_id: str, request: Request):
+    job = matrix_bridge.get(job_id)
+    if job is None:
+        raise HTTPException(404, "no such job")
+
+    async def gen():
+        sent = 0
+        last_phase = None
+        while True:
+            # An update can run for minutes; bail promptly if the client is gone
+            # rather than spinning the poll loop until the job's 25-min ceiling.
+            if await request.is_disconnected():
+                return
+            n = len(job.lines)
+            if n > sent:
+                for line in job.lines[sent:n]:
+                    yield f"data: {json.dumps({'line': line})}\n\n"
+                sent = n
+            if job.phase != last_phase:
+                yield f"event: phase\ndata: {json.dumps({'state': job.state, 'phase': job.phase})}\n\n"
+                last_phase = job.phase
+            if job.returncode is not None and sent >= len(job.lines):
+                yield f"event: done\ndata: {json.dumps({'state': job.state, 'returncode': job.returncode})}\n\n"
+                return
+            await asyncio.sleep(0.5)
+
+    return StreamingResponse(gen(), media_type="text/event-stream")
+
+
+@app.get("/api/matrix-bridge/logs")
+async def get_matrix_bridge_logs(tail: int = Query(100, ge=1, le=1000)) -> dict:
+    """Last N lines of `docker logs` for the bot container (stderr merged)."""
+    result = await matrix_bridge.fetch_logs(tail=tail)
+    if not result.get("ok"):
+        raise HTTPException(502, result.get("output") or result.get("error") or "could not read logs")
+    return result
+
+
 # ---- Speech model patch management ----

@app.get("/api/speech-models")
@@ -641,17 +813,20 @@ async def get_endpoints() -> dict:
            "base_url": vllm.get("base_url"),
            "model": vllm.get("current_model"),
            "openai_compat": True,
+            "disabled": bool(vllm.get("disabled")),
        },
        "parakeet": {
            "ready": bool(parakeet.get("ok")),
            "base_url": parakeet.get("base_url"),
            "kind": "stt",
            "model": (parakeet.get("detail") or {}).get("model") if isinstance(parakeet.get("detail"), dict) else None,
+            "disabled": bool(parakeet.get("disabled")),
        },
        "kokoro": {
            "ready": bool(kokoro.get("ok")),
            "base_url": kokoro.get("base_url"),
            "kind": "tts",
+            "disabled": bool(kokoro.get("disabled")),
        },
        "embeddings": {
            "ready": bool(embeddings.get("ok")),
@@ -660,12 +835,14 @@ async def get_endpoints() -> dict:
            "model": embeddings.get("model"),
            # The proxied OpenAI-compatible endpoints live on Spark Control itself.
            "openai_endpoints": ["/v1/embeddings", "/v1/rerank", "/api/search"],
+            "disabled": bool(embeddings.get("disabled")),
        },
        "qdrant": {
            "ready": bool(qdrant.get("ok")),
            "base_url": qdrant.get("base_url"),
            "kind": "vectordb",
            "collection": settings.qdrant_collection or None,
+            "disabled": bool(qdrant.get("disabled")),
        },
    }

@@ -679,12 +856,15 @@ async def get_status() -> dict:
        check_embeddings(settings),
        check_qdrant(settings),
    )
-    # Feed health into the connectivity log (deduped — only logs on transition)
-    record_state("vllm", bool(vllm.get("ok")))
-    record_state("parakeet", bool(parakeet.get("ok")))
-    record_state("kokoro", bool(kokoro.get("ok")))
-    record_state("embeddings", bool(embeddings.get("ok")))
-    record_state("qdrant", bool(qdrant.get("ok")))
+    # Feed health into the connectivity log (deduped — only logs on transition).
+    # Skip services switched off via DISABLED_SERVICES — they'd otherwise log as
+    # perpetually down.
+    for _name, _r in (
+        ("vllm", vllm), ("parakeet", parakeet), ("kokoro", kokoro),
+        ("embeddings", embeddings), ("qdrant", qdrant),
+    ):
+        if not _r.get("disabled"):
+            record_state(_name, bool(_r.get("ok")))
    current_key = _identify_current_model(vllm.get("current_model"))
    return {
        "configured": settings.configured,
@@ -722,9 +902,21 @@ async def validate_swap(key: str) -> dict:


@app.post("/api/swap")
-async def post_swap(req: SwapRequest) -> dict:
+async def post_swap(req: SwapRequest, request: Request) -> dict:
    if not settings.configured and not req.dry_run:
        raise HTTPException(503, "spark1 not configured")
+    # Enforce the swap reservation lock (the GPU arbiter). A held lock blocks any
+    # real swap that doesn't present the holder's token in X-Swap-Lock-Token — so
+    # an external scheduler that holds the lock can swap, but the dashboard (no
+    # token) is refused while someone else holds it. Dry runs don't touch the
+    # cluster, so they're exempt.
+    if not req.dry_run:
+        blocked = swap_lock.is_blocked_by(request.headers.get("x-swap-lock-token"))
+        if blocked is not None:
+            raise HTTPException(status_code=423, detail={
+                "error": "the GPU swap path is reserved by another holder",
+                "lock": blocked,
+            })
    try:
        job = await swap_manager.trigger(req.model_key, dry_run=req.dry_run)
    except KeyError:
@@ -779,6 +971,89 @@ async def stream_swap(job_id: str):
    return StreamingResponse(gen(), media_type="text/event-stream")


+# ---- Coordination layer: swap lock + schedule registry ----
+# Endpoints are control-surface, not browser-exempt: an external scheduler is a
+# non-browser client (no Origin header) so it passes the CSRF guard already, the
+# same way it calls /api/swap today; the dashboard is same-origin.
+
+class LockAcquireRequest(BaseModel):
+    holder: str
+    ttl_seconds: int | None = None
+    note: str = ""
+    token: str | None = None   # present only to extend an existing hold
+
+
+@app.post("/api/swap/lock")
+async def acquire_swap_lock(req: LockAcquireRequest) -> dict:
+    """Reserve the GPU swap path. Returns a secret token used to swap (header
+    X-Swap-Lock-Token) and to release. 409 if held by another holder."""
+    try:
+        lock = swap_lock.acquire(req.holder, req.ttl_seconds, req.note, token=req.token)
+    except ValueError as e:
+        raise HTTPException(422, str(e))
+    except LockHeld as e:
+        raise HTTPException(status_code=409, detail={
+            "error": "swap lock is held by another holder",
+            "lock": e.state,
+        })
+    return {**swap_lock.status(), "token": lock.token}
+
+
+@app.get("/api/swap/lock")
+async def get_swap_lock() -> dict:
+    """Public, token-free view of the reservation: held? who? until when?"""
+    return swap_lock.status()
+
+
+@app.delete("/api/swap/lock")
+async def release_swap_lock(request: Request, force: bool = Query(False)) -> dict:
+    """Release the reservation. Needs the matching X-Swap-Lock-Token unless
+    ?force=true (the human override from the dashboard)."""
+    token = request.headers.get("x-swap-lock-token") or request.query_params.get("token")
+    try:
+        released = swap_lock.release(token, force=force)
+    except PermissionError as e:
+        raise HTTPException(403, str(e))
+    return {"released": released, **swap_lock.status()}
+
+
+class ScheduleRequest(BaseModel):
+    name: str
+    id: str | None = None
+    owner: str = ""
+    cron: str = ""
+    next_run: str = ""
+    description: str = ""
+
+
+@app.get("/api/schedule")
+async def list_schedules() -> dict:
+    return {"schedules": schedule_registry.list()}
+
+
+@app.post("/api/schedule")
+async def register_schedule(req: ScheduleRequest) -> dict:
+    """Register (or update, by id) a schedule an external scheduler owns. Spark
+    Control only stores it for the dashboard — it never executes it."""
+    try:
+        entry = schedule_registry.register(
+            name=req.name, id=req.id, owner=req.owner,
+            cron=req.cron, next_run=req.next_run, description=req.description,
+        )
+    except ValueError as e:
+        raise HTTPException(422, str(e))
+    return entry.public()
+
+
+@app.delete("/api/schedule/{schedule_id}")
+async def delete_schedule(schedule_id: str) -> dict:
+    # Whitelist the path segment at the boundary (repo convention), even though
+    # it's only ever a dict key — keeps it from being reflected or logged raw.
+    if not valid_schedule_id(schedule_id):
+        raise HTTPException(422, "invalid schedule id")
+    return {"deleted": schedule_registry.delete(schedule_id)}
+
+
 class DownloadRequest(BaseModel):
    repo: str
    mode: Literal["spark1", "spark2", "cluster"] = "spark1"
@@ -5,6 +5,7 @@ machinery. We just run `docker start|stop|restart <container>` via SSH on the
 appropriate host.
 """
 from __future__ import annotations
+import logging
 import time
 from dataclasses import dataclass
 from typing import Literal, Optional
@@ -13,6 +14,8 @@ from .config import Settings
 from .shellsafe import quote_arg
 from .ssh import ssh_run

+log = logging.getLogger(__name__)
+

 # Cache the "unreachable" verdict per (host, user) for a short period so that a
 # repeated docker_state call doesn't re-pay the 6 s SSH connect timeout each time.
@@ -89,10 +92,27 @@ def services_from_settings(s: Settings) -> dict[str, ServiceDef]:
            container=s.qdrant_container,
            port=s.qdrant_port,
        ),
+        # matrix-bridge Matrix bot. No HTTP port to probe (host networking, no
+        # health endpoint) — judged purely by docker state. Driven as its own
+        # SSH user (modelo, the repo owner) so git/docker run unprivileged.
+        "matrix-bridge": ServiceDef(
+            name="matrix-bridge",
+            kind="bot",
+            host=s.matrix_bridge_host,
+            user=s.matrix_bridge_user,
+            container=s.matrix_bridge_container,
+            port=0,
+        ),
    }
    for entry in load_custom_services():
        key = entry.get("key")
-        if not key or key in out:
+        if not key:
+            continue
+        if key in out:
+            # A custom entry can't shadow a built-in (parakeet/kokoro/…); warn so
+            # an adopter who picked a colliding key for, say, a second vLLM sees
+            # why no tile appeared instead of a silent no-op.
+            log.warning("custom service %r collides with a built-in name; ignoring", key)
            continue
        out[key] = ServiceDef(
            name=key,
@@ -102,7 +122,9 @@ def services_from_settings(s: Settings) -> dict[str, ServiceDef]:
            container=entry.get("container", key),
            port=int(entry.get("port", 0)),
        )
-    return out
+    # Drop services the deployment has switched off (DISABLED_SERVICES) so they
+    # show no tile and are never probed/auto-restarted.
+    return {k: v for k, v in out.items() if k not in s.disabled_services}


 async def docker_state(settings: Settings, svc: ServiceDef) -> dict:
@@ -28,6 +28,12 @@ _IMAGE_RE = re.compile(r"^[A-Za-z0-9][A-Za-z0-9._:/@-]*$")
 # Docker container / volume name (Docker's own rule).
 _CONTAINER_RE = re.compile(r"^[A-Za-z0-9][A-Za-z0-9_.-]*$")

+# Absolute filesystem path to a local model directory on a Spark. Conservative
+# charset (letters, digits, and safe path punctuation) with a required leading
+# '/', so it carries no shell metacharacters and no whitespace. Traversal ('.'
+# and '..' segments) is rejected separately in validate_local_path.
+_LOCAL_PATH_RE = re.compile(r"^/[A-Za-z0-9._+/-]+$")
+

 def validate_repo(repo: str) -> str:
    """Return `repo` if it is a well-formed 'org/name'; else raise ValueError."""
@@ -50,6 +56,25 @@ def validate_container(name: str) -> str:
    return name


+def validate_local_path(path: str) -> str:
+    """Return `path` if it is a safe absolute model directory path; else ValueError.
+
+    For locally fine-tuned models served by directory (not an HF repo). Requires
+    an absolute path, a metacharacter-free charset, and no '.'/'..' segments so a
+    caller cannot traverse out of an intended models directory. The `quote_arg`
+    sink still quotes it in depth — this is the boundary check.
+    """
+    p = path or ""
+    if len(p) > 512 or not _LOCAL_PATH_RE.fullmatch(p):
+        raise ValueError(
+            f"invalid local model path (expected an absolute path, no spaces or "
+            f"shell metacharacters): {path!r}"
+        )
+    if any(seg in (".", "..") for seg in p.split("/")):
+        raise ValueError(f"local model path must not contain '.' or '..' segments: {path!r}")
+    return p
+
+
 def quote_arg(value: object) -> str:
    """shlex.quote a single token for safe embedding in a shell command string."""
    return shlex.quote(str(value))
@@ -13,6 +13,7 @@ const state = {
  swap_progress: 0,        // 0–1
  services: {},
  service_action_in_flight: null,  // e.g. "parakeet:restart"
+  mb_update_in_flight: false,      // matrix-bridge update job running
  hardware: {},
  config: {},
  configured: true,
@@ -20,11 +21,19 @@ const state = {
  deep_health: {},
  disk_status: {},         // keyed by model key: { on_disk, total_bytes, per_host }
  disk_status_loaded: false,
+  lock: { held: false },   // GPU swap reservation (coordination layer)
+  schedules: [],           // schedules external automation has registered
 };

 const el = (sel) => document.querySelector(sel);
 const $$ = (sel) => document.querySelectorAll(sel);

+// ISO timestamp -> local clock string (e.g. "2:45:10 PM"); '' if unparseable.
+function fmtClock(iso) {
+  const t = Date.parse(iso);
+  return isNaN(t) ? '' : new Date(t).toLocaleTimeString();
+}
+
 function escapeHtml(s) {
  if (s == null) return '';
  return String(s)
@@ -50,6 +59,12 @@ function renderCards() {
  const root = el('#cards');
  root.innerHTML = '';
  const isSwapping = !!state.swap_job_id;
+  // GPU reserved by external automation — manual swaps are refused server-side
+  // (423); reflect that in the buttons so the click never bounces.
+  const locked = !!(state.lock && state.lock.held);
+  const lockTip = locked
+    ? `Reserved by ${state.lock.holder || 'automation'}${state.lock.expires_at ? ' until ' + fmtClock(state.lock.expires_at) : ''}`
+    : '';
  for (const key of Object.keys(state.models)) {
    const m = state.models[key];
    const isActive = key === state.current_model_key;
@@ -59,6 +74,7 @@ function renderCards() {
      ? `<div class="desc">${escapeHtml(m.description)}</div>`
      : '';
    const customPill = m.custom ? `<span class="tag custom-pill">custom</span>` : '';
+    const localPill = m.local_path ? `<span class="tag local-pill" title="Served from a directory on the Spark, not Hugging Face">local</span>` : '';
    // Disk-presence pill + trash button. Until /api/models/disk-status comes back,
    // we don't know — render a neutral placeholder.
    const disk = state.disk_status[key];
@@ -72,8 +88,10 @@ function renderCards() {
      }
    }
    // Trash button — hidden if not on disk; disabled (with tooltip) if currently loaded.
+    // Never offered for local models: their directory is hand-placed training output,
+    // not a re-downloadable HF cache (the server refuses the delete too).
    let trashBtn = '';
-    if (state.disk_status_loaded && disk && disk.on_disk) {
+    if (state.disk_status_loaded && disk && disk.on_disk && !m.local_path) {
      const disabled = isActive || isSwapping;
      const tip = isActive
        ? 'Currently loaded — switch to another model first'
@@ -90,7 +108,12 @@ function renderCards() {
    if (isActive) {
      primaryBtn = `<button class="btn" disabled>Current</button>`;
    } else if (isOnDisk) {
-      primaryBtn = `<button class="btn primary" data-swap-key="${key}" ${isSwapping ? 'disabled' : ''}>Switch to this</button>`;
+      const swapBlocked = isSwapping || locked;
+      const tip = locked ? ` title="${escapeHtml(lockTip)}"` : '';
+      primaryBtn = `<button class="btn primary" data-swap-key="${key}"${tip} ${swapBlocked ? 'disabled' : ''}>Switch to this</button>`;
+    } else if (m.local_path) {
+      // A local model can't be "downloaded" — its directory has to exist on the Spark.
+      primaryBtn = `<button class="btn" disabled title="Directory not found on the Spark — create it there, then refresh">Not found on Spark</button>`;
    } else {
      const tip = dlInFlight ? 'A download is already in progress' : 'Download weights to the Spark(s)';
      primaryBtn = `<button class="btn info" data-download-key="${key}" title="${escapeHtml(tip)}" ${dlInFlight ? 'disabled' : ''}>Download</button>`;
@@ -101,12 +124,15 @@ function renderCards() {
        <span class="tag mode-${m.mode}">${m.mode}</span>
        <span class="tag">${m.size_gb} GB</span>
        ${customPill}
+        ${localPill}
        ${diskPill}
        ${(m.capabilities || []).map(c => `<span class="tag cap">${escapeHtml(c)}</span>`).join('')}
      </div>
      ${desc}
      <div class="muted small repo">
-        <a href="https://huggingface.co/${encodeURIComponent(m.repo)}" target="_blank" rel="noopener" title="View on Hugging Face">${escapeHtml(m.repo)} <span class="hf-icon">↗</span></a>
+        ${m.local_path
+          ? `<span class="local-path" title="Local model directory on the Spark">${escapeHtml(m.local_path)}</span>`
+          : `<a href="https://huggingface.co/${encodeURIComponent(m.repo)}" target="_blank" rel="noopener" title="View on Hugging Face">${escapeHtml(m.repo)} <span class="hf-icon">↗</span></a>`}
      </div>
      <div class="spacer"></div>
      <div class="card-actions">
@@ -305,6 +331,32 @@ async function wakeSpark(name) {
  }
 }

+// Generate-if-missing + copy this Spark's OUTBOUND ssh public key (the key the
+// Spark uses to log in to other machines, e.g. the Mac). Distinct from the
+// package's own key in the StartOS "Show Public Key" action.
+async function copySparkSshKey(name, btn) {
+  if (btn) btn.disabled = true;
+  try {
+    const r = await fetchJSON(`/api/spark/${name}/ssh-key`, { method: 'POST' });
+    // Best-effort clipboard copy; on plain-HTTP this no-ops, but the dialog
+    // below always shows the key for manual selection.
+    await copyText(r.pubkey, btn);
+    const label = r.host ? `${name} (${r.host})` : name;
+    el('#sshkey-title').textContent = `${name} — SSH public key`;
+    el('#sshkey-intro').textContent = r.created
+      ? `Generated a new SSH key on ${label} and copied it to your clipboard. This is the key ${name} uses to log in to OTHER machines.`
+      : `${label} already had an SSH key; copied its public key to your clipboard. This is the key ${name} uses to log in to OTHER machines.`;
+    el('#sshkey-value').textContent = r.pubkey;
+    el('#sshkey-install').textContent =
+      `mkdir -p ~/.ssh && echo '${r.pubkey}' >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys`;
+    el('#sshkey-dialog').showModal();
+  } catch (e) {
+    alert(`Couldn't get the SSH key for ${name}: ${e.message}`);
+  } finally {
+    if (btn) btn.disabled = false;
+  }
+}
+
 function renderHardware() {
  const panel = el('#hardware-panel');
  const grid = el('#hardware-grid');
@@ -358,11 +410,21 @@ function renderHardware() {
    if (s.gpu_temp_c != null) gpuExtras.push(`${s.gpu_temp_c}°C`);
    if (s.gpu_power_w != null) gpuExtras.push(`${s.gpu_power_w.toFixed(0)}W`);
    const gpuExtrasStr = gpuExtras.length ? ` · ${gpuExtras.join(' · ')}` : '';
+    // Read-only WireGuard badge: shown only when the Spark has a wg interface up.
+    // "VPN <ip>" means it's a peer on that tunnel (reachable off-LAN when the
+    // tunnel is up); it reflects interface presence, not live peer reachability.
+    const wgIp = s.wg_addr ? String(s.wg_addr).split('/')[0] : '';
+    const wgBadge = s.wg_iface
+      ? ` · <span class="wg-badge" title="On WireGuard tunnel '${escapeHtml(s.wg_iface)}'${wgIp ? ' as ' + escapeHtml(wgIp) : ''} — reachable off-LAN while the tunnel is up">VPN${wgIp ? ' ' + escapeHtml(wgIp) : ''}</span>`
+      : '';
    card.className = 'hw-card';
    card.innerHTML = `
      <div class="head">
        <span class="name">${escapeHtml(s.hostname || key)}</span>
-        <span class="meta">${escapeHtml(key)} · ${escapeHtml(s.gpu_name || '')} · ${escapeHtml(s.uptime || '')}</span>
+        <span class="meta">${escapeHtml(key)} · ${escapeHtml(s.gpu_name || '')} · ${escapeHtml(s.uptime || '')}${wgBadge}</span>
+        <button class="icon-btn ssh-key-btn" data-ssh-key="${escapeHtml(key)}" title="Copy this Spark's SSH public key (creates one if it doesn't have one) — e.g. to let it log in to your Mac" aria-label="Copy SSH public key">
+          <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><rect x="9" y="9" width="13" height="13" rx="2"/><path d="M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1"/></svg>
+        </button>
      </div>
      <div class="hw-metric">
        <span class="label">CPU</span>
@@ -402,8 +464,13 @@ function classifyService(s) {
  if (s.docker_state === 'missing') return 'missing';
  if (s.docker_state === 'restarting') return 'unhealthy';
  if (s.docker_state === 'exited') return 'unhealthy';
-  if (s.docker_state === 'running' && !s.http_ready) return 'starting';
-  if (s.docker_state === 'running' && s.http_ready) return 'running';
+  if (s.docker_state === 'running') {
+    // http_ready === false means an HTTP probe is expected but failing → still
+    // warming up. null means the service has no HTTP surface (e.g. the bot), so
+    // a running container is simply healthy.
+    if (s.http_ready === false) return 'starting';
+    return 'running';
+  }
  return s.docker_state || 'unknown';
 }

@@ -435,6 +502,11 @@ async function renderServices() {
  grid.innerHTML = '';
  for (const [name, s] of entries) {
    const cls = classifyService(s);
+    const isBot = s.kind === 'bot';
+    // The bot tile is opt-in: it only belongs to deployments that actually run
+    // matrix-bridge. When the container is absent (missing) or the host isn't
+    // configured, hide the tile entirely rather than show a stray red card.
+    if (isBot && (cls === 'missing' || cls === 'unconfigured')) continue;
    const card = document.createElement('div');
    card.className = `service-card ${cls}`;
    const inFlight = state.service_action_in_flight && state.service_action_in_flight.startsWith(name + ':');
@@ -447,7 +519,7 @@ async function renderServices() {
      return false;
    };
    const copyIcon = `<svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><rect x="9" y="9" width="13" height="13" rx="2"/><path d="M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1"/></svg>`;
-    const hostStr = s.host ? `${s.host}:${s.port}` : '';
+    const hostStr = s.host ? (s.port ? `${s.host}:${s.port}` : s.host) : '';
    const hostRow = s.host
      ? `<div class="row"><span class="k">Host</span><span class="v copyable" data-copy-self title="Click to copy">${escapeHtml(hostStr)}</span><button class="icon-btn" data-copy-text="${escapeHtml(hostStr)}" title="Copy host" aria-label="Copy">${copyIcon}</button></div>`
      : `<div class="row"><span class="k">Host</span><span class="v muted-v">not configured</span></div>`;
@@ -501,9 +573,11 @@ async function renderServices() {
      ${restartsRow}
      ${deepRow}
      <div class="service-actions">
+        ${isBot ? `<button class="btn primary" data-mb-update title="Pull latest code, rebuild, and recreate the bot" ${inFlight || state.mb_update_in_flight ? 'disabled' : ''}>Update</button>` : ''}
        <button class="btn" data-svc-action="${name}:start" ${disable('start') ? 'disabled' : ''}>Start</button>
        <button class="btn" data-svc-action="${name}:restart" ${disable('restart') ? 'disabled' : ''}>Restart</button>
        <button class="btn danger" data-svc-action="${name}:stop" ${disable('stop') ? 'disabled' : ''}>Stop</button>
+        ${isBot ? `<button class="btn" data-mb-logs title="Show the last 100 log lines">View logs</button>` : ''}
      </div>
    `;
    grid.appendChild(card);
@@ -511,6 +585,10 @@ async function renderServices() {
  for (const btn of grid.querySelectorAll('.btn[data-svc-action]')) {
    btn.addEventListener('click', () => onServiceAction(btn.dataset.svcAction));
  }
+  const mbUpdateBtn = grid.querySelector('[data-mb-update]');
+  if (mbUpdateBtn) mbUpdateBtn.addEventListener('click', onMatrixBridgeUpdate);
+  const mbLogsBtn = grid.querySelector('[data-mb-logs]');
+  if (mbLogsBtn) mbLogsBtn.addEventListener('click', openMatrixBridgeLogs);
  for (const btn of grid.querySelectorAll('[data-dh-run]')) {
    btn.addEventListener('click', () => onDeepHealthRun(btn.dataset.dhRun, btn));
  }
@@ -689,6 +767,118 @@ async function onServiceAction(key) {
  }
 }

+// ===================== matrix-bridge bot (update + logs) =====================
+
+const mbState = { job_id: null, eventsource: null, timer: null, started_at: null };
+
+function mbTimerStart(at) {
+  mbState.started_at = at;
+  if (mbState.timer) clearInterval(mbState.timer);
+  const tick = () => {
+    if (!mbState.started_at) return;
+    const sec = Math.max(0, Math.floor((Date.now() - mbState.started_at) / 1000));
+    el('#mb-update-elapsed').textContent = `${Math.floor(sec / 60)}:${(sec % 60).toString().padStart(2, '0')}`;
+  };
+  tick();
+  mbState.timer = setInterval(tick, 500);
+}
+
+async function onMatrixBridgeUpdate() {
+  if (state.mb_update_in_flight) return;
+  if (!confirm('Update the matrix-bridge bot?\n\nThis pulls the latest code, rebuilds the container image, and recreates the container. The first build after a base-image change can take several minutes. The bot is briefly offline while it restarts.')) return;
+  state.mb_update_in_flight = true;
+  renderServices();
+  try {
+    const r = await fetchJSON('/api/matrix-bridge/update', { method: 'POST' });
+    attachMbUpdateProgress(r.job_id);
+  } catch (e) {
+    state.mb_update_in_flight = false;
+    renderServices();
+    alert('Update failed to start: ' + e.message);
+  }
+}
+
+async function attachMbUpdateProgress(jobId) {
+  mbState.job_id = jobId;
+  el('#mb-update-log').textContent = '';
+  el('#mb-update-title').textContent = 'Updating matrix-bridge…';
+  el('#mb-update-phase').textContent = 'Starting…';
+  el('#mb-update-dialog').showModal();
+  try {
+    const snap = await fetchJSON(`/api/matrix-bridge/update/${jobId}`);
+    mbTimerStart(Date.parse(snap.started_at));
+    el('#mb-update-phase').textContent = snap.phase || 'Working…';
+    el('#mb-update-log').textContent = (snap.lines || []).join('\n');
+    if (snap.returncode !== null) { onMbUpdateDone(snap); return; }
+  } catch { mbTimerStart(Date.now()); }
+  const es = new EventSource(`/api/matrix-bridge/update/${jobId}/stream`);
+  mbState.eventsource = es;
+  es.onmessage = ev => {
+    try {
+      const d = JSON.parse(ev.data);
+      if (d.line !== undefined) {
+        const log = el('#mb-update-log');
+        log.textContent += d.line + '\n';
+        log.scrollTop = log.scrollHeight;
+      }
+    } catch {}
+  };
+  es.addEventListener('phase', ev => {
+    try { el('#mb-update-phase').textContent = JSON.parse(ev.data).phase; } catch {}
+  });
+  es.addEventListener('done', ev => {
+    let d = {}; try { d = JSON.parse(ev.data); } catch {}
+    onMbUpdateDone(d);
+  });
+  es.onerror = () => {
+    // Don't leave the Update button wedged-disabled on a dropped stream. The
+    // job keeps running server-side; re-clicking Update returns a clean 409.
+    es.close();
+    mbState.eventsource = null;
+    state.mb_update_in_flight = false;
+    el('#mb-update-phase').textContent = 'Lost connection to the update stream — reopen or check logs.';
+    renderServices();
+  };
+}
+
+function onMbUpdateDone(d) {
+  if (mbState.eventsource) { mbState.eventsource.close(); mbState.eventsource = null; }
+  if (mbState.timer) { clearInterval(mbState.timer); mbState.timer = null; }
+  state.mb_update_in_flight = false;
+  if (d.state === 'failed') {
+    el('#mb-update-title').textContent = `Update failed (rc=${d.returncode})`;
+    el('#mb-update-phase').textContent = 'Failed — see the log above.';
+  } else {
+    el('#mb-update-title').textContent = 'Update complete';
+    el('#mb-update-phase').textContent = 'Done ✓';
+  }
+  // Refresh the tile's badge.
+  (async () => { try { state.services = await fetchJSON('/api/services'); } catch {} renderServices(); })();
+}
+
+async function openMatrixBridgeLogs() {
+  const pre = el('#mb-logs-pre');
+  el('#mb-logs-title').textContent = 'matrix-bridge logs';
+  pre.textContent = 'Loading…';
+  el('#mb-logs-dialog').showModal();
+  await loadMatrixBridgeLogs();
+}
+
+async function loadMatrixBridgeLogs() {
+  const pre = el('#mb-logs-pre');
+  const btn = el('#mb-logs-refresh');
+  if (btn) btn.disabled = true;
+  try {
+    const r = await fetchJSON('/api/matrix-bridge/logs?tail=100');
+    pre.textContent = r.output || '(no output)';
+    pre.scrollTop = pre.scrollHeight;
+  } catch (e) {
+    pre.textContent = 'Could not read logs: ' + e.message;
+  } finally {
+    if (btn) btn.disabled = false;
+  }
+}
+
 function renderEndpoint(status) {
  const v = status.vllm || {};
  const panel = el('#endpoint-panel');
@@ -758,6 +948,10 @@ function renderHealth(status) {
  function setDot(id, ok, payload) {
    const item = el(id);
    if (!item) return;
+    // A service switched off via DISABLED_SERVICES isn't part of this
+    // deployment — hide its indicator entirely rather than show it as down.
+    if (payload && payload.disabled) { item.classList.add('hidden'); return; }
+    item.classList.remove('hidden');
    const dot = item.querySelector('.dot');
    dot.classList.remove('ok', 'bad', 'warn');
    if (ok === true) dot.classList.add('ok');
@@ -1056,6 +1250,11 @@ function openDiskDeleteDialog(key) {

 async function triggerSwap(modelKey) {
  if (state.swap_job_id) return;
+  if (state.lock && state.lock.held) {
+    const until = state.lock.expires_at ? ' until ' + fmtClock(state.lock.expires_at) : '';
+    alert(`The GPU swap path is reserved by ${state.lock.holder || 'automation'}${until}. Use "Release" on the reservation banner to override.`);
+    return;
+  }
  try {
    const r = await fetchJSON('/api/swap', {
      method: 'POST',
@@ -1064,9 +1263,83 @@ async function triggerSwap(modelKey) {
    });
    attachToSwap(r.job_id, /*needsBackfill=*/false);
  } catch (e) {
+    // 423 Locked: a reservation was acquired between our last poll and this click.
+    if (e.message && e.message.startsWith('423')) {
+      alert('The GPU swap path was just reserved by automation. Refreshing…');
+      pollCoordination();
+    } else {
      alert('Failed to start swap: ' + e.message);
    }
  }
+}
+
+// ---- coordination layer: swap lock + schedule registry ----
+
+async function pollCoordination() {
+  try {
+    state.lock = await fetchJSON('/api/swap/lock');
+  } catch { state.lock = { held: false }; }
+  try {
+    const r = await fetchJSON('/api/schedule');
+    state.schedules = r.schedules || [];
+  } catch { state.schedules = []; }
+  renderLockBanner();
+  renderSchedules();
+  renderCards();  // reflect lock state on the swap buttons
+}
+
+function renderLockBanner() {
+  const banner = el('#lock-banner');
+  if (!banner) return;
+  const lock = state.lock;
+  if (lock && lock.held) {
+    const until = lock.expires_at ? ` until ${fmtClock(lock.expires_at)}` : '';
+    const note = lock.note ? ` — ${escapeHtml(lock.note)}` : '';
+    el('#lock-text').innerHTML =
+      `GPU swap path reserved by <strong>${escapeHtml(lock.holder || 'automation')}</strong>${until}${note}. Manual swaps are paused.`;
+    banner.classList.remove('hidden');
+  } else {
+    banner.classList.add('hidden');
+  }
+}
+
+function renderSchedules() {
+  const panel = el('#schedule-panel');
+  const list = el('#schedule-list');
+  if (!panel || !list) return;
+  const items = state.schedules || [];
+  if (!items.length) {
+    panel.classList.add('hidden');
+    list.innerHTML = '';
+    return;
+  }
+  list.innerHTML = items.map((s) => {
+    const meta = [
+      s.cron ? `<code>${escapeHtml(s.cron)}</code>` : '',
+      s.next_run ? `next: ${escapeHtml(s.next_run)}` : '',
+      s.owner ? `by ${escapeHtml(s.owner)}` : '',
+    ].filter(Boolean).join(' · ');
+    const desc = s.description ? `<div class="desc">${escapeHtml(s.description)}</div>` : '';
+    return `<div class="schedule-item">
+      <div class="name">${escapeHtml(s.name)}</div>
+      <div class="muted small">${meta}</div>
+      ${desc}
+    </div>`;
+  }).join('');
+  panel.classList.remove('hidden');
+}
+
+async function releaseLock() {
+  const lock = state.lock || {};
+  const who = lock.holder || 'automation';
+  if (!confirm(`Force-release the GPU reservation held by ${who}? Any job relying on it may then collide with a manual swap.`)) return;
+  try {
+    await fetchJSON('/api/swap/lock?force=true', { method: 'DELETE' });
+  } catch (e) {
+    alert('Failed to release: ' + e.message);
+  }
+  pollCoordination();
+}

 async function triggerDownloadForKey(modelKey) {
  const m = state.models[modelKey];
@@ -1506,6 +1779,60 @@ function setupAdvancedDialog() {
  el('#adv-gmu').addEventListener('input', (e) => { el('#adv-gmu-out').value = parseFloat(e.target.value).toFixed(2); });
 }

+function openLocalModelDialog() {
+  const dlg = el('#local-model-dialog');
+  el('#lm-key').value = '';
+  el('#lm-name').value = '';
+  el('#lm-path').value = '';
+  el('#lm-chat').value = '';
+  el('#lm-size').value = '';
+  el('#lm-mode').value = 'solo';
+  el('#lm-desc').value = '';
+  el('#lm-mml').value = 32768;
+  el('#lm-gmu').value = 0.85;
+  el('#lm-gmu-out').value = '0.85';
+  el('#lm-fst').checked = true;
+  el('#lm-pcache').checked = true;
+  el('#lm-fp8').checked = true;
+  dlg.showModal();
+}
+
+function setupLocalModelDialog() {
+  el('#lm-cancel').addEventListener('click', () => el('#local-model-dialog').close());
+  el('#lm-gmu').addEventListener('input', (e) => { el('#lm-gmu-out').value = parseFloat(e.target.value).toFixed(2); });
+  el('#local-model-form').addEventListener('submit', async (e) => {
+    e.preventDefault();
+    const chat = el('#lm-chat').value.trim();
+    const body = {
+      key: el('#lm-key').value.trim(),
+      display_name: el('#lm-name').value.trim(),
+      local_path: el('#lm-path').value.trim(),
+      size_gb: parseFloat(el('#lm-size').value) || 0,
+      mode: el('#lm-mode').value,
+      description: el('#lm-desc').value.trim() || null,
+      // A fine-tune's chat template (if any) rides along as a launch flag.
+      vllm_args: chat ? [`--chat-template=${chat}`] : [],
+      knobs: {
+        max_model_len: parseInt(el('#lm-mml').value, 10) || 32768,
+        gpu_memory_utilization: parseFloat(el('#lm-gmu').value),
+        fastsafetensors: el('#lm-fst').checked,
+        prefix_caching: el('#lm-pcache').checked,
+        kv_cache_dtype: el('#lm-fp8').checked ? 'fp8' : 'auto',
+      },
+    };
+    try {
+      await fetchJSON('/api/models', {
+        method: 'POST',
+        headers: { 'content-type': 'application/json' },
+        body: JSON.stringify(body),
+      });
+      el('#local-model-dialog').close();
+      await loadModels();
+      pollStatus();
+    } catch (e) { alert('Add local model failed: ' + e.message); }
+  });
+}
+
 // ===================== NIM installer =====================

 const nimState = {
@@ -1847,15 +2174,33 @@ async function init() {
  el('#nim-cancel').addEventListener('click', () => el('#nim-dialog').close());
  el('#nim-form').addEventListener('submit', submitNim);
  el('#nim-prog-close').addEventListener('click', () => el('#nim-progress-dialog').close());
+  el('#mb-update-close').addEventListener('click', () => el('#mb-update-dialog').close());
+  // Dismissing the modal (Close or Esc) stops streaming; the job runs on
+  // server-side and re-clicking Update returns a 409 if still in progress.
+  el('#mb-update-dialog').addEventListener('close', () => {
+    if (mbState.eventsource) { mbState.eventsource.close(); mbState.eventsource = null; }
+    if (mbState.timer) { clearInterval(mbState.timer); mbState.timer = null; }
+    state.mb_update_in_flight = false;
+    renderServices();
+  });
+  el('#mb-logs-close').addEventListener('click', () => el('#mb-logs-dialog').close());
+  el('#mb-logs-refresh').addEventListener('click', loadMatrixBridgeLogs);
  el('#open-connectivity').addEventListener('click', openConnectivityDialog);
  el('#connectivity-close').addEventListener('click', () => el('#connectivity-dialog').close());
-  // Wake-on-LAN buttons live on unreachable hardware cards; delegate.
+  // Hardware-card buttons (Wake-on-LAN on unreachable cards; SSH-key copy on
+  // reachable ones) are rendered dynamically, so delegate from the grid.
  el('#hardware-grid').addEventListener('click', (e) => {
-    const btn = e.target.closest('[data-wake]');
-    if (btn) wakeSpark(btn.dataset.wake);
+    const wbtn = e.target.closest('[data-wake]');
+    if (wbtn) { wakeSpark(wbtn.dataset.wake); return; }
+    const kbtn = e.target.closest('[data-ssh-key]');
+    if (kbtn) { copySparkSshKey(kbtn.dataset.sshKey, kbtn); return; }
  });
+  el('#sshkey-close').addEventListener('click', () => el('#sshkey-dialog').close());
+  el('#open-local').addEventListener('click', openLocalModelDialog);
+  el('#lock-release').addEventListener('click', releaseLock);
  setupCatalogDialog();
  setupAdvancedDialog();
+  setupLocalModelDialog();
  // Open WebUI link from /api/config
  try {
    state.config = await fetchJSON('/api/config');
@@ -1870,6 +2215,7 @@ async function init() {
  await loadModels();
  await pollStatus();
  await renderServices();
+  pollCoordination();
  pollHardware();
  pollUpdates();
  // Disk-status probe runs after first paint — slow over SSH and not blocking.
@@ -1877,6 +2223,7 @@ async function init() {
  // Speech-model patches panel — slow over SSH, runs after first paint.
  renderSpeechModels();
  setInterval(pollStatus, 5000);
+  setInterval(pollCoordination, 5000); // swap lock + schedule registry
  setInterval(pollHardware, 8000);    // every 8s
  setInterval(pollUpdates, 300000);  // every 5 min
  setInterval(loadDiskStatus, 60000); // every 60s — disk state changes rarely
@@ -96,6 +96,13 @@
      </details>
    </section>

+    <section id="lock-banner" class="banner lock-banner hidden">
+      <span class="lock-icon" aria-hidden="true">🔒</span>
+      <span id="lock-text">GPU swap path reserved</span>
+      <span class="spacer"></span>
+      <button id="lock-release" class="btn small-btn">Release</button>
+    </section>
+
    <nav id="dashboard-tabs" class="dashboard-tabs hidden" role="tablist">
      <button type="button" class="dashboard-tab" data-tab="llm" role="tab" aria-selected="true">LLM</button>
      <button type="button" class="dashboard-tab" data-tab="audio" role="tab" aria-selected="false">Audio / Speech</button>
@@ -164,6 +171,37 @@
          </div>
        </form>
      </dialog>
+
+      <dialog id="mb-update-dialog" class="modal">
+        <form method="dialog" class="modal-form">
+          <h3 id="mb-update-title">Updating matrix-bridge…</h3>
+          <div class="phase-row">
+            <div class="phase" id="mb-update-phase">Starting…</div>
+            <span class="spacer"></span>
+            <span class="timer" id="mb-update-elapsed">0:00</span>
+          </div>
+          <details open>
+            <summary class="muted small">Log</summary>
+            <pre id="mb-update-log" class="log"></pre>
+          </details>
+          <div class="modal-actions">
+            <button type="button" id="mb-update-close" class="btn">Close</button>
+          </div>
+        </form>
+      </dialog>
+
+      <dialog id="mb-logs-dialog" class="modal">
+        <form method="dialog" class="modal-form">
+          <h3 id="mb-logs-title">matrix-bridge logs</h3>
+          <p class="muted small">Last 100 lines from <code>docker logs</code> on the Spark.</p>
+          <pre id="mb-logs-pre" class="log"></pre>
+          <div class="modal-actions">
+            <button type="button" id="mb-logs-refresh" class="btn">Refresh</button>
+            <span class="spacer"></span>
+            <button type="button" id="mb-logs-close" class="btn">Close</button>
+          </div>
+        </form>
+      </dialog>
    </section>

    <section id="speech-models-panel" class="speech-models hidden">
@@ -198,6 +236,7 @@
      <div class="section-header">
        <h2 class="section-title">LLM swap</h2>
        <button id="open-download" class="btn small-btn">+ Download a new model</button>
+        <button id="open-local" class="btn small-btn">+ Add local model</button>
      </div>

      <dialog id="catalog-dialog" class="modal">
@@ -230,6 +269,37 @@
        </form>
      </dialog>

+      <dialog id="local-model-dialog" class="modal">
+        <form method="dialog" class="modal-form" id="local-model-form">
+          <h3>Add a local / fine-tuned model</h3>
+          <p class="muted small">For a model that lives as a directory on a Spark (e.g. a fine-tune), not a Hugging Face repo. The directory is bind-mounted into the vLLM container at the same path when you swap to it. It must already exist on the Spark.</p>
+          <label class="modal-row"><span>Key (URL-safe id)</span><input type="text" id="lm-key" required pattern="[a-zA-Z0-9_-]+"></label>
+          <label class="modal-row"><span>Display name</span><input type="text" id="lm-name" required></label>
+          <label class="modal-row"><span>Model directory (absolute path on the Spark)</span><input type="text" id="lm-path" required placeholder="e.g. /home/you/models/my-finetune"></label>
+          <label class="modal-row"><span>Chat template path (optional)</span><input type="text" id="lm-chat" placeholder="e.g. /home/you/models/my-finetune/chat_template.jinja"></label>
+          <label class="modal-row"><span>Size (GB)</span><input type="number" id="lm-size" step="0.1" min="0"></label>
+          <label class="modal-row"><span>Mode</span>
+            <select id="lm-mode">
+              <option value="solo">solo (Spark 1 only)</option>
+              <option value="cluster">cluster (both Sparks via Ray)</option>
+            </select>
+          </label>
+          <label class="modal-row"><span>Description (optional)</span><textarea id="lm-desc" rows="3"></textarea></label>
+          <fieldset class="modal-fieldset">
+            <legend>Default launch knobs</legend>
+            <label class="modal-row"><span>Max context (tokens)</span><input type="number" id="lm-mml" step="1024" min="1024" value="32768"></label>
+            <label class="modal-row"><span>GPU memory %</span><input type="range" id="lm-gmu" min="0.5" max="0.95" step="0.01" value="0.85"> <output id="lm-gmu-out">0.85</output></label>
+            <label class="modal-row inline"><input type="checkbox" id="lm-fst" checked> Fast safetensors loading</label>
+            <label class="modal-row inline"><input type="checkbox" id="lm-pcache" checked> Prefix caching</label>
+            <label class="modal-row inline"><input type="checkbox" id="lm-fp8" checked> FP8 KV cache</label>
+          </fieldset>
+          <div class="modal-actions">
+            <button type="button" id="lm-cancel" class="btn">Cancel</button>
+            <button type="submit" class="btn primary">Add local model</button>
+          </div>
+        </form>
+      </dialog>
+
      <dialog id="disk-delete-dialog" class="modal">
        <form method="dialog" class="modal-form">
          <h3>Delete model weights from disk?</h3>
@@ -244,6 +314,24 @@
        </form>
      </dialog>

+      <dialog id="sshkey-dialog" class="modal">
+        <form method="dialog" class="modal-form">
+          <h3 id="sshkey-title">SSH public key</h3>
+          <p id="sshkey-intro" class="muted small"></p>
+          <div class="sshkey-row">
+            <pre id="sshkey-value" class="snippet copyable" data-copy-self title="Click to copy"></pre>
+            <button type="button" class="icon-btn" data-copy="#sshkey-value" title="Copy public key" aria-label="Copy public key">
+              <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><rect x="9" y="9" width="13" height="13" rx="2"/><path d="M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1"/></svg>
+            </button>
+          </div>
+          <p class="muted small">To let this Spark log in to another machine (e.g. your Mac), run this in a terminal <em>on that machine</em>:</p>
+          <pre id="sshkey-install" class="snippet copyable" data-copy-self title="Click to copy"></pre>
+          <div class="modal-actions">
+            <button type="button" id="sshkey-close" class="btn">Close</button>
+          </div>
+        </form>
+      </dialog>
+
      <dialog id="advanced-dialog" class="modal">
        <form method="dialog" class="modal-form" id="advanced-form">
          <h3 id="adv-title">Advanced settings</h3>
@@ -313,6 +401,14 @@
      <section id="cards" class="cards"></section>
    </section>

+    <section id="schedule-panel" class="schedule-panel hidden">
+      <div class="section-header">
+        <h2 class="section-title">Scheduled jobs</h2>
+      </div>
+      <p class="muted small">Registered by your own automation. Spark Control only displays these — it doesn't run them.</p>
+      <div id="schedule-list" class="schedule-list"></div>
+    </section>
+
    <section id="update-banner" class="update-banner hidden">
      <div class="ub-context muted small">
        Updates to <strong><a href="https://github.com/eugr/spark-vllm-docker" target="_blank" rel="noopener">eugr/spark-vllm-docker</a></strong>
@@ -74,6 +74,42 @@ main {
 }
 .banner em { font-style: normal; background: rgba(245, 158, 11, 0.15); padding: 2px 6px; border-radius: 4px; }

+/* GPU swap reservation (coordination layer) — informational, not a warning. */
+.lock-banner {
+  display: flex;
+  align-items: center;
+  gap: 10px;
+  border-color: var(--info);
+  color: var(--info);
+}
+.lock-banner .lock-icon { font-size: 16px; }
+.lock-banner strong { color: var(--text); }
+.lock-banner .spacer { flex: 1; }
+
+/* Scheduled-jobs panel — read-only view of what external automation registered. */
+.schedule-panel { margin-top: 8px; }
+.schedule-list {
+  display: grid;
+  grid-template-columns: repeat(auto-fill, minmax(240px, 1fr));
+  gap: 12px;
+  margin-top: 8px;
+}
+.schedule-item {
+  background: var(--surface);
+  border: 1px solid var(--border);
+  border-radius: var(--radius);
+  padding: 12px 14px;
+}
+.schedule-item .name { font-weight: 600; margin-bottom: 4px; }
+.schedule-item code {
+  background: var(--surface-2);
+  border: 1px solid var(--border);
+  border-radius: 4px;
+  padding: 1px 5px;
+  font-size: 12px;
+}
+.schedule-item .desc { margin-top: 6px; color: var(--muted); font-size: 13px; }
+
 /* ===== Endpoint panel ===== */

 .endpoint-panel {
@@ -374,6 +410,12 @@ main {
 }
 .hw-card .head .name { font-weight: 600; font-size: 15px; }
 .hw-card .head .meta { color: var(--muted); font-size: 12px; margin-left: auto; }
+/* WireGuard "VPN <ip>" badge in the meta line — accent (green) = on a tunnel. */
+.hw-card .head .meta .wg-badge { color: var(--accent); font-weight: 600; cursor: help; }
+/* Copy-this-Spark's-ssh-key button pins to the top-right corner; meta keeps
+   its margin-left:auto so name/meta/button read left→right→corner. */
+.hw-card .head .ssh-key-btn { align-self: flex-start; padding: 3px 6px; }
+.hw-card .head .ssh-key-btn svg { width: 13px; height: 13px; }
 .hw-card.unreachable { border-color: rgba(239, 68, 68, 0.4); }
 .hw-card.unreachable .name { color: var(--error); }
 .hw-card.unreachable ol { color: var(--muted); }
@@ -387,6 +429,10 @@ main {
 }
 .hw-card .wol-row .btn { padding: 5px 10px; font-size: 12px; }
 .hw-card .mac-display { font-family: ui-monospace, SFMono-Regular, Menlo, monospace; }
+/* SSH-key dialog: key line beside its copy button; long key wraps rather than scrolls. */
+.sshkey-row { display: flex; align-items: flex-start; gap: 8px; }
+.sshkey-row .snippet { flex: 1; margin: 0; white-space: pre-wrap; word-break: break-all; }
+#sshkey-install { white-space: pre-wrap; word-break: break-all; }

 .connectivity-content {
  max-height: 360px;
@@ -516,10 +562,12 @@ main {
 #dl-log-details { margin-top: 12px; }
 #dl-log-details summary { cursor: pointer; padding: 4px 0; }

-/* ===== NIM install dialog ===== */
+/* ===== NIM install + matrix-bridge dialogs ===== */

 .modal#nim-dialog,
-.modal#nim-progress-dialog { max-width: 640px; }
+.modal#nim-progress-dialog,
+.modal#mb-update-dialog,
+.modal#mb-logs-dialog { max-width: 640px; }
 .nim-grid {
  display: grid;
  gap: 8px;
@@ -682,6 +730,7 @@ main {
 .card .repo a { color: inherit; text-decoration: none; }
 .card .repo a:hover { color: var(--info); text-decoration: underline; }
 .card .repo .hf-icon { font-size: 13px; opacity: 0.7; }
+.card .repo .local-path { font-family: var(--mono, ui-monospace, monospace); opacity: 0.85; }
 .tag {
  background: var(--surface-2);
  border: 1px solid var(--border);
@@ -726,6 +775,7 @@ main {
 .card .adv-btn,
 .card .test-btn { padding: 8px 12px; font-size: 12px; }
 .card .custom-pill { color: var(--info); border-color: rgba(96, 165, 250, 0.4); }
+.card .local-pill { color: var(--warn); border-color: rgba(245, 158, 11, 0.4); }
 .tag.on-disk { color: var(--accent); border-color: rgba(74, 222, 128, 0.4); }
 .tag.not-on-disk { color: var(--muted); border-color: var(--border); opacity: 0.7; }
 .card-actions .icon-btn.danger { color: var(--error); border-color: rgba(239, 68, 68, 0.3); margin-left: auto; }
@@ -6,7 +6,9 @@ from datetime import datetime, timezone
 from typing import Optional

 from .config import Settings
+from .coordination import WebhookNotifier, build_webhook_payload
 from .models import Catalog, build_launch_command
+from .shellsafe import quote_arg
 from .ssh import ssh_run, ssh_stream, StreamHandle


@@ -32,9 +34,15 @@ class SwapJob:


 class SwapManager:
-    def __init__(self, settings: Settings, catalog: Catalog) -> None:
+    def __init__(
+        self,
+        settings: Settings,
+        catalog: Catalog,
+        notifier: Optional[WebhookNotifier] = None,
+    ) -> None:
        self.settings = settings
        self.catalog = catalog
+        self.notifier = notifier
        self.lock = asyncio.Lock()
        self.jobs: dict[str, SwapJob] = {}
        self.current_job_id: Optional[str] = None
@@ -77,6 +85,21 @@ class SwapManager:
                job.finished_at = datetime.now(timezone.utc).isoformat()
                if self.current_job_id == job.id:
                    self.current_job_id = None
+        # Outside the swap lock (so a webhook POST can't stall a queued swap) and
+        # only for real swaps — a dry run never changes the running model. A
+        # webhook failure is logged inside fire(), never raised.
+        if self.notifier is not None and self.notifier.enabled and not job.dry_run:
+            event = "swap_complete" if job.state == "ready" else "swap_failed"
+            await self.notifier.fire(event, build_webhook_payload(
+                event=event,
+                job_id=job.id,
+                model_key=job.model_key,
+                state=job.state,
+                returncode=job.returncode,
+                started_at=job.started_at,
+                finished_at=job.finished_at,
+                dry_run=job.dry_run,
+            ))

    async def _do(self, job: SwapJob) -> None:
        model = self.catalog.models[job.model_key]
@@ -112,7 +135,7 @@ class SwapManager:

        # Step 3: tail logs until the ready marker (or timeout)
        job.state = "tailing"
-        tail_cmd = "docker logs -f --tail 50 vllm_node"
+        tail_cmd = f"docker logs -f --tail 50 {quote_arg(s.vllm_container)}"
        job.append(f"$ {tail_cmd}")
        timeout = max(model.expected_ready_seconds * 2, 600)
        handle = StreamHandle()
@@ -22,6 +22,7 @@ from typing import Any

 from .config import Settings
 from .models import Catalog, build_launch_command
+from .shellsafe import quote_arg
 from .ssh import ssh_run


@@ -114,7 +115,7 @@ async def validate_launch(key: str, catalog: Catalog, settings: Settings) -> dic
    # Pipe the JSON args list to a here-doc Python invocation. The validator
    # reads from stdin to avoid shell-escaping the args themselves.
    cmd = (
-        f"echo '{payload}' | docker exec -i vllm_node python3 -c "
+        f"echo '{payload}' | docker exec -i {quote_arg(settings.vllm_container)} python3 -c "
        + shlex.quote(_VALIDATOR_SCRIPT)
    )

@@ -12,6 +12,12 @@ dependencies = [
    "python-multipart>=0.0.9",
 ]

+[project.optional-dependencies]
+dev = ["pytest>=8"]
+
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+
 [build-system]
 requires = ["setuptools>=68"]
 build-backend = "setuptools.build_meta"
@@ -0,0 +1,17 @@
+"""Shared pytest setup.
+
+These suites are pure/offline — they exercise pure functions and never touch the
+Sparks, /data, or the network. We still pin the env vars the app modules expect
+(documented in docs/guides/fastapi-image.md) to tmp paths so importing them can
+never write to the container-only /data path.
+"""
+import os
+import sys
+from pathlib import Path
+
+# Let `import app...` resolve whether or not the package is pip-installed.
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+
+os.environ.setdefault("REDACTION_MAP_DB", "/tmp/spark_control_test_maps.db")
+os.environ.setdefault("CONNECTIVITY_LOG", "/tmp/spark_control_test_connectivity.json")
+os.environ.setdefault("MODELS_OVERRIDES", "/tmp/spark_control_test_overrides.yaml")
@@ -0,0 +1,201 @@
+"""Coordination layer: swap lock lifecycle/expiry, schedule registry CRUD, and
+the webhook payload+signature. All offline — the lock takes an injectable `now`
+so expiry is tested without sleeping, and the webhook is exercised only on the
+disabled (no-network) path plus its pure payload/signature helpers.
+"""
+import asyncio
+from datetime import datetime, timedelta, timezone
+
+import pytest
+
+from app.coordination import (
+    LOCK_TTL_MAX,
+    LOCK_TTL_MIN,
+    LockHeld,
+    ScheduleRegistry,
+    SwapLockManager,
+    WebhookNotifier,
+    build_webhook_payload,
+    sign_payload,
+    valid_schedule_id,
+)
+
+T0 = datetime(2026, 6, 17, 12, 0, 0, tzinfo=timezone.utc)
+
+
+# ----------------------------------------------------------------- swap lock ----
+
+def test_acquire_free_lock_returns_token_and_status_held():
+    mgr = SwapLockManager()
+    lock = mgr.acquire("openclaw", ttl_seconds=60, note="daily vol", now=T0)
+    assert lock.token
+    st = mgr.status(now=T0)
+    assert st["held"] is True
+    assert st["holder"] == "openclaw"
+    assert st["note"] == "daily vol"
+    assert st["seconds_remaining"] == 60
+    assert "token" not in st  # public view never leaks the token
+
+
+def test_acquire_requires_holder():
+    with pytest.raises(ValueError):
+        SwapLockManager().acquire("  ", now=T0)
+
+
+def test_acquire_held_by_other_raises_lockheld_with_state():
+    mgr = SwapLockManager()
+    mgr.acquire("openclaw", ttl_seconds=60, now=T0)
+    with pytest.raises(LockHeld) as ei:
+        mgr.acquire("johnny5", ttl_seconds=60, now=T0)
+    assert ei.value.state["holder"] == "openclaw"
+
+
+def test_reacquire_with_token_extends_and_keeps_token():
+    mgr = SwapLockManager()
+    first = mgr.acquire("openclaw", ttl_seconds=60, now=T0)
+    later = T0 + timedelta(seconds=30)
+    second = mgr.acquire("openclaw", ttl_seconds=60, token=first.token, now=later)
+    assert second.token == first.token
+    # window extended from the later moment, not the original
+    assert mgr.status(now=later)["seconds_remaining"] == 60
+    assert second.acquired_at == first.acquired_at  # acquired_at preserved
+
+
+def test_reacquire_without_token_is_refused_even_for_same_holder_name():
+    # Holder name is descriptive, not a secret — matching it must not grant access.
+    mgr = SwapLockManager()
+    mgr.acquire("openclaw", ttl_seconds=60, now=T0)
+    with pytest.raises(LockHeld):
+        mgr.acquire("openclaw", ttl_seconds=60, now=T0)
+
+
+def test_ttl_is_clamped():
+    mgr = SwapLockManager()
+    mgr.acquire("a", ttl_seconds=0, now=T0)
+    assert mgr.status(now=T0)["seconds_remaining"] == LOCK_TTL_MIN
+    mgr2 = SwapLockManager()
+    mgr2.acquire("b", ttl_seconds=10**9, now=T0)
+    assert mgr2.status(now=T0)["seconds_remaining"] == LOCK_TTL_MAX
+
+
+def test_lock_expires_and_clears_lazily():
+    mgr = SwapLockManager()
+    tok = mgr.acquire("openclaw", ttl_seconds=10, now=T0).token
+    after = T0 + timedelta(seconds=11)
+    assert mgr.status(now=after) == {"held": False}
+    assert mgr.verify(tok, now=after) is False
+    # an expired lock is free to re-take by anyone
+    mgr.acquire("johnny5", ttl_seconds=10, now=after)
+    assert mgr.status(now=after)["holder"] == "johnny5"
+
+
+def test_verify_matches_only_active_token():
+    mgr = SwapLockManager()
+    tok = mgr.acquire("openclaw", ttl_seconds=60, now=T0).token
+    assert mgr.verify(tok, now=T0) is True
+    assert mgr.verify("nope", now=T0) is False
+    assert mgr.verify(None, now=T0) is False
+
+
+def test_release_requires_token_then_frees():
+    mgr = SwapLockManager()
+    tok = mgr.acquire("openclaw", ttl_seconds=60, now=T0).token
+    with pytest.raises(PermissionError):
+        mgr.release("wrong", now=T0)
+    assert mgr.release(tok, now=T0) is True
+    assert mgr.status(now=T0) == {"held": False}
+
+
+def test_force_release_skips_token_and_release_of_free_lock_is_false():
+    mgr = SwapLockManager()
+    mgr.acquire("openclaw", ttl_seconds=60, now=T0)
+    assert mgr.release(force=True, now=T0) is True
+    assert mgr.release(force=True, now=T0) is False  # nothing held now
+
+
+def test_is_blocked_by_is_the_swap_gate():
+    # Mirrors the single-read decision the /api/swap endpoint makes.
+    mgr = SwapLockManager()
+    assert mgr.is_blocked_by(None, now=T0) is None          # free lock blocks nobody
+    tok = mgr.acquire("openclaw", ttl_seconds=10, now=T0).token
+    blocked = mgr.is_blocked_by(None, now=T0)               # no token -> blocked
+    assert blocked is not None and blocked["holder"] == "openclaw"
+    assert mgr.is_blocked_by("wrong", now=T0) is not None   # wrong token -> blocked
+    assert mgr.is_blocked_by(tok, now=T0) is None           # holder's token -> allowed
+    # At/after expiry the gate is open even without a token (the bug a separate
+    # status()+verify() pair would get wrong).
+    assert mgr.is_blocked_by(None, now=T0 + timedelta(seconds=11)) is None
+
+
+# ------------------------------------------------------------------- webhook ----
+
+def test_build_webhook_payload_shape():
+    p = build_webhook_payload(
+        event="swap_complete", job_id="abc123", model_key="gemma",
+        state="ready", returncode=0, started_at="t0", finished_at="t1",
+        dry_run=False,
+    )
+    assert p == {
+        "event": "swap_complete", "job_id": "abc123", "model_key": "gemma",
+        "state": "ready", "returncode": 0, "started_at": "t0",
+        "finished_at": "t1", "dry_run": False,
+    }
+
+
+def test_sign_payload_is_deterministic_and_prefixed():
+    body = b'{"event":"swap_complete"}'
+    sig = sign_payload("s3cr3t", body)
+    assert sig.startswith("sha256=")
+    assert sig == sign_payload("s3cr3t", body)
+    assert sig != sign_payload("other", body)
+
+
+def test_disabled_webhook_fire_is_noop():
+    n = WebhookNotifier("", "")
+    assert n.enabled is False
+    # Must not attempt any network call or raise when no URL is configured.
+    assert asyncio.run(n.fire("swap_complete", {"x": 1})) is None
+
+
+# --------------------------------------------------------- schedule registry ----
+
+def test_register_and_list_schedule():
+    reg = ScheduleRegistry()
+    e = reg.register(name="Daily Vol", owner="openclaw", cron="0 6 * * *")
+    assert e.id and e.registered_at and e.updated_at
+    listed = reg.list()
+    assert len(listed) == 1 and listed[0]["name"] == "Daily Vol"
+
+
+def test_register_with_id_updates_in_place():
+    reg = ScheduleRegistry()
+    reg.register(name="Daily Vol", id="dv", owner="openclaw", cron="0 6 * * *")
+    reg.register(name="Daily Vol v2", id="dv", owner="openclaw", cron="0 7 * * *")
+    listed = reg.list()
+    assert len(listed) == 1
+    assert listed[0]["name"] == "Daily Vol v2" and listed[0]["cron"] == "0 7 * * *"
+
+
+def test_register_requires_name_and_validates_id():
+    reg = ScheduleRegistry()
+    with pytest.raises(ValueError):
+        reg.register(name="  ")
+    with pytest.raises(ValueError):
+        reg.register(name="ok", id="bad id; rm -rf")
+
+
+def test_delete_schedule():
+    reg = ScheduleRegistry()
+    reg.register(name="Daily Vol", id="dv")
+    assert reg.delete("dv") is True
+    assert reg.delete("dv") is False
+    assert reg.list() == []
+
+
+def test_valid_schedule_id():
+    assert valid_schedule_id("daily-vol")
+    assert valid_schedule_id("a.b_c-1")
+    assert not valid_schedule_id("")
+    assert not valid_schedule_id("../etc")
+    assert not valid_schedule_id("has space")
+    assert not valid_schedule_id("x" * 65)
@@ -0,0 +1,69 @@
+"""_merge_words_with_speakers + _assign_speaker_to_word: the transcript/diarizer
+merge that turns Parakeet words + Sortformer turns into speaker-labelled blocks.
+Pure functions, no cluster — this is the core of transcribe-with-speakers.
+"""
+from app.audio_proxy import _assign_speaker_to_word, _merge_words_with_speakers
+
+
+def _w(start, end, text):
+    return {"start": start, "end": end, "text": text}
+
+
+def _t(start, end, speaker):
+    return {"start_s": start, "end_s": end, "speaker": speaker}
+
+
+# ---- _assign_speaker_to_word ----
+
+def test_assign_by_midpoint_containment():
+    turns = [_t(0.0, 2.0, "Speaker_0"), _t(2.0, 4.0, "Speaker_1")]
+    assert _assign_speaker_to_word(2.4, 2.8, turns) == "Speaker_1"
+
+
+def test_assign_falls_back_to_max_overlap_when_midpoint_outside():
+    # midpoint 5.0 is in no turn; word span overlaps Speaker_0 more than Speaker_1.
+    turns = [_t(0.0, 4.9, "Speaker_0"), _t(6.0, 8.0, "Speaker_1")]
+    assert _assign_speaker_to_word(4.0, 6.0, turns) == "Speaker_0"
+
+
+def test_assign_unknown_when_no_overlap():
+    turns = [_t(0.0, 1.0, "Speaker_0")]
+    assert _assign_speaker_to_word(10.0, 11.0, turns) == "Speaker_unknown"
+
+
+# ---- _merge_words_with_speakers ----
+
+def test_empty_words_returns_empty():
+    assert _merge_words_with_speakers([], [_t(0, 1, "Speaker_0")]) == []
+
+
+def test_consecutive_same_speaker_words_join_into_one_block():
+    words = [_w(0.0, 0.5, "good"), _w(0.5, 1.0, "morning")]
+    turns = [_t(0.0, 2.0, "Speaker_0")]
+    blocks = _merge_words_with_speakers(words, turns)
+    assert blocks == [
+        {"start_ms": 0, "end_ms": 1000, "speaker": "Speaker_0", "text": "good morning"}
+    ]
+
+
+def test_speaker_change_splits_blocks():
+    words = [_w(0.0, 1.0, "hi"), _w(2.1, 3.0, "hello")]
+    turns = [_t(0.0, 2.0, "Speaker_0"), _t(2.0, 4.0, "Speaker_1")]
+    blocks = _merge_words_with_speakers(words, turns)
+    assert [b["speaker"] for b in blocks] == ["Speaker_0", "Speaker_1"]
+    assert [b["text"] for b in blocks] == ["hi", "hello"]
+
+
+def test_long_silence_breaks_block_for_same_speaker():
+    # >1.5s gap between two words of the same speaker forces a new block.
+    words = [_w(0.0, 0.5, "one"), _w(3.0, 3.5, "two")]
+    turns = [_t(0.0, 4.0, "Speaker_0")]
+    blocks = _merge_words_with_speakers(words, turns)
+    assert len(blocks) == 2
+    assert [b["text"] for b in blocks] == ["one", "two"]
+
+
+def test_punctuation_token_joins_without_leading_space():
+    words = [_w(0.0, 0.5, "hello"), _w(0.5, 0.7, ".")]
+    turns = [_t(0.0, 2.0, "Speaker_0")]
+    assert _merge_words_with_speakers(words, turns)[0]["text"] == "hello."
@@ -0,0 +1,148 @@
+"""build_launch_command: argument assembly + the shell-injection invariant.
+
+The security-critical property is that every user-controllable value (repo,
+vllm_args, knobs) is shlex-quoted at the sink, so `shlex.split` cleanly reverses
+the command back into the exact token list. The vLLM pre-flight validator
+(validate.py) depends on this round-trip — these tests lock it in.
+"""
+import shlex
+
+import pytest
+from pydantic import ValidationError
+
+from app.models import Defaults, ModelDef, build_launch_command
+
+DEFAULTS = Defaults(port=8888, host="0.0.0.0")
+
+
+def _model(**kw) -> ModelDef:
+    base = dict(display_name="X", repo="org/name", size_gb=1.0, mode="solo")
+    base.update(kw)
+    return ModelDef(**base)
+
+
+def test_solo_model_emits_solo_flag_and_ordered_args():
+    cmd = build_launch_command("k", _model(vllm_args=["--max-model-len=1000"]), DEFAULTS)
+    assert cmd == (
+        "./launch-cluster.sh --solo -d exec vllm serve org/name "
+        "--port=8888 --host=0.0.0.0 --max-model-len=1000"
+    )
+
+
+def test_cluster_model_omits_solo_flag():
+    cmd = build_launch_command("k", _model(mode="cluster", vllm_args=["-tp=2"]), DEFAULTS)
+    assert " --solo " not in cmd
+    assert cmd.startswith("./launch-cluster.sh -d exec vllm serve org/name")
+
+
+def test_knob_overrides_matching_bundled_flag():
+    # bundled arg sets max-model-len; the knob must win (single occurrence).
+    m = _model(vllm_args=["--max-model-len=1000"], knobs={"max_model_len": 65536})
+    cmd = build_launch_command("k", m, DEFAULTS)
+    assert "--max-model-len=65536" in cmd
+    assert "--max-model-len=1000" not in cmd
+
+
+def test_repo_with_shell_metacharacters_is_quoted_not_executed():
+    # build_launch_command quotes even a hostile repo (validate_repo guards the
+    # API boundary; this proves the sink itself is safe in depth).
+    evil = "org/name; rm -rf ~ #"
+    cmd = build_launch_command("k", _model(repo=evil), DEFAULTS)
+    # The raw metacharacters must not appear unquoted...
+    assert "; rm -rf" not in cmd.replace(shlex.quote(evil), "")
+    # ...and shlex.split must recover the repo as one literal token.
+    tokens = shlex.split(cmd)
+    assert evil in tokens
+
+
+def test_command_string_round_trips_through_shlex_split():
+    # The invariant validate.py relies on: every arg survives quote -> split intact.
+    args = ["--max-model-len=32768", "--load-format=fastsafetensors", "--note=a b c"]
+    cmd = build_launch_command("k", _model(vllm_args=args), DEFAULTS)
+    tokens = shlex.split(cmd)
+    for a in args:
+        assert a in tokens
+
+
+def test_injection_via_vllm_arg_stays_literal():
+    payload = "--foo=$(touch /tmp/pwned)"
+    cmd = build_launch_command("k", _model(vllm_args=[payload]), DEFAULTS)
+    assert payload in shlex.split(cmd)  # preserved as one inert token
+
+
+# ---- local / fine-tuned models (served by directory, not HF repo) ----
+
+def test_local_model_bind_mounts_dir_and_serves_the_path():
+    m = _model(repo="", local_path="/home/u/models/ft-v2", vllm_args=["--max-model-len=2048"])
+    cmd = build_launch_command("k", m, DEFAULTS)
+    tokens = shlex.split(cmd)
+    # The launch script's hook bind-mounts the host dir at the SAME container path.
+    assert tokens[0] == (
+        "VLLM_SPARK_EXTRA_DOCKER_ARGS=-v /home/u/models/ft-v2:/home/u/models/ft-v2"
+    )
+    # vLLM is pointed at the directory, not an HF repo id.
+    i = tokens.index("serve")
+    assert tokens[i + 1] == "/home/u/models/ft-v2"
+    assert "--max-model-len=2048" in tokens
+
+
+def test_local_model_chat_template_arg_survives_round_trip():
+    m = _model(
+        repo="",
+        local_path="/m/ft",
+        vllm_args=["--chat-template=/m/ft/chat_template.jinja"],
+    )
+    cmd = build_launch_command("k", m, DEFAULTS)
+    assert "--chat-template=/m/ft/chat_template.jinja" in shlex.split(cmd)
+
+
+def test_local_path_with_metacharacters_is_quoted_not_executed():
+    # The validator rejects a hostile path at the boundary; bypass it with
+    # model_construct to prove the quote_arg sink is safe in depth even if a bad
+    # value somehow reaches build_launch_command.
+    evil = "/m/ft; rm -rf ~"
+    m = ModelDef.model_construct(
+        display_name="X", repo="", local_path=evil, size_gb=1.0, mode="solo",
+        vllm_args=[], knobs=None, custom=False, capabilities=[],
+        expected_ready_seconds=300, description=None,
+    )
+    cmd = build_launch_command("k", m, DEFAULTS)
+    tokens = shlex.split(cmd)
+    i = tokens.index("serve")
+    assert tokens[i + 1] == evil  # recovered as one literal token, not executed
+    assert tokens[0] == f"VLLM_SPARK_EXTRA_DOCKER_ARGS=-v {evil}:{evil}"
+
+
+def test_model_requires_exactly_one_source():
+    with pytest.raises(ValidationError):
+        ModelDef(display_name="x", size_gb=1, mode="solo")  # neither repo nor local_path
+    with pytest.raises(ValidationError):
+        ModelDef(display_name="x", repo="o/n", local_path="/p", size_gb=1, mode="solo")  # both
+
+
+def test_local_model_rejects_chat_template_outside_dir():
+    # Only local_path is mounted into the container, so a chat-template elsewhere
+    # would silently 404 inside vLLM — reject it up front.
+    with pytest.raises(ValidationError):
+        ModelDef(
+            display_name="x", repo="", local_path="/m/ft", size_gb=1, mode="solo",
+            vllm_args=["--chat-template=/other/dir/t.jinja"],
+        )
+
+
+def test_invalid_local_path_rejected_by_model():
+    with pytest.raises(ValidationError):
+        ModelDef(display_name="x", repo="", local_path="/m/../etc", size_gb=1, mode="solo")
+
+
+def test_merge_overrides_loads_local_and_skips_invalid(monkeypatch):
+    # YAML/override-added local models get the same validation as the API; a single
+    # bad entry is skipped (logged) rather than breaking the whole catalog load.
+    from app import models as M
+    monkeypatch.setattr(M, "load_overrides", lambda: {"knobs": {}, "custom": [
+        {"key": "good", "display_name": "G", "local_path": "/home/u/m", "size_gb": 1, "mode": "solo"},
+        {"key": "bad", "display_name": "B", "local_path": "/home/u/../etc", "size_gb": 1, "mode": "solo"},
+    ]})
+    cat = M._merge_overrides(M.Catalog(models={}))
+    assert cat.models["good"].is_local and cat.models["good"].source == "/home/u/m"
+    assert "bad" not in cat.models  # traversal path skipped, not catalog-fatal
@@ -0,0 +1,47 @@
+"""build_update_command: the matrix-bridge update one-liner.
+
+Pure string assembly, no cluster. Locks in the contract from
+docs/spark-control-integration.md (matrix-bridge repo): fetch, hard-reset to the
+release branch, then rebuild/recreate via docker compose — chained with `&&` so
+any failure (e.g. Gitea unreachable) aborts before the build and surfaces a
+non-zero exit. The clone dir must stay unquoted so a `~` expands server-side.
+"""
+from app.matrix_bridge import build_update_command, _phase_for
+
+
+def test_command_is_the_contract_chain():
+    cmd = build_update_command("~/matrix-bridge", "master")
+    assert cmd == (
+        "cd ~/matrix-bridge && "
+        "git fetch origin && "
+        "git reset --hard origin/master && "
+        "docker compose up -d --build"
+    )
+
+
+def test_fail_loud_chaining():
+    # Every step is &&-chained: a failed fetch never reaches the build.
+    cmd = build_update_command("~/matrix-bridge", "master")
+    assert "; " not in cmd
+    assert cmd.count(" && ") == 3
+    assert cmd.index("git fetch") < cmd.index("git reset") < cmd.index("docker compose")
+
+
+def test_tilde_dir_left_unquoted_for_server_side_expansion():
+    cmd = build_update_command("~/matrix-bridge", "master")
+    assert "cd ~/matrix-bridge &&" in cmd
+    assert "'~" not in cmd  # quoting would defeat the home-dir expansion
+
+
+def test_absolute_dir_and_custom_branch():
+    cmd = build_update_command("/home/modelo/matrix-bridge", "phase-1")
+    assert cmd.startswith("cd /home/modelo/matrix-bridge && ")
+    assert "git reset --hard origin/phase-1 &&" in cmd
+
+
+def test_phase_detection_maps_known_lines():
+    assert _phase_for("HEAD is now at 1a2b3c4 some commit") == "Resetting to the latest release…"
+    assert _phase_for("#5 building image") == "Building the bot image…"
+    assert _phase_for("Container matrix-bridge  Recreate") == "Recreating the container…"
+    assert _phase_for("Already up to date.") == "No new code; rebuilding…"
+    assert _phase_for("some unremarkable line") is None
@@ -0,0 +1,127 @@
+"""shellsafe validators: the API-boundary whitelist behind the v0.19.0 SSH
+command-injection hardening. The quoting *sink* is covered in
+test_launch_command.py; this locks in the *boundary* — that hostile input is
+rejected early, and that a valid value passes through unchanged so callers can
+use `validate_x(v)` inline.
+"""
+import pytest
+
+from app.shellsafe import (
+    validate_container,
+    validate_image,
+    validate_local_path,
+    validate_repo,
+)
+
+# Shell metacharacters that must never survive any validator — these are the
+# actual injection vectors. (Path traversal like "../" is NOT in scope here:
+# validate_image legitimately permits "/" and "." for real image refs such as
+# nvcr.io/nim/...; the defense for images is "no shell metacharacters" + the
+# quote_arg sink, not path-shape. Slash-rejection is tested directly for repo
+# and container, where "/" is disallowed.)
+HOSTILE = [
+    "; rm -rf /",
+    " a b",
+    "$(touch pwned)",
+    "`id`",
+    "x|cat",
+    "x&y",
+    "x>out",
+    "x\nrm",
+]
+
+
+# ---- validate_repo: HF 'org/name', exactly one slash ----
+
+@pytest.mark.parametrize("repo", [
+    "RedHatAI/Qwen3.6-35B-A3B-NVFP4",  # the live production model
+    "org/name",
+    "a.b_c-d/x.y_z-1",
+])
+def test_repo_valid_passes_through_unchanged(repo):
+    assert validate_repo(repo) == repo
+
+
+@pytest.mark.parametrize("repo", [
+    "",
+    "noslash",
+    "a/b/c",        # two slashes
+    "/name",        # empty org
+    "org/",         # empty name
+] + [f"org/name{h}" for h in HOSTILE])
+def test_repo_rejects_malformed_and_hostile(repo):
+    with pytest.raises(ValueError):
+        validate_repo(repo)
+
+
+# ---- validate_image: registry/path:tag@digest ----
+
+@pytest.mark.parametrize("image", [
+    "nvcr.io/nim/nvidia/parakeet-1_1b-ctc-en-us:latest",
+    "ubuntu",
+    "img@sha256:deadbeefcafe",
+    "a.b/c:1.2_3-4",
+])
+def test_image_valid_passes_through_unchanged(image):
+    assert validate_image(image) == image
+
+
+@pytest.mark.parametrize("image", [
+    "",
+    "-leading",     # must start alphanumeric
+    ".leading",
+    "/leading",
+    ":leading",
+    "a" * 513,      # over the 512 cap
+] + [f"img{h}" for h in HOSTILE])
+def test_image_rejects_malformed_and_hostile(image):
+    with pytest.raises(ValueError):
+        validate_image(image)
+
+
+# ---- validate_container: Docker name rule, no slash ----
+
+@pytest.mark.parametrize("name", [
+    "parakeet-asr",
+    "a",
+    "vol_1.2-3",
+])
+def test_container_valid_passes_through_unchanged(name):
+    assert validate_container(name) == name
+
+
+@pytest.mark.parametrize("name", [
+    "",
+    "_leading",     # underscore is not a valid first char
+    "-leading",
+    ".leading",
+    "has/slash",    # slash not allowed in a container name
+    "a" * 129,      # over the 128 cap
+] + [f"name{h}" for h in HOSTILE])
+def test_container_rejects_malformed_and_hostile(name):
+    with pytest.raises(ValueError):
+        validate_container(name)
+
+
+# ---- validate_local_path: absolute model dir, no traversal/metacharacters ----
+
+@pytest.mark.parametrize("path", [
+    "/home/modelo/models/gemma-4-31B-ten31-v2",
+    "/data/models/ft.v2_1",
+    "/srv/m/a-b/c",
+])
+def test_local_path_valid_passes_through_unchanged(path):
+    assert validate_local_path(path) == path
+
+
+@pytest.mark.parametrize("path", [
+    "",
+    "relative/path",            # must be absolute
+    "~/models/x",               # no ~ expansion
+    "/models/../etc/shadow",    # '..' traversal
+    "/models/./x",              # '.' segment
+    "/a" * 300,                 # over the 512 cap (600 chars)
+] + [f"/models/x{h}" for h in HOSTILE])
+def test_local_path_rejects_relative_traversal_and_hostile(path):
+    with pytest.raises(ValueError):
+        validate_local_path(path)
@@ -0,0 +1,120 @@
+"""Configurable topology: DISABLED_SERVICES, vLLM container override, and the
+extra-vLLM probe. All offline — the disabled checks short-circuit before any
+network call, and the probes are exercised only on the not-configured path.
+"""
+import asyncio
+
+from app.config import Settings
+from app.health import (
+    check_embeddings,
+    check_kokoro,
+    check_parakeet,
+    check_qdrant,
+    check_vllm,
+    probe_vllm_endpoint,
+)
+from app.services import services_from_settings
+
+
+def _settings(monkeypatch, **env) -> Settings:
+    # Pin the topology env vars under test; default the rest to blank so a stray
+    # value in the real environment can't leak into the assertion.
+    keys = [
+        "SPARK1_HOST", "SPARK1_USER", "SPARK2_HOST", "SPARK2_USER",
+        "DISABLED_SERVICES", "VLLM_CONTAINER",
+    ]
+    for k in keys:
+        monkeypatch.delenv(k, raising=False)
+    for k, v in env.items():
+        monkeypatch.setenv(k, v)
+    return Settings.from_env()
+
+
+# ---- DISABLED_SERVICES parsing ----
+
+def test_disabled_services_parsed_lowercased_and_trimmed(monkeypatch):
+    s = _settings(monkeypatch, DISABLED_SERVICES="parakeet, Kokoro ,,")
+    assert s.disabled_services == frozenset({"parakeet", "kokoro"})
+
+
+def test_disabled_services_blank_is_empty(monkeypatch):
+    assert _settings(monkeypatch).disabled_services == frozenset()
+
+
+# ---- vLLM container override ----
+
+def test_vllm_container_defaults_to_vllm_node(monkeypatch):
+    assert _settings(monkeypatch).vllm_container == "vllm_node"
+
+
+def test_vllm_container_override(monkeypatch):
+    assert _settings(monkeypatch, VLLM_CONTAINER="vllm-gemma4").vllm_container == "vllm-gemma4"
+
+
+def test_vllm_container_invalid_falls_back(monkeypatch):
+    # A malformed value (space / shell metachar) is rejected at the boundary and
+    # falls back to the default rather than crashing startup or reaching a sink.
+    assert _settings(monkeypatch, VLLM_CONTAINER="bad name; rm -rf").vllm_container == "vllm_node"
+
+
+# ---- services map honors the disable list ----
+
+def test_services_from_settings_drops_disabled(monkeypatch):
+    s = _settings(
+        monkeypatch,
+        SPARK1_HOST="10.0.0.1", SPARK1_USER="u",
+        SPARK2_HOST="10.0.0.2", SPARK2_USER="u",
+        DISABLED_SERVICES="parakeet,qdrant",
+    )
+    svcs = services_from_settings(s)
+    assert "parakeet" not in svcs and "qdrant" not in svcs
+    assert "kokoro" in svcs and "embeddings" in svcs
+
+
+def test_custom_vllm_service_registered(monkeypatch):
+    from app import custom_services
+    monkeypatch.setattr(custom_services, "load_custom_services", lambda: [
+        {"key": "vllm-spark2", "kind": "vllm", "host": "10.0.0.2",
+         "user": "u", "container": "vllm_node", "port": 8000},
+    ])
+    s = _settings(monkeypatch, SPARK1_HOST="10.0.0.1", SPARK1_USER="u",
+                  SPARK2_HOST="10.0.0.2", SPARK2_USER="u")
+    svc = services_from_settings(s)["vllm-spark2"]
+    assert svc.kind == "vllm" and svc.port == 8000 and svc.container == "vllm_node"
+
+
+def test_custom_service_colliding_with_builtin_is_ignored(monkeypatch):
+    # A custom entry can't shadow a built-in key — the built-in wins.
+    from app import custom_services
+    monkeypatch.setattr(custom_services, "load_custom_services", lambda: [
+        {"key": "parakeet", "kind": "vllm", "host": "10.0.0.9", "user": "u", "port": 8000},
+    ])
+    s = _settings(monkeypatch, SPARK1_HOST="10.0.0.1", SPARK1_USER="u",
+                  SPARK2_HOST="10.0.0.2", SPARK2_USER="u")
+    assert services_from_settings(s)["parakeet"].kind == "stt"
+
+
+# ---- disabled health checks short-circuit (no network) ----
+
+def test_disabled_check_returns_disabled_verdict(monkeypatch):
+    s = _settings(
+        monkeypatch,
+        SPARK2_HOST="10.0.0.2", SPARK2_USER="u",  # host set, but disable wins
+        DISABLED_SERVICES="parakeet,kokoro,embeddings,qdrant",
+    )
+    for check in (check_parakeet, check_kokoro, check_embeddings, check_qdrant):
+        r = asyncio.run(check(s))
+        assert r == {"ok": False, "disabled": True, "error": "disabled", "base_url": None}
+
+
+# ---- vLLM probe: not-configured path is pure ----
+
+def test_probe_vllm_endpoint_unconfigured(monkeypatch):
+    r = asyncio.run(probe_vllm_endpoint("", 8000))
+    assert r["ok"] is False and "not configured" in r["error"]
+
+
+def test_check_vllm_unconfigured_without_spark1(monkeypatch):
+    s = _settings(monkeypatch)  # no SPARK1_HOST
+    r = asyncio.run(check_vllm(s))
+    assert r["ok"] is False and "spark1 not configured" in r["error"]
@@ -1,6 +1,6 @@
 MIT License

-Copyright (c) 2026 Grant
+Copyright (c) 2026 Alice

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -1,3 +1,14 @@
 ARCHES := x86
 # overrides to s9pk.mk must precede the include statement
 include s9pk.mk
+
+# Publish the built s9pk to Gitea Releases (adopters pull it with a read-only
+# token instead of being hand-sent the package). Needs GITEA_URL + GITEA_TOKEN;
+# the vX.Y.Z git tag must already be pushed. See ../scripts/gitea-release.sh.
+RELEASE_VERSION := $(shell sed -n "s/.*version: '\([^']*\)'.*/\1/p" startos/versions/v0_1_0.ts)
+
+.PHONY: release
+release:
+	@test -f "$(PACKAGE_ID)_x86_64.s9pk" || { echo "Build first: make x86"; exit 1; }
+	GITEA_URL="$(GITEA_URL)" GITEA_TOKEN="$(GITEA_TOKEN)" \
+	  ../scripts/gitea-release.sh "$(RELEASE_VERSION)" "$(PACKAGE_ID)_x86_64.s9pk"
@@ -40,6 +40,33 @@ const inputSpec = InputSpec.of({
    placeholder: 'your SSH username',
    masked: false,
  }),
+  vllm_port: Value.text({
+    name: 'vLLM port (optional)',
+    description:
+      "The port your vLLM server listens on, on Spark 1 — used by the health check and the chat proxy. Leave blank to use 8888, which is what the bundled launch-cluster.sh wrapper uses. Set this to 8000 (vLLM's own default) or another port if your vLLM listens elsewhere.",
+    required: false,
+    default: null,
+    placeholder: 'leave blank for 8888',
+    masked: false,
+  }),
+  vllm_container: Value.text({
+    name: 'vLLM container name (optional)',
+    description:
+      'Docker container name for the swappable vLLM on Spark 1. Defaults to "vllm_node" (what the bundled launch-cluster.sh creates). Change this only if you run your vLLM under a different container name — the model-swap log view and the pre-flight validator exec into it by name.',
+    required: false,
+    default: null,
+    placeholder: 'leave blank for vllm_node',
+    masked: false,
+  }),
+  disabled_services: Value.text({
+    name: 'Services to hide (optional)',
+    description:
+      "Comma-separated list of built-in services your cluster doesn't run, so Spark Control hides their tiles and stops probing them. Valid names: parakeet, kokoro, embeddings, qdrant. Example: if you only run vLLM, set this to 'parakeet,kokoro,embeddings,qdrant'. Leave blank to monitor all of them. (Useful when, say, your vLLM shares port 8000 with Parakeet's default — hide Parakeet so its probe doesn't hit vLLM.)",
+    required: false,
+    default: null,
+    placeholder: 'e.g. parakeet,kokoro',
+    masked: false,
+  }),
  parakeet_host: Value.text({
    name: 'Parakeet host (optional)',
    description:
@@ -119,6 +146,15 @@ const inputSpec = InputSpec.of({
    placeholder: 'e.g. crm_chunks',
    masked: false,
  }),
+  matrix_bridge_user: Value.text({
+    name: 'matrix-bridge bot SSH user (optional)',
+    description:
+      "If you run the matrix-bridge Matrix bot on Spark 2, enter the SSH user that owns its ~/matrix-bridge folder (e.g. 'modelo'). Spark Control then shows a tile to update, restart, and view logs for the bot. Leave blank if you don't run the bot — the tile stays hidden. Note: this package's SSH public key must be authorized for that user (Show Public Key action) unless it's the same as your Spark 2 user.",
+    required: false,
+    default: null,
+    placeholder: 'e.g. modelo',
+    masked: false,
+  }),
  open_webui_url: Value.text({
    name: 'Open WebUI URL (optional)',
    description:
@@ -137,6 +173,24 @@ const inputSpec = InputSpec.of({
    placeholder: 'starts with "nvapi-..."',
    masked: true,
  }),
+  swap_webhook_url: Value.text({
+    name: 'Swap webhook URL (optional)',
+    description:
+      'If you run automation that needs to know when the loaded model changes, paste a URL here. Spark Control POSTs a small JSON event (swap_complete / swap_failed) to it after every model swap, so the consumer can re-point its config to the new model. Leave blank to disable. Only needed if something other than this dashboard cares about swaps.',
+    required: false,
+    default: null,
+    placeholder: 'e.g. https://my-service.local/spark-swap',
+    masked: false,
+  }),
+  swap_webhook_secret: Value.text({
+    name: 'Swap webhook secret (optional)',
+    description:
+      'Optional shared secret. If set, each webhook is signed with an "X-Spark-Signature: sha256=…" header (HMAC of the body) so the receiver can verify it really came from Spark Control. Leave blank to send the webhook unsigned.',
+    required: false,
+    default: null,
+    placeholder: 'a random string the receiver also knows',
+    masked: true,
+  }),
 })

 export const configureSparks = sdk.Action.withInput(
@@ -7,6 +7,13 @@ export const sparkConfigSchema = z.object({
  spark1_user: z.string().catch(''),
  spark2_host: z.string().catch(''),
  spark2_user: z.string().catch(''),
+  // Optional vLLM port override (Spark 1). Blank => 8888 (launch-cluster.sh default).
+  vllm_port: z.string().catch(''),
+  // Optional vLLM container-name override (Spark 1). Blank => "vllm_node".
+  vllm_container: z.string().catch(''),
+  // Optional comma-separated list of built-in services to switch off
+  // (parakeet, kokoro, embeddings, qdrant). Blank => all enabled.
+  disabled_services: z.string().catch(''),
  // Optional per-service overrides. Blank => use spark2_host / spark2_user.
  parakeet_host: z.string().catch(''),
  parakeet_user: z.string().catch(''),
@@ -22,10 +29,17 @@ export const sparkConfigSchema = z.object({
  qdrant_user: z.string().catch(''),
  qdrant_container: z.string().catch(''),
  qdrant_collection: z.string().catch(''),
+  // Optional matrix-bridge bot. Blank => no tile. Host reuses Spark 2.
+  matrix_bridge_user: z.string().catch(''),
  // Optional Open WebUI deep-link
  open_webui_url: z.string().catch(''),
  // Optional NGC API key for pulling NIM containers from nvcr.io/nim/...
  ngc_api_key: z.string().catch(''),
+  // Optional coordination webhook: POSTed on swap_complete/swap_failed so
+  // downstream consumers re-point their model config. Blank => disabled.
+  swap_webhook_url: z.string().catch(''),
+  // Optional shared secret; if set, the webhook body is HMAC-signed.
+  swap_webhook_secret: z.string().catch(''),
 })

 export type SparkConfig = z.infer<typeof sparkConfigSchema>
@@ -13,6 +13,9 @@ export const main = sdk.setupMain(async ({ effects }) => {
    spark1_user: '',
    spark2_host: '',
    spark2_user: '',
+    vllm_port: '',
+    vllm_container: '',
+    disabled_services: '',
    parakeet_host: '',
    parakeet_user: '',
    parakeet_container: '',
@@ -26,8 +29,11 @@ export const main = sdk.setupMain(async ({ effects }) => {
    qdrant_user: '',
    qdrant_container: '',
    qdrant_collection: '',
+    matrix_bridge_user: '',
    open_webui_url: '',
    ngc_api_key: '',
+    swap_webhook_url: '',
+    swap_webhook_secret: '',
  }

  return sdk.Daemons.of(effects).addDaemon('primary', {
@@ -49,6 +55,9 @@ export const main = sdk.setupMain(async ({ effects }) => {
        SPARK1_USER: cfg.spark1_user,
        SPARK2_HOST: cfg.spark2_host,
        SPARK2_USER: cfg.spark2_user,
+        VLLM_PORT: cfg.vllm_port,
+        VLLM_CONTAINER: cfg.vllm_container,
+        DISABLED_SERVICES: cfg.disabled_services,
        PARAKEET_HOST: cfg.parakeet_host,
        PARAKEET_USER: cfg.parakeet_user,
        PARAKEET_CONTAINER: cfg.parakeet_container,
@@ -62,11 +71,14 @@ export const main = sdk.setupMain(async ({ effects }) => {
        QDRANT_USER: cfg.qdrant_user,
        QDRANT_CONTAINER: cfg.qdrant_container,
        QDRANT_COLLECTION: cfg.qdrant_collection,
+        MATRIX_BRIDGE_USER: cfg.matrix_bridge_user,
        MODELS_OVERRIDES: '/data/models-overrides.yaml',
        SERVICES_OVERRIDES: '/data/services-overrides.yaml',
        CONNECTIVITY_LOG: '/data/connectivity.json',
        OPEN_WEBUI_URL: cfg.open_webui_url,
        NGC_API_KEY: cfg.ngc_api_key,
+        SWAP_WEBHOOK_URL: cfg.swap_webhook_url,
+        SWAP_WEBHOOK_SECRET: cfg.swap_webhook_secret,
        BIND_PORT: String(uiPort),
      },
    },
@@ -1,10 +1,10 @@
 import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk'

 export const v0_1_0 = VersionInfo.of({
-  version: '0.19.0:0',
+  version: '0.25.0:0',
  releaseNotes: {
    en_US:
-      'v0.19.0:0 — security hardening of the cluster-control surface (no change to the proxy/data APIs your other apps use). (1) Every user-supplied value that reaches an SSH command on the Sparks — model repo, vLLM args/knobs, NIM image/container, service names — is now strictly validated and shell-quoted, closing a command-injection path. (2) The Qdrant collection name in /api/search is validated so it can no longer be used to reach other collections. (3) State-changing dashboard endpoints (model swap, NIM install, service start/stop, disk delete, etc.) now require a same-origin request, blocking cross-site (CSRF) attacks from a malicious page open in your browser. The OpenAI-compatible proxies (/v1/*), the redaction gateway (/scrub, /rehydrate), /api/search, /api/audio/*, and /api/health-event are exempt, so Recap Relay, the CRM, Open WebUI and other consumers are unaffected.',
+      "v0.25.0:0 — cluster coordination layer (GPU arbiter). For clusters where automation, not just this dashboard, swaps models. Three additions: (1) Swap reservation lock — an external scheduler can reserve the GPU swap path (POST /api/swap/lock) and gets a secret token; while held, any swap without the token is refused (423), so the dashboard's manual swap is paused and shows who holds the GPU and until when (with a human Release override). The lock is TTL-bounded and self-frees. (2) Swap webhook — set a URL (and optional signing secret) in Configure Sparks; Spark Control POSTs a swap_complete / swap_failed event after each swap so downstream consumers re-point their model config. (3) Schedule registry — your automation can register its cron jobs (POST /api/schedule) for a read-only \"Scheduled jobs\" panel on the dashboard; Spark Control only displays them, it never runs them. New API: /api/swap/lock (GET/POST/DELETE), /api/schedule (GET/POST/DELETE). See docs/COORDINATION.md. Spark Control remains a control plane, not a job runner — business pipelines stay in their own services and call the swap API.",
  },
  migrations: {
    up: async ({ effects }) => {},
@@ -34,6 +34,44 @@ These take effect on the **next swap to that model**. If a swap fails after this
 - Status auto-refreshes every 5 s.
 - A swap takes 3–6 minutes depending on the model. Don't close the tab — but if you do, the swap continues; reopen and you'll re-attach to the log stream.

+## matrix-bridge bot tile (optional)
+
+If you run the matrix-bridge bot container on a Spark, set its SSH user in **Configure Sparks** (e.g. the user that owns `~/matrix-bridge`) and a tile appears under "Always-on services" with status, Update, Restart, Stop/Start, and View logs. Status is docker-state only (no HTTP health), so a `running` badge means the container is up, not necessarily that the bot is connected.
+
+The **Update** button runs `git fetch && git reset --hard origin/<branch> && docker compose up -d --build` as that SSH user. For it to reach your git remote:
+
+1. `~/matrix-bridge` must be a clone of the repo (not loose files). Gitignored secrets (`.env`, etc.) survive a `git reset --hard`.
+2. If that user has more than one SSH key, pin the remote's key so git doesn't offer the wrong one first (a common `Permission denied (publickey)` cause). In the user's `~/.ssh/config`:
+
+   ```
+   Host <your-git-host>
+       Port <port>
+       IdentityFile ~/.ssh/id_ed25519
+       IdentitiesOnly yes
+   ```
+
+3. Spark Control's own package key must be authorized for that SSH user (Show Public Key → add to their `authorized_keys`) unless it's the same user Spark Control already uses for that Spark.
+
+## Configurable topology (v0.24.0+)
+
+For a cluster wired differently from the reference layout, three optional knobs in **Configure Sparks** (no fork needed):
+
+- **vLLM container name** — defaults to `vllm_node`. Set it if your swappable vLLM on Spark 1 runs under a different container name; the swap log-tail and the pre-flight validator `docker exec` into it by name.
+- **Services to hide** — comma-separated `parakeet,kokoro,embeddings,qdrant`. Hidden services show no tile and are never probed (status, deep-health, or connectivity log). Use this when a service you don't run would otherwise be probed at a port something else answers — e.g. a vLLM on port 8000 colliding with Parakeet's default.
+- **Monitor a second vLLM** — the swap machinery only drives the Spark 1 vLLM, but you can *monitor* a vLLM on another Spark by adding a custom service of `kind: vllm` to `/data/services-overrides.yaml`:
+
+  ```yaml
+  custom:
+    - key: vllm-spark2
+      kind: vllm
+      host: <spark-2-ip>
+      user: <ssh-user>
+      container: vllm_node
+      port: 8000
+  ```
+
+  It gets a read-only tile: loaded model (via `/v1/models`), container state, and start/stop/restart. (Spark Control's SSH key must be authorized for that user — Show Public Key.)
+
 ## Adding a new model

 1. Add an entry to `image/models.yaml`. Required fields: `display_name`, `repo`, `size_gb`, `mode` (`solo` or `cluster`), `vllm_args`. Optional but recommended: `description` (one paragraph — what the model is, what it's good for, how it differs from others; renders below the meta tags in each card), `capabilities` (tags like `[vision, reasoning, tools]`), `expected_ready_seconds`.
@@ -42,6 +80,12 @@ These take effect on the **next swap to that model**. If a swap fails after this

 If `description` is omitted, the card simply hides that section — no need to populate it for every model. Keep descriptions generic (not user-specific) so the catalog stays portable.

+### Local / fine-tuned models (v0.23.0+)
+
+A model that lives as a directory on a Spark (e.g. a LoRA-merged fine-tune) instead of an HF repo: use the **"+ Add local model"** button under LLM swap (or a `custom:` entry with `local_path` instead of `repo` in the override YAML). The directory must already exist on the Spark; only its parent dir is mounted, so a `--chat-template` must live **inside** `local_path`.
+
+**Load-bearing contract:** on swap, spark-control prefixes the launch with `VLLM_SPARK_EXTRA_DOCKER_ARGS="-v <path>:<path>"` so `launch-cluster.sh` bind-mounts the dir into the vLLM container at the same path. This relies on the upstream `eugr/spark-vllm-docker` `launch-cluster.sh` expanding `$VLLM_SPARK_EXTRA_DOCKER_ARGS` **unquoted** into its `docker run` (verified against the on-Spark script 2026-06-17: line ~11 appends it to `DOCKER_ARGS`, used unquoted in `docker run`). If a future upstream version quotes that variable, local-model mounts would silently fail — re-check this before pulling launch-cluster.sh updates.
+
 ## Manual swap fallback

 If the UI is unavailable and you need to swap by hand:
@@ -57,6 +101,17 @@ cd ~/spark-vllm-docker
 docker logs -f vllm_node      # wait for "Application startup complete."
 ```

+## Sideload (`make install`) can't reach the server
+
+Symptom: `make install` fails with `package.sideload: error sending request for url (https://immense-voyage.local/rpc/v1)`. Cause seen 2026-06-17: `immense-voyage.local` stopped resolving via mDNS from the Mac (`curl https://immense-voyage.local/...` → exit 6, "couldn't resolve host"), even though the server is up — `curl -sk https://<server-ip>/rpc/v1` returns 200.
+
+- **Don't** work around it with `start-cli -H https://<server-ip> package install`: TLS connects but it returns `UNAUTHORIZED`, because start-cli's stored credential is bound to the registered `.local` host, not the IP.
+- **Fix:** make the name resolve again, then re-run `make install`:
+  - `sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder` (flush mDNS), or
+  - `echo "<server-ip> immense-voyage.local" | sudo tee -a /etc/hosts` (deterministic; remove later).
+
+Note this only blocks installing to *your own* Start9 — building and publishing the s9pk to Gitea Releases is unaffected (adopters still pull the latest).
+
 ## Diagnostics

 ```bash
@@ -0,0 +1,65 @@
+#!/usr/bin/env bash
+# Publish a built Spark Control s9pk to Gitea Releases, so adopters can pull the
+# latest package with a read-only token instead of being hand-sent the file.
+#
+#   GITEA_URL=https://gitea.example:3000 GITEA_TOKEN=<write-token> \
+#     scripts/gitea-release.sh 0.22.0:0 package/spark-control_x86_64.s9pk
+#
+# The git tag (vX.Y.Z, derived from the version) must already exist and be pushed
+# (`git tag v0.22.0 && git push gitea v0.22.0`). Re-running is idempotent: it
+# reuses an existing release for the tag and replaces a same-named asset.
+# Set GITEA_INSECURE=1 to skip TLS verification (self-signed cert on a LAN box).
+set -euo pipefail
+
+VERSION="${1:-}"; S9PK="${2:-}"
+[ -n "$VERSION" ] && [ -n "$S9PK" ] || {
+  echo "usage: GITEA_URL=.. GITEA_TOKEN=.. $0 <version e.g. 0.22.0:0> <s9pk path>" >&2; exit 2; }
+: "${GITEA_URL:?set GITEA_URL to your Gitea base URL, e.g. https://gitea.lan:3000}"
+: "${GITEA_TOKEN:?set GITEA_TOKEN to a token with repository read+write access}"
+[ -f "$S9PK" ] || { echo "s9pk not found: $S9PK" >&2; exit 1; }
+
+TAG="v${VERSION%%:*}"                      # 0.22.0:0 -> v0.22.0
+ASSET="$(basename "$S9PK")"
+SLUG="$(git remote get-url gitea | sed -E 's#.*[:/]([^/:]+/[^/]+)\.git$#\1#')"  # grant/spark-control
+API="${GITEA_URL%/}/api/v1/repos/${SLUG}"
+CURL=(curl -sS)                            # no -f: we inspect HTTP codes ourselves
+[ "${GITEA_INSECURE:-}" = "1" ] && CURL+=(-k)
+
+echo "repo ${SLUG} | tag ${TAG} | asset ${ASSET} | ${GITEA_URL}"
+
+# api METHOD URL [extra curl args...] -> sets globals HTTP_CODE and BODY
+api() {
+  local method="$1" url="$2"; shift 2
+  local out
+  out="$("${CURL[@]}" -X "$method" -H "Authorization: token ${GITEA_TOKEN}" "$@" \
+        -w $'\n%{http_code}' "$url")"
+  HTTP_CODE="${out##*$'\n'}"
+  BODY="${out%$'\n'*}"
+}
+
+# Reuse an existing release for this tag, otherwise create one.
+api GET "$API/releases/tags/$TAG"
+if [ "$HTTP_CODE" = 200 ]; then
+  id="$(printf '%s' "$BODY" | jq -r '.id')"
+elif [ "$HTTP_CODE" = 404 ]; then
+  api POST "$API/releases" -H 'Content-Type: application/json' \
+    --data "$(jq -n --arg t "$TAG" --arg n "$VERSION" \
+      '{tag_name:$t, name:$n, body:("Spark Control "+$n+". See AGENTS.md / release notes.")}')"
+  [ "$HTTP_CODE" = 201 ] || { echo "create release failed (HTTP $HTTP_CODE): $BODY" >&2; exit 1; }
+  id="$(printf '%s' "$BODY" | jq -r '.id')"
+else
+  echo "release lookup failed (HTTP $HTTP_CODE) — check GITEA_URL and the token's scope: $BODY" >&2
+  exit 1
+fi
+[ -n "$id" ] && [ "$id" != null ] || { echo "could not parse release id: $BODY" >&2; exit 1; }
+
+# Replace a same-named asset so re-runs don't 409.
+api GET "$API/releases/$id/assets"
+old="$(printf '%s' "$BODY" | jq -r --arg n "$ASSET" '.[]? | select(.name==$n) | .id')"
+[ -n "$old" ] && { api DELETE "$API/releases/$id/assets/$old"; }
+
+api POST "$API/releases/$id/assets?name=$ASSET" \
+  -F "attachment=@${S9PK};type=application/octet-stream"
+[ "$HTTP_CODE" = 201 ] || { echo "asset upload failed (HTTP $HTTP_CODE): $BODY" >&2; exit 1; }
+
+echo "published: ${GITEA_URL%/}/${SLUG}/releases/tag/${TAG}"