docs: record git-history scrub; fix stale SHAs and IP-fragment remnants

History was rewritten with git filter-repo to purge owner-specific values (IPs, hostnames, SSH username, key name, personal names) from all commits, tags, and messages — including three LAN IPs and one Start9 address the v0.18.0:1 working-tree scrub had missed (one still live in HEAD at docs/AUDIO_API.md). Verified 0 hits across all refs. - AGENTS.md: Portability + Repo-wart + work-queue #2 + shipping note updated; commit-SHA references repointed to post-rewrite SHAs (367d986->8d839e3). - EVALUATION.md: P0 owner-data finding marked resolved; cleaned shorthand IP-octet fragments (/.87, /11) left by the placeholder substitution.
2026-06-12 16:53:57 -05:00
25 changed files with 59 additions and 1095 deletions
@@ -11,11 +11,5 @@ node_modules/
 dist/
 build/
 .DS_Store
 # Claude Code — deny by default, allow-list shared wiring (see standards/portability.md)
 .claude/*
 !.claude/rules/
 !.claude/agents/
 !.claude/commands/
 !.claude/skills/
 !.claude/settings.json
@@ -6,9 +6,6 @@ Browser-based StartOS 0.4 package controlling a dual NVIDIA DGX Spark AI cluster
 Subsystem guidance lives in `docs/guides/` and loads when matching files are touched (Claude Code lazy-loads via `.claude/rules/` symlinks; other agents read the guides directly): `startos-package.md` (build/versioning, `package/**`), `fastapi-image.md` (dev server/env/layout, `image/**`), `redaction.md` (vendoring + test gates), `audio-speech.md` (parakeet patches, cluster-container footguns, audio testing). **Read `docs/guides/audio-speech.md` before touching the Sparks' containers over SSH** — ops sessions don't trip the path scoping.
 > **Inbox check:** At session start, if `~/Projects/standards/INBOX.md` exists, scan it for
 > items tagged `(spark-control)` and surface them before proposing next steps; triage with `/triage`.
 ## Stack
 - Two halves, always coordinated:
@@ -23,7 +20,6 @@ Subsystem guidance lives in `docs/guides/` and loads when matching files are tou
 ```bash
 (cd package && make x86)                                  # build the s9pk; make install sideloads (restarts live service — ask first)
 (cd image && uvicorn app.server:app --port 9999)          # local dev — needs env vars, see fastapi-image rule
 (cd image && .venv/bin/python -m pytest)                          # offline unit suite (launch-cmd injection, label-merge)
 (cd image && .venv/bin/python -m app.redaction.test_gateway)      # offline redaction suite 1
 (cd image && .venv/bin/python app/redaction/test_scrub_leak.py)   # offline redaction suite 2
 ./scripts/test-audio-with-speakers.sh <audio-file>        # e2e audio — hits the LIVE cluster
@@ -55,12 +51,37 @@ Subsystem guidance lives in `docs/guides/` and loads when matching files are tou
 ## Current state
- **Working (v0.21.0:1, installed and serving):** swap dashboard; chat / transcribe / diarize(+chunk) / TTS proxies; embeddings + rerank + hybrid search (Qdrant); `/scrub` + `/rehydrate`; label-merge incl. dual-channel; per-Spark SSH-key copy + WireGuard `VPN <ip>` hardware-card badge. Spark 2 audio stack healthy. Security hardening (v0.19.0:0 — shellsafe SSH-injection guard, Qdrant path-injection, same-origin CSRF guard) shipped and stable; evidence in `EVALUATION.md`.
+- **Working (v0.18.0:0, installed and serving):** swap dashboard; chat / transcribe / diarize(+chunk) / TTS proxies; embeddings + rerank + hybrid search (Qdrant); `/scrub` + `/rehydrate`; label-merge incl. dual-channel mode. Spark 2 audio stack is healthy (11k+ requests/12h, all 200).
- **matrix-bridge bot tile (done, v0.21.0:1, verified live):** `bot`-kind service tile — status badge from docker-state only (no HTTP port), plus **Update** / Restart / Stop/Start / **View logs**. Code: `app/matrix_bridge.py` + `/api/matrix-bridge/{update,logs}` (update streams; 25-min cap; fail-loud). Driven directly as `modelo` on Spark 2 (**no `sudo -iu`** — spark2 has no passwordless sudo). User is a blank-default Configure-Sparks field (`matrix_bridge_user`); blank → tile hidden (portable). Host reuses `spark2_host` (`192.168.1.87` = the bot's box `spark-32d0`); container/dir/branch are env-overridable defaults. **Load-bearing ops dep:** Update's `git fetch` runs as `modelo`, which needs `modelo`'s `~/.ssh/config` pinning the Gitea deploy key with `IdentitiesOnly yes` — else the wrong key is offered and Gitea denies (publickey). Optional next, only if the bot dev asks: Docker `HEALTHCHECK` for running-but-disconnected detection (spec §Note).
+- **In progress — Signal Engine "flakiness":** diagnosed, not a server bug — transient 1–4s unresponsiveness while the single GPU is continuously busy. Remedy is client-side; a drafted message (in-flight cap 2, hard ceiling 3 global across audio endpoints, retry-with-backoff on timeout/503) is with the owner to forward to that dev.
- **Tests:** offline pytest harness in `image/tests/` — `cd image && .venv/bin/python -m pytest` (70 passing). Covers `build_launch_command` (incl. the shell-injection round-trip), the transcript↔diarizer label-merge, the `shellsafe` validators, and `matrix_bridge.build_update_command` (+ phase detection). Mock-heavy swap/proxy tests deliberately skipped (low ROI). Redaction + live-audio suites remain standalone scripts.
+- **Decided, not implemented:** remote access stays WireGuard/Tailscale split-tunnel — no public interface, so no API auth built; an empirical concurrency sweep is offered but needs the owner's explicit OK in a quiet window. **Revisit (full-eval 2026-06-12):** the "LAN-only, so no auth" call is now load-bearing against RCE — unquoted user input reaches the SSH shell on several endpoints, so the network boundary is the *only* thing preventing cluster takeover. Quoting the injection sinks (work queue) is needed regardless of the auth decision; a defense-in-depth auth/CSRF gate is the follow-on.
- **Signal Engine "flakiness":** diagnosed as *not* a server bug — transient 1–4s unresponsiveness while the single GPU is busy. Client-side remedy (in-flight cap 2 / ceiling 3 / retry-on-timeout+503) drafted and **forwarded to that dev (owner confirmed 2026-06-15)**. Awaiting whether they want the measured concurrency knee.
+- **Known limits:** `/health` blips while the GPU is busy (mitigated client-side); dual-channel can miss a quiet local word under loud remote bleed; the connectivity log misses sub-5s outages between 5s polls; diarizer caps at 4 speakers.
- **Stance (decided, not built):** no public interface / no API-token auth — LAN + WireGuard/Tailscale split-tunnel only; the CSRF guard covers the browser-driven vector.
+- **Portability:** working tree + **full git history** scrubbed 2026-06-12 — all owner-specific IPs/hostnames/usernames/names replaced with placeholders; `claude-code-starter-prompt.md` deleted (old build-time prompt). Real cluster values live only in StartOS install config, shell env vars, and the gitignored `settings.local.json`. History was rewritten with `git filter-repo` (every commit SHA changed; force-pushed to `gitea`); a token sweep across all refs now returns 0 for every owner-specific value, including three LAN IPs + one Start9 address the original working-tree scrub had missed. Backup bundle of the pre-rewrite history is at `../spark-control-prehistory-rewrite.bundle`. **Still owner-side:** the `<ssh-key>` SSH key was leaked by *name* only (not material) — rotate it if it's still authorized on the Sparks.
- **Known limits:** `/health` blips while the GPU is busy (mitigated client-side); dual-channel can miss a quiet local word under loud remote bleed; connectivity log misses sub-5s outages between 5s polls; diarizer caps at 4 speakers; matrix-bridge badge won't visibly flip on a fast `docker restart` (status re-checked only after the command returns).
+- **Repo wart:** commit `8d839e3` (was `367d986` before the 2026-06-12 history rewrite) is labeled `v0.13.0:4` but actually contains everything through v0.18.0:0 — per-version commits for v0.14–v0.18 are missing. Keep commit messages accurate going forward.
- **Infra gotcha (safety):** passwordless sudo is NOT configured on spark2 — design unprivileged probes for any Spark feature (the badge uses `ip`, not `sudo wg show`). spark2 sits on the `starttunnel` WireGuard subnet (`10.59.211.6/24`, survives reboot). Owner declined SSH-key rotation after the 2026-06-12 history scrub (only the key *name* leaked) — don't re-flag.
+- **Hosting:** repo pushes to the owner's self-hosted Gitea — remote `gitea`, branch `master`, over SSH (host alias + key live in the local `~/.ssh/config`; no owner-specific details belong in the repo). Push there after committing.
- **Hosting:** self-hosted Gitea — remote `gitea`, branch `master`, over SSH; push after committing. (Wart: commit `8d839e3` is mislabeled `v0.13.0:4` but contains through v0.18.0:0.)
+- **Next (pre-eval backlog):** (1) owner forwards the concurrency note to the Signal Engine dev; (2) run the concurrency sweep if the dev wants the measured knee; (3) add the `--memory` cap to parakeet-asr via the Reapply-patches action; (4) pick the next item from ROADMAP.md.
- **Next — committed 2026-06-17: OpenClaw/Johnny-5 coexistence epic (full plan + design stance in `ROADMAP.md` → "Cluster coordination").** Stance: Spark Control = control plane / GPU arbiter, **not** a job runner; business cron jobs live in separate services that *call* its swap API (swaps are already API-driven via `POST /api/swap`). Sequence: (1) **configurable `VLLM_PORT`** — DONE in tree, staged as **v0.22.0:0** (Configure-Sparks field, blank ⇒ 8888; + `_env_int` hardening in `config.py` so a blank/bad port no longer crashes startup, killing a P3 tech-debt item). **Not yet built/installed/committed — awaiting go/no-go.** (2) local-path/fine-tuned models (in ROADMAP under Dashboard). (3) configurable topology (service→Spark→port map + container names). (4) coordination layer (swap lock + swap webhook + schedule visibility) — only when our own automation lands. Still-open older threads: audio concurrency sweep (only if the Signal Engine dev wants the knee; needs a quiet window); optional matrix-bridge Docker `HEALTHCHECK` if the bot dev asks; Parakeet long-audio guard deferred (rationale in ROADMAP).
+
 ### Full-eval triage (2026-06-12)
 Source: `EVALUATION.md` at repo root (full evidence, file:line pointers, scorecard). Findings triaged below; do these before the pre-eval backlog above where they overlap.
 **Work queue — P0/P1, fix before sharing the package wider:**
 1. ~~**[P0] Shell-quote/validate every user value crossing into SSH**~~ — **DONE (code, 2026-06-12; not yet shipped).** New `image/app/shellsafe.py` (`validate_repo`/`validate_image`/`validate_container` whitelists + `quote_arg`/`quote_args`). Boundary validation added to `POST /api/models` (repo) and `POST /api/nim/install` (image+container); `shlex.quote` applied at every SSH sink — `models.build_launch_command` (repo+args, covers `vllm_args`+knobs), `download._do` (repo), `nim._do` (image/container/volume/port/env), `services.docker_state`+`run_action` (container). Verified: injection survives only as a single quoted token, vLLM preflight `shlex.split` round-trip intact, both redaction suites still pass. Side-benefit: NGC key now `shlex.quote`'d in `nim._do` (was single-quoted) — closes the quote-breakout half of the P2 NGC-key item; the process-list-exposure half remains. **Ship step pending:** version bump + release notes + rebuilt s9pk.
 2. ~~**[P0] Decide the git-history question**~~ — **DONE (2026-06-12).** Chose to rewrite. `git filter-repo --replace-text --replace-message` mapped every owner-specific value to its canonical placeholder across all commits + tags + messages (incl. three LAN IPs and one Start9 address the working-tree scrub missed, one of which — `192.168.1.72` in `docs/AUDIO_API.md` — was still live in HEAD). Verified 0 hits across all refs; force-pushed to `gitea`. Pre-rewrite backup bundle kept at `../spark-control-prehistory-rewrite.bundle`. **Owner follow-up:** rotate the `<ssh-key>` SSH key if still authorized on the Sparks (only its name leaked, not the key).
 3. ~~**[P1] Defense-in-depth gate on mutating endpoints**~~ — **DONE (code, 2026-06-12; not yet shipped).** `csrf_guard` HTTP middleware in `server.py` rejects state-changing requests whose `Origin`/`Referer` hostname ≠ the served host. Scoped to control endpoints; the programmatic API surface is exempt (`/v1/*`, `/scrub`, `/rehydrate`, `/api/search`, `/api/audio/`, `/api/health-event`) so downstream consumers are unaffected. No app-layer token auth (deliberate — would break consumers + the non-technical owner). Verified via TestClient: cross-origin control POST→403, same-origin/no-Origin→pass, exempt prefixes always pass, GET never blocked. **Verify on-box:** confirm the StartOS reverse proxy passes `Host`/`Origin` so the dashboard isn't false-positive-blocked.
 4. ~~**[P1] Validate the Qdrant `collection`**~~ — **DONE (code, 2026-06-12; not yet shipped).** `_safe_collection` whitelist (`[A-Za-z0-9._-]`, rejects `..`) + URL-encoded path segment in `embeddings_proxy.py`. The raw `filter` is left as a passthrough (Qdrant parses it; pydantic enforces `dict`) — locking it to an allowlist would break hybrid-search consumers; the path segment was the real injection vector.
 **Shipping (all of #1/#3/#4 batched):** SHIPPED 2026-06-12 — version `0.18.0:1`→`0.19.0:0`, release notes in `versions/v0_1_0.ts`, s9pk rebuilt (`make x86`) and sideloaded to the live Start9 server (`make install`). Committed as `1c4e861` (its SHA after the same-day history rewrite) and force-pushed to `gitea`. **On-box check still pending:** click a control action (swap / service stop) to confirm the new CSRF guard doesn't false-positive-block the dashboard behind the StartOS proxy.
 **Known debt — P2, track but not blocking:**
 - Test coverage is redaction-only; swap state machine, proxies, SSH wrapper, and the package have zero automated tests. Live-cluster paths (swap exec, audio, embeddings/search) couldn't be exercised at all — biggest blind spot.
 - Loose dependency floors permit vulnerable `python-multipart`/`starlette` (DoS CVEs) on rebuild; no lockfile; no upload size caps (`pyproject.toml:6-13`).
 - StartOS registry blockers (only if pursuing the registry): source not public + `packageRepo`/`upstreamRepo` are `example.com` placeholders (`manifest/index.ts:12-13`).
 - Opaque HTTP 500 on `POST /api/models` / `PUT /knobs` when `MODELS_OVERRIDES` unset in dev (write to read-only `/data`) — catch the `OSError`.
 - NGC API key inlined single-quoted into a remote shell command (`nim.py:147`) — pass via stdin/env.
 - Global mutable `catalog` reassigned via `global`, shared across async requests with no snapshot (`server.py:107`) — latent race as concurrency grows.
 - Container runs uvicorn as **root** bound to `0.0.0.0:9999` (no `USER` in Dockerfile) — amplifies any RCE blast radius.
 **Parked — P3+, do in bulk when next touching docs/packaging:**
 - README Status block stale (`v0.2.3 / 0.13.0:4` → v0.18.0:1, undercounts features); deprecated `@app.on_event` + hardcoded `app.version="0.1.0"`; `NimInstallBody.register` shadows `BaseModel` (rename → `register_service`); httpx class names leak into TTS/speech-models error text; one unescaped `innerHTML` sink (`app.js:177`) + `task_id` reflected in scrub JSON.
 - Packaging cosmetics: `marketingUrl` placeholder; broken `instructions.md` source link; per-service SSH users (`parakeet_user` etc.) absent from the Configure-Sparks action inputSpec (silent default-empty); `Makefile` builds only x86 though manifest declares `aarch64`; release notes describe the scrub, not capabilities.
 - Hardening misc: no body/upload size limits on `/v1/audio/*`, `/v1/chat/completions`, `/scrub`; `int(_env(...))` startup crash on bad `VLLM_PORT`; upstream error text echoed to clients.
@@ -2,23 +2,8 @@
 Longer-term backlog, roughly ordered. An item moves to "Current state" in CLAUDE.md when picked up.
 ## Cluster coordination — OpenClaw coexistence (committed 2026-06-17, from Johnny 5 report 2026-06-16)
 Driven by the one other Spark Control adopter (a colleague running OpenClaw + cron jobs against his own dual Sparks; report at the date above). His cluster is configured differently from ours (vLLM on **both** Sparks, port 8000, raw `docker run`, container `vllm-gemma4`) and an automated cron physically swaps models — so his notes are partly *portability gaps* (the package hard-codes our layout) and partly *coordination gaps* (his dashboard and his crons fight over the GPU).
 **Design stance (decided):** Spark Control is the **control plane / GPU arbiter, not a job runner.** Recurring business pipelines (his "Daily Vol" generator; our own future scheduled jobs) live in *separate* application services that *call* Spark Control's swap API. The dividing line is what a scheduled job *does*: control-plane actions (swap a model, warm it, restart a service, run a health sweep) are in scope for an in-package scheduler; business logic (scrape / summarize / build / deploy) stays in the app layer. Swaps are already API-driven (`POST /api/swap` → `GET /api/swap/{id}` / `…/stream`, `POST /api/swap/{key}/validate`) and non-browser clients pass the CSRF guard, so an external scheduler can drive swaps **today** — the items below add the *safety* layer, not the capability.
 Sequenced:
 1. **Configurable `VLLM_PORT`** — DONE, v0.22.0:0. Field in Configure Sparks (blank ⇒ 8888); numeric-setting parsing hardened so a blank/bad value falls back instead of crashing startup. Was the immediate "vLLM unreachable" bug for an adopter on port 8000.
 2. **Local-path / fine-tuned model support** — see the dedicated item under "## Dashboard" below. Independently wanted; his merged `ten31-v2` (a directory, not an HF repo) is the motivating case.
 3. **Configurable topology** — make the service→Spark→port map and container names configurable so the package stops assuming our exact layout. Lets an adopter monitor vLLM on *both* Sparks, use a different container name, and stop the Parakeet probe from hitting a vLLM that shares its port — without forking. (Covers report P4 multi-Spark vLLM, P5 container name, and the Parakeet-port collision #6.)
 4. **Coordination layer** — build when our own automation actually lands (zero value until something other than the dashboard swaps models):
   - **Swap lock** with holder + TTL (`POST` / `GET` / `DELETE /api/swap/lock`). An external scheduler acquires it before swapping; the dashboard then refuses manual swaps and shows who holds the GPU and until when. Enforced by the swap path, not advisory.
   - **Swap-event webhook** (`swap_complete` / `swap_failed`) to a configurable URL, so downstream consumers update their provider config when the running model changes.
   - **Schedule visibility** — read-only view the dashboard surfaces, *registered by* external schedulers (Spark Control does not own the schedule).
 ## Near term
- parakeet-asr long-audio memory guard — **deferred 2026-06-15, low priority.** A duration cap on `/v1/audio/diarize`: Sortformer runs the whole file in one pass (`diarizer.py:128-135`) over Spark 2's *shared* 128 GB unified memory (also feeding Kokoro/embeddings/Qdrant), so one giant single file can thrash into swap. **Precautionary — no observed incident**, and the production consumer (Recap Relay) already chunks via `/diarize-chunk` (~5-min, already bounded), so the only exposed path is a consumer POSTing one huge file to the full `/diarize`. When picked up: add a configurable `MAX_DIARIZE_SECONDS` guard in `diarizer.py` right after `duration` is computed (~line 130) → raise → HTTP 413 in `main.py` (mirrors the existing `MAX_UPLOAD_MB` 413); ship via the Reapply-patches action (restarts the live parakeet-asr container → needs go/no-go). Leave transcription out of v1 (upstream/un-patched file; parakeet-TDT handles long audio better). Revisit only if a consumer starts sending long single files.
+- parakeet-asr `--memory` cap, shipped via the Reapply-patches action (guards against swap-thrash on very long audio).
 - Controlled concurrency sweep of the audio endpoints in a quiet window — replace the reasoned in-flight cap (2, ceiling 3) with the measured knee.
 ## Audio quality
@@ -34,25 +19,6 @@ Sequenced:
 - Second audio worker / queueing layer; revisit which services share Spark 2.
 ## Dashboard
 - Support local-path / fine-tuned models in the swap catalog. Today the catalog is static (`models.yaml` + custom overrides) and the "Add custom model" path (`POST /api/models`) only accepts an HF `org/name` repo (`shellsafe._HF_REPO_RE`), so a model that exists only as a directory on a Spark (the usual fine-tuning output) can't be registered or swapped. Needs: (a) a "local model" add form/field taking a Spark-side directory path, with its own safe validation instead of the `org/name` regex (path whitelist + `shlex.quote`, no traversal); (b) `models.build_launch_command` / `launch-cluster.sh` able to `vllm serve <path>`; (c) `disk.py` size-probe handling a path instead of deriving the HF cache dir from a repo id. Raised 2026-06-15 — a colleague's locally fine-tuned model doesn't appear because nothing scans the machine; the list is a curated catalog, not a discovery probe.
 - Per-model configurable vLLM flags editable from the UI (today: edit `models.yaml` and rebuild).
 - Spark host update actions (OS/driver) from the UI.
 - Open WebUI link-out integration; richer per-service detail views.
 ## Tech debt (from the 2026-06-12 full-eval — see EVALUATION.md)
 P0/P1 security findings are all fixed in v0.19.0:0. Remaining, none blocking:
 **P2 — track:**
 - No automated tests beyond the two redaction suites — swap state machine, proxies, SSH wrapper, and the StartOS package are untested; live-cluster paths (swap exec, audio, embeddings/search) are exercised only by hand. Biggest coverage gap; a small pytest harness for `build_launch_command` (incl. injection cases), swap transitions, and `_merge_words_with_speakers` is the highest-value start.
 - Loose dependency floors permit vulnerable `python-multipart`/`starlette` (DoS CVEs) on rebuild; no lockfile; no upload size caps (`pyproject.toml`).
 - Opaque HTTP 500 on `POST /api/models` / `PUT /knobs` when `MODELS_OVERRIDES` unset in dev (write to read-only `/data`) — catch the `OSError`.
 - NGC API key still appears on the remote process command line (`nim.py`) — the quote-breakout risk is fixed; pass via stdin/env to also remove the process-list exposure.
 - Global mutable `catalog` reassigned via `global`, shared across async requests with no snapshot (`server.py`) — latent race as concurrency grows.
 - Container runs uvicorn as **root** bound to `0.0.0.0:9999` (no `USER` in Dockerfile) — amplifies any RCE blast radius.
 **P3 — bulk-fix when next touching docs/packaging:**
 - README Status block stale (`v0.2.3 / 0.13.0:4` → now v0.19.0:0); deprecated `@app.on_event` + hardcoded `app.version="0.1.0"`; `NimInstallBody.register` shadows `BaseModel` (rename → `register_service`); httpx class names leak into TTS/speech-models error text; one unescaped `innerHTML` sink (`app.js`) + `task_id` reflected in scrub JSON.
 - Packaging: `marketingUrl`/`packageRepo`/`upstreamRepo` are `example.com` placeholders; broken `instructions.md` source link; per-service SSH users (`parakeet_user` etc.) absent from the Configure-Sparks action inputSpec (silent default-empty); `Makefile` builds only x86 though the manifest declares `aarch64`.
 - Hardening misc: no body/upload size limits on `/v1/audio/*`, `/v1/chat/completions`, `/scrub`; `int(_env(...))` startup crash on bad `VLLM_PORT`; upstream error text echoed to clients.
 - StartOS registry (only if ever pursuing it): source must be public + real repo URLs.
@@ -24,17 +24,12 @@ Other env vars: `BIND_PORT`, `MODELS_YAML`, `SSH_DIR`, `SSH_KNOWN_HOSTS`, `MODEL
 ## Tests
-Two kinds, both run with the `image/.venv` interpreter (system python3 has no deps):
+No pytest harness — each suite is a standalone script run with the `image/.venv` interpreter (system python3 has no deps). See the redaction and audio rules for the suites themselves.
 - **pytest unit suite** — offline, pure functions, no cluster. `.venv/bin/python -m pytest` from `image/`. Lives in `image/tests/`; currently covers `build_launch_command` (incl. the shell-injection / `shlex` round-trip invariant) and the transcript↔diarizer label-merge (`_merge_words_with_speakers`). Install the test dep once with `pip install -e '.[dev]'`. Add new pure-function coverage here.
 - **Standalone scripts** — the redaction suites and the live-cluster audio e2e are run directly (not via pytest). See the redaction and audio rules.
 ## Conventions
 - Pydantic request models go at **module scope**, never inside a `build_router()` body (FastAPI silently 422s otherwise).
 - New external-facing endpoints get documented in `docs/` (`AUDIO_API.md`, `EMBEDDINGS.md`, `REDACTION_GATEWAY.md`) and noted in release notes.
 - **SSH-input safety:** any user-supplied value that reaches an SSH command on the Sparks MUST go through `app/shellsafe.py` — validate against a whitelist at the API boundary, then `quote_arg`/`quote_args` (`shlex.quote`) at the sink. Never raw f-string a user value into a command string. Existing sinks: `models.build_launch_command`, `download`, `nim`, `services`; `disk.py` keeps its own `_SAFE_DIRNAME` because it needs `$HOME` to expand server-side. The vLLM pre-flight (`validate.py`) relies on `shlex.split` cleanly reversing this quoting — preserve that invariant.
 - **CSRF / same-origin:** state-mutating *control* endpoints are guarded by the `csrf_guard` middleware in `server.py` (rejects requests whose `Origin`/`Referer` host ≠ the served host). A new endpoint meant to be called **cross-origin by downstream apps** (a proxy/data endpoint) must be added to `_CSRF_EXEMPT_PREFIXES`, or browser POSTs from those apps will 403. No app-layer token auth by design (LAN/VPN-only; would break consumers).
 ## Layout
@@ -8,16 +8,6 @@ def _env(name: str, default: str = "") -> str:
    return os.environ.get(name, default)
 def _env_int(name: str, default: int) -> int:
    """Parse an int env var, falling back to `default` when unset, blank, or
    malformed. The StartOS Configure panel passes optional numeric fields as an
    empty string when left blank, so a bare int("") would crash daemon startup."""
    try:
        return int(os.environ.get(name, "") or default)
    except (TypeError, ValueError):
        return default
 def _resolve_models_yaml() -> str:
    if env := os.environ.get("MODELS_YAML"):
        return env
@@ -52,11 +42,6 @@ class Settings:
    qdrant_user: str
    qdrant_container: str
    qdrant_collection: str
    matrix_bridge_host: str
    matrix_bridge_user: str
    matrix_bridge_container: str
    matrix_bridge_dir: str
    matrix_bridge_branch: str
    redaction_map_db: str
    redaction_map_ttl: int
    ssh_key_path: str
@@ -96,31 +81,18 @@ class Settings:
            qdrant_user=_env("QDRANT_USER") or spark2_user,
            qdrant_container=_env("QDRANT_CONTAINER") or "qdrant",
            qdrant_collection=_env("QDRANT_COLLECTION", ""),
            # matrix-bridge bot container, driven as its own SSH user (the owner
            # of the ~/matrix-bridge git clone) so git/docker run unprivileged.
            # The user is BLANK by default and set via the "Configure Sparks"
            # action; leaving it blank reports the service as unconfigured, which
            # hides the tile. That keeps the shared package portable — a
            # deployment without the bot never shows a stray tile or a hardcoded
            # username. Host defaults to Spark 2 (same box); container/dir/branch
            # are sensible defaults. All are env-overridable.
            matrix_bridge_host=_env("MATRIX_BRIDGE_HOST") or spark2_host,
            matrix_bridge_user=_env("MATRIX_BRIDGE_USER"),
            matrix_bridge_container=_env("MATRIX_BRIDGE_CONTAINER") or "matrix-bridge",
            matrix_bridge_dir=_env("MATRIX_BRIDGE_DIR") or "~/matrix-bridge",
            matrix_bridge_branch=_env("MATRIX_BRIDGE_BRANCH") or "master",
            # Redaction gateway pseudonym-map store (server-held de-anon key).
            redaction_map_db=_env("REDACTION_MAP_DB", "/data/redaction_maps.db"),
-            redaction_map_ttl=_env_int("REDACTION_MAP_TTL", 7200),
+            redaction_map_ttl=int(_env("REDACTION_MAP_TTL", "7200")),
            ssh_key_path=_env("SSH_KEY_PATH"),
            ssh_known_hosts=_env("SSH_KNOWN_HOSTS"),
            models_yaml=_resolve_models_yaml(),
-            vllm_port=_env_int("VLLM_PORT", 8888),
+            vllm_port=int(_env("VLLM_PORT", "8888")),
-            parakeet_port=_env_int("PARAKEET_PORT", 8000),
+            parakeet_port=int(_env("PARAKEET_PORT", "8000")),
-            kokoro_port=_env_int("KOKORO_PORT", 8880),
+            kokoro_port=int(_env("KOKORO_PORT", "8880")),
-            embed_port=_env_int("EMBED_PORT", 8088),
+            embed_port=int(_env("EMBED_PORT", "8088")),
-            qdrant_port=_env_int("QDRANT_PORT", 6333),
+            qdrant_port=int(_env("QDRANT_PORT", "6333")),
-            bind_port=_env_int("BIND_PORT", 9999),
+            bind_port=int(_env("BIND_PORT", "9999")),
            open_webui_url=_env("OPEN_WEBUI_URL", ""),
            ngc_api_key=_env("NGC_API_KEY", ""),
        )
@@ -26,9 +26,6 @@ echo GPU=$(nvidia-smi --query-gpu=name,utilization.gpu,temperature.gpu,power.dra
 echo GPU_MEM_USED_MIB=$(nvidia-smi --query-compute-apps=used_gpu_memory --format=csv,noheader,nounits 2>/dev/null | awk '{s+=$1} END {print s+0}')
 DEFIF=$(ip route show default 2>/dev/null | awk '{print $5; exit}')
 echo MAC=$(cat /sys/class/net/$DEFIF/address 2>/dev/null)
 WGIF=$(ip -o link show type wireguard 2>/dev/null | awk -F': ' 'NR==1 {print $2}')
 echo WG_IFACE=$WGIF
 echo WG_ADDR=$(ip -o -4 addr show "$WGIF" 2>/dev/null | awk 'NR==1 {print $4}')
 """.strip()
@@ -87,11 +84,6 @@ def _parse(out: str) -> dict:
    # MAC address on the default-route interface (for Wake-on-LAN)
    if info.get("mac"):
        parsed["mac"] = info["mac"].lower()
    # WireGuard tunnel membership: name + address of the first wg interface, if
    # any. Read-only and unprivileged (`ip` needs no root), so it never depends
    # on sudo and never breaks the probe — absence just yields no badge.
    parsed["wg_iface"] = info.get("wg_iface") or None
    parsed["wg_addr"] = info.get("wg_addr") or None
    return parsed
@@ -1,186 +0,0 @@
 """Update + logs for the matrix-bridge bot container on the Spark.
 matrix-bridge is a single Docker container managed by docker compose out of a
 git clone at `~matrix_bridge_user/matrix-bridge`. Status (the badge) and
 start/stop/restart ride the generic service machinery in `services.py`
 (`docker_state` / `run_action`). The two things that don't fit that mould live
 here:
  - **Update** — `git fetch && git reset --hard origin/<branch> && docker
    compose up -d --build`. Long-running (docker build), so it streams like the
    vLLM `UpdateManager`: fire-and-forget job, SSE stream, fail-loud rc.
  - **Logs** — a one-shot `docker logs --tail N` for diagnosing a red badge.
 We connect **directly as the configured user** (`modelo` — the repo owner), so
 git never trips its dubious-ownership guard and docker runs via the user's
 docker-group membership. We deliberately do NOT `sudo -iu modelo`: this Spark
 has no passwordless sudo, so a sudo wrap would hang in SSH BatchMode.
 """
 from __future__ import annotations
 import asyncio
 import time
 import uuid
 from dataclasses import dataclass, field
 from datetime import datetime, timezone
 from typing import Optional
 from .config import Settings
 from .shellsafe import quote_arg
 from .ssh import ssh_run, ssh_stream, StreamHandle
 # Hard ceiling on a single update. A first build after a base-image bump is
 # slow (minutes); the cache makes later ones quick. 25 min is generous headroom
 # without letting a genuinely wedged build spin forever.
 _UPDATE_TIMEOUT_S = 1500
 def build_update_command(directory: str, branch: str) -> str:
    """The update one-liner, run from the bot's git clone as its owner.
    `directory` and `branch` come from operator config (not request input), so
    they're interpolated directly — same trust model as the Spark hostnames in
    `health`/`updates`. `directory` may be `~/...`, which must stay unquoted so
    the remote login shell expands it; quoting would defeat that.
    """
    return (
        f"cd {directory} && "
        f"git fetch origin && "
        f"git reset --hard origin/{branch} && "
        f"docker compose up -d --build"
    )
 def _phase_for(line: str) -> Optional[str]:
    """Map a streamed output line to a human-readable phase, or None to keep
    the current phase. Kept loose — compose/buildkit output varies by version."""
    low = line.lower()
    if "git reset" in low or "head is now at" in low:
        return "Resetting to the latest release…"
    if "docker compose" in low or "buildkit" in low or low.startswith("step ") or "=> " in line or "building " in low:
        return "Building the bot image…"
    if "recreate" in low or "starting" in low or "started" in low or "container matrix-bridge" in low:
        return "Recreating the container…"
    if "already up to date" in low:
        return "No new code; rebuilding…"
    return None
@dataclass
 class UpdateJob:
    id: str
    started_at: str
    state: str = "starting"
    lines: list[str] = field(default_factory=list)
    returncode: Optional[int] = None
    finished_at: Optional[str] = None
    phase: str = "Starting…"
    def append(self, line: str) -> None:
        self.lines.append(line)
        if len(self.lines) > 1000:
            del self.lines[: len(self.lines) - 1000]
 class MatrixBridgeManager:
    def __init__(self, settings: Settings) -> None:
        self.settings = settings
        self.lock = asyncio.Lock()
        self.jobs: dict[str, UpdateJob] = {}
        self.current_job_id: Optional[str] = None
    def _configured(self) -> bool:
        s = self.settings
        return bool(s.matrix_bridge_host and s.matrix_bridge_user)
    def get(self, job_id: str) -> UpdateJob | None:
        return self.jobs.get(job_id)
    async def fetch_logs(self, tail: int = 100) -> dict:
        """One-shot `docker logs --tail N <container>` (stderr merged in)."""
        s = self.settings
        if not self._configured():
            return {"ok": False, "error": "matrix-bridge host not configured"}
        tail = max(1, min(int(tail), 1000))
        # tail is already int-clamped, but quote at the sink anyway so the
        # shellsafe convention (no raw interpolation into an SSH command) holds
        # regardless of caller.
        cmd = f"docker logs --tail {quote_arg(str(tail))} {quote_arg(s.matrix_bridge_container)} 2>&1"
        rc, out, err = await ssh_run(
            s.matrix_bridge_host, s.matrix_bridge_user, cmd, s, timeout=20
        )
        return {
            "ok": rc == 0,
            "rc": rc,
            "container": s.matrix_bridge_container,
            "output": (out or err).strip(),
        }
    async def trigger_update(self) -> UpdateJob:
        if not self._configured():
            raise RuntimeError("matrix-bridge host not configured")
        if self.lock.locked():
            raise RuntimeError("An update is already in progress")
        job = UpdateJob(
            id=uuid.uuid4().hex[:8],
            started_at=datetime.now(timezone.utc).isoformat(),
        )
        self.jobs[job.id] = job
        self.current_job_id = job.id
        asyncio.create_task(self._run(job))
        return job
    async def _run(self, job: UpdateJob) -> None:
        async with self.lock:
            try:
                await self._do(job)
                if job.state != "failed":
                    job.state = "done"
                    job.returncode = 0
                    job.phase = "Done"
            except asyncio.TimeoutError:
                job.append(f"[error] update timed out after {_UPDATE_TIMEOUT_S}s")
                job.state = "failed"
                job.returncode = 124
                job.phase = "Timed out"
            except Exception as e:
                job.append(f"[error] {type(e).__name__}: {e}")
                job.state = "failed"
                if job.returncode is None:
                    job.returncode = 1
            finally:
                job.finished_at = datetime.now(timezone.utc).isoformat()
                if self.current_job_id == job.id:
                    self.current_job_id = None
    async def _do(self, job: UpdateJob) -> None:
        s = self.settings
        cmd = build_update_command(s.matrix_bridge_dir, s.matrix_bridge_branch)
        job.append(f"$ {cmd}")
        job.state = "running"
        job.phase = "Fetching latest code…"
        handle = StreamHandle()
        gen = ssh_stream(s.matrix_bridge_host, s.matrix_bridge_user, cmd, s, handle=handle)
        deadline = time.monotonic() + _UPDATE_TIMEOUT_S
        try:
            while True:
                remaining = deadline - time.monotonic()
                if remaining <= 0:
                    raise asyncio.TimeoutError
                try:
                    line = await asyncio.wait_for(gen.__anext__(), timeout=remaining)
                except StopAsyncIteration:
                    break
                job.append(line)
                phase = _phase_for(line)
                if phase:
                    job.phase = phase
        finally:
            # Closing the generator terminates the underlying ssh process and
            # populates handle.returncode via ssh_stream's finally block.
            await gen.aclose()
        rc = handle.returncode or 0
        if rc != 0:
            job.state = "failed"
            job.returncode = rc
@@ -3,7 +3,7 @@ import asyncio
 import json
 from pathlib import Path
-from fastapi import FastAPI, HTTPException, Query, Request
+from fastapi import FastAPI, HTTPException
 from fastapi.responses import FileResponse, JSONResponse, StreamingResponse
 from fastapi.staticfiles import StaticFiles
 from pydantic import BaseModel
@@ -21,7 +21,6 @@ from .embeddings_proxy import build_router as build_embeddings_router
 from .redaction_gateway import build_router as build_redaction_router, MapStore
 from .hardware import HardwareProbe
 from .health import check_kokoro, check_parakeet, check_vllm, check_embeddings, check_qdrant
 from .matrix_bridge import MatrixBridgeManager
 from .models import load_catalog
 from .nim import SUGGESTED_NIMS, CATALOG_URL, NimManager
 from .overrides import add_custom, delete_custom, extract_knobs_from_args, load_overrides, set_knobs
@@ -44,7 +43,6 @@ hardware_probe = HardwareProbe(settings)
 nim_manager = NimManager(settings)
 deep_health = DeepHealth(settings)
 speech_models = SpeechModelsManager(settings)
 matrix_bridge = MatrixBridgeManager(settings)
 app = FastAPI(title="spark-control", version="0.1.0")
@@ -403,53 +401,6 @@ async def wake_spark(name: str) -> dict:
    return {"ok": True, "spark": name, "mac": mac, "delivered_via": delivered_via}
@app.post("/api/spark/{name}/ssh-key")
 async def spark_ssh_key(name: str) -> dict:
    """Ensure the named Spark has an ed25519 keypair and return its PUBLIC key.
    This is the Spark's *outbound* identity — the key it uses to log in to other
    machines (e.g. the operator's Mac). It is the opposite direction from, and
    distinct from, the package's own key shown by the StartOS "Show Public Key"
    action (which grants this dashboard SSH access to the Sparks).
    Non-destructive: generates the key only if absent, never overwrites an
    existing one (which may already be an identity the Spark uses elsewhere).
    Public keys are not secret, so returning it is safe. No request-supplied
    value reaches the command — `name` is constrained to a fixed set and
    host/user come from operator config — so there is nothing to shell-quote.
    """
    if name not in ("spark1", "spark2"):
        raise HTTPException(404, f"unknown spark: {name}")
    host = settings.spark1_host if name == "spark1" else settings.spark2_host
    user = settings.spark1_user if name == "spark1" else settings.spark2_user
    if not host or not user:
        raise HTTPException(400, f"{name} is not configured")
    # Empty passphrase so the key is usable unattended; comment carries the
    # remote hostname so it's identifiable in an authorized_keys file later.
    cmd = (
        "set -e; "
        "mkdir -p ~/.ssh && chmod 700 ~/.ssh; "
        "if [ ! -f ~/.ssh/id_ed25519 ]; then "
        'ssh-keygen -t ed25519 -N "" -C "spark-control@$(hostname)" -f ~/.ssh/id_ed25519 >/dev/null 2>&1; '
        "echo CREATED=1; else echo CREATED=0; fi; "
        "[ -f ~/.ssh/id_ed25519.pub ] || ssh-keygen -y -f ~/.ssh/id_ed25519 > ~/.ssh/id_ed25519.pub; "
        "echo PUBKEY=$(cat ~/.ssh/id_ed25519.pub)"
    )
    rc, out, err = await ssh_run(host, user, cmd, settings, timeout=15)
    if rc != 0:
        raise HTTPException(502, f"couldn't read/create the SSH key on {name}: {err.strip() or out.strip() or f'rc={rc}'}")
    created = False
    pubkey = ""
    for line in out.splitlines():
        if line.startswith("CREATED="):
            created = line.strip() == "CREATED=1"
        elif line.startswith("PUBKEY="):
            pubkey = line[len("PUBKEY="):].strip()
    if not pubkey:
        raise HTTPException(502, f"no public key returned from {name}")
    return {"ok": True, "spark": name, "host": host, "user": user, "pubkey": pubkey, "created": created}
@app.get("/api/services")
 async def get_services() -> dict:
    """Lifecycle state of always-on support services (Parakeet, Kokoro, …).
@@ -476,11 +427,6 @@ async def get_services() -> dict:
            http = await check_embeddings(settings)
        elif name == "qdrant":
            http = await check_qdrant(settings)
        elif svc.kind == "bot":
            # No HTTP health endpoint (host networking, no port) — judged purely
            # by docker state. http_ready stays None so the badge isn't pinned
            # to a "Starting…" verdict that can never clear.
            http = {"ok": None, "base_url": None}
        else:
            # Custom services expose a /health endpoint by convention.
            http = await check_kokoro(settings) if svc.kind == "tts" else {"ok": None, "base_url": svc.host and f"http://{svc.host}:{svc.port}"}
@@ -491,9 +437,7 @@ async def get_services() -> dict:
            "container": svc.container,
            "kind": svc.kind,
            "base_url": http.get("base_url"),
-            # None (not False) for services with no HTTP surface (the bot), so
+            "http_ready": bool(http.get("ok")),
            # the UI judges them by docker state alone instead of "Starting…".
            "http_ready": None if svc.kind == "bot" else bool(http.get("ok")),
            # Prefer the check fn's own top-level model key (embeddings reports
            # it there); fall back to a model field inside detail for services
            # whose /health embeds it (parakeet).
@@ -509,11 +453,8 @@ async def get_services() -> dict:
    results = await asyncio.gather(*[one(n) for n in services.keys()])
    for name, info in results:
        out[name] = info
-        # Feed http reachability into the connectivity log (transition-only).
+        # Feed http reachability into the connectivity log (transition-only)
-        # Skip services with no HTTP surface (http_ready is None) — they'd
+        record_state(name, bool(info.get("http_ready")))
        # otherwise register as perpetually "down".
        if info.get("http_ready") is not None:
            record_state(name, bool(info.get("http_ready")))
    return out
@@ -618,7 +559,7 @@ async def stream_nim_install(job_id: str):
@app.delete("/api/services/{name}")
 async def del_service(name: str) -> dict:
    # Only allow deleting custom services (not the bundled built-in keys)
-    if name in ("parakeet", "kokoro", "embeddings", "qdrant", "matrix-bridge"):
+    if name in ("parakeet", "kokoro", "embeddings", "qdrant"):
        raise HTTPException(400, "built-in service; cannot delete (use Configure Sparks to point at a different host)")
    delete_custom_service(name)
    return {"ok": True, "name": name}
@@ -637,81 +578,6 @@ async def service_action(name: str, action: str) -> dict:
    return {"name": name, "action": action, **result}
 # ---- matrix-bridge bot: update (git pull + rebuild) + logs ----
 # Status badge + start/stop/restart ride the generic /api/services machinery
 # above (the bot is a registered ServiceDef). Only the long-running Update and
 # the logs view need bespoke endpoints.
 def _serialize_mb_update(job) -> dict:
    return {
        "id": job.id,
        "state": job.state,
        "phase": job.phase,
        "started_at": job.started_at,
        "finished_at": job.finished_at,
        "returncode": job.returncode,
        "lines": job.lines,
    }
@app.post("/api/matrix-bridge/update")
 async def post_matrix_bridge_update() -> dict:
    """Pull latest code, rebuild, and recreate the bot container. Long-running
    (docker build) — returns a job id to stream."""
    try:
        job = await matrix_bridge.trigger_update()
    except RuntimeError as e:
        raise HTTPException(409 if "in progress" in str(e) else 503, str(e))
    return {"job_id": job.id, "state": job.state}
@app.get("/api/matrix-bridge/update/{job_id}")
 async def get_matrix_bridge_update(job_id: str) -> dict:
    job = matrix_bridge.get(job_id)
    if job is None:
        raise HTTPException(404, "no such job")
    return _serialize_mb_update(job)
@app.get("/api/matrix-bridge/update/{job_id}/stream")
 async def stream_matrix_bridge_update(job_id: str, request: Request):
    job = matrix_bridge.get(job_id)
    if job is None:
        raise HTTPException(404, "no such job")
    async def gen():
        sent = 0
        last_phase = None
        while True:
            # An update can run for minutes; bail promptly if the client is gone
            # rather than spinning the poll loop until the job's 25-min ceiling.
            if await request.is_disconnected():
                return
            n = len(job.lines)
            if n > sent:
                for line in job.lines[sent:n]:
                    yield f"data: {json.dumps({'line': line})}\n\n"
                sent = n
            if job.phase != last_phase:
                yield f"event: phase\ndata: {json.dumps({'state': job.state, 'phase': job.phase})}\n\n"
                last_phase = job.phase
            if job.returncode is not None and sent >= len(job.lines):
                yield f"event: done\ndata: {json.dumps({'state': job.state, 'returncode': job.returncode})}\n\n"
                return
            await asyncio.sleep(0.5)
    return StreamingResponse(gen(), media_type="text/event-stream")
@app.get("/api/matrix-bridge/logs")
 async def get_matrix_bridge_logs(tail: int = Query(100, ge=1, le=1000)) -> dict:
    """Last N lines of `docker logs` for the bot container (stderr merged)."""
    result = await matrix_bridge.fetch_logs(tail=tail)
    if not result.get("ok"):
        raise HTTPException(502, result.get("output") or result.get("error") or "could not read logs")
    return result
 # ---- Speech model patch management ----
@app.get("/api/speech-models")
@@ -89,17 +89,6 @@ def services_from_settings(s: Settings) -> dict[str, ServiceDef]:
            container=s.qdrant_container,
            port=s.qdrant_port,
        ),
        # matrix-bridge Matrix bot. No HTTP port to probe (host networking, no
        # health endpoint) — judged purely by docker state. Driven as its own
        # SSH user (modelo, the repo owner) so git/docker run unprivileged.
        "matrix-bridge": ServiceDef(
            name="matrix-bridge",
            kind="bot",
            host=s.matrix_bridge_host,
            user=s.matrix_bridge_user,
            container=s.matrix_bridge_container,
            port=0,
        ),
    }
    for entry in load_custom_services():
        key = entry.get("key")
@@ -13,7 +13,6 @@ const state = {
  swap_progress: 0,        // 0–1
  services: {},
  service_action_in_flight: null,  // e.g. "parakeet:restart"
  mb_update_in_flight: false,      // matrix-bridge update job running
  hardware: {},
  config: {},
  configured: true,
@@ -306,32 +305,6 @@ async function wakeSpark(name) {
  }
 }
 // Generate-if-missing + copy this Spark's OUTBOUND ssh public key (the key the
 // Spark uses to log in to other machines, e.g. the Mac). Distinct from the
 // package's own key in the StartOS "Show Public Key" action.
 async function copySparkSshKey(name, btn) {
  if (btn) btn.disabled = true;
  try {
    const r = await fetchJSON(`/api/spark/${name}/ssh-key`, { method: 'POST' });
    // Best-effort clipboard copy; on plain-HTTP this no-ops, but the dialog
    // below always shows the key for manual selection.
    await copyText(r.pubkey, btn);
    const label = r.host ? `${name} (${r.host})` : name;
    el('#sshkey-title').textContent = `${name} — SSH public key`;
    el('#sshkey-intro').textContent = r.created
      ? `Generated a new SSH key on ${label} and copied it to your clipboard. This is the key ${name} uses to log in to OTHER machines.`
      : `${label} already had an SSH key; copied its public key to your clipboard. This is the key ${name} uses to log in to OTHER machines.`;
    el('#sshkey-value').textContent = r.pubkey;
    el('#sshkey-install').textContent =
      `mkdir -p ~/.ssh && echo '${r.pubkey}' >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys`;
    el('#sshkey-dialog').showModal();
  } catch (e) {
    alert(`Couldn't get the SSH key for ${name}: ${e.message}`);
  } finally {
    if (btn) btn.disabled = false;
  }
 }
 function renderHardware() {
  const panel = el('#hardware-panel');
  const grid = el('#hardware-grid');
@@ -385,21 +358,11 @@ function renderHardware() {
    if (s.gpu_temp_c != null) gpuExtras.push(`${s.gpu_temp_c}°C`);
    if (s.gpu_power_w != null) gpuExtras.push(`${s.gpu_power_w.toFixed(0)}W`);
    const gpuExtrasStr = gpuExtras.length ? ` · ${gpuExtras.join(' · ')}` : '';
    // Read-only WireGuard badge: shown only when the Spark has a wg interface up.
    // "VPN <ip>" means it's a peer on that tunnel (reachable off-LAN when the
    // tunnel is up); it reflects interface presence, not live peer reachability.
    const wgIp = s.wg_addr ? String(s.wg_addr).split('/')[0] : '';
    const wgBadge = s.wg_iface
      ? ` · <span class="wg-badge" title="On WireGuard tunnel '${escapeHtml(s.wg_iface)}'${wgIp ? ' as ' + escapeHtml(wgIp) : ''} — reachable off-LAN while the tunnel is up">VPN${wgIp ? ' ' + escapeHtml(wgIp) : ''}</span>`
      : '';
    card.className = 'hw-card';
    card.innerHTML = `
      <div class="head">
        <span class="name">${escapeHtml(s.hostname || key)}</span>
-        <span class="meta">${escapeHtml(key)} · ${escapeHtml(s.gpu_name || '')} · ${escapeHtml(s.uptime || '')}${wgBadge}</span>
+        <span class="meta">${escapeHtml(key)} · ${escapeHtml(s.gpu_name || '')} · ${escapeHtml(s.uptime || '')}</span>
        <button class="icon-btn ssh-key-btn" data-ssh-key="${escapeHtml(key)}" title="Copy this Spark's SSH public key (creates one if it doesn't have one) — e.g. to let it log in to your Mac" aria-label="Copy SSH public key">
          <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><rect x="9" y="9" width="13" height="13" rx="2"/><path d="M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1"/></svg>
        </button>
      </div>
      <div class="hw-metric">
        <span class="label">CPU</span>
@@ -439,13 +402,8 @@ function classifyService(s) {
  if (s.docker_state === 'missing') return 'missing';
  if (s.docker_state === 'restarting') return 'unhealthy';
  if (s.docker_state === 'exited') return 'unhealthy';
-  if (s.docker_state === 'running') {
+  if (s.docker_state === 'running' && !s.http_ready) return 'starting';
-    // http_ready === false means an HTTP probe is expected but failing → still
+  if (s.docker_state === 'running' && s.http_ready) return 'running';
    // warming up. null means the service has no HTTP surface (e.g. the bot), so
    // a running container is simply healthy.
    if (s.http_ready === false) return 'starting';
    return 'running';
  }
  return s.docker_state || 'unknown';
 }
@@ -477,11 +435,6 @@ async function renderServices() {
  grid.innerHTML = '';
  for (const [name, s] of entries) {
    const cls = classifyService(s);
    const isBot = s.kind === 'bot';
    // The bot tile is opt-in: it only belongs to deployments that actually run
    // matrix-bridge. When the container is absent (missing) or the host isn't
    // configured, hide the tile entirely rather than show a stray red card.
    if (isBot && (cls === 'missing' || cls === 'unconfigured')) continue;
    const card = document.createElement('div');
    card.className = `service-card ${cls}`;
    const inFlight = state.service_action_in_flight && state.service_action_in_flight.startsWith(name + ':');
@@ -494,7 +447,7 @@ async function renderServices() {
      return false;
    };
    const copyIcon = `<svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><rect x="9" y="9" width="13" height="13" rx="2"/><path d="M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1"/></svg>`;
-    const hostStr = s.host ? (s.port ? `${s.host}:${s.port}` : s.host) : '';
+    const hostStr = s.host ? `${s.host}:${s.port}` : '';
    const hostRow = s.host
      ? `<div class="row"><span class="k">Host</span><span class="v copyable" data-copy-self title="Click to copy">${escapeHtml(hostStr)}</span><button class="icon-btn" data-copy-text="${escapeHtml(hostStr)}" title="Copy host" aria-label="Copy">${copyIcon}</button></div>`
      : `<div class="row"><span class="k">Host</span><span class="v muted-v">not configured</span></div>`;
@@ -548,11 +501,9 @@ async function renderServices() {
      ${restartsRow}
      ${deepRow}
      <div class="service-actions">
        ${isBot ? `<button class="btn primary" data-mb-update title="Pull latest code, rebuild, and recreate the bot" ${inFlight || state.mb_update_in_flight ? 'disabled' : ''}>Update</button>` : ''}
        <button class="btn" data-svc-action="${name}:start" ${disable('start') ? 'disabled' : ''}>Start</button>
        <button class="btn" data-svc-action="${name}:restart" ${disable('restart') ? 'disabled' : ''}>Restart</button>
        <button class="btn danger" data-svc-action="${name}:stop" ${disable('stop') ? 'disabled' : ''}>Stop</button>
        ${isBot ? `<button class="btn" data-mb-logs title="Show the last 100 log lines">View logs</button>` : ''}
      </div>
    `;
    grid.appendChild(card);
@@ -560,10 +511,6 @@ async function renderServices() {
  for (const btn of grid.querySelectorAll('.btn[data-svc-action]')) {
    btn.addEventListener('click', () => onServiceAction(btn.dataset.svcAction));
  }
  const mbUpdateBtn = grid.querySelector('[data-mb-update]');
  if (mbUpdateBtn) mbUpdateBtn.addEventListener('click', onMatrixBridgeUpdate);
  const mbLogsBtn = grid.querySelector('[data-mb-logs]');
  if (mbLogsBtn) mbLogsBtn.addEventListener('click', openMatrixBridgeLogs);
  for (const btn of grid.querySelectorAll('[data-dh-run]')) {
    btn.addEventListener('click', () => onDeepHealthRun(btn.dataset.dhRun, btn));
  }
@@ -742,118 +689,6 @@ async function onServiceAction(key) {
  }
 }
 // ===================== matrix-bridge bot (update + logs) =====================
 const mbState = { job_id: null, eventsource: null, timer: null, started_at: null };
 function mbTimerStart(at) {
  mbState.started_at = at;
  if (mbState.timer) clearInterval(mbState.timer);
  const tick = () => {
    if (!mbState.started_at) return;
    const sec = Math.max(0, Math.floor((Date.now() - mbState.started_at) / 1000));
    el('#mb-update-elapsed').textContent = `${Math.floor(sec / 60)}:${(sec % 60).toString().padStart(2, '0')}`;
  };
  tick();
  mbState.timer = setInterval(tick, 500);
 }
 async function onMatrixBridgeUpdate() {
  if (state.mb_update_in_flight) return;
  if (!confirm('Update the matrix-bridge bot?\n\nThis pulls the latest code, rebuilds the container image, and recreates the container. The first build after a base-image change can take several minutes. The bot is briefly offline while it restarts.')) return;
  state.mb_update_in_flight = true;
  renderServices();
  try {
    const r = await fetchJSON('/api/matrix-bridge/update', { method: 'POST' });
    attachMbUpdateProgress(r.job_id);
  } catch (e) {
    state.mb_update_in_flight = false;
    renderServices();
    alert('Update failed to start: ' + e.message);
  }
 }
 async function attachMbUpdateProgress(jobId) {
  mbState.job_id = jobId;
  el('#mb-update-log').textContent = '';
  el('#mb-update-title').textContent = 'Updating matrix-bridge…';
  el('#mb-update-phase').textContent = 'Starting…';
  el('#mb-update-dialog').showModal();
  try {
    const snap = await fetchJSON(`/api/matrix-bridge/update/${jobId}`);
    mbTimerStart(Date.parse(snap.started_at));
    el('#mb-update-phase').textContent = snap.phase || 'Working…';
    el('#mb-update-log').textContent = (snap.lines || []).join('\n');
    if (snap.returncode !== null) { onMbUpdateDone(snap); return; }
  } catch { mbTimerStart(Date.now()); }
  const es = new EventSource(`/api/matrix-bridge/update/${jobId}/stream`);
  mbState.eventsource = es;
  es.onmessage = ev => {
    try {
      const d = JSON.parse(ev.data);
      if (d.line !== undefined) {
        const log = el('#mb-update-log');
        log.textContent += d.line + '\n';
        log.scrollTop = log.scrollHeight;
      }
    } catch {}
  };
  es.addEventListener('phase', ev => {
    try { el('#mb-update-phase').textContent = JSON.parse(ev.data).phase; } catch {}
  });
  es.addEventListener('done', ev => {
    let d = {}; try { d = JSON.parse(ev.data); } catch {}
    onMbUpdateDone(d);
  });
  es.onerror = () => {
    // Don't leave the Update button wedged-disabled on a dropped stream. The
    // job keeps running server-side; re-clicking Update returns a clean 409.
    es.close();
    mbState.eventsource = null;
    state.mb_update_in_flight = false;
    el('#mb-update-phase').textContent = 'Lost connection to the update stream — reopen or check logs.';
    renderServices();
  };
 }
 function onMbUpdateDone(d) {
  if (mbState.eventsource) { mbState.eventsource.close(); mbState.eventsource = null; }
  if (mbState.timer) { clearInterval(mbState.timer); mbState.timer = null; }
  state.mb_update_in_flight = false;
  if (d.state === 'failed') {
    el('#mb-update-title').textContent = `Update failed (rc=${d.returncode})`;
    el('#mb-update-phase').textContent = 'Failed — see the log above.';
  } else {
    el('#mb-update-title').textContent = 'Update complete';
    el('#mb-update-phase').textContent = 'Done ✓';
  }
  // Refresh the tile's badge.
  (async () => { try { state.services = await fetchJSON('/api/services'); } catch {} renderServices(); })();
 }
 async function openMatrixBridgeLogs() {
  const pre = el('#mb-logs-pre');
  el('#mb-logs-title').textContent = 'matrix-bridge logs';
  pre.textContent = 'Loading…';
  el('#mb-logs-dialog').showModal();
  await loadMatrixBridgeLogs();
 }
 async function loadMatrixBridgeLogs() {
  const pre = el('#mb-logs-pre');
  const btn = el('#mb-logs-refresh');
  if (btn) btn.disabled = true;
  try {
    const r = await fetchJSON('/api/matrix-bridge/logs?tail=100');
    pre.textContent = r.output || '(no output)';
    pre.scrollTop = pre.scrollHeight;
  } catch (e) {
    pre.textContent = 'Could not read logs: ' + e.message;
  } finally {
    if (btn) btn.disabled = false;
  }
 }
 function renderEndpoint(status) {
  const v = status.vllm || {};
  const panel = el('#endpoint-panel');
@@ -2012,28 +1847,13 @@ async function init() {
  el('#nim-cancel').addEventListener('click', () => el('#nim-dialog').close());
  el('#nim-form').addEventListener('submit', submitNim);
  el('#nim-prog-close').addEventListener('click', () => el('#nim-progress-dialog').close());
  el('#mb-update-close').addEventListener('click', () => el('#mb-update-dialog').close());
  // Dismissing the modal (Close or Esc) stops streaming; the job runs on
  // server-side and re-clicking Update returns a 409 if still in progress.
  el('#mb-update-dialog').addEventListener('close', () => {
    if (mbState.eventsource) { mbState.eventsource.close(); mbState.eventsource = null; }
    if (mbState.timer) { clearInterval(mbState.timer); mbState.timer = null; }
    state.mb_update_in_flight = false;
    renderServices();
  });
  el('#mb-logs-close').addEventListener('click', () => el('#mb-logs-dialog').close());
  el('#mb-logs-refresh').addEventListener('click', loadMatrixBridgeLogs);
  el('#open-connectivity').addEventListener('click', openConnectivityDialog);
  el('#connectivity-close').addEventListener('click', () => el('#connectivity-dialog').close());
-  // Hardware-card buttons (Wake-on-LAN on unreachable cards; SSH-key copy on
+  // Wake-on-LAN buttons live on unreachable hardware cards; delegate.
  // reachable ones) are rendered dynamically, so delegate from the grid.
  el('#hardware-grid').addEventListener('click', (e) => {
-    const wbtn = e.target.closest('[data-wake]');
+    const btn = e.target.closest('[data-wake]');
-    if (wbtn) { wakeSpark(wbtn.dataset.wake); return; }
+    if (btn) wakeSpark(btn.dataset.wake);
    const kbtn = e.target.closest('[data-ssh-key]');
    if (kbtn) { copySparkSshKey(kbtn.dataset.sshKey, kbtn); return; }
  });
  el('#sshkey-close').addEventListener('click', () => el('#sshkey-dialog').close());
  setupCatalogDialog();
  setupAdvancedDialog();
  // Open WebUI link from /api/config
@@ -164,37 +164,6 @@
          </div>
        </form>
      </dialog>
      <dialog id="mb-update-dialog" class="modal">
        <form method="dialog" class="modal-form">
          <h3 id="mb-update-title">Updating matrix-bridge…</h3>
          <div class="phase-row">
            <div class="phase" id="mb-update-phase">Starting…</div>
            <span class="spacer"></span>
            <span class="timer" id="mb-update-elapsed">0:00</span>
          </div>
          <details open>
            <summary class="muted small">Log</summary>
            <pre id="mb-update-log" class="log"></pre>
          </details>
          <div class="modal-actions">
            <button type="button" id="mb-update-close" class="btn">Close</button>
          </div>
        </form>
      </dialog>
      <dialog id="mb-logs-dialog" class="modal">
        <form method="dialog" class="modal-form">
          <h3 id="mb-logs-title">matrix-bridge logs</h3>
          <p class="muted small">Last 100 lines from <code>docker logs</code> on the Spark.</p>
          <pre id="mb-logs-pre" class="log"></pre>
          <div class="modal-actions">
            <button type="button" id="mb-logs-refresh" class="btn">Refresh</button>
            <span class="spacer"></span>
            <button type="button" id="mb-logs-close" class="btn">Close</button>
          </div>
        </form>
      </dialog>
    </section>
    <section id="speech-models-panel" class="speech-models hidden">
@@ -275,24 +244,6 @@
        </form>
      </dialog>
      <dialog id="sshkey-dialog" class="modal">
        <form method="dialog" class="modal-form">
          <h3 id="sshkey-title">SSH public key</h3>
          <p id="sshkey-intro" class="muted small"></p>
          <div class="sshkey-row">
            <pre id="sshkey-value" class="snippet copyable" data-copy-self title="Click to copy"></pre>
            <button type="button" class="icon-btn" data-copy="#sshkey-value" title="Copy public key" aria-label="Copy public key">
              <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><rect x="9" y="9" width="13" height="13" rx="2"/><path d="M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1"/></svg>
            </button>
          </div>
          <p class="muted small">To let this Spark log in to another machine (e.g. your Mac), run this in a terminal <em>on that machine</em>:</p>
          <pre id="sshkey-install" class="snippet copyable" data-copy-self title="Click to copy"></pre>
          <div class="modal-actions">
            <button type="button" id="sshkey-close" class="btn">Close</button>
          </div>
        </form>
      </dialog>
      <dialog id="advanced-dialog" class="modal">
        <form method="dialog" class="modal-form" id="advanced-form">
          <h3 id="adv-title">Advanced settings</h3>
@@ -374,12 +374,6 @@ main {
 }
 .hw-card .head .name { font-weight: 600; font-size: 15px; }
 .hw-card .head .meta { color: var(--muted); font-size: 12px; margin-left: auto; }
 /* WireGuard "VPN <ip>" badge in the meta line — accent (green) = on a tunnel. */
 .hw-card .head .meta .wg-badge { color: var(--accent); font-weight: 600; cursor: help; }
 /* Copy-this-Spark's-ssh-key button pins to the top-right corner; meta keeps
   its margin-left:auto so name/meta/button read left→right→corner. */
 .hw-card .head .ssh-key-btn { align-self: flex-start; padding: 3px 6px; }
 .hw-card .head .ssh-key-btn svg { width: 13px; height: 13px; }
 .hw-card.unreachable { border-color: rgba(239, 68, 68, 0.4); }
 .hw-card.unreachable .name { color: var(--error); }
 .hw-card.unreachable ol { color: var(--muted); }
@@ -393,10 +387,6 @@ main {
 }
 .hw-card .wol-row .btn { padding: 5px 10px; font-size: 12px; }
 .hw-card .mac-display { font-family: ui-monospace, SFMono-Regular, Menlo, monospace; }
 /* SSH-key dialog: key line beside its copy button; long key wraps rather than scrolls. */
 .sshkey-row { display: flex; align-items: flex-start; gap: 8px; }
 .sshkey-row .snippet { flex: 1; margin: 0; white-space: pre-wrap; word-break: break-all; }
 #sshkey-install { white-space: pre-wrap; word-break: break-all; }
 .connectivity-content {
  max-height: 360px;
@@ -526,12 +516,10 @@ main {
 #dl-log-details { margin-top: 12px; }
 #dl-log-details summary { cursor: pointer; padding: 4px 0; }
-/* ===== NIM install + matrix-bridge dialogs ===== */
+/* ===== NIM install dialog ===== */
 .modal#nim-dialog,
-.modal#nim-progress-dialog,
+.modal#nim-progress-dialog { max-width: 640px; }
 .modal#mb-update-dialog,
 .modal#mb-logs-dialog { max-width: 640px; }
 .nim-grid {
  display: grid;
  gap: 8px;
@@ -12,12 +12,6 @@ dependencies = [
    "python-multipart>=0.0.9",
 ]
 [project.optional-dependencies]
 dev = ["pytest>=8"]
 [tool.pytest.ini_options]
 testpaths = ["tests"]
 [build-system]
 requires = ["setuptools>=68"]
 build-backend = "setuptools.build_meta"
@@ -1,17 +0,0 @@
 """Shared pytest setup.
 These suites are pure/offline — they exercise pure functions and never touch the
 Sparks, /data, or the network. We still pin the env vars the app modules expect
 (documented in docs/guides/fastapi-image.md) to tmp paths so importing them can
 never write to the container-only /data path.
 """
 import os
 import sys
 from pathlib import Path
 # Let `import app...` resolve whether or not the package is pip-installed.
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 os.environ.setdefault("REDACTION_MAP_DB", "/tmp/spark_control_test_maps.db")
 os.environ.setdefault("CONNECTIVITY_LOG", "/tmp/spark_control_test_connectivity.json")
 os.environ.setdefault("MODELS_OVERRIDES", "/tmp/spark_control_test_overrides.yaml")
@@ -1,69 +0,0 @@
 """_merge_words_with_speakers + _assign_speaker_to_word: the transcript/diarizer
 merge that turns Parakeet words + Sortformer turns into speaker-labelled blocks.
 Pure functions, no cluster — this is the core of transcribe-with-speakers.
 """
 from app.audio_proxy import _assign_speaker_to_word, _merge_words_with_speakers
 def _w(start, end, text):
    return {"start": start, "end": end, "text": text}
 def _t(start, end, speaker):
    return {"start_s": start, "end_s": end, "speaker": speaker}
 # ---- _assign_speaker_to_word ----
 def test_assign_by_midpoint_containment():
    turns = [_t(0.0, 2.0, "Speaker_0"), _t(2.0, 4.0, "Speaker_1")]
    assert _assign_speaker_to_word(2.4, 2.8, turns) == "Speaker_1"
 def test_assign_falls_back_to_max_overlap_when_midpoint_outside():
    # midpoint 5.0 is in no turn; word span overlaps Speaker_0 more than Speaker_1.
    turns = [_t(0.0, 4.9, "Speaker_0"), _t(6.0, 8.0, "Speaker_1")]
    assert _assign_speaker_to_word(4.0, 6.0, turns) == "Speaker_0"
 def test_assign_unknown_when_no_overlap():
    turns = [_t(0.0, 1.0, "Speaker_0")]
    assert _assign_speaker_to_word(10.0, 11.0, turns) == "Speaker_unknown"
 # ---- _merge_words_with_speakers ----
 def test_empty_words_returns_empty():
    assert _merge_words_with_speakers([], [_t(0, 1, "Speaker_0")]) == []
 def test_consecutive_same_speaker_words_join_into_one_block():
    words = [_w(0.0, 0.5, "good"), _w(0.5, 1.0, "morning")]
    turns = [_t(0.0, 2.0, "Speaker_0")]
    blocks = _merge_words_with_speakers(words, turns)
    assert blocks == [
        {"start_ms": 0, "end_ms": 1000, "speaker": "Speaker_0", "text": "good morning"}
    ]
 def test_speaker_change_splits_blocks():
    words = [_w(0.0, 1.0, "hi"), _w(2.1, 3.0, "hello")]
    turns = [_t(0.0, 2.0, "Speaker_0"), _t(2.0, 4.0, "Speaker_1")]
    blocks = _merge_words_with_speakers(words, turns)
    assert [b["speaker"] for b in blocks] == ["Speaker_0", "Speaker_1"]
    assert [b["text"] for b in blocks] == ["hi", "hello"]
 def test_long_silence_breaks_block_for_same_speaker():
    # >1.5s gap between two words of the same speaker forces a new block.
    words = [_w(0.0, 0.5, "one"), _w(3.0, 3.5, "two")]
    turns = [_t(0.0, 4.0, "Speaker_0")]
    blocks = _merge_words_with_speakers(words, turns)
    assert len(blocks) == 2
    assert [b["text"] for b in blocks] == ["one", "two"]
 def test_punctuation_token_joins_without_leading_space():
    words = [_w(0.0, 0.5, "hello"), _w(0.5, 0.7, ".")]
    turns = [_t(0.0, 2.0, "Speaker_0")]
    assert _merge_words_with_speakers(words, turns)[0]["text"] == "hello."
@@ -1,67 +0,0 @@
 """build_launch_command: argument assembly + the shell-injection invariant.
 The security-critical property is that every user-controllable value (repo,
 vllm_args, knobs) is shlex-quoted at the sink, so `shlex.split` cleanly reverses
 the command back into the exact token list. The vLLM pre-flight validator
 (validate.py) depends on this round-trip — these tests lock it in.
 """
 import shlex
 from app.models import Defaults, ModelDef, build_launch_command
 DEFAULTS = Defaults(port=8888, host="0.0.0.0")
 def _model(**kw) -> ModelDef:
    base = dict(display_name="X", repo="org/name", size_gb=1.0, mode="solo")
    base.update(kw)
    return ModelDef(**base)
 def test_solo_model_emits_solo_flag_and_ordered_args():
    cmd = build_launch_command("k", _model(vllm_args=["--max-model-len=1000"]), DEFAULTS)
    assert cmd == (
        "./launch-cluster.sh --solo -d exec vllm serve org/name "
        "--port=8888 --host=0.0.0.0 --max-model-len=1000"
    )
 def test_cluster_model_omits_solo_flag():
    cmd = build_launch_command("k", _model(mode="cluster", vllm_args=["-tp=2"]), DEFAULTS)
    assert " --solo " not in cmd
    assert cmd.startswith("./launch-cluster.sh -d exec vllm serve org/name")
 def test_knob_overrides_matching_bundled_flag():
    # bundled arg sets max-model-len; the knob must win (single occurrence).
    m = _model(vllm_args=["--max-model-len=1000"], knobs={"max_model_len": 65536})
    cmd = build_launch_command("k", m, DEFAULTS)
    assert "--max-model-len=65536" in cmd
    assert "--max-model-len=1000" not in cmd
 def test_repo_with_shell_metacharacters_is_quoted_not_executed():
    # build_launch_command quotes even a hostile repo (validate_repo guards the
    # API boundary; this proves the sink itself is safe in depth).
    evil = "org/name; rm -rf ~ #"
    cmd = build_launch_command("k", _model(repo=evil), DEFAULTS)
    # The raw metacharacters must not appear unquoted...
    assert "; rm -rf" not in cmd.replace(shlex.quote(evil), "")
    # ...and shlex.split must recover the repo as one literal token.
    tokens = shlex.split(cmd)
    assert evil in tokens
 def test_command_string_round_trips_through_shlex_split():
    # The invariant validate.py relies on: every arg survives quote -> split intact.
    args = ["--max-model-len=32768", "--load-format=fastsafetensors", "--note=a b c"]
    cmd = build_launch_command("k", _model(vllm_args=args), DEFAULTS)
    tokens = shlex.split(cmd)
    for a in args:
        assert a in tokens
 def test_injection_via_vllm_arg_stays_literal():
    payload = "--foo=$(touch /tmp/pwned)"
    cmd = build_launch_command("k", _model(vllm_args=[payload]), DEFAULTS)
    assert payload in shlex.split(cmd)  # preserved as one inert token
@@ -1,47 +0,0 @@
 """build_update_command: the matrix-bridge update one-liner.
 Pure string assembly, no cluster. Locks in the contract from
 docs/spark-control-integration.md (matrix-bridge repo): fetch, hard-reset to the
 release branch, then rebuild/recreate via docker compose — chained with `&&` so
 any failure (e.g. Gitea unreachable) aborts before the build and surfaces a
 non-zero exit. The clone dir must stay unquoted so a `~` expands server-side.
 """
 from app.matrix_bridge import build_update_command, _phase_for
 def test_command_is_the_contract_chain():
    cmd = build_update_command("~/matrix-bridge", "master")
    assert cmd == (
        "cd ~/matrix-bridge && "
        "git fetch origin && "
        "git reset --hard origin/master && "
        "docker compose up -d --build"
    )
 def test_fail_loud_chaining():
    # Every step is &&-chained: a failed fetch never reaches the build.
    cmd = build_update_command("~/matrix-bridge", "master")
    assert "; " not in cmd
    assert cmd.count(" && ") == 3
    assert cmd.index("git fetch") < cmd.index("git reset") < cmd.index("docker compose")
 def test_tilde_dir_left_unquoted_for_server_side_expansion():
    cmd = build_update_command("~/matrix-bridge", "master")
    assert "cd ~/matrix-bridge &&" in cmd
    assert "'~" not in cmd  # quoting would defeat the home-dir expansion
 def test_absolute_dir_and_custom_branch():
    cmd = build_update_command("/home/modelo/matrix-bridge", "phase-1")
    assert cmd.startswith("cd /home/modelo/matrix-bridge && ")
    assert "git reset --hard origin/phase-1 &&" in cmd
 def test_phase_detection_maps_known_lines():
    assert _phase_for("HEAD is now at 1a2b3c4 some commit") == "Resetting to the latest release…"
    assert _phase_for("#5 building image") == "Building the bot image…"
    assert _phase_for("Container matrix-bridge  Recreate") == "Recreating the container…"
    assert _phase_for("Already up to date.") == "No new code; rebuilding…"
    assert _phase_for("some unremarkable line") is None
@@ -1,98 +0,0 @@
 """shellsafe validators: the API-boundary whitelist behind the v0.19.0 SSH
 command-injection hardening. The quoting *sink* is covered in
 test_launch_command.py; this locks in the *boundary* — that hostile input is
 rejected early, and that a valid value passes through unchanged so callers can
 use `validate_x(v)` inline.
 """
 import pytest
 from app.shellsafe import validate_container, validate_image, validate_repo
 # Shell metacharacters that must never survive any validator — these are the
 # actual injection vectors. (Path traversal like "../" is NOT in scope here:
 # validate_image legitimately permits "/" and "." for real image refs such as
 # nvcr.io/nim/...; the defense for images is "no shell metacharacters" + the
 # quote_arg sink, not path-shape. Slash-rejection is tested directly for repo
 # and container, where "/" is disallowed.)
 HOSTILE = [
    "; rm -rf /",
    " a b",
    "$(touch pwned)",
    "`id`",
    "x|cat",
    "x&y",
    "x>out",
    "x\nrm",
 ]
 # ---- validate_repo: HF 'org/name', exactly one slash ----
@pytest.mark.parametrize("repo", [
    "RedHatAI/Qwen3.6-35B-A3B-NVFP4",  # the live production model
    "org/name",
    "a.b_c-d/x.y_z-1",
 ])
 def test_repo_valid_passes_through_unchanged(repo):
    assert validate_repo(repo) == repo
@pytest.mark.parametrize("repo", [
    "",
    "noslash",
    "a/b/c",        # two slashes
    "/name",        # empty org
    "org/",         # empty name
 ] + [f"org/name{h}" for h in HOSTILE])
 def test_repo_rejects_malformed_and_hostile(repo):
    with pytest.raises(ValueError):
        validate_repo(repo)
 # ---- validate_image: registry/path:tag@digest ----
@pytest.mark.parametrize("image", [
    "nvcr.io/nim/nvidia/parakeet-1_1b-ctc-en-us:latest",
    "ubuntu",
    "img@sha256:deadbeefcafe",
    "a.b/c:1.2_3-4",
 ])
 def test_image_valid_passes_through_unchanged(image):
    assert validate_image(image) == image
@pytest.mark.parametrize("image", [
    "",
    "-leading",     # must start alphanumeric
    ".leading",
    "/leading",
    ":leading",
    "a" * 513,      # over the 512 cap
 ] + [f"img{h}" for h in HOSTILE])
 def test_image_rejects_malformed_and_hostile(image):
    with pytest.raises(ValueError):
        validate_image(image)
 # ---- validate_container: Docker name rule, no slash ----
@pytest.mark.parametrize("name", [
    "parakeet-asr",
    "a",
    "vol_1.2-3",
 ])
 def test_container_valid_passes_through_unchanged(name):
    assert validate_container(name) == name
@pytest.mark.parametrize("name", [
    "",
    "_leading",     # underscore is not a valid first char
    "-leading",
    ".leading",
    "has/slash",    # slash not allowed in a container name
    "a" * 129,      # over the 128 cap
 ] + [f"name{h}" for h in HOSTILE])
 def test_container_rejects_malformed_and_hostile(name):
    with pytest.raises(ValueError):
        validate_container(name)
@@ -1,14 +1,3 @@
 ARCHES := x86
 # overrides to s9pk.mk must precede the include statement
 include s9pk.mk
 # Publish the built s9pk to Gitea Releases (adopters pull it with a read-only
 # token instead of being hand-sent the package). Needs GITEA_URL + GITEA_TOKEN;
 # the vX.Y.Z git tag must already be pushed. See ../scripts/gitea-release.sh.
 RELEASE_VERSION := $(shell sed -n "s/.*version: '\([^']*\)'.*/\1/p" startos/versions/v0_1_0.ts)
 .PHONY: release
 release:
 	@test -f "$(PACKAGE_ID)_x86_64.s9pk" || { echo "Build first: make x86"; exit 1; }
 	GITEA_URL="$(GITEA_URL)" GITEA_TOKEN="$(GITEA_TOKEN)" \
 	  ../scripts/gitea-release.sh "$(RELEASE_VERSION)" "$(PACKAGE_ID)_x86_64.s9pk"
@@ -40,15 +40,6 @@ const inputSpec = InputSpec.of({
    placeholder: 'your SSH username',
    masked: false,
  }),
  vllm_port: Value.text({
    name: 'vLLM port (optional)',
    description:
      "The port your vLLM server listens on, on Spark 1 — used by the health check and the chat proxy. Leave blank to use 8888, which is what the bundled launch-cluster.sh wrapper uses. Set this to 8000 (vLLM's own default) or another port if your vLLM listens elsewhere.",
    required: false,
    default: null,
    placeholder: 'leave blank for 8888',
    masked: false,
  }),
  parakeet_host: Value.text({
    name: 'Parakeet host (optional)',
    description:
@@ -128,15 +119,6 @@ const inputSpec = InputSpec.of({
    placeholder: 'e.g. crm_chunks',
    masked: false,
  }),
  matrix_bridge_user: Value.text({
    name: 'matrix-bridge bot SSH user (optional)',
    description:
      "If you run the matrix-bridge Matrix bot on Spark 2, enter the SSH user that owns its ~/matrix-bridge folder (e.g. 'modelo'). Spark Control then shows a tile to update, restart, and view logs for the bot. Leave blank if you don't run the bot — the tile stays hidden. Note: this package's SSH public key must be authorized for that user (Show Public Key action) unless it's the same as your Spark 2 user.",
    required: false,
    default: null,
    placeholder: 'e.g. modelo',
    masked: false,
  }),
  open_webui_url: Value.text({
    name: 'Open WebUI URL (optional)',
    description:
@@ -7,8 +7,6 @@ export const sparkConfigSchema = z.object({
  spark1_user: z.string().catch(''),
  spark2_host: z.string().catch(''),
  spark2_user: z.string().catch(''),
  // Optional vLLM port override (Spark 1). Blank => 8888 (launch-cluster.sh default).
  vllm_port: z.string().catch(''),
  // Optional per-service overrides. Blank => use spark2_host / spark2_user.
  parakeet_host: z.string().catch(''),
  parakeet_user: z.string().catch(''),
@@ -24,8 +22,6 @@ export const sparkConfigSchema = z.object({
  qdrant_user: z.string().catch(''),
  qdrant_container: z.string().catch(''),
  qdrant_collection: z.string().catch(''),
  // Optional matrix-bridge bot. Blank => no tile. Host reuses Spark 2.
  matrix_bridge_user: z.string().catch(''),
  // Optional Open WebUI deep-link
  open_webui_url: z.string().catch(''),
  // Optional NGC API key for pulling NIM containers from nvcr.io/nim/...
@@ -13,7 +13,6 @@ export const main = sdk.setupMain(async ({ effects }) => {
    spark1_user: '',
    spark2_host: '',
    spark2_user: '',
    vllm_port: '',
    parakeet_host: '',
    parakeet_user: '',
    parakeet_container: '',
@@ -27,7 +26,6 @@ export const main = sdk.setupMain(async ({ effects }) => {
    qdrant_user: '',
    qdrant_container: '',
    qdrant_collection: '',
    matrix_bridge_user: '',
    open_webui_url: '',
    ngc_api_key: '',
  }
@@ -51,7 +49,6 @@ export const main = sdk.setupMain(async ({ effects }) => {
        SPARK1_USER: cfg.spark1_user,
        SPARK2_HOST: cfg.spark2_host,
        SPARK2_USER: cfg.spark2_user,
        VLLM_PORT: cfg.vllm_port,
        PARAKEET_HOST: cfg.parakeet_host,
        PARAKEET_USER: cfg.parakeet_user,
        PARAKEET_CONTAINER: cfg.parakeet_container,
@@ -65,7 +62,6 @@ export const main = sdk.setupMain(async ({ effects }) => {
        QDRANT_USER: cfg.qdrant_user,
        QDRANT_CONTAINER: cfg.qdrant_container,
        QDRANT_COLLECTION: cfg.qdrant_collection,
        MATRIX_BRIDGE_USER: cfg.matrix_bridge_user,
        MODELS_OVERRIDES: '/data/models-overrides.yaml',
        SERVICES_OVERRIDES: '/data/services-overrides.yaml',
        CONNECTIVITY_LOG: '/data/connectivity.json',
@@ -1,10 +1,10 @@
 import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk'
 export const v0_1_0 = VersionInfo.of({
-  version: '0.22.0:0',
+  version: '0.19.0:0',
  releaseNotes: {
    en_US:
-      "v0.22.0:0 — configurable vLLM port. The port Spark Control uses to reach vLLM on Spark 1 (the health check and the chat proxy) is now a field in the Configure Sparks action, so you can point it at a vLLM that listens on a non-default port without rebuilding the package. Leave it blank to keep the previous default of 8888 — what the bundled launch-cluster.sh wrapper uses; set it to 8000 (vLLM's own default) or any other port if your vLLM listens elsewhere. Also hardened numeric-setting parsing so a blank or malformed port value falls back to its default instead of crashing daemon startup.",
+      'v0.19.0:0 — security hardening of the cluster-control surface (no change to the proxy/data APIs your other apps use). (1) Every user-supplied value that reaches an SSH command on the Sparks — model repo, vLLM args/knobs, NIM image/container, service names — is now strictly validated and shell-quoted, closing a command-injection path. (2) The Qdrant collection name in /api/search is validated so it can no longer be used to reach other collections. (3) State-changing dashboard endpoints (model swap, NIM install, service start/stop, disk delete, etc.) now require a same-origin request, blocking cross-site (CSRF) attacks from a malicious page open in your browser. The OpenAI-compatible proxies (/v1/*), the redaction gateway (/scrub, /rehydrate), /api/search, /api/audio/*, and /api/health-event are exempt, so Recap Relay, the CRM, Open WebUI and other consumers are unaffected.',
  },
  migrations: {
    up: async ({ effects }) => {},
@@ -34,24 +34,6 @@ These take effect on the **next swap to that model**. If a swap fails after this
 - Status auto-refreshes every 5 s.
 - A swap takes 3–6 minutes depending on the model. Don't close the tab — but if you do, the swap continues; reopen and you'll re-attach to the log stream.
 ## matrix-bridge bot tile (optional)
 If you run the matrix-bridge bot container on a Spark, set its SSH user in **Configure Sparks** (e.g. the user that owns `~/matrix-bridge`) and a tile appears under "Always-on services" with status, Update, Restart, Stop/Start, and View logs. Status is docker-state only (no HTTP health), so a `running` badge means the container is up, not necessarily that the bot is connected.
 The **Update** button runs `git fetch && git reset --hard origin/<branch> && docker compose up -d --build` as that SSH user. For it to reach your git remote:
 1. `~/matrix-bridge` must be a clone of the repo (not loose files). Gitignored secrets (`.env`, etc.) survive a `git reset --hard`.
 2. If that user has more than one SSH key, pin the remote's key so git doesn't offer the wrong one first (a common `Permission denied (publickey)` cause). In the user's `~/.ssh/config`:
   ```
   Host <your-git-host>
       Port <port>
       IdentityFile ~/.ssh/id_ed25519
       IdentitiesOnly yes
   ```
 3. Spark Control's own package key must be authorized for that SSH user (Show Public Key → add to their `authorized_keys`) unless it's the same user Spark Control already uses for that Spark.
 ## Adding a new model
 1. Add an entry to `image/models.yaml`. Required fields: `display_name`, `repo`, `size_gb`, `mode` (`solo` or `cluster`), `vllm_args`. Optional but recommended: `description` (one paragraph — what the model is, what it's good for, how it differs from others; renders below the meta tags in each card), `capabilities` (tags like `[vision, reasoning, tools]`), `expected_ready_seconds`.
@@ -1,45 +0,0 @@
 #!/usr/bin/env bash
 # Publish a built Spark Control s9pk to Gitea Releases, so adopters can pull the
 # latest package with a read-only token instead of being hand-sent the file.
 #
 #   GITEA_URL=https://gitea.example:3000 GITEA_TOKEN=<write-token> \
 #     scripts/gitea-release.sh 0.22.0:0 package/spark-control_x86_64.s9pk
 #
 # The git tag (vX.Y.Z, derived from the version) must already exist and be pushed
 # (`git tag v0.22.0 && git push gitea v0.22.0`). Re-running is idempotent: it
 # reuses an existing release for the tag and replaces a same-named asset.
 set -euo pipefail
 VERSION="${1:-}"; S9PK="${2:-}"
 [ -n "$VERSION" ] && [ -n "$S9PK" ] || {
  echo "usage: GITEA_URL=.. GITEA_TOKEN=.. $0 <version e.g. 0.22.0:0> <s9pk path>" >&2; exit 2; }
 : "${GITEA_URL:?set GITEA_URL to your Gitea base URL, e.g. https://gitea.lan:3000}"
 : "${GITEA_TOKEN:?set GITEA_TOKEN to a token with repository write access}"
 [ -f "$S9PK" ] || { echo "s9pk not found: $S9PK" >&2; exit 1; }
 TAG="v${VERSION%%:*}"                      # 0.22.0:0 -> v0.22.0
 ASSET="$(basename "$S9PK")"
 SLUG="$(git remote get-url gitea | sed -E 's#.*[:/]([^/:]+/[^/]+)\.git$#\1#')"  # grant/spark-control
 API="${GITEA_URL%/}/api/v1/repos/${SLUG}"
 AUTH=(-H "Authorization: token ${GITEA_TOKEN}")
 echo "repo ${SLUG} | tag ${TAG} | asset ${ASSET} | ${GITEA_URL}"
 # Reuse an existing release for this tag, otherwise create one.
 id="$(curl -fsS "${AUTH[@]}" "$API/releases/tags/$TAG" 2>/dev/null | jq -r '.id // empty')"
 if [ -z "$id" ]; then
  id="$(curl -fsS -X POST "${AUTH[@]}" -H 'Content-Type: application/json' \
    --data "$(jq -n --arg t "$TAG" --arg n "$VERSION" \
      '{tag_name:$t, name:$n, body:("Spark Control "+$n+". See AGENTS.md / release notes.")}')" \
    "$API/releases" | jq -r '.id')"
 fi
 [ -n "$id" ] && [ "$id" != null ] || { echo "could not obtain release id (check URL/token/tag)" >&2; exit 1; }
 # Replace a same-named asset so re-runs don't 409.
 old="$(curl -fsS "${AUTH[@]}" "$API/releases/$id/assets" | jq -r --arg n "$ASSET" '.[] | select(.name==$n) | .id')"
 [ -n "$old" ] && curl -fsS -X DELETE "${AUTH[@]}" "$API/releases/$id/assets/$old" >/dev/null || true
 curl -fsS -X POST "${AUTH[@]}" -F "attachment=@${S9PK};type=application/octet-stream" \
  "$API/releases/$id/assets?name=$ASSET" >/dev/null
 echo "published: ${GITEA_URL%/}/${SLUG}/releases/tag/${TAG}"