GPU-arbiter safety layer for when automation, not just the dashboard, swaps models: - swap reservation lock (POST/GET/DELETE /api/swap/lock); 423-enforced in post_swap via a single-read gate, TTL-bounded, secret-token auth, human force-release override + dashboard banner - swap webhook (swap_complete/swap_failed) fired outside the swap lock, optional HMAC signature, configurable URL+secret - read-only schedule registry (GET/POST/DELETE /api/schedule) + dashboard panel New module image/app/coordination.py; docs/COORDINATION.md for consumers; 22 offline tests in test_coordination.py.
14 KiB
AGENTS.md
This file provides guidance to coding agents (Claude Code and others) when working with code in this repository. (Claude Code reads it via the CLAUDE.md symlink.)
Browser-based StartOS 0.4 package controlling a dual NVIDIA DGX Spark AI cluster: one-click vLLM model swaps, plus health, proxying, and APIs for speech (STT/diarization/TTS), embeddings, and redaction.
Subsystem guidance lives in docs/guides/ and loads when matching files are touched (Claude Code lazy-loads via .claude/rules/ symlinks; other agents read the guides directly): startos-package.md (build/versioning, package/**), fastapi-image.md (dev server/env/layout, image/**), redaction.md (vendoring + test gates), audio-speech.md (parakeet patches, cluster-container footguns, audio testing). Read docs/guides/audio-speech.md before touching the Sparks' containers over SSH — ops sessions don't trip the path scoping.
Inbox check: At session start, if
~/Projects/standards/INBOX.mdexists, scan it for items tagged(spark-control)and surface them before proposing next steps; triage with/triage.
Stack
- Two halves, always coordinated:
image/— standalone FastAPI app (Python ≥3.11; UI on port 9999; vanilla HTML/CSS/JS).package/— StartOS 0.4 wrapper (TypeScript) that ships the Docker image as an s9pk.
- Build host needs
start-cli, Node ≥22 + npm, and Docker. - Cluster runtimes live on the Sparks, not in this repo (
spark-vllm-docker, the parakeet/kokoro/embeddings containers). This repo is the controller; it reaches them over SSH + HTTP. - Sparks are ARM64 (GB10 Grace-Blackwell, sm_121, CUDA 13). Services: vLLM
:8888(Spark 1);parakeet-asr:8000, Kokoro TTS:8880, bge-m3 embeddings + Qdrant (Spark 2). Seedocs/for API contracts.
Commands (headlines — details in the scoped rules)
(cd package && make x86) # build the s9pk; make install sideloads (restarts live service — ask first)
(cd image && uvicorn app.server:app --port 9999) # local dev — needs env vars, see fastapi-image rule
(cd image && .venv/bin/python -m pytest) # offline unit suite (launch-cmd injection, label-merge)
(cd image && .venv/bin/python -m app.redaction.test_gateway) # offline redaction suite 1
(cd image && .venv/bin/python app/redaction/test_scrub_leak.py) # offline redaction suite 2
./scripts/test-audio-with-speakers.sh <audio-file> # e2e audio — hits the LIVE cluster
Layout
image/app/— FastAPI app (server.pyentry, routers in sibling modules,static/dashboard UI).package/startos/— StartOS manifest, interfaces, actions, version + release notes.docs/—AUDIO_API.md,EMBEDDINGS.md,REDACTION_GATEWAY.md,COORDINATION.md(consumer-facing API refs; update with API changes).README.md(overview),HANDOFF.md(fresh-user install guide),runbook.md(ops notes),known-issues.md,ROADMAP.md(longer-term backlog — items move into "Current state" below when picked up).
Conventions
- Every shipped change = version bump + release notes + rebuilt s9pk (version format
X.Y.Z:N; details in the startos-package rule). - Commit messages:
vX.Y.Z:N - short lowercase summary. Never add a Co-Authored-By / Claude attribution trailer. - The package owner is non-technical: explain infra effects in plain English and get an explicit go/no-go before mutating the cluster.
- New external-facing endpoints get documented in
docs/and noted in release notes for downstream app developers (Recap Relay, Ten31 Transcripts, CRM, Signal Engine consume these APIs). - Doc layout:
AGENTS.mdis the canonical file;CLAUDE.mdis a symlink to it (don't overwrite it). Subsystem guides are real files indocs/guides/<topic>.md(withpaths:frontmatter);.claude/rules/<topic>.mdare relative symlinks into them. A new guide = adddocs/guides/<topic>.md, symlink it from.claude/rules/, and add an index line above.
Always / Never (cluster-wide)
- Always confirm with the user before swap/stop/restart of anything on the live cluster. Read-only probes and dry-runs are fine without asking.
- Always use the Spark's IP for HTTP probes —
.localmDNS names can resolve IPv6-first and hang httpx (vLLM and friends bind IPv4 only). Never trust.localhostnames inside HTTP client code. - Always pass
SSH_KEY_PATH/-i <key>explicitly in scripted SSH; non-interactive shells have no ssh-agent identities. - Never route audio or transcripts to cloud services — speech stays on the LAN. (Scrubbed text via
/scrubis the only sanctioned path toward frontier models.) - Never commit owner-specific hostnames, IPs, usernames, or names into package strings, UI text, or docs — this package gets shared; use placeholders. Canonical set:
<spark-1-ip>/<spark-2-ip>,<spark-1-host>/<spark-2-host>,<spark-user>, and generic example names (Alice/Bob). - Never install
cuda-pythoninparakeet-asr— crashes real decode on this GPU/CUDA-13 stack; full story in the audio-speech rule.
Current state
- Live service runs v0.22.0:0 (installed and serving). v0.25.0:0 is the latest in tree — coordination layer (swap lock + webhook + schedule registry); built/typechecked clean, NOT yet committed/tagged/installed (this session's work). It stacks on three releases also staged-but-not-live: v0.24.0:0 (configurable topology — committed
26070eb, tagged, pushed togitea/master), v0.23.0:0 (local/fine-tuned models — committed/tagged/Gitea-published). Close-out backlog for all of these: (a) commit/tag/push v0.25.0:0; (b)make releaseto publish s9pk assets to Gitea Releases (needsGITEA_URL+ writeGITEA_TOKEN, neither in env); (c) the live install. Installs blocked on the same mDNS issue (P3 line below). Working features: swap dashboard; chat / transcribe / diarize(+chunk) / TTS proxies; embeddings + rerank + hybrid search (Qdrant);/scrub+/rehydrate; label-merge incl. dual-channel; per-Spark SSH-key copy + WireGuardVPN <ip>hardware-card badge; configurable vLLM port (blank ⇒ 8888); configurable topology (vLLM container name, hide-services list, second-Spark vLLM monitor — v0.24.0:0); local/fine-tuned models (v0.23.0:0); coordination layer (v0.25.0:0 — GPU swap reservation lock with423-enforced manual-swap pause + human Release override, swap_complete/swap_failed webhook, read-only schedule registry; API indocs/COORDINATION.md). Everything from v0.23 onward lands live once the installs go through. Spark 2 audio stack healthy. Security hardening (v0.19.0:0 — shellsafe SSH-injection guard, Qdrant path-injection, same-origin CSRF guard) shipped and stable; evidence inEVALUATION.md. - matrix-bridge bot tile (done, v0.21.0:1, verified live):
bot-kind service tile — status badge from docker-state only (no HTTP port), plus Update / Restart / Stop/Start / View logs. Code:app/matrix_bridge.py+/api/matrix-bridge/{update,logs}(update streams; 25-min cap; fail-loud). Driven directly asmodeloon Spark 2 (nosudo -iu— spark2 has no passwordless sudo). User is a blank-default Configure-Sparks field (matrix_bridge_user); blank → tile hidden (portable). Host reusesspark2_host(192.168.1.87= the bot's boxspark-32d0); container/dir/branch are env-overridable defaults. Load-bearing ops dep: Update'sgit fetchruns asmodelo, which needsmodelo's~/.ssh/configpinning the Gitea deploy key withIdentitiesOnly yes— else the wrong key is offered and Gitea denies (publickey). Optional next, only if the bot dev asks: DockerHEALTHCHECKfor running-but-disconnected detection (spec §Note). - Tests: offline pytest harness in
image/tests/—cd image && .venv/bin/python -m pytest(124 passing). Coversbuild_launch_command(incl. the shell-injection round-trip + local-model bind-mount), the transcript↔diarizer label-merge, theshellsafevalidators,matrix_bridge.build_update_command(+ phase detection), the configurable-topology layer (test_topology.py), and the coordination layer (test_coordination.py: swap-lock lifecycle/expiry/token-auth, schedule-registry CRUD, webhook payload + HMAC signature —nowis injected into the lock so expiry is tested without sleeping). Mock-heavy swap/proxy/endpoint tests deliberately skipped (low ROI). Redaction + live-audio suites remain standalone scripts. - Signal Engine "flakiness": diagnosed as not a server bug — transient 1–4s unresponsiveness while the single GPU is busy. Client-side remedy (in-flight cap 2 / ceiling 3 / retry-on-timeout+503) drafted and forwarded to that dev (owner confirmed 2026-06-15). Awaiting whether they want the measured concurrency knee.
- Stance (decided, not built): no public interface / no API-token auth — LAN + WireGuard/Tailscale split-tunnel only; the CSRF guard covers the browser-driven vector.
- Known limits:
/healthblips while the GPU is busy (mitigated client-side); dual-channel can miss a quiet local word under loud remote bleed; connectivity log misses sub-5s outages between 5s polls; diarizer caps at 4 speakers; matrix-bridge badge won't visibly flip on a fastdocker restart(status re-checked only after the command returns). - Infra gotcha (safety): passwordless sudo is NOT configured on spark2 — design unprivileged probes for any Spark feature (the badge uses
ip, notsudo wg show). spark2 sits on thestarttunnelWireGuard subnet (10.59.211.6/24, survives reboot). Owner declined SSH-key rotation after the 2026-06-12 history scrub (only the key name leaked) — don't re-flag. - Hosting: self-hosted Gitea — remote
gitea, branchmaster, over SSH; push after committing. (Wart: commit8d839e3is mislabeledv0.13.0:4but contains through v0.18.0:0.) - Next — committed 2026-06-17: OpenClaw/Johnny-5 coexistence epic (full plan + design stance in
ROADMAP.md→ "Cluster coordination"). Stance: Spark Control = control plane / GPU arbiter, not a job runner; business cron jobs live in separate services that call its swap API (swaps are already API-driven viaPOST /api/swap). Sequence: (1) configurableVLLM_PORT— SHIPPED v0.22.0:0 (Configure-Sparks field, blank ⇒ 8888; +_env_inthardening inconfig.pyso a blank/bad port no longer crashes startup, killing a P3 tech-debt item). Committed136a471, pushed, taggedv0.22.0, rebuilt clean, installed, and published to the self-hosted Gitea Releases 2026-06-17 (make release→scripts/gitea-release.sh, takesGITEA_URL+ a write token). Distribution model (decided 2026-06-17): Gitea Releases + a read-only token the adopter's agent uses to pull the latest s9pk (GET /api/v1/repos/grant/spark-control/releases/latest→ download the.s9pkasset → sideload). Note: Gitea returnsbrowser_download_urlon its.localROOT_URL, which won't resolve off-LAN — a remote adopter pulls via whatever address reaches the Gitea (the WireGuard IP). (2) local-path/fine-tuned models — DONE in tree, staged as v0.23.0:0 (ModelDef.local_path+ exactly-one-source validator; swap bind-mounts the dir at the same container path via the launch script'sVLLM_SPARK_EXTRA_DOCKER_ARGShook, nolaunch-cluster.shchange; "+ Add local model" UI form +localbadge;validate_local_path; disk-delete refused for local; 94 tests pass. Reviewer-agent pass done, findings addressed (path validation + chat-template-location guard folded into theModelDefvalidator so YAML/override entries are checked too;_merge_overridesskips a bad entry instead of failing the whole catalog;VLLM_SPARK_EXTRA_DOCKER_ARGScontract documented inrunbook.md). Committede783653, taggedv0.23.0, built clean, published to Gitea Releases — butmake installto the live Start9 FAILED:immense-voyage.localwasn't resolving via mDNS from the Mac (server up at192.168.1.72;start-cli -H <ip>reaches it but returns UNAUTHORIZED, auth bound to the registered.localhost). FINISH-HERE: flush mDNS (sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder) or add a hosts entry, then re-runcd package && make install(details in runbook → "Sideload can't reach the server"). (3) configurable topology — DONE in tree, staged as v0.24.0:0 (built clean, not yet committed/installed). Three optional Configure-Sparks knobs: vLLM container name (VLLM_CONTAINER, blank ⇒vllm_node, threaded into the swap log-tail + validator exec viaquote_arg); "services to hide" (DISABLED_SERVICEScomma list →Settings.disabled_servicesfrozenset, skipped byservices_from_settings, thecheck_*probes, deep-healthrun_all, and connectivity logging — kills the Parakeet-on-8000 collision); second-Spark vLLM monitor via akind: vllmcustom service inservices-overrides.yaml(probe_vllm_endpointshared withcheck_vllm)./api/endpointsgained adisabledflag; the health-dot hides when disabled. 102 tests pass (+8 intest_topology.py). Swap mechanism deliberately NOT generalized to rawdocker run(that's coordination, item 4). Install pending — same mDNS situation as v0.23.0. (4) coordination layer — DONE in tree, staged as v0.25.0:0 (brought forward 2026-06-17 on request rather than waiting for our own automation).image/app/coordination.py+docs/COORDINATION.md: swap reservation lock (GET/POST/DELETE /api/swap/lock, secret token,423-enforced inpost_swap, TTL-bounded in-memory,?force=truehuman override, dashboard banner + swap-button pause), swap webhook (swap_complete/swap_failedfired outside the swap lock fromSwapManager._run, optional HMACX-Spark-Signature, Configure-Sparks URL+secret), schedule registry (GET/POST/DELETE /api/schedule, read-only "Scheduled jobs" panel). +20 tests (test_coordination.py). Built/typechecked clean; commit + install pending. Still-open older threads: audio concurrency sweep (only if the Signal Engine dev wants the knee; needs a quiet window); optional matrix-bridge DockerHEALTHCHECKif the bot dev asks; Parakeet long-audio guard deferred (rationale in ROADMAP).