Files
spark-control/AGENTS.md
T
Keysat 7ae6ab3ba8 v0.25.0:0 - cluster coordination layer (swap lock + webhook + schedule registry)
GPU-arbiter safety layer for when automation, not just the dashboard, swaps
models:
- swap reservation lock (POST/GET/DELETE /api/swap/lock); 423-enforced in
  post_swap via a single-read gate, TTL-bounded, secret-token auth, human
  force-release override + dashboard banner
- swap webhook (swap_complete/swap_failed) fired outside the swap lock, optional
  HMAC signature, configurable URL+secret
- read-only schedule registry (GET/POST/DELETE /api/schedule) + dashboard panel

New module image/app/coordination.py; docs/COORDINATION.md for consumers; 22
offline tests in test_coordination.py.
2026-06-18 07:07:08 -05:00

14 KiB
Raw Blame History

AGENTS.md

This file provides guidance to coding agents (Claude Code and others) when working with code in this repository. (Claude Code reads it via the CLAUDE.md symlink.)

Browser-based StartOS 0.4 package controlling a dual NVIDIA DGX Spark AI cluster: one-click vLLM model swaps, plus health, proxying, and APIs for speech (STT/diarization/TTS), embeddings, and redaction.

Subsystem guidance lives in docs/guides/ and loads when matching files are touched (Claude Code lazy-loads via .claude/rules/ symlinks; other agents read the guides directly): startos-package.md (build/versioning, package/**), fastapi-image.md (dev server/env/layout, image/**), redaction.md (vendoring + test gates), audio-speech.md (parakeet patches, cluster-container footguns, audio testing). Read docs/guides/audio-speech.md before touching the Sparks' containers over SSH — ops sessions don't trip the path scoping.

Inbox check: At session start, if ~/Projects/standards/INBOX.md exists, scan it for items tagged (spark-control) and surface them before proposing next steps; triage with /triage.

Stack

  • Two halves, always coordinated:
    • image/ — standalone FastAPI app (Python ≥3.11; UI on port 9999; vanilla HTML/CSS/JS).
    • package/ — StartOS 0.4 wrapper (TypeScript) that ships the Docker image as an s9pk.
  • Build host needs start-cli, Node ≥22 + npm, and Docker.
  • Cluster runtimes live on the Sparks, not in this repo (spark-vllm-docker, the parakeet/kokoro/embeddings containers). This repo is the controller; it reaches them over SSH + HTTP.
  • Sparks are ARM64 (GB10 Grace-Blackwell, sm_121, CUDA 13). Services: vLLM :8888 (Spark 1); parakeet-asr :8000, Kokoro TTS :8880, bge-m3 embeddings + Qdrant (Spark 2). See docs/ for API contracts.

Commands (headlines — details in the scoped rules)

(cd package && make x86)                                  # build the s9pk; make install sideloads (restarts live service — ask first)
(cd image && uvicorn app.server:app --port 9999)          # local dev — needs env vars, see fastapi-image rule
(cd image && .venv/bin/python -m pytest)                          # offline unit suite (launch-cmd injection, label-merge)
(cd image && .venv/bin/python -m app.redaction.test_gateway)      # offline redaction suite 1
(cd image && .venv/bin/python app/redaction/test_scrub_leak.py)   # offline redaction suite 2
./scripts/test-audio-with-speakers.sh <audio-file>        # e2e audio — hits the LIVE cluster

Layout

  • image/app/ — FastAPI app (server.py entry, routers in sibling modules, static/ dashboard UI).
  • package/startos/ — StartOS manifest, interfaces, actions, version + release notes.
  • docs/AUDIO_API.md, EMBEDDINGS.md, REDACTION_GATEWAY.md, COORDINATION.md (consumer-facing API refs; update with API changes).
  • README.md (overview), HANDOFF.md (fresh-user install guide), runbook.md (ops notes), known-issues.md, ROADMAP.md (longer-term backlog — items move into "Current state" below when picked up).

Conventions

  • Every shipped change = version bump + release notes + rebuilt s9pk (version format X.Y.Z:N; details in the startos-package rule).
  • Commit messages: vX.Y.Z:N - short lowercase summary. Never add a Co-Authored-By / Claude attribution trailer.
  • The package owner is non-technical: explain infra effects in plain English and get an explicit go/no-go before mutating the cluster.
  • New external-facing endpoints get documented in docs/ and noted in release notes for downstream app developers (Recap Relay, Ten31 Transcripts, CRM, Signal Engine consume these APIs).
  • Doc layout: AGENTS.md is the canonical file; CLAUDE.md is a symlink to it (don't overwrite it). Subsystem guides are real files in docs/guides/<topic>.md (with paths: frontmatter); .claude/rules/<topic>.md are relative symlinks into them. A new guide = add docs/guides/<topic>.md, symlink it from .claude/rules/, and add an index line above.

Always / Never (cluster-wide)

  • Always confirm with the user before swap/stop/restart of anything on the live cluster. Read-only probes and dry-runs are fine without asking.
  • Always use the Spark's IP for HTTP probes — .local mDNS names can resolve IPv6-first and hang httpx (vLLM and friends bind IPv4 only). Never trust .local hostnames inside HTTP client code.
  • Always pass SSH_KEY_PATH / -i <key> explicitly in scripted SSH; non-interactive shells have no ssh-agent identities.
  • Never route audio or transcripts to cloud services — speech stays on the LAN. (Scrubbed text via /scrub is the only sanctioned path toward frontier models.)
  • Never commit owner-specific hostnames, IPs, usernames, or names into package strings, UI text, or docs — this package gets shared; use placeholders. Canonical set: <spark-1-ip> / <spark-2-ip>, <spark-1-host> / <spark-2-host>, <spark-user>, and generic example names (Alice/Bob).
  • Never install cuda-python in parakeet-asr — crashes real decode on this GPU/CUDA-13 stack; full story in the audio-speech rule.

Current state

  • Live service runs v0.22.0:0 (installed and serving). v0.25.0:0 is the latest in tree — coordination layer (swap lock + webhook + schedule registry); built/typechecked clean, NOT yet committed/tagged/installed (this session's work). It stacks on three releases also staged-but-not-live: v0.24.0:0 (configurable topology — committed 26070eb, tagged, pushed to gitea/master), v0.23.0:0 (local/fine-tuned models — committed/tagged/Gitea-published). Close-out backlog for all of these: (a) commit/tag/push v0.25.0:0; (b) make release to publish s9pk assets to Gitea Releases (needs GITEA_URL + write GITEA_TOKEN, neither in env); (c) the live install. Installs blocked on the same mDNS issue (P3 line below). Working features: swap dashboard; chat / transcribe / diarize(+chunk) / TTS proxies; embeddings + rerank + hybrid search (Qdrant); /scrub + /rehydrate; label-merge incl. dual-channel; per-Spark SSH-key copy + WireGuard VPN <ip> hardware-card badge; configurable vLLM port (blank ⇒ 8888); configurable topology (vLLM container name, hide-services list, second-Spark vLLM monitor — v0.24.0:0); local/fine-tuned models (v0.23.0:0); coordination layer (v0.25.0:0 — GPU swap reservation lock with 423-enforced manual-swap pause + human Release override, swap_complete/swap_failed webhook, read-only schedule registry; API in docs/COORDINATION.md). Everything from v0.23 onward lands live once the installs go through. Spark 2 audio stack healthy. Security hardening (v0.19.0:0 — shellsafe SSH-injection guard, Qdrant path-injection, same-origin CSRF guard) shipped and stable; evidence in EVALUATION.md.
  • matrix-bridge bot tile (done, v0.21.0:1, verified live): bot-kind service tile — status badge from docker-state only (no HTTP port), plus Update / Restart / Stop/Start / View logs. Code: app/matrix_bridge.py + /api/matrix-bridge/{update,logs} (update streams; 25-min cap; fail-loud). Driven directly as modelo on Spark 2 (no sudo -iu — spark2 has no passwordless sudo). User is a blank-default Configure-Sparks field (matrix_bridge_user); blank → tile hidden (portable). Host reuses spark2_host (192.168.1.87 = the bot's box spark-32d0); container/dir/branch are env-overridable defaults. Load-bearing ops dep: Update's git fetch runs as modelo, which needs modelo's ~/.ssh/config pinning the Gitea deploy key with IdentitiesOnly yes — else the wrong key is offered and Gitea denies (publickey). Optional next, only if the bot dev asks: Docker HEALTHCHECK for running-but-disconnected detection (spec §Note).
  • Tests: offline pytest harness in image/tests/cd image && .venv/bin/python -m pytest (124 passing). Covers build_launch_command (incl. the shell-injection round-trip + local-model bind-mount), the transcript↔diarizer label-merge, the shellsafe validators, matrix_bridge.build_update_command (+ phase detection), the configurable-topology layer (test_topology.py), and the coordination layer (test_coordination.py: swap-lock lifecycle/expiry/token-auth, schedule-registry CRUD, webhook payload + HMAC signature — now is injected into the lock so expiry is tested without sleeping). Mock-heavy swap/proxy/endpoint tests deliberately skipped (low ROI). Redaction + live-audio suites remain standalone scripts.
  • Signal Engine "flakiness": diagnosed as not a server bug — transient 14s unresponsiveness while the single GPU is busy. Client-side remedy (in-flight cap 2 / ceiling 3 / retry-on-timeout+503) drafted and forwarded to that dev (owner confirmed 2026-06-15). Awaiting whether they want the measured concurrency knee.
  • Stance (decided, not built): no public interface / no API-token auth — LAN + WireGuard/Tailscale split-tunnel only; the CSRF guard covers the browser-driven vector.
  • Known limits: /health blips while the GPU is busy (mitigated client-side); dual-channel can miss a quiet local word under loud remote bleed; connectivity log misses sub-5s outages between 5s polls; diarizer caps at 4 speakers; matrix-bridge badge won't visibly flip on a fast docker restart (status re-checked only after the command returns).
  • Infra gotcha (safety): passwordless sudo is NOT configured on spark2 — design unprivileged probes for any Spark feature (the badge uses ip, not sudo wg show). spark2 sits on the starttunnel WireGuard subnet (10.59.211.6/24, survives reboot). Owner declined SSH-key rotation after the 2026-06-12 history scrub (only the key name leaked) — don't re-flag.
  • Hosting: self-hosted Gitea — remote gitea, branch master, over SSH; push after committing. (Wart: commit 8d839e3 is mislabeled v0.13.0:4 but contains through v0.18.0:0.)
  • Next — committed 2026-06-17: OpenClaw/Johnny-5 coexistence epic (full plan + design stance in ROADMAP.md → "Cluster coordination"). Stance: Spark Control = control plane / GPU arbiter, not a job runner; business cron jobs live in separate services that call its swap API (swaps are already API-driven via POST /api/swap). Sequence: (1) configurable VLLM_PORT — SHIPPED v0.22.0:0 (Configure-Sparks field, blank ⇒ 8888; + _env_int hardening in config.py so a blank/bad port no longer crashes startup, killing a P3 tech-debt item). Committed 136a471, pushed, tagged v0.22.0, rebuilt clean, installed, and published to the self-hosted Gitea Releases 2026-06-17 (make releasescripts/gitea-release.sh, takes GITEA_URL + a write token). Distribution model (decided 2026-06-17): Gitea Releases + a read-only token the adopter's agent uses to pull the latest s9pk (GET /api/v1/repos/grant/spark-control/releases/latest → download the .s9pk asset → sideload). Note: Gitea returns browser_download_url on its .local ROOT_URL, which won't resolve off-LAN — a remote adopter pulls via whatever address reaches the Gitea (the WireGuard IP). (2) local-path/fine-tuned models — DONE in tree, staged as v0.23.0:0 (ModelDef.local_path + exactly-one-source validator; swap bind-mounts the dir at the same container path via the launch script's VLLM_SPARK_EXTRA_DOCKER_ARGS hook, no launch-cluster.sh change; "+ Add local model" UI form + local badge; validate_local_path; disk-delete refused for local; 94 tests pass. Reviewer-agent pass done, findings addressed (path validation + chat-template-location guard folded into the ModelDef validator so YAML/override entries are checked too; _merge_overrides skips a bad entry instead of failing the whole catalog; VLLM_SPARK_EXTRA_DOCKER_ARGS contract documented in runbook.md). Committed e783653, tagged v0.23.0, built clean, published to Gitea Releases — but make install to the live Start9 FAILED: immense-voyage.local wasn't resolving via mDNS from the Mac (server up at 192.168.1.72; start-cli -H <ip> reaches it but returns UNAUTHORIZED, auth bound to the registered .local host). FINISH-HERE: flush mDNS (sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder) or add a hosts entry, then re-run cd package && make install (details in runbook → "Sideload can't reach the server"). (3) configurable topology — DONE in tree, staged as v0.24.0:0 (built clean, not yet committed/installed). Three optional Configure-Sparks knobs: vLLM container name (VLLM_CONTAINER, blank ⇒ vllm_node, threaded into the swap log-tail + validator exec via quote_arg); "services to hide" (DISABLED_SERVICES comma list → Settings.disabled_services frozenset, skipped by services_from_settings, the check_* probes, deep-health run_all, and connectivity logging — kills the Parakeet-on-8000 collision); second-Spark vLLM monitor via a kind: vllm custom service in services-overrides.yaml (probe_vllm_endpoint shared with check_vllm). /api/endpoints gained a disabled flag; the health-dot hides when disabled. 102 tests pass (+8 in test_topology.py). Swap mechanism deliberately NOT generalized to raw docker run (that's coordination, item 4). Install pending — same mDNS situation as v0.23.0. (4) coordination layer — DONE in tree, staged as v0.25.0:0 (brought forward 2026-06-17 on request rather than waiting for our own automation). image/app/coordination.py + docs/COORDINATION.md: swap reservation lock (GET/POST/DELETE /api/swap/lock, secret token, 423-enforced in post_swap, TTL-bounded in-memory, ?force=true human override, dashboard banner + swap-button pause), swap webhook (swap_complete/swap_failed fired outside the swap lock from SwapManager._run, optional HMAC X-Spark-Signature, Configure-Sparks URL+secret), schedule registry (GET/POST/DELETE /api/schedule, read-only "Scheduled jobs" panel). +20 tests (test_coordination.py). Built/typechecked clean; commit + install pending. Still-open older threads: audio concurrency sweep (only if the Signal Engine dev wants the knee; needs a quiet window); optional matrix-bridge Docker HEALTHCHECK if the bot dev asks; Parakeet long-audio guard deferred (rationale in ROADMAP).