Files
spark-control/AGENTS.md
T
Keysat 56a519ff4f docs: record git-history scrub; fix stale SHAs and IP-fragment remnants
History was rewritten with git filter-repo to purge owner-specific values
(IPs, hostnames, SSH username, key name, personal names) from all commits,
tags, and messages — including three LAN IPs and one Start9 address the
v0.18.0:1 working-tree scrub had missed (one still live in HEAD at
docs/AUDIO_API.md). Verified 0 hits across all refs.

- AGENTS.md: Portability + Repo-wart + work-queue #2 + shipping note updated;
  commit-SHA references repointed to post-rewrite SHAs (367d986->8d839e3).
- EVALUATION.md: P0 owner-data finding marked resolved; cleaned shorthand
  IP-octet fragments (/.87, /11) left by the placeholder substitution.
2026-06-12 16:55:08 -05:00

88 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AGENTS.md
This file provides guidance to coding agents (Claude Code and others) when working with code in this repository. (Claude Code reads it via the `CLAUDE.md` symlink.)
Browser-based StartOS 0.4 package controlling a dual NVIDIA DGX Spark AI cluster: one-click vLLM model swaps, plus health, proxying, and APIs for speech (STT/diarization/TTS), embeddings, and redaction.
Subsystem guidance lives in `docs/guides/` and loads when matching files are touched (Claude Code lazy-loads via `.claude/rules/` symlinks; other agents read the guides directly): `startos-package.md` (build/versioning, `package/**`), `fastapi-image.md` (dev server/env/layout, `image/**`), `redaction.md` (vendoring + test gates), `audio-speech.md` (parakeet patches, cluster-container footguns, audio testing). **Read `docs/guides/audio-speech.md` before touching the Sparks' containers over SSH** — ops sessions don't trip the path scoping.
## Stack
- Two halves, always coordinated:
- `image/` — standalone FastAPI app (Python ≥3.11; UI on port 9999; vanilla HTML/CSS/JS).
- `package/` — StartOS 0.4 wrapper (TypeScript) that ships the Docker image as an s9pk.
- Build host needs `start-cli`, Node ≥22 + npm, and Docker.
- Cluster runtimes live **on the Sparks, not in this repo** (`spark-vllm-docker`, the parakeet/kokoro/embeddings containers). This repo is the controller; it reaches them over SSH + HTTP.
- Sparks are ARM64 (GB10 Grace-Blackwell, sm_121, CUDA 13). Services: vLLM `:8888` (Spark 1); `parakeet-asr` `:8000`, Kokoro TTS `:8880`, bge-m3 embeddings + Qdrant (Spark 2). See `docs/` for API contracts.
## Commands (headlines — details in the scoped rules)
```bash
(cd package && make x86) # build the s9pk; make install sideloads (restarts live service — ask first)
(cd image && uvicorn app.server:app --port 9999) # local dev — needs env vars, see fastapi-image rule
(cd image && .venv/bin/python -m app.redaction.test_gateway) # offline redaction suite 1
(cd image && .venv/bin/python app/redaction/test_scrub_leak.py) # offline redaction suite 2
./scripts/test-audio-with-speakers.sh <audio-file> # e2e audio — hits the LIVE cluster
```
## Layout
- `image/app/` — FastAPI app (`server.py` entry, routers in sibling modules, `static/` dashboard UI).
- `package/startos/` — StartOS manifest, interfaces, actions, version + release notes.
- `docs/``AUDIO_API.md`, `EMBEDDINGS.md`, `REDACTION_GATEWAY.md` (consumer-facing API refs; update with API changes).
- `README.md` (overview), `HANDOFF.md` (fresh-user install guide), `runbook.md` (ops notes), `known-issues.md`, `ROADMAP.md` (longer-term backlog — items move into "Current state" below when picked up).
## Conventions
- Every shipped change = version bump + release notes + rebuilt s9pk (version format `X.Y.Z:N`; details in the startos-package rule).
- Commit messages: `vX.Y.Z:N - short lowercase summary`. **Never add a Co-Authored-By / Claude attribution trailer.**
- The package owner is non-technical: explain infra effects in plain English and get an explicit go/no-go before mutating the cluster.
- New external-facing endpoints get documented in `docs/` and noted in release notes for downstream app developers (Recap Relay, Ten31 Transcripts, CRM, Signal Engine consume these APIs).
- Doc layout: `AGENTS.md` is the canonical file; `CLAUDE.md` is a symlink to it (don't overwrite it). Subsystem guides are real files in `docs/guides/<topic>.md` (with `paths:` frontmatter); `.claude/rules/<topic>.md` are relative symlinks into them. A new guide = add `docs/guides/<topic>.md`, symlink it from `.claude/rules/`, and add an index line above.
## Always / Never (cluster-wide)
- **Always** confirm with the user before swap/stop/restart of anything on the live cluster. Read-only probes and dry-runs are fine without asking.
- **Always** use the Spark's **IP** for HTTP probes — `.local` mDNS names can resolve IPv6-first and hang httpx (vLLM and friends bind IPv4 only). Never trust `.local` hostnames inside HTTP client code.
- **Always** pass `SSH_KEY_PATH` / `-i <key>` explicitly in scripted SSH; non-interactive shells have no ssh-agent identities.
- **Never** route audio or transcripts to cloud services — speech stays on the LAN. (Scrubbed text via `/scrub` is the only sanctioned path toward frontier models.)
- **Never** commit owner-specific hostnames, IPs, usernames, or names into package strings, UI text, or docs — this package gets shared; use placeholders. Canonical set: `<spark-1-ip>` / `<spark-2-ip>`, `<spark-1-host>` / `<spark-2-host>`, `<spark-user>`, and generic example names (`Alice`/`Bob`).
- **Never** install `cuda-python` in `parakeet-asr` — crashes real decode on this GPU/CUDA-13 stack; full story in the audio-speech rule.
## Current state
- **Working (v0.18.0:0, installed and serving):** swap dashboard; chat / transcribe / diarize(+chunk) / TTS proxies; embeddings + rerank + hybrid search (Qdrant); `/scrub` + `/rehydrate`; label-merge incl. dual-channel mode. Spark 2 audio stack is healthy (11k+ requests/12h, all 200).
- **In progress — Signal Engine "flakiness":** diagnosed, not a server bug — transient 14s unresponsiveness while the single GPU is continuously busy. Remedy is client-side; a drafted message (in-flight cap 2, hard ceiling 3 global across audio endpoints, retry-with-backoff on timeout/503) is with the owner to forward to that dev.
- **Decided, not implemented:** remote access stays WireGuard/Tailscale split-tunnel — no public interface, so no API auth built; an empirical concurrency sweep is offered but needs the owner's explicit OK in a quiet window. **Revisit (full-eval 2026-06-12):** the "LAN-only, so no auth" call is now load-bearing against RCE — unquoted user input reaches the SSH shell on several endpoints, so the network boundary is the *only* thing preventing cluster takeover. Quoting the injection sinks (work queue) is needed regardless of the auth decision; a defense-in-depth auth/CSRF gate is the follow-on.
- **Known limits:** `/health` blips while the GPU is busy (mitigated client-side); dual-channel can miss a quiet local word under loud remote bleed; the connectivity log misses sub-5s outages between 5s polls; diarizer caps at 4 speakers.
- **Portability:** working tree + **full git history** scrubbed 2026-06-12 — all owner-specific IPs/hostnames/usernames/names replaced with placeholders; `claude-code-starter-prompt.md` deleted (old build-time prompt). Real cluster values live only in StartOS install config, shell env vars, and the gitignored `settings.local.json`. History was rewritten with `git filter-repo` (every commit SHA changed; force-pushed to `gitea`); a token sweep across all refs now returns 0 for every owner-specific value, including three LAN IPs + one Start9 address the original working-tree scrub had missed. Backup bundle of the pre-rewrite history is at `../spark-control-prehistory-rewrite.bundle`. **Still owner-side:** the `<ssh-key>` SSH key was leaked by *name* only (not material) — rotate it if it's still authorized on the Sparks.
- **Repo wart:** commit `8d839e3` (was `367d986` before the 2026-06-12 history rewrite) is labeled `v0.13.0:4` but actually contains everything through v0.18.0:0 — per-version commits for v0.14v0.18 are missing. Keep commit messages accurate going forward.
- **Hosting:** repo pushes to the owner's self-hosted Gitea — remote `gitea`, branch `master`, over SSH (host alias + key live in the local `~/.ssh/config`; no owner-specific details belong in the repo). Push there after committing.
- **Next (pre-eval backlog):** (1) owner forwards the concurrency note to the Signal Engine dev; (2) run the concurrency sweep if the dev wants the measured knee; (3) add the `--memory` cap to parakeet-asr via the Reapply-patches action; (4) pick the next item from ROADMAP.md.
### Full-eval triage (2026-06-12)
Source: `EVALUATION.md` at repo root (full evidence, file:line pointers, scorecard). Findings triaged below; do these before the pre-eval backlog above where they overlap.
**Work queue — P0/P1, fix before sharing the package wider:**
1. ~~**[P0] Shell-quote/validate every user value crossing into SSH**~~**DONE (code, 2026-06-12; not yet shipped).** New `image/app/shellsafe.py` (`validate_repo`/`validate_image`/`validate_container` whitelists + `quote_arg`/`quote_args`). Boundary validation added to `POST /api/models` (repo) and `POST /api/nim/install` (image+container); `shlex.quote` applied at every SSH sink — `models.build_launch_command` (repo+args, covers `vllm_args`+knobs), `download._do` (repo), `nim._do` (image/container/volume/port/env), `services.docker_state`+`run_action` (container). Verified: injection survives only as a single quoted token, vLLM preflight `shlex.split` round-trip intact, both redaction suites still pass. Side-benefit: NGC key now `shlex.quote`'d in `nim._do` (was single-quoted) — closes the quote-breakout half of the P2 NGC-key item; the process-list-exposure half remains. **Ship step pending:** version bump + release notes + rebuilt s9pk.
2. ~~**[P0] Decide the git-history question**~~**DONE (2026-06-12).** Chose to rewrite. `git filter-repo --replace-text --replace-message` mapped every owner-specific value to its canonical placeholder across all commits + tags + messages (incl. three LAN IPs and one Start9 address the working-tree scrub missed — the Start9 address was still live in HEAD at `docs/AUDIO_API.md`). Verified 0 hits across all refs; force-pushed to `gitea`. Pre-rewrite backup bundle kept at `../spark-control-prehistory-rewrite.bundle`. **Owner follow-up:** rotate the `<ssh-key>` SSH key if still authorized on the Sparks (only its name leaked, not the key).
3. ~~**[P1] Defense-in-depth gate on mutating endpoints**~~**DONE (code, 2026-06-12; not yet shipped).** `csrf_guard` HTTP middleware in `server.py` rejects state-changing requests whose `Origin`/`Referer` hostname ≠ the served host. Scoped to control endpoints; the programmatic API surface is exempt (`/v1/*`, `/scrub`, `/rehydrate`, `/api/search`, `/api/audio/`, `/api/health-event`) so downstream consumers are unaffected. No app-layer token auth (deliberate — would break consumers + the non-technical owner). Verified via TestClient: cross-origin control POST→403, same-origin/no-Origin→pass, exempt prefixes always pass, GET never blocked. **Verify on-box:** confirm the StartOS reverse proxy passes `Host`/`Origin` so the dashboard isn't false-positive-blocked.
4. ~~**[P1] Validate the Qdrant `collection`**~~**DONE (code, 2026-06-12; not yet shipped).** `_safe_collection` whitelist (`[A-Za-z0-9._-]`, rejects `..`) + URL-encoded path segment in `embeddings_proxy.py`. The raw `filter` is left as a passthrough (Qdrant parses it; pydantic enforces `dict`) — locking it to an allowlist would break hybrid-search consumers; the path segment was the real injection vector.
**Shipping (all of #1/#3/#4 batched):** SHIPPED 2026-06-12 — version `0.18.0:1``0.19.0:0`, release notes in `versions/v0_1_0.ts`, s9pk rebuilt (`make x86`) and sideloaded to the live Start9 server (`make install`). Committed as `1c4e861` (its SHA after the same-day history rewrite) and force-pushed to `gitea`. **On-box check still pending:** click a control action (swap / service stop) to confirm the new CSRF guard doesn't false-positive-block the dashboard behind the StartOS proxy.
**Known debt — P2, track but not blocking:**
- Test coverage is redaction-only; swap state machine, proxies, SSH wrapper, and the package have zero automated tests. Live-cluster paths (swap exec, audio, embeddings/search) couldn't be exercised at all — biggest blind spot.
- Loose dependency floors permit vulnerable `python-multipart`/`starlette` (DoS CVEs) on rebuild; no lockfile; no upload size caps (`pyproject.toml:6-13`).
- StartOS registry blockers (only if pursuing the registry): source not public + `packageRepo`/`upstreamRepo` are `example.com` placeholders (`manifest/index.ts:12-13`).
- Opaque HTTP 500 on `POST /api/models` / `PUT /knobs` when `MODELS_OVERRIDES` unset in dev (write to read-only `/data`) — catch the `OSError`.
- NGC API key inlined single-quoted into a remote shell command (`nim.py:147`) — pass via stdin/env.
- Global mutable `catalog` reassigned via `global`, shared across async requests with no snapshot (`server.py:107`) — latent race as concurrency grows.
- Container runs uvicorn as **root** bound to `0.0.0.0:9999` (no `USER` in Dockerfile) — amplifies any RCE blast radius.
**Parked — P3+, do in bulk when next touching docs/packaging:**
- README Status block stale (`v0.2.3 / 0.13.0:4` → v0.18.0:1, undercounts features); deprecated `@app.on_event` + hardcoded `app.version="0.1.0"`; `NimInstallBody.register` shadows `BaseModel` (rename → `register_service`); httpx class names leak into TTS/speech-models error text; one unescaped `innerHTML` sink (`app.js:177`) + `task_id` reflected in scrub JSON.
- Packaging cosmetics: `marketingUrl` placeholder; broken `instructions.md` source link; per-service SSH users (`parakeet_user` etc.) absent from the Configure-Sparks action inputSpec (silent default-empty); `Makefile` builds only x86 though manifest declares `aarch64`; release notes describe the scrub, not capabilities.
- Hardening misc: no body/upload size limits on `/v1/audio/*`, `/v1/chat/completions`, `/scrub`; `int(_env(...))` startup crash on bad `VLLM_PORT`; upstream error text echoed to clients.