A 12MP photo expands past vLLM's ~4096-image-token limit -> 400. Cap via --mm-processor-kwargs max_pixels in the qwen36 recipe so big images auto- downscale server-side for every /v1 consumer (verified live: 400->200). Remove the v0.27.2 in-dashboard vision-check button per owner request; the vision badge already signals capability.
17 KiB
AGENTS.md
This file provides guidance to coding agents (Claude Code and others) when working with code in this repository. (Claude Code reads it via the CLAUDE.md symlink.)
Browser-based StartOS 0.4 package controlling a dual NVIDIA DGX Spark AI cluster: one-click vLLM model swaps, plus health, proxying, and APIs for speech (STT/diarization/TTS), embeddings, and redaction.
Subsystem guidance lives in docs/guides/ and loads when matching files are touched (Claude Code lazy-loads via .claude/rules/ symlinks; other agents read the guides directly): startos-package.md (build/versioning, package/**), fastapi-image.md (dev server/env/layout, image/**), redaction.md (vendoring + test gates), audio-speech.md (parakeet patches, cluster-container footguns, audio testing). Read docs/guides/audio-speech.md before touching the Sparks' containers over SSH — ops sessions don't trip the path scoping.
Inbox check: At session start, if
~/Projects/standards/INBOX.mdexists, scan it for items tagged(spark-control)and surface them before proposing next steps; triage with/triage.
Stack
- Two halves, always coordinated:
image/— standalone FastAPI app (Python ≥3.11; UI on port 9999; vanilla HTML/CSS/JS).package/— StartOS 0.4 wrapper (TypeScript) that ships the Docker image as an s9pk.
- Build host needs
start-cli, Node ≥22 + npm, and Docker. - Cluster runtimes live on the Sparks, not in this repo (
spark-vllm-docker, the parakeet/kokoro/embeddings containers). This repo is the controller; it reaches them over SSH + HTTP. - Sparks are ARM64 (GB10 Grace-Blackwell, sm_121, CUDA 13). Services: vLLM
:8888(Spark 1);parakeet-asr:8000, Kokoro TTS:8880, bge-m3 embeddings + Qdrant (Spark 2). Seedocs/for API contracts.
Commands (headlines — details in the scoped rules)
(cd package && make x86) # build the s9pk; make install sideloads (restarts live service — ask first)
(cd image && uvicorn app.server:app --port 9999) # local dev — needs env vars, see fastapi-image rule
(cd image && .venv/bin/python -m pytest) # offline unit suite (launch-cmd injection, label-merge)
(cd image && .venv/bin/python -m app.redaction.test_gateway) # offline redaction suite 1
(cd image && .venv/bin/python app/redaction/test_scrub_leak.py) # offline redaction suite 2
./scripts/test-audio-with-speakers.sh <audio-file> # e2e audio — hits the LIVE cluster
Layout
image/app/— FastAPI app (server.pyentry, routers in sibling modules,static/dashboard UI).package/startos/— StartOS manifest, interfaces, actions, version + release notes.docs/—AUDIO_API.md,EMBEDDINGS.md,REDACTION_GATEWAY.md,COORDINATION.md(consumer-facing API refs; update with API changes).README.md(overview),HANDOFF.md(fresh-user install guide),runbook.md(ops notes),known-issues.md,ROADMAP.md(longer-term backlog — items move into "Current state" below when picked up).
Conventions
- Every shipped change = version bump + release notes + rebuilt s9pk (version format
X.Y.Z:N; details in the startos-package rule). - Commit messages:
vX.Y.Z:N - short lowercase summary. Never add a Co-Authored-By / Claude attribution trailer. - The package owner is non-technical: explain infra effects in plain English and get an explicit go/no-go before mutating the cluster.
- New external-facing endpoints get documented in
docs/and noted in release notes for downstream app developers (Recap Relay, Ten31 Transcripts, CRM, Signal Engine consume these APIs). - Doc layout:
AGENTS.mdis the canonical file;CLAUDE.mdis a symlink to it (don't overwrite it). Subsystem guides are real files indocs/guides/<topic>.md(withpaths:frontmatter);.claude/rules/<topic>.mdare relative symlinks into them. A new guide = adddocs/guides/<topic>.md, symlink it from.claude/rules/, and add an index line above.
Always / Never (cluster-wide)
- Always confirm with the user before swap/stop/restart of anything on the live cluster. Read-only probes and dry-runs are fine without asking.
- Always use the Spark's IP for HTTP probes —
.localmDNS names can resolve IPv6-first and hang httpx (vLLM and friends bind IPv4 only). Never trust.localhostnames inside HTTP client code. - Always pass
SSH_KEY_PATH/-i <key>explicitly in scripted SSH; non-interactive shells have no ssh-agent identities. - Never route audio or transcripts to cloud services — speech stays on the LAN. (Scrubbed text via
/scrubis the only sanctioned path toward frontier models.) - Never commit owner-specific hostnames, IPs, usernames, or names into package strings, UI text, or docs — this package gets shared; use placeholders. Canonical set:
<spark-1-ip>/<spark-2-ip>,<spark-1-host>/<spark-2-host>,<spark-user>, and generic example names (Alice/Bob). - Never install
cuda-pythoninparakeet-asr— crashes real decode on this GPU/CUDA-13 stack; full story in the audio-speech rule.
Current state
- Live: v0.27.3:0 — Qwen3.6 vision works end-to-end (incl. full-size phone photos). Installed on
immense-voyage(start-cliconfirms0.27.3:0). Two-part story: (A) the daily driverRedHatAI/Qwen3.6-35B-A3B-NVFP4is itself a vision model (Qwen3_5MoeForConditionalGeneration,vision_config+model_visual.safetensorson disk) — recipe was mislabelled[reasoning], now[vision, reasoning]. Real business card read 7/7 fields perfect (~97 tok/s, no patches). (B) oversized-image fix: a 12MP phone photo expands to ~11.8k vision tokens → exceeds vLLM's ~4096-image-token cap → 400 "Failed to apply Qwen3VLProcessor … token count mismatch." Fix = cap resolution server-side via'--mm-processor-kwargs={"max_pixels": 2000000}'in the qwen36 recipe (auto-downscales big images for every/v1consumer; verified live — the 12MP image went 400→200). Quoting survives the stack becauselaunch-cluster.shdoesprintf "%q"on the serve args (line 163) andbuild_launch_commandshlex-quotes (round-trip test passes). An in-dashboard "Vision check" button shipped in v0.27.2 then was removed in v0.27.3 at the owner's request (clutter; thevisionbadge already signals capability — don't re-add it). The/v1/chat/completionsproxy is a dumb passthrough that already forwards image content, so no backend change was needed. 161 pytest green. - Gemma-4-26B-A4B-NVFP4 eval — RESOLVED as "defer; Qwen covers vision better." Two independent deep-research agents (this session) confirmed: it does NOT run on the stock
eugr/spark-vllm-dockerstack (crashes ontie_weightsNotImplementedError— the checkpoint declares compressed-tensors in config.json but is modelopt NVFP4). The working path needs thevllm/vllm-openai:gemma4-0505-arm64-cu130image (lacks Ray → can't go throughlaunch-cluster.sh, needs rawdocker run= the deferred raw-docker-swap feature) + a bind-mounted patchedgemma4.py(upstream PR #39084 unmerged) +--moe-backend marlin, AND even then vision is degraded by open vLLM bug #40106 (wrong attention on image tokens — hurts OCR specifically). ~52 tok/s vs Qwen's 97. Net: more duct tape for worse vision than the Qwen Grant already runs. Revisit when #40106 + #39084 land. Alternatives agent also flaggedRedHatAI/Qwen3.5-122B-A10B-NVFP4as the proven single-Spark reasoning step-up (30–51 tok/s, fits 128 GB, no patches) — a future daily-driver upgrade, orthogonal to vision. - Live: v0.27.1:0 — fix: "Download a new model" button (uvx PATH). Commit
1e1e1cb; installed onimmense-voyage(start-cli package listconfirms0.27.1:0); pushed to gitea master; published to Clankistry (~/.spark-control/publish.sh). Root cause:hf-download.shshells out touvx, which the uv installer puts in~/.local/bin; Spark Control's non-interactive SSH session doesn't source the user's profile, so~/.local/binis off PATH and the download died with "uvx: command not found" (same class as the matrix-bridge non-interactive-SSH gotcha). Fix:download.build_download_commandprependsexport PATH="$HOME/.local/bin:$PATH"(server-side$HOME, generic for any adopter); extracted to a pure helper with regression tests (test_download.py: PATH prefix, no-trailing-space, cluster flags, shlex round-trip). 161 pytest green; verified live. Prompted by Grant adding Gemma-4-26B: he downloadednvidia/Gemma-4-26B-A4B-NVFP4(recipegemma4-26balready in catalog) via the now-fixed button — fix confirmed end-to-end — and is swapping to it. Pending: business-card OCR / vision test once it's up. - Live: v0.27.0:0 — in-app Settings gear + two bug fixes (commit
7e07598; installed onimmense-voyage—start-cli package listconfirms0.27.0:0; published to Clankistry; pushed to gitea master). Prompted by the second adopter's v0.25 feedback. (1) StartOS "Configure Sparks" action trimmed to the four required fields; all optional knobs moved to a ⚙ Settings gear in the dashboard, backed by a/data/app_settings.jsonoverlay (app_settings.py) keyed by env-var names, overlaid onos.environ, applied live via in-placeSettings.reload()(architecture + the snapshot-holder gotcha are in the fastapi-image guide). Existing installs' values migrate automatically on first boot (seed_from_env). (2) Support-service ports now configurable (PARAKEET_PORT/KOKORO_PORT/EMBED_PORT/QDRANT_PORT;VLLM_PORTsurfaced) — fixes the adopter's false "vLLM down" (theirs is on 8000, not launch-cluster.sh's 8888) and Parakeet 404 (remapped off 8000). (3) Bug fix:GET /api/swap/lock404 (was shadowed by/api/swap/{job_id}; lock routes now register first). Code review caught a real P1 (theWebhookNotifiersnapshot — fixed viaswap_webhook.update()after reload, regression-tested). 157 pytest + live smoke all green. - Next on this thread (small, externally gated): (a) adopter reply is drafted (in the session — corrects the vLLM-port misconception → set 8000 in the gear, confirms the port knobs + swap/lock fix, asks the disk-scan diagnostic) — pending Grant to send + pick the distribution-channel wording. (b) Optional Gitea tag +
make releaseso the adopter can pull v0.27 from Gitea Releases (NOT done this session — only registry + sideload shipped); do it only if that adopter pulls from Gitea Releases rather than subscribing to Clankistry. (c) Un-diagnosed: adopter's disk-scan shows Gemma "not on disk" — needs them to runls ~/.cache/huggingface/hubas the SSH user vsdisk.py's$HOME/.cache/huggingface/hubassumption (likely a customHF_HOME/container-volume/different-user cache path → would need a configurable cache path). - Live: v0.26.0:0 — disk-driven model menu (installed on the server 2026-06-18,
installed-versionconfirms; also published to the self-hosted StartOS registry). The dashboard lists what's actually downloaded on the Sparks;models.yaml/overrides are launch recipes matched byrepo, not the menu; an on-disk model with no recipe showsneeds_setupand infers its launch flags fromconfig.json(operator confirms once). Delete removes weights and the card; dropped the two legacy Qwen recipes. Architecture (discovery.py/build_menu/infer_recipe, the recipe-vs-disk split) is in the fastapi-image guide. - Gemma-4-26B-A4B vision eval — DONE this session (deferred; see the v0.27.2 + Gemma bullets up top). The
gemma4-26brecipe stays in the catalog but is known not to launch on the stock stack; the owner's vision/OCR goal is met by the Qwen3.6 daily driver instead. - Live: v0.25.0:0 (installed 2026-06-18). The OpenClaw/Johnny-5 coexistence epic is fully shipped & live: configurable
VLLM_PORT(v0.22, blank ⇒ 8888), local/fine-tuned models (v0.23), configurable topology (v0.24 —VLLM_CONTAINER,DISABLED_SERVICEShide-list, second-Sparkkind: vllmmonitor), coordination layer (v0.25 — swap reservation lock with423-enforced manual-swap pause +?force=trueRelease override,swap_complete/swap_failedwebhook, read-only schedule registry; consumer API indocs/COORDINATION.md). - Other live features: swap dashboard; chat / transcribe / diarize(+chunk) / TTS proxies; embeddings + rerank + hybrid search (Qdrant);
/scrub+/rehydrate; label-merge incl. dual-channel; per-Spark SSH-key copy + WireGuardVPN <ip>hardware badge. Security hardening (v0.19 — shellsafe SSH-injection guard, Qdrant path-injection, same-origin CSRF guard) stable (EVALUATION.md). Spark 2 audio/embeddings stack healthy. - matrix-bridge bot tile (v0.21.0:1, live):
bot-kind tile (docker-state badge; Update/Restart/Stop-Start/View-logs) for the Matrix bot on Spark 2, driven asmodelo(nosudo -iu; blankmatrix_bridge_user⇒ tile hidden; host reusesspark2_host). Code:app/matrix_bridge.py+/api/matrix-bridge/{update,logs}. Load-bearing: Update'sgit fetchruns asmodeloand needsmodelo's~/.ssh/configpinning the Gitea deploy key withIdentitiesOnly yes(else publickey denial). Optional next only if the bot dev asks: DockerHEALTHCHECK. - Tests: offline pytest harness in
image/tests/—cd image && .venv/bin/python -m pytest(157 passing; the in-app settings gear + swap-lock route-order regression + the webhook-repoint live-reload check are intest_app_settings.py, incl.TestClientend-to-end). Coversbuild_launch_command(incl. the shell-injection round-trip + local-model bind-mount), the transcript↔diarizer label-merge, theshellsafevalidators,matrix_bridge.build_update_command(+ phase detection), the configurable-topology layer (test_topology.py), the coordination layer (test_coordination.py: swap-lock lifecycle/expiry/token-auth, schedule-registry CRUD, webhook payload + HMAC signature —nowis injected into the lock so expiry is tested without sleeping), and the disk-driven menu (test_discovery.py: cache-dirname↔repo parsing, the cache-listing parser incl. incomplete-download filtering, andinfer_recipefamily/mode mapping — Qwen3-MoE→flashinfer_cutlass, Gemma-MoE→marlin, vision caps, solo-vs-cluster by size/host-count). Thebuild_menumerge +/api/models/suggestare exercised by hand against the live cluster (mock-heavy unit tests there would test the mocks). Redaction + live-audio suites remain standalone scripts. - Signal Engine "flakiness": diagnosed as not a server bug — transient 1–4s unresponsiveness while the single GPU is busy. Client-side remedy (in-flight cap 2 / ceiling 3 / retry-on-timeout+503) drafted and forwarded to that dev (owner confirmed 2026-06-15). Awaiting whether they want the measured concurrency knee.
- Stance (decided, not built): no public interface / no API-token auth — LAN + WireGuard/Tailscale split-tunnel only; the CSRF guard covers the browser-driven vector.
- Known limits:
/healthblips while the GPU is busy (mitigated client-side); dual-channel can miss a quiet local word under loud remote bleed; connectivity log misses sub-5s outages between 5s polls; diarizer caps at 4 speakers; matrix-bridge badge won't visibly flip on a fastdocker restart(status re-checked only after the command returns). - Infra gotcha (safety): passwordless sudo is NOT configured on spark2 — design unprivileged probes for any Spark feature (the badge uses
ip, notsudo wg show). spark2 sits on thestarttunnelWireGuard subnet (10.59.211.6/24, survives reboot). Owner declined SSH-key rotation after the 2026-06-12 history scrub (only the key name leaked) — don't re-flag. - Hosting / distribution: source on self-hosted Gitea — remote
gitea, branchmaster, over SSH; push after committing. (Wart: commit8d839e3is mislabeledv0.13.0:4but contains through v0.18.0:0.) The s9pk ships via Gitea Releases (make release) and a self-hosted StartOS registry — operator-local publish tooling lives outside the repo; owner-specific addresses + the authenticated-writes-must-be-direct-not-via-the-tunnel gotcha are in session memory. - Design stance (decided): Spark Control = control plane / GPU arbiter, not a job runner; recurring business jobs live in separate services that call the swap API (
POST /api/swap). Full epic history (v0.22→v0.25) is in git log +ROADMAP.md→ "Cluster coordination". - Usage note (2026-06-18): owner's daily driver is the solo Qwen3.6 35B; the 235B
clustermodels are dormant. Keepinglaunch-cluster.sh(theeugr/spark-vllm-dockercommunity standard, mirrors NVIDIA'sdgx-spark-playbooksRay+RoCE design) is still correct even single-node — it supplies the maintained, hardware-tuned vLLM images; raw docker would mean DIY image upkeep for no gain. Spark 2 stays the speech/embeddings box regardless. - Next steps (all low-priority / externally gated; P2/P3 tech-debt backlog in
ROADMAP.md): (1) raw-docker runswap generalization — DEFERRED (rationale in ROADMAP; revisit only if an adopter wants Spark Control to drive, not just monitor, raw-docker swaps — cleanest fix is the adopter adoptinglaunch-cluster.sh). (2) audio concurrency knee — only if the Signal Engine dev wants it (needs a quiet window). (3) matrix-bridge DockerHEALTHCHECK— only if the bot dev asks. (4) Parakeet long-audio guard — deferred (rationale in ROADMAP).