spark-control/EVALUATION.md

# Evaluation — spark-control — 2026-06-12

Intent: A browser-based StartOS 0.4 package controlling a dual-DGX-Spark vLLM cluster — one-click model swaps plus health, proxying, and APIs for speech (STT/diarization/TTS), embeddings, and redaction.

Agents run: evaluator, security-auditor, exerciser, start9-spec-checker. Reviewer skipped (working tree clean — no diff to review).

## Verdict

This is a capable, well-documented single-operator control plane: a ~960-line FastAPI app fronting SSH-driven model swaps plus honest HTTP proxies for chat, speech, embeddings, and a genuinely well-engineered fail-closed redaction gateway, wrapped by a thin, spec-conformant StartOS 0.4 package that builds cleanly and passes both offline test suites. The app boots and behaves correctly with the cluster absent, and the packaging is compliant on every structural requirement. The dominant risk, corroborated by two agents at the same code paths, is **unauthenticated remote command execution**: several endpoints interpolate caller-controlled strings (`repo`, `vllm_args`, NIM `image`/`container`, custom-service names) unquoted into shell commands run over SSH on the GPU nodes, and the app has no auth or CSRF protection by design — so the LAN/VPN trust boundary is the only thing between a browser-reachable request and cluster RCE. Owner infra topology (IPs, hostnames, SSH username, key name) was scrubbed from the working tree but still lives in git history, handing an attacker a target list for exactly those endpoints. The package is structurally ready but not safe to share widely until the injection sinks are quoted/validated and the history is dealt with.

## Cross-referenced findings

- **Command injection → cluster RCE** is reported by *both* the evaluator (P1) and the security-auditor (P0) at the same sinks (`models.py:80`, `swap.py:101`, `download.py:129`, `nim.py:145-166`, `services.py:144`). The evaluator demonstrated `build_launch_command` producing a live `;`-separated command from a hostile `repo`. Merged as **one P0** — the auditor's adversarial evidence (browser/CSRF reachability over plaintext HTTP, no auth) escalates the evaluator's network-gated P1.
- **No auth on state-mutating endpoints** is the shared root enabler: the evaluator filed it P2 (documented/intentional), the auditor filed the **CSRF** angle P1 (a malicious page in the operator's browser can `fetch()` the mutating routes and chain into the P0 injections). Merged into one P1, noting the auditor's CSRF evidence escalates the evaluator's original P2.
- **Owner data exposure**: the evaluator flagged real IPs/username in the (gitignored, untracked) `.claude/settings.local.json`; the auditor independently found the same class of data — IPs, hostnames, user `<spark-user>`, key name — persisting in **git history** despite the v0.18.0:1 working-tree scrub. These are the same concern at two locations; the git-history copy is the P0.
- **Front-end output hygiene**: the evaluator flagged `current_model` rendered via `innerHTML` without `escapeHtml` (`app.js:177`, P3); the exerciser noted `task_id` echoed verbatim in scrub JSON. The auditor read the UI as broadly `escapeHtml`-clean — see Disagreements.

## Priority queue

- [P0] Command injection via unquoted user input (`repo`, `vllm_args`, NIM `image`/`container`/`port`, custom-service `container`) interpolated into SSH shell commands → arbitrary RCE as the SSH user on the Sparks — `models.py:80`, `swap.py:101`, `download.py:129`, `nim.py:145-166`, `services.py:144`; demonstrated via `build_launch_command` — evaluator + security-auditor
- [P0] Owner infra topology (IPs `<spark-1-ip>/.87`, QSFP `<spark-1-qsfp-ip>/11`, hosts `<spark-1-host>`/`<spark-2-host>`, user `<spark-user>`, key `<ssh-key>`) persists in git history pre-`50c67cd` despite the working-tree scrub → target list for the unauthenticated endpoints — security-auditor
- [P1] No auth + no CSRF protection on state-changing endpoints (plaintext `http`, `interfaces.ts:8`) → any LAN peer, or a malicious page in the operator's browser, can drive swap/install/stop/delete and chain into the P0 injections — security-auditor (CSRF P1) + evaluator (auth P2, escalated)
- [P1] SSRF / Qdrant path injection: caller `collection` interpolated into the Qdrant URL with no validation and raw `filter` forwarded verbatim — `embeddings_proxy.py:237,175,204` — security-auditor
- [P2] Test coverage is redaction-only; the swap state machine, proxies, SSH wrapper, and the StartOS package have zero automated tests — evaluator
- [P2] Loose dependency floors permit known-vulnerable `python-multipart`/`starlette` (DoS CVE-2024-53981 / CVE-2024-47874) on rebuild; no lockfile; no upload size caps — `pyproject.toml:6-13` — security-auditor
- [P2] Registry-submission blockers: source not public + `packageRepo`/`upstreamRepo` are `https://example.com` placeholders — `manifest/index.ts:12-13` — start9-spec-checker
- [P2] Unhandled `OSError` → opaque HTTP 500 on `POST /api/models` and `PUT /knobs` when `MODELS_OVERRIDES` is unset in dev (write to read-only `/data`) — exerciser
- [P2] NGC API key inlined single-quoted into a remote shell command (`export NGC_API_KEY='...'`) → quote-breakout risk + exposure in target process list — `nim.py:147` — security-auditor
- [P2] Single global mutable `catalog` reassigned via `global`, shared across in-flight async requests with no snapshot → latent race as concurrency grows — `server.py:107` — evaluator
- [P2] Container runs uvicorn as **root** (no `USER` in Dockerfile) bound to `0.0.0.0:9999` → any injection RCE runs the SSH client as root in-container — security-auditor (surprise)
- [P3] README Status block stale ("v0.2.3 / s9pk 0.13.0:4", undercounts features) vs actual v0.18.0:1 — `README.md:115` — evaluator
- [P3] `current_model` rendered via `innerHTML` without `escapeHtml` (`app.js:177`); `task_id` echoed verbatim in scrub JSON — evaluator + exerciser
- [P3] httpx exception class names leak into `/v1/audio/speech` and `/api/speech-models` error responses — exerciser
- [P3] `NimInstallBody.register` shadows `BaseModel` attribute → `UserWarning` on every startup; rename (e.g. `register_service`) — exerciser
- [P3] Deprecated `@app.on_event` startup/shutdown and hardcoded `app.version="0.1.0"` (real version 0.18.0:1) — `server.py:49,55` — evaluator
- [P3] `marketingUrl` is an `example.com` placeholder (set `null` or a real URL) — `manifest/index.ts:14` — start9-spec-checker
- [P3] `instructions.md:35` has a broken/template source link (`github.com/Start9Labs/... (TBD)`) visible to end users — start9-spec-checker
- [P3] Per-service SSH users (`parakeet_user`/`kokoro_user`/`embed_user`/`qdrant_user`) are read by `main.ts` but absent from the Configure-Sparks action inputSpec → silent default-to-empty misconfig — start9-spec-checker
- [P3] `Makefile` builds only `x86` though the manifest declares `aarch64`; release notes describe the portability scrub, not package capabilities — start9-spec-checker
- [P3] Hardening: no body/upload size limits on `/v1/audio/*`, `/v1/chat/completions`, `/scrub`; `int(_env(...))` startup crash on bad `VLLM_PORT`; upstream error text (`r.text[:500]`) echoed to clients — security-auditor

## Scorecard

| Lens | Score /5 | Justification (cross-checked) |
|------|----------|-------------------------------|
| Architecture | 4 | Clean router-per-concern split, all SSH funnelled through one wrapper (`ssh.py:29`), proxies stay intentionally dumb; global mutable `catalog` + deprecated `on_event` are minor seams. |
| Security | 2 | Held at 2: auditor's evidence (P0 git-history leak, P1 CSRF, P1 SSRF, root container) corroborates and escalates the evaluator's injection finding rather than contradicting it. The redaction boundary is the bright spot; the transport around it is not. |
| Performance | 4 | Async throughout, parallel health fans, unreachable-host cache avoids repeated 6s SSH stalls; `_win_rms` per-sample Python loop is the one hot spot (`audio_proxy.py:635`). |
| Testing | 3 | Two thorough offline redaction suites pass (69/69 + leak); everything else — swap, proxies, SSH, package — is untested, and live-cluster paths couldn't be exercised at all. |
| Code quality | 4 | Consistent style, useful "why" comments, typed dataclasses; `server.py` (962 lines) and `audio_proxy.py` (829) are getting long. |
| Documentation | 4 | Excellent AGENTS.md, scoped guides, HANDOFF, dated `known-issues.md`; undercut by the stale README status line. |

No lens score was overturned by cross-agent evidence; Security stays at 2 with the auditor's findings reinforcing it.

## Disagreements & gaps

- **Injection severity**: auditor P0 vs evaluator P1. Resolved to P0 — the disagreement is purely about whether the no-auth/LAN posture demotes it; the auditor's CSRF finding shows it's reachable from a browser, so the network gate is weaker than the evaluator assumed.
- **Front-end XSS**: the evaluator flagged one unescaped `innerHTML` sink (`current_model`) and the exerciser flagged `task_id` reflection, while the auditor judged the UI broadly `escapeHtml`-clean (47 escape calls). Low-stakes (JSON API + mostly-escaped render path) but unresolved.
- **Shared blind spot**: no agent could exercise the live-cluster paths — actual swap execution, audio transcription/diarization/label-merge, embeddings/search with real vectors. These are simultaneously the **largest, most security-relevant, and least-tested** modules (`swap.py`, `audio_proxy.py`, `services.py`), so a regression in launch-command construction or speaker-merge logic would ship silently. The evaluator and exerciser both name this gap.
- **Registry context**: the spec-checker notes there is currently no StartOS 0.4 community registry (alpha only), so its blockers are inferred from the 0.3.5.x submission doc — applicable when 0.4 opens, but the process may change.

## Suggested order of work

1. **Close the injection sinks** — `shlex.quote` or strict-regex-validate every user-controlled value crossing into SSH (`repo`, `vllm_args`, NIM `image`/`container`/`port`, custom-service names); the safe pattern already exists in `disk.py:_SAFE_DIRNAME`. Cheap, local, independent of the auth decision. (P0)
2. **Decide the git-history question** before any wider sharing — rewrite history (`git-filter-repo`) and rotate the named `<ssh-key>` key, or commit to keeping the repo private-forever. (P0)
3. **Add a defense-in-depth gate** on mutating endpoints — an `Origin`/referer check or a shared-token header in middleware — so a misconfigured StartOS exposure isn't instant RCE; leave read-only probes open. (P1)
4. **Harden the remaining inputs** — validate the Qdrant `collection`, pin dependency floors + commit a lockfile, add upload size caps, drop the root container `USER`. (P1–P2)
5. **Add a minimal pytest harness** for `build_launch_command` (incl. injection cases), the swap state transitions, and `_merge_words_with_speakers` — the untested core. (P2)
6. **Fix the doc/packaging drift** — README status block, the `example.com` manifest URLs, the `instructions.md` link, release-note content, and the hardcoded `app.version`. (P2–P3)
7. **If pursuing the registry later** — publish source publicly, build the declared `aarch64` artifact, and run the manual on-box checklist (`start-cli s9pk inspect`, install/uninstall, backup/restore). (P2)