Files
spark-control/EVALUATION.md
T
Keysat 1c4e861783 v0.19.0:0 - harden cluster-control surface: ssh injection, qdrant path, csrf
Triaged from a full independent evaluation (EVALUATION.md). Addresses the
three P0/P1 code findings; the proxy/data APIs that downstream apps consume
are deliberately untouched.

- ssh command injection (P0): new shellsafe.py validates + shlex.quotes every
  user-supplied value crossing into an SSH command on the Sparks (model repo,
  vllm args/knobs, NIM image/container/volume/port/env, service names).
  Boundary validation on POST /api/models and POST /api/nim/install; quoting at
  every sink in models/download/nim/services. NGC key now quoted too.
- qdrant path injection (P1): /api/search validates the collection name against
  a metacharacter-free whitelist and URL-encodes the path segment.
- csrf (P1): csrf_guard middleware enforces same-origin on state-changing
  control endpoints; /v1/*, /scrub, /rehydrate, /api/search, /api/audio/* and
  /api/health-event are exempt so external consumers are unaffected.

Verified: injection survives only as a single quoted token, vLLM preflight
shlex.split round-trip intact, CSRF behaviors covered via TestClient, both
offline redaction suites still pass, tsc clean, s9pk rebuilt.
2026-06-12 16:36:33 -05:00

11 KiB
Raw Blame History

Evaluation — spark-control — 2026-06-12

Intent: A browser-based StartOS 0.4 package controlling a dual-DGX-Spark vLLM cluster — one-click model swaps plus health, proxying, and APIs for speech (STT/diarization/TTS), embeddings, and redaction.

Agents run: evaluator, security-auditor, exerciser, start9-spec-checker. Reviewer skipped (working tree clean — no diff to review).

Verdict

This is a capable, well-documented single-operator control plane: a ~960-line FastAPI app fronting SSH-driven model swaps plus honest HTTP proxies for chat, speech, embeddings, and a genuinely well-engineered fail-closed redaction gateway, wrapped by a thin, spec-conformant StartOS 0.4 package that builds cleanly and passes both offline test suites. The app boots and behaves correctly with the cluster absent, and the packaging is compliant on every structural requirement. The dominant risk, corroborated by two agents at the same code paths, is unauthenticated remote command execution: several endpoints interpolate caller-controlled strings (repo, vllm_args, NIM image/container, custom-service names) unquoted into shell commands run over SSH on the GPU nodes, and the app has no auth or CSRF protection by design — so the LAN/VPN trust boundary is the only thing between a browser-reachable request and cluster RCE. Owner infra topology (IPs, hostnames, SSH username, key name) was scrubbed from the working tree but still lives in git history, handing an attacker a target list for exactly those endpoints. The package is structurally ready but not safe to share widely until the injection sinks are quoted/validated and the history is dealt with.

Cross-referenced findings

  • Command injection → cluster RCE is reported by both the evaluator (P1) and the security-auditor (P0) at the same sinks (models.py:80, swap.py:101, download.py:129, nim.py:145-166, services.py:144). The evaluator demonstrated build_launch_command producing a live ;-separated command from a hostile repo. Merged as one P0 — the auditor's adversarial evidence (browser/CSRF reachability over plaintext HTTP, no auth) escalates the evaluator's network-gated P1.
  • No auth on state-mutating endpoints is the shared root enabler: the evaluator filed it P2 (documented/intentional), the auditor filed the CSRF angle P1 (a malicious page in the operator's browser can fetch() the mutating routes and chain into the P0 injections). Merged into one P1, noting the auditor's CSRF evidence escalates the evaluator's original P2.
  • Owner data exposure: the evaluator flagged real IPs/username in the (gitignored, untracked) .claude/settings.local.json; the auditor independently found the same class of data — IPs, hostnames, user <spark-user>, key name — persisting in git history despite the v0.18.0:1 working-tree scrub. These are the same concern at two locations; the git-history copy is the P0.
  • Front-end output hygiene: the evaluator flagged current_model rendered via innerHTML without escapeHtml (app.js:177, P3); the exerciser noted task_id echoed verbatim in scrub JSON. The auditor read the UI as broadly escapeHtml-clean — see Disagreements.

Priority queue

  • [P0] Command injection via unquoted user input (repo, vllm_args, NIM image/container/port, custom-service container) interpolated into SSH shell commands → arbitrary RCE as the SSH user on the Sparks — models.py:80, swap.py:101, download.py:129, nim.py:145-166, services.py:144; demonstrated via build_launch_command — evaluator + security-auditor
  • [P0] Owner infra topology (IPs <spark-1-ip>/.87, QSFP <spark-1-qsfp-ip>/11, hosts <spark-1-host>/<spark-2-host>, user <spark-user>, key <ssh-key>) persists in git history pre-50c67cd despite the working-tree scrub → target list for the unauthenticated endpoints — security-auditor
  • [P1] No auth + no CSRF protection on state-changing endpoints (plaintext http, interfaces.ts:8) → any LAN peer, or a malicious page in the operator's browser, can drive swap/install/stop/delete and chain into the P0 injections — security-auditor (CSRF P1) + evaluator (auth P2, escalated)
  • [P1] SSRF / Qdrant path injection: caller collection interpolated into the Qdrant URL with no validation and raw filter forwarded verbatim — embeddings_proxy.py:237,175,204 — security-auditor
  • [P2] Test coverage is redaction-only; the swap state machine, proxies, SSH wrapper, and the StartOS package have zero automated tests — evaluator
  • [P2] Loose dependency floors permit known-vulnerable python-multipart/starlette (DoS CVE-2024-53981 / CVE-2024-47874) on rebuild; no lockfile; no upload size caps — pyproject.toml:6-13 — security-auditor
  • [P2] Registry-submission blockers: source not public + packageRepo/upstreamRepo are https://example.com placeholders — manifest/index.ts:12-13 — start9-spec-checker
  • [P2] Unhandled OSError → opaque HTTP 500 on POST /api/models and PUT /knobs when MODELS_OVERRIDES is unset in dev (write to read-only /data) — exerciser
  • [P2] NGC API key inlined single-quoted into a remote shell command (export NGC_API_KEY='...') → quote-breakout risk + exposure in target process list — nim.py:147 — security-auditor
  • [P2] Single global mutable catalog reassigned via global, shared across in-flight async requests with no snapshot → latent race as concurrency grows — server.py:107 — evaluator
  • [P2] Container runs uvicorn as root (no USER in Dockerfile) bound to 0.0.0.0:9999 → any injection RCE runs the SSH client as root in-container — security-auditor (surprise)
  • [P3] README Status block stale ("v0.2.3 / s9pk 0.13.0:4", undercounts features) vs actual v0.18.0:1 — README.md:115 — evaluator
  • [P3] current_model rendered via innerHTML without escapeHtml (app.js:177); task_id echoed verbatim in scrub JSON — evaluator + exerciser
  • [P3] httpx exception class names leak into /v1/audio/speech and /api/speech-models error responses — exerciser
  • [P3] NimInstallBody.register shadows BaseModel attribute → UserWarning on every startup; rename (e.g. register_service) — exerciser
  • [P3] Deprecated @app.on_event startup/shutdown and hardcoded app.version="0.1.0" (real version 0.18.0:1) — server.py:49,55 — evaluator
  • [P3] marketingUrl is an example.com placeholder (set null or a real URL) — manifest/index.ts:14 — start9-spec-checker
  • [P3] instructions.md:35 has a broken/template source link (github.com/Start9Labs/... (TBD)) visible to end users — start9-spec-checker
  • [P3] Per-service SSH users (parakeet_user/kokoro_user/embed_user/qdrant_user) are read by main.ts but absent from the Configure-Sparks action inputSpec → silent default-to-empty misconfig — start9-spec-checker
  • [P3] Makefile builds only x86 though the manifest declares aarch64; release notes describe the portability scrub, not package capabilities — start9-spec-checker
  • [P3] Hardening: no body/upload size limits on /v1/audio/*, /v1/chat/completions, /scrub; int(_env(...)) startup crash on bad VLLM_PORT; upstream error text (r.text[:500]) echoed to clients — security-auditor

Scorecard

Lens Score /5 Justification (cross-checked)
Architecture 4 Clean router-per-concern split, all SSH funnelled through one wrapper (ssh.py:29), proxies stay intentionally dumb; global mutable catalog + deprecated on_event are minor seams.
Security 2 Held at 2: auditor's evidence (P0 git-history leak, P1 CSRF, P1 SSRF, root container) corroborates and escalates the evaluator's injection finding rather than contradicting it. The redaction boundary is the bright spot; the transport around it is not.
Performance 4 Async throughout, parallel health fans, unreachable-host cache avoids repeated 6s SSH stalls; _win_rms per-sample Python loop is the one hot spot (audio_proxy.py:635).
Testing 3 Two thorough offline redaction suites pass (69/69 + leak); everything else — swap, proxies, SSH, package — is untested, and live-cluster paths couldn't be exercised at all.
Code quality 4 Consistent style, useful "why" comments, typed dataclasses; server.py (962 lines) and audio_proxy.py (829) are getting long.
Documentation 4 Excellent AGENTS.md, scoped guides, HANDOFF, dated known-issues.md; undercut by the stale README status line.

No lens score was overturned by cross-agent evidence; Security stays at 2 with the auditor's findings reinforcing it.

Disagreements & gaps

  • Injection severity: auditor P0 vs evaluator P1. Resolved to P0 — the disagreement is purely about whether the no-auth/LAN posture demotes it; the auditor's CSRF finding shows it's reachable from a browser, so the network gate is weaker than the evaluator assumed.
  • Front-end XSS: the evaluator flagged one unescaped innerHTML sink (current_model) and the exerciser flagged task_id reflection, while the auditor judged the UI broadly escapeHtml-clean (47 escape calls). Low-stakes (JSON API + mostly-escaped render path) but unresolved.
  • Shared blind spot: no agent could exercise the live-cluster paths — actual swap execution, audio transcription/diarization/label-merge, embeddings/search with real vectors. These are simultaneously the largest, most security-relevant, and least-tested modules (swap.py, audio_proxy.py, services.py), so a regression in launch-command construction or speaker-merge logic would ship silently. The evaluator and exerciser both name this gap.
  • Registry context: the spec-checker notes there is currently no StartOS 0.4 community registry (alpha only), so its blockers are inferred from the 0.3.5.x submission doc — applicable when 0.4 opens, but the process may change.

Suggested order of work

  1. Close the injection sinksshlex.quote or strict-regex-validate every user-controlled value crossing into SSH (repo, vllm_args, NIM image/container/port, custom-service names); the safe pattern already exists in disk.py:_SAFE_DIRNAME. Cheap, local, independent of the auth decision. (P0)
  2. Decide the git-history question before any wider sharing — rewrite history (git-filter-repo) and rotate the named <ssh-key> key, or commit to keeping the repo private-forever. (P0)
  3. Add a defense-in-depth gate on mutating endpoints — an Origin/referer check or a shared-token header in middleware — so a misconfigured StartOS exposure isn't instant RCE; leave read-only probes open. (P1)
  4. Harden the remaining inputs — validate the Qdrant collection, pin dependency floors + commit a lockfile, add upload size caps, drop the root container USER. (P1P2)
  5. Add a minimal pytest harness for build_launch_command (incl. injection cases), the swap state transitions, and _merge_words_with_speakers — the untested core. (P2)
  6. Fix the doc/packaging drift — README status block, the example.com manifest URLs, the instructions.md link, release-note content, and the hardcoded app.version. (P2P3)
  7. If pursuing the registry later — publish source publicly, build the declared aarch64 artifact, and run the manual on-box checklist (start-cli s9pk inspect, install/uninstall, backup/restore). (P2)