From 1c4e8617839c80640792e84920cd5c5f0ccbbf4f Mon Sep 17 00:00:00 2001 From: Keysat Date: Fri, 12 Jun 2026 16:36:33 -0500 Subject: [PATCH] v0.19.0:0 - harden cluster-control surface: ssh injection, qdrant path, csrf Triaged from a full independent evaluation (EVALUATION.md). Addresses the three P0/P1 code findings; the proxy/data APIs that downstream apps consume are deliberately untouched. - ssh command injection (P0): new shellsafe.py validates + shlex.quotes every user-supplied value crossing into an SSH command on the Sparks (model repo, vllm args/knobs, NIM image/container/volume/port/env, service names). Boundary validation on POST /api/models and POST /api/nim/install; quoting at every sink in models/download/nim/services. NGC key now quoted too. - qdrant path injection (P1): /api/search validates the collection name against a metacharacter-free whitelist and URL-encodes the path segment. - csrf (P1): csrf_guard middleware enforces same-origin on state-changing control endpoints; /v1/*, /scrub, /rehydrate, /api/search, /api/audio/* and /api/health-event are exempt so external consumers are unaffected. Verified: injection survives only as a single quoted token, vLLM preflight shlex.split round-trip intact, CSRF behaviors covered via TestClient, both offline redaction suites still pass, tsc clean, s9pk rebuilt. --- AGENTS.md | 32 ++++++++++++-- EVALUATION.md | 70 ++++++++++++++++++++++++++++++ image/app/download.py | 6 +-- image/app/embeddings_proxy.py | 18 +++++++- image/app/models.py | 6 ++- image/app/nim.py | 35 ++++++++++----- image/app/server.py | 48 ++++++++++++++++++++ image/app/services.py | 5 ++- image/app/shellsafe.py | 60 +++++++++++++++++++++++++ package/startos/versions/v0_1_0.ts | 4 +- 10 files changed, 260 insertions(+), 24 deletions(-) create mode 100644 EVALUATION.md create mode 100644 image/app/shellsafe.py diff --git a/AGENTS.md b/AGENTS.md index 2a5b5cf..4d5a2be 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -53,9 +53,35 @@ Subsystem guidance lives in `docs/guides/` and loads when matching files are tou - **Working (v0.18.0:0, installed and serving):** swap dashboard; chat / transcribe / diarize(+chunk) / TTS proxies; embeddings + rerank + hybrid search (Qdrant); `/scrub` + `/rehydrate`; label-merge incl. dual-channel mode. Spark 2 audio stack is healthy (11k+ requests/12h, all 200). - **In progress — Signal Engine "flakiness":** diagnosed, not a server bug — transient 1–4s unresponsiveness while the single GPU is continuously busy. Remedy is client-side; a drafted message (in-flight cap 2, hard ceiling 3 global across audio endpoints, retry-with-backoff on timeout/503) is with the owner to forward to that dev. -- **Decided, not implemented:** remote access stays WireGuard/Tailscale split-tunnel — no public interface, so no API auth built; an empirical concurrency sweep is offered but needs the owner's explicit OK in a quiet window. +- **Decided, not implemented:** remote access stays WireGuard/Tailscale split-tunnel — no public interface, so no API auth built; an empirical concurrency sweep is offered but needs the owner's explicit OK in a quiet window. **Revisit (full-eval 2026-06-12):** the "LAN-only, so no auth" call is now load-bearing against RCE — unquoted user input reaches the SSH shell on several endpoints, so the network boundary is the *only* thing preventing cluster takeover. Quoting the injection sinks (work queue) is needed regardless of the auth decision; a defense-in-depth auth/CSRF gate is the follow-on. - **Known limits:** `/health` blips while the GPU is busy (mitigated client-side); dual-channel can miss a quiet local word under loud remote bleed; the connectivity log misses sub-5s outages between 5s polls; diarizer caps at 4 speakers. -- **Portability:** audited 2026-06-12, now compliant — all owner-specific IPs/hostnames/usernames/names scrubbed from tracked files to placeholders; `claude-code-starter-prompt.md` deleted (old build-time prompt). Real cluster values live only in StartOS install config, shell env vars, and the gitignored `settings.local.json`. +- **Portability:** working tree scrubbed 2026-06-12 — all owner-specific IPs/hostnames/usernames/names replaced with placeholders in tracked files; `claude-code-starter-prompt.md` deleted (old build-time prompt). Real cluster values live only in StartOS install config, shell env vars, and the gitignored `settings.local.json`. **Caveat (full-eval 2026-06-12): git *history* was not rewritten** — the old IPs/hosts/user ``/key name are still recoverable pre-`50c67cd`. The scrub is working-tree-only; treat the repo as private until history is rewritten (see work queue below). - **Repo wart:** commit `367d986` is labeled `v0.13.0:4` but actually contains everything through v0.18.0:0 — per-version commits for v0.14–v0.18 are missing. Keep commit messages accurate going forward. - **Hosting:** repo pushes to the owner's self-hosted Gitea — remote `gitea`, branch `master`, over SSH (host alias + key live in the local `~/.ssh/config`; no owner-specific details belong in the repo). Push there after committing. -- **Next:** (1) owner forwards the concurrency note to the Signal Engine dev; (2) run the concurrency sweep if the dev wants the measured knee; (3) add the `--memory` cap to parakeet-asr via the Reapply-patches action; (4) pick the next item from ROADMAP.md. +- **Next (pre-eval backlog):** (1) owner forwards the concurrency note to the Signal Engine dev; (2) run the concurrency sweep if the dev wants the measured knee; (3) add the `--memory` cap to parakeet-asr via the Reapply-patches action; (4) pick the next item from ROADMAP.md. + +### Full-eval triage (2026-06-12) + +Source: `EVALUATION.md` at repo root (full evidence, file:line pointers, scorecard). Findings triaged below; do these before the pre-eval backlog above where they overlap. + +**Work queue — P0/P1, fix before sharing the package wider:** +1. ~~**[P0] Shell-quote/validate every user value crossing into SSH**~~ — **DONE (code, 2026-06-12; not yet shipped).** New `image/app/shellsafe.py` (`validate_repo`/`validate_image`/`validate_container` whitelists + `quote_arg`/`quote_args`). Boundary validation added to `POST /api/models` (repo) and `POST /api/nim/install` (image+container); `shlex.quote` applied at every SSH sink — `models.build_launch_command` (repo+args, covers `vllm_args`+knobs), `download._do` (repo), `nim._do` (image/container/volume/port/env), `services.docker_state`+`run_action` (container). Verified: injection survives only as a single quoted token, vLLM preflight `shlex.split` round-trip intact, both redaction suites still pass. Side-benefit: NGC key now `shlex.quote`'d in `nim._do` (was single-quoted) — closes the quote-breakout half of the P2 NGC-key item; the process-list-exposure half remains. **Ship step pending:** version bump + release notes + rebuilt s9pk. +2. **[P0] Decide the git-history question** — owner IPs/hosts/user ``/key name persist pre-`50c67cd` despite the working-tree scrub. Either rewrite history (`git-filter-repo`) + rotate the `` key, or keep the repo private-forever. Blocks any public/shared publish. **(Open — git-ops decision, not code.)** +3. ~~**[P1] Defense-in-depth gate on mutating endpoints**~~ — **DONE (code, 2026-06-12; not yet shipped).** `csrf_guard` HTTP middleware in `server.py` rejects state-changing requests whose `Origin`/`Referer` hostname ≠ the served host. Scoped to control endpoints; the programmatic API surface is exempt (`/v1/*`, `/scrub`, `/rehydrate`, `/api/search`, `/api/audio/`, `/api/health-event`) so downstream consumers are unaffected. No app-layer token auth (deliberate — would break consumers + the non-technical owner). Verified via TestClient: cross-origin control POST→403, same-origin/no-Origin→pass, exempt prefixes always pass, GET never blocked. **Verify on-box:** confirm the StartOS reverse proxy passes `Host`/`Origin` so the dashboard isn't false-positive-blocked. +4. ~~**[P1] Validate the Qdrant `collection`**~~ — **DONE (code, 2026-06-12; not yet shipped).** `_safe_collection` whitelist (`[A-Za-z0-9._-]`, rejects `..`) + URL-encoded path segment in `embeddings_proxy.py`. The raw `filter` is left as a passthrough (Qdrant parses it; pydantic enforces `dict`) — locking it to an allowlist would break hybrid-search consumers; the path segment was the real injection vector. + +**Shipping (all of #1/#3/#4 batched):** version bumped `0.18.0:1`→`0.19.0:0` with release notes (`versions/v0_1_0.ts`). Rebuild `make x86`; `make install` (live-service restart) needs explicit go-ahead. Not committed yet. + +**Known debt — P2, track but not blocking:** +- Test coverage is redaction-only; swap state machine, proxies, SSH wrapper, and the package have zero automated tests. Live-cluster paths (swap exec, audio, embeddings/search) couldn't be exercised at all — biggest blind spot. +- Loose dependency floors permit vulnerable `python-multipart`/`starlette` (DoS CVEs) on rebuild; no lockfile; no upload size caps (`pyproject.toml:6-13`). +- StartOS registry blockers (only if pursuing the registry): source not public + `packageRepo`/`upstreamRepo` are `example.com` placeholders (`manifest/index.ts:12-13`). +- Opaque HTTP 500 on `POST /api/models` / `PUT /knobs` when `MODELS_OVERRIDES` unset in dev (write to read-only `/data`) — catch the `OSError`. +- NGC API key inlined single-quoted into a remote shell command (`nim.py:147`) — pass via stdin/env. +- Global mutable `catalog` reassigned via `global`, shared across async requests with no snapshot (`server.py:107`) — latent race as concurrency grows. +- Container runs uvicorn as **root** bound to `0.0.0.0:9999` (no `USER` in Dockerfile) — amplifies any RCE blast radius. + +**Parked — P3+, do in bulk when next touching docs/packaging:** +- README Status block stale (`v0.2.3 / 0.13.0:4` → v0.18.0:1, undercounts features); deprecated `@app.on_event` + hardcoded `app.version="0.1.0"`; `NimInstallBody.register` shadows `BaseModel` (rename → `register_service`); httpx class names leak into TTS/speech-models error text; one unescaped `innerHTML` sink (`app.js:177`) + `task_id` reflected in scrub JSON. +- Packaging cosmetics: `marketingUrl` placeholder; broken `instructions.md` source link; per-service SSH users (`parakeet_user` etc.) absent from the Configure-Sparks action inputSpec (silent default-empty); `Makefile` builds only x86 though manifest declares `aarch64`; release notes describe the scrub, not capabilities. +- Hardening misc: no body/upload size limits on `/v1/audio/*`, `/v1/chat/completions`, `/scrub`; `int(_env(...))` startup crash on bad `VLLM_PORT`; upstream error text echoed to clients. diff --git a/EVALUATION.md b/EVALUATION.md new file mode 100644 index 0000000..6812e5e --- /dev/null +++ b/EVALUATION.md @@ -0,0 +1,70 @@ +# Evaluation — spark-control — 2026-06-12 + +Intent: A browser-based StartOS 0.4 package controlling a dual-DGX-Spark vLLM cluster — one-click model swaps plus health, proxying, and APIs for speech (STT/diarization/TTS), embeddings, and redaction. + +Agents run: evaluator, security-auditor, exerciser, start9-spec-checker. Reviewer skipped (working tree clean — no diff to review). + +## Verdict + +This is a capable, well-documented single-operator control plane: a ~960-line FastAPI app fronting SSH-driven model swaps plus honest HTTP proxies for chat, speech, embeddings, and a genuinely well-engineered fail-closed redaction gateway, wrapped by a thin, spec-conformant StartOS 0.4 package that builds cleanly and passes both offline test suites. The app boots and behaves correctly with the cluster absent, and the packaging is compliant on every structural requirement. The dominant risk, corroborated by two agents at the same code paths, is **unauthenticated remote command execution**: several endpoints interpolate caller-controlled strings (`repo`, `vllm_args`, NIM `image`/`container`, custom-service names) unquoted into shell commands run over SSH on the GPU nodes, and the app has no auth or CSRF protection by design — so the LAN/VPN trust boundary is the only thing between a browser-reachable request and cluster RCE. Owner infra topology (IPs, hostnames, SSH username, key name) was scrubbed from the working tree but still lives in git history, handing an attacker a target list for exactly those endpoints. The package is structurally ready but not safe to share widely until the injection sinks are quoted/validated and the history is dealt with. + +## Cross-referenced findings + +- **Command injection → cluster RCE** is reported by *both* the evaluator (P1) and the security-auditor (P0) at the same sinks (`models.py:80`, `swap.py:101`, `download.py:129`, `nim.py:145-166`, `services.py:144`). The evaluator demonstrated `build_launch_command` producing a live `;`-separated command from a hostile `repo`. Merged as **one P0** — the auditor's adversarial evidence (browser/CSRF reachability over plaintext HTTP, no auth) escalates the evaluator's network-gated P1. +- **No auth on state-mutating endpoints** is the shared root enabler: the evaluator filed it P2 (documented/intentional), the auditor filed the **CSRF** angle P1 (a malicious page in the operator's browser can `fetch()` the mutating routes and chain into the P0 injections). Merged into one P1, noting the auditor's CSRF evidence escalates the evaluator's original P2. +- **Owner data exposure**: the evaluator flagged real IPs/username in the (gitignored, untracked) `.claude/settings.local.json`; the auditor independently found the same class of data — IPs, hostnames, user ``, key name — persisting in **git history** despite the v0.18.0:1 working-tree scrub. These are the same concern at two locations; the git-history copy is the P0. +- **Front-end output hygiene**: the evaluator flagged `current_model` rendered via `innerHTML` without `escapeHtml` (`app.js:177`, P3); the exerciser noted `task_id` echoed verbatim in scrub JSON. The auditor read the UI as broadly `escapeHtml`-clean — see Disagreements. + +## Priority queue + +- [P0] Command injection via unquoted user input (`repo`, `vllm_args`, NIM `image`/`container`/`port`, custom-service `container`) interpolated into SSH shell commands → arbitrary RCE as the SSH user on the Sparks — `models.py:80`, `swap.py:101`, `download.py:129`, `nim.py:145-166`, `services.py:144`; demonstrated via `build_launch_command` — evaluator + security-auditor +- [P0] Owner infra topology (IPs `/.87`, QSFP `/11`, hosts ``/``, user ``, key ``) persists in git history pre-`50c67cd` despite the working-tree scrub → target list for the unauthenticated endpoints — security-auditor +- [P1] No auth + no CSRF protection on state-changing endpoints (plaintext `http`, `interfaces.ts:8`) → any LAN peer, or a malicious page in the operator's browser, can drive swap/install/stop/delete and chain into the P0 injections — security-auditor (CSRF P1) + evaluator (auth P2, escalated) +- [P1] SSRF / Qdrant path injection: caller `collection` interpolated into the Qdrant URL with no validation and raw `filter` forwarded verbatim — `embeddings_proxy.py:237,175,204` — security-auditor +- [P2] Test coverage is redaction-only; the swap state machine, proxies, SSH wrapper, and the StartOS package have zero automated tests — evaluator +- [P2] Loose dependency floors permit known-vulnerable `python-multipart`/`starlette` (DoS CVE-2024-53981 / CVE-2024-47874) on rebuild; no lockfile; no upload size caps — `pyproject.toml:6-13` — security-auditor +- [P2] Registry-submission blockers: source not public + `packageRepo`/`upstreamRepo` are `https://example.com` placeholders — `manifest/index.ts:12-13` — start9-spec-checker +- [P2] Unhandled `OSError` → opaque HTTP 500 on `POST /api/models` and `PUT /knobs` when `MODELS_OVERRIDES` is unset in dev (write to read-only `/data`) — exerciser +- [P2] NGC API key inlined single-quoted into a remote shell command (`export NGC_API_KEY='...'`) → quote-breakout risk + exposure in target process list — `nim.py:147` — security-auditor +- [P2] Single global mutable `catalog` reassigned via `global`, shared across in-flight async requests with no snapshot → latent race as concurrency grows — `server.py:107` — evaluator +- [P2] Container runs uvicorn as **root** (no `USER` in Dockerfile) bound to `0.0.0.0:9999` → any injection RCE runs the SSH client as root in-container — security-auditor (surprise) +- [P3] README Status block stale ("v0.2.3 / s9pk 0.13.0:4", undercounts features) vs actual v0.18.0:1 — `README.md:115` — evaluator +- [P3] `current_model` rendered via `innerHTML` without `escapeHtml` (`app.js:177`); `task_id` echoed verbatim in scrub JSON — evaluator + exerciser +- [P3] httpx exception class names leak into `/v1/audio/speech` and `/api/speech-models` error responses — exerciser +- [P3] `NimInstallBody.register` shadows `BaseModel` attribute → `UserWarning` on every startup; rename (e.g. `register_service`) — exerciser +- [P3] Deprecated `@app.on_event` startup/shutdown and hardcoded `app.version="0.1.0"` (real version 0.18.0:1) — `server.py:49,55` — evaluator +- [P3] `marketingUrl` is an `example.com` placeholder (set `null` or a real URL) — `manifest/index.ts:14` — start9-spec-checker +- [P3] `instructions.md:35` has a broken/template source link (`github.com/Start9Labs/... (TBD)`) visible to end users — start9-spec-checker +- [P3] Per-service SSH users (`parakeet_user`/`kokoro_user`/`embed_user`/`qdrant_user`) are read by `main.ts` but absent from the Configure-Sparks action inputSpec → silent default-to-empty misconfig — start9-spec-checker +- [P3] `Makefile` builds only `x86` though the manifest declares `aarch64`; release notes describe the portability scrub, not package capabilities — start9-spec-checker +- [P3] Hardening: no body/upload size limits on `/v1/audio/*`, `/v1/chat/completions`, `/scrub`; `int(_env(...))` startup crash on bad `VLLM_PORT`; upstream error text (`r.text[:500]`) echoed to clients — security-auditor + +## Scorecard + +| Lens | Score /5 | Justification (cross-checked) | +|------|----------|-------------------------------| +| Architecture | 4 | Clean router-per-concern split, all SSH funnelled through one wrapper (`ssh.py:29`), proxies stay intentionally dumb; global mutable `catalog` + deprecated `on_event` are minor seams. | +| Security | 2 | Held at 2: auditor's evidence (P0 git-history leak, P1 CSRF, P1 SSRF, root container) corroborates and escalates the evaluator's injection finding rather than contradicting it. The redaction boundary is the bright spot; the transport around it is not. | +| Performance | 4 | Async throughout, parallel health fans, unreachable-host cache avoids repeated 6s SSH stalls; `_win_rms` per-sample Python loop is the one hot spot (`audio_proxy.py:635`). | +| Testing | 3 | Two thorough offline redaction suites pass (69/69 + leak); everything else — swap, proxies, SSH, package — is untested, and live-cluster paths couldn't be exercised at all. | +| Code quality | 4 | Consistent style, useful "why" comments, typed dataclasses; `server.py` (962 lines) and `audio_proxy.py` (829) are getting long. | +| Documentation | 4 | Excellent AGENTS.md, scoped guides, HANDOFF, dated `known-issues.md`; undercut by the stale README status line. | + +No lens score was overturned by cross-agent evidence; Security stays at 2 with the auditor's findings reinforcing it. + +## Disagreements & gaps + +- **Injection severity**: auditor P0 vs evaluator P1. Resolved to P0 — the disagreement is purely about whether the no-auth/LAN posture demotes it; the auditor's CSRF finding shows it's reachable from a browser, so the network gate is weaker than the evaluator assumed. +- **Front-end XSS**: the evaluator flagged one unescaped `innerHTML` sink (`current_model`) and the exerciser flagged `task_id` reflection, while the auditor judged the UI broadly `escapeHtml`-clean (47 escape calls). Low-stakes (JSON API + mostly-escaped render path) but unresolved. +- **Shared blind spot**: no agent could exercise the live-cluster paths — actual swap execution, audio transcription/diarization/label-merge, embeddings/search with real vectors. These are simultaneously the **largest, most security-relevant, and least-tested** modules (`swap.py`, `audio_proxy.py`, `services.py`), so a regression in launch-command construction or speaker-merge logic would ship silently. The evaluator and exerciser both name this gap. +- **Registry context**: the spec-checker notes there is currently no StartOS 0.4 community registry (alpha only), so its blockers are inferred from the 0.3.5.x submission doc — applicable when 0.4 opens, but the process may change. + +## Suggested order of work + +1. **Close the injection sinks** — `shlex.quote` or strict-regex-validate every user-controlled value crossing into SSH (`repo`, `vllm_args`, NIM `image`/`container`/`port`, custom-service names); the safe pattern already exists in `disk.py:_SAFE_DIRNAME`. Cheap, local, independent of the auth decision. (P0) +2. **Decide the git-history question** before any wider sharing — rewrite history (`git-filter-repo`) and rotate the named `` key, or commit to keeping the repo private-forever. (P0) +3. **Add a defense-in-depth gate** on mutating endpoints — an `Origin`/referer check or a shared-token header in middleware — so a misconfigured StartOS exposure isn't instant RCE; leave read-only probes open. (P1) +4. **Harden the remaining inputs** — validate the Qdrant `collection`, pin dependency floors + commit a lockfile, add upload size caps, drop the root container `USER`. (P1–P2) +5. **Add a minimal pytest harness** for `build_launch_command` (incl. injection cases), the swap state transitions, and `_merge_words_with_speakers` — the untested core. (P2) +6. **Fix the doc/packaging drift** — README status block, the `example.com` manifest URLs, the `instructions.md` link, release-note content, and the hardcoded `app.version`. (P2–P3) +7. **If pursuing the registry later** — publish source publicly, build the declared `aarch64` artifact, and run the manual on-box checklist (`start-cli s9pk inspect`, install/uninstall, backup/restore). (P2) diff --git a/image/app/download.py b/image/app/download.py index f2a00e0..8d010f3 100644 --- a/image/app/download.py +++ b/image/app/download.py @@ -16,6 +16,7 @@ from datetime import datetime, timezone from typing import Literal, Optional from .config import Settings +from .shellsafe import quote_arg, validate_repo from .ssh import ssh_stream, StreamHandle @@ -77,8 +78,7 @@ class DownloadManager: return self.jobs.get(job_id) async def trigger(self, repo: str, mode: Mode) -> DownloadJob: - if not repo or "/" not in repo: - raise ValueError("repo must be in 'org/name' form") + validate_repo(repo) # raises ValueError on anything but a clean 'org/name' if self.lock.locked(): raise RuntimeError("A download is already in progress") job = DownloadJob( @@ -126,7 +126,7 @@ class DownloadManager: if not target_host or not target_user: raise RuntimeError(f"{job.mode} host not configured") - cmd = f"cd ~/spark-vllm-docker && ./hf-download.sh {job.repo} {flags}".strip() + cmd = f"cd ~/spark-vllm-docker && ./hf-download.sh {quote_arg(job.repo)} {flags}".strip() job.append(f"$ {cmd}") job.state = "downloading" job.progress.phase = "Connecting to Hugging Face…" diff --git a/image/app/embeddings_proxy.py b/image/app/embeddings_proxy.py index 33acb38..eb3d79f 100644 --- a/image/app/embeddings_proxy.py +++ b/image/app/embeddings_proxy.py @@ -25,8 +25,10 @@ vector is supplied, /api/search degrades cleanly to dense + rerank. """ from __future__ import annotations import logging +import re import time from typing import Any, Optional, Union +from urllib.parse import quote as urlquote import httpx from fastapi import APIRouter, HTTPException @@ -36,6 +38,19 @@ from .config import Settings logger = logging.getLogger("spark-control.embeddings") +# Qdrant collection name: caller-supplied and interpolated into the Qdrant URL +# path. Restrict to a metacharacter-free whitelist so it cannot inject path +# segments ('/', '..'), a query string ('?'), or a fragment ('#') and pivot to +# other collections/endpoints on the internal Qdrant. (Qdrant's own names are +# alphanumerics + dot/dash/underscore.) +_COLLECTION_RE = re.compile(r"^[A-Za-z0-9._-]+$") + + +def _safe_collection(name: str) -> str: + if not name or ".." in name or not _COLLECTION_RE.fullmatch(name): + raise HTTPException(400, f"invalid collection name: {name!r}") + return name + # Embedding/rerank can be slow on a cold model; search is interactive. EMBED_TIMEOUT = 120.0 QDRANT_TIMEOUT = 30.0 @@ -175,6 +190,7 @@ def build_router(settings: Settings) -> APIRouter: collection = body.collection or settings.qdrant_collection if not collection: raise HTTPException(400, "collection is required (no default configured)") + collection = _safe_collection(collection) top_k = max(1, min(body.top_k, 100)) retrieve_n = body.retrieve_n or max(50, top_k * 10) @@ -234,7 +250,7 @@ def build_router(settings: Settings) -> APIRouter: t1 = time.time() qr = await _post( - f"{_qdrant_base()}/collections/{collection}/points/query", + f"{_qdrant_base()}/collections/{urlquote(collection, safe='')}/points/query", query_body, QDRANT_TIMEOUT, "qdrant", ) if qr.status_code == 404: diff --git a/image/app/models.py b/image/app/models.py index 061596b..0d00e3e 100644 --- a/image/app/models.py +++ b/image/app/models.py @@ -4,6 +4,7 @@ import yaml from pydantic import BaseModel, Field from .overrides import apply_knobs_to_args, load_overrides +from .shellsafe import quote_arg, quote_args class ModelDef(BaseModel): @@ -77,4 +78,7 @@ def build_launch_command(key: str, model: ModelDef, defaults: Defaults) -> str: solo = "--solo " if model.mode == "solo" else "" base_args = apply_knobs_to_args(list(model.vllm_args), model.knobs) args = [f"--port={defaults.port}", f"--host={defaults.host}", *base_args] - return f"./launch-cluster.sh {solo}-d exec vllm serve {model.repo} {' '.join(args)}" + # repo + args are user-controlled (custom models, knobs); shlex.quote each so + # they cannot break out of the SSH shell command. shlex.split (used by the + # vLLM pre-flight validator) cleanly reverses this quoting. + return f"./launch-cluster.sh {solo}-d exec vllm serve {quote_arg(model.repo)} {quote_args(args)}" diff --git a/image/app/nim.py b/image/app/nim.py index d8b4019..c7013ff 100644 --- a/image/app/nim.py +++ b/image/app/nim.py @@ -18,6 +18,7 @@ from datetime import datetime, timezone from typing import Optional from .config import Settings +from .shellsafe import quote_arg from .ssh import ssh_stream, StreamHandle @@ -138,30 +139,40 @@ class NimManager: async def _do(self, job: NimInstallJob, extra_env: dict[str, str]) -> None: # Build the bash one-liner. We use docker login non-interactively with the NGC API key. - env_parts = [f'-e NGC_API_KEY=$NGC_API_KEY'] + # The real docker commands use shlex.quote'd values (img/ctr/vol) so nothing + # user-controlled can break out of the SSH shell. The cosmetic `echo` log lines + # embed the *raw* values inside single quotes — safe because image/container are + # validated against a metacharacter-free whitelist at the API boundary, and + # volume/port derive from them. (Embedding shlex.quote output inside another + # quoted echo string would be wrong — it can re-expose $() / $VAR.) + img = quote_arg(job.image) + ctr = quote_arg(job.container) + vol = quote_arg(job.volume) + port = int(job.port) # int can't inject; coerce defensively + env_parts = ['-e NGC_API_KEY=$NGC_API_KEY'] for k, v in extra_env.items(): - env_parts.append(f"-e {k}={v}") + env_parts.append(f"-e {quote_arg(k)}={quote_arg(v)}") env_str = " ".join(env_parts) cmd = ( f"set -e; " - f"export NGC_API_KEY='{self.settings.ngc_api_key}'; " + f"export NGC_API_KEY={quote_arg(self.settings.ngc_api_key or '')}; " f"echo '=== docker login nvcr.io ==='; " f"echo \"$NGC_API_KEY\" | docker login nvcr.io -u '$oauthtoken' --password-stdin; " f"echo '=== docker pull {job.image} (this can be 1-10 GB) ==='; " - f"docker pull {job.image}; " + f"docker pull {img}; " f"echo '=== remove any prior container with the same name ==='; " - f"docker rm -f {job.container} 2>/dev/null || true; " - f"echo '=== docker run -d --gpus all -p {job.port}:{job.port} -v {job.volume}:/opt/nim/.cache {env_str} --name {job.container} --restart unless-stopped {job.image} ==='; " + f"docker rm -f {ctr} 2>/dev/null || true; " + f"echo '=== docker run -d --gpus all -p {job.port}:{job.port} -v {job.volume}:/opt/nim/.cache --name {job.container} --restart unless-stopped {job.image} ==='; " f"docker run -d --gpus all " - f"-p {job.port}:{job.port} " - f"-v {job.volume}:/opt/nim/.cache " + f"-p {port}:{port} " + f"-v {vol}:/opt/nim/.cache " f"{env_str} " - f"--name {job.container} " + f"--name {ctr} " f"--restart unless-stopped " - f"{job.image}; " + f"{img}; " f"echo '=== ensuring cache volume is writable by uid 1000 (riva-server) ==='; " - f"docker run --rm -v {job.volume}:/cache alpine chown -R 1000:1000 /cache && " - f"docker restart {job.container}; " + f"docker run --rm -v {vol}:/cache alpine chown -R 1000:1000 /cache && " + f"docker restart {ctr}; " f"echo '=== install complete; container is starting up and will download its model on first boot ==='" ) job.append(f"$ ") diff --git a/image/app/server.py b/image/app/server.py index 1027a56..c79a5e8 100644 --- a/image/app/server.py +++ b/image/app/server.py @@ -25,6 +25,7 @@ from .models import load_catalog from .nim import SUGGESTED_NIMS, CATALOG_URL, NimManager from .overrides import add_custom, delete_custom, extract_knobs_from_args, load_overrides, set_knobs from .services import docker_state, run_action, services_from_settings +from .shellsafe import validate_container, validate_image, validate_repo from .speech_models import SpeechModelsManager from .ssh import ssh_run from .swap import SwapManager @@ -46,6 +47,44 @@ speech_models = SpeechModelsManager(settings) app = FastAPI(title="spark-control", version="0.1.0") +# ---- Same-origin (CSRF) guard on state-mutating control endpoints ---- +# The app ships no API auth by design (LAN/VPN-only, no public interface). That +# makes the realistic remote threat a *browser-driven CSRF*: a malicious page open +# in the operator's browser silently POSTing to the control endpoints (swap, NIM +# install, service stop, disk delete, …) while they're on the trusted network. +# Browsers attach an Origin (and Referer) header to every cross-site state-changing +# request, so we reject mutating requests whose Origin/Referer hostname doesn't +# match the host the dashboard was served from. Programmatic consumers (Recap Relay, +# CRM, Open WebUI, …) hit the proxy/data surface below and send no browser Origin, +# so they're unaffected; the exempt prefixes are the cross-origin-by-design API. +_CSRF_SAFE_METHODS = {"GET", "HEAD", "OPTIONS", "TRACE"} +_CSRF_EXEMPT_PREFIXES = ( + "/v1/", # OpenAI-compatible chat/audio/embeddings/rerank proxies + "/scrub", "/rehydrate", # redaction gateway (used by downstream apps) + "/api/search", # retrieval proxy + "/api/audio/", # diarize-chunk / label-merge / transcribe-with-speakers + "/api/health-event", # health reports posted by consumer apps +) + + +@app.middleware("http") +async def csrf_guard(request, call_next): + if request.method not in _CSRF_SAFE_METHODS and not request.url.path.startswith(_CSRF_EXEMPT_PREFIXES): + origin = request.headers.get("origin") or request.headers.get("referer") + if origin: + from urllib.parse import urlparse + origin_host = urlparse(origin).hostname + req_host = (request.headers.get("host") or "").rsplit(":", 1)[0] + # Only block when we can positively identify a mismatch; absence of a + # header (non-browser client) or an unparseable Host falls through. + if origin_host and req_host and origin_host != req_host: + return JSONResponse( + status_code=403, + content={"detail": "cross-origin request to a control endpoint was blocked"}, + ) + return await call_next(request) + + @app.on_event("startup") async def _start_deep_health() -> None: # Fire-and-forget; the loop catches its own exceptions. @@ -155,6 +194,10 @@ class CustomModelBody(BaseModel): async def post_model(body: CustomModelBody) -> dict: if not body.key or not body.key.replace("-", "").replace("_", "").isalnum(): raise HTTPException(400, "key must be alphanumeric/-/_ only") + try: + validate_repo(body.repo) + except ValueError as e: + raise HTTPException(400, str(e)) if body.key in catalog.models and not catalog.models[body.key].custom: raise HTTPException(409, f"'{body.key}' is a bundled model — pick a different key") add_custom(body.model_dump()) @@ -435,6 +478,11 @@ class NimInstallBody(BaseModel): @app.post("/api/nim/install") async def post_nim_install(body: NimInstallBody) -> dict: + try: + validate_image(body.image) + validate_container(body.container) + except ValueError as e: + raise HTTPException(400, str(e)) target_host = settings.spark1_host if body.host == "spark1" else settings.spark2_host target_user = settings.spark1_user if body.host == "spark1" else settings.spark2_user try: diff --git a/image/app/services.py b/image/app/services.py index 64aa55d..b44308d 100644 --- a/image/app/services.py +++ b/image/app/services.py @@ -10,6 +10,7 @@ from dataclasses import dataclass from typing import Literal, Optional from .config import Settings +from .shellsafe import quote_arg from .ssh import ssh_run @@ -111,7 +112,7 @@ async def docker_state(settings: Settings, svc: ServiceDef) -> dict: if _is_recently_unreachable(svc.host, svc.user): return {"state": "unreachable", "host_unreachable": True, "restart_count": None, "uptime": None} cmd = ( - f"docker inspect {svc.container} " + f"docker inspect {quote_arg(svc.container)} " f"--format '{{{{.State.Status}}}}|{{{{.State.StartedAt}}}}|{{{{.RestartCount}}}}|{{{{.State.ExitCode}}}}|{{{{.State.Error}}}}' " f"2>&1 || echo 'NOT_FOUND'" ) @@ -141,7 +142,7 @@ async def run_action(settings: Settings, svc: ServiceDef, action: ServiceAction) """Run docker start/stop/restart on the target host.""" if not svc.host or not svc.user: return {"ok": False, "error": "service host not configured"} - cmd = f"docker {action} {svc.container}" + cmd = f"docker {action} {quote_arg(svc.container)}" rc, out, err = await ssh_run(svc.host, svc.user, cmd, settings, timeout=30) return { "ok": rc == 0, diff --git a/image/app/shellsafe.py b/image/app/shellsafe.py new file mode 100644 index 0000000..f2f3f6e --- /dev/null +++ b/image/app/shellsafe.py @@ -0,0 +1,60 @@ +"""Validation + safe-quoting for user-supplied values that cross into SSH shell +commands on the Sparks. + +Two layers of defense (same spirit as disk.py's `_SAFE_DIRNAME`): + 1. Validate at the API boundary against a strict whitelist — rejects junk + early with a clear error, and guarantees the value carries no shell + metacharacters (so it is also safe to drop into echo/log lines). + 2. `quote_arg` / `quote_args` at the actual interpolation site — the real + guarantee: even a value that somehow skips validation cannot break out of + the command. + +Rule: anything user-controlled that ends up in an `ssh_run` / `ssh_stream` +command string must go through one of these, never be raw f-string'd. +""" +from __future__ import annotations +import re +import shlex + +# Hugging Face repo 'org/name'. HF identifiers allow letters, digits, dot, dash, +# underscore; exactly one slash separates org from name. +_HF_REPO_RE = re.compile(r"^[A-Za-z0-9._-]+/[A-Za-z0-9._-]+$") + +# Docker/OCI image reference: registry/path/name[:tag][@sha256:digest]. +# Conservative charset covering e.g. nvcr.io/nim/nvidia/parakeet-...:latest and +# @digest pins; excludes every shell metacharacter. +_IMAGE_RE = re.compile(r"^[A-Za-z0-9][A-Za-z0-9._:/@-]*$") + +# Docker container / volume name (Docker's own rule). +_CONTAINER_RE = re.compile(r"^[A-Za-z0-9][A-Za-z0-9_.-]*$") + + +def validate_repo(repo: str) -> str: + """Return `repo` if it is a well-formed 'org/name'; else raise ValueError.""" + if not _HF_REPO_RE.fullmatch(repo or ""): + raise ValueError(f"invalid model repo (expected 'org/name'): {repo!r}") + return repo + + +def validate_image(image: str) -> str: + """Return `image` if it is a well-formed container image ref; else ValueError.""" + if not image or len(image) > 512 or not _IMAGE_RE.fullmatch(image): + raise ValueError(f"invalid container image reference: {image!r}") + return image + + +def validate_container(name: str) -> str: + """Return `name` if it is a valid Docker container/volume name; else ValueError.""" + if not name or len(name) > 128 or not _CONTAINER_RE.fullmatch(name): + raise ValueError(f"invalid container name: {name!r}") + return name + + +def quote_arg(value: object) -> str: + """shlex.quote a single token for safe embedding in a shell command string.""" + return shlex.quote(str(value)) + + +def quote_args(values: object) -> str: + """shlex.quote each token and join with spaces.""" + return " ".join(shlex.quote(str(v)) for v in values) # type: ignore[union-attr] diff --git a/package/startos/versions/v0_1_0.ts b/package/startos/versions/v0_1_0.ts index 1668266..5c3bd15 100644 --- a/package/startos/versions/v0_1_0.ts +++ b/package/startos/versions/v0_1_0.ts @@ -1,10 +1,10 @@ import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk' export const v0_1_0 = VersionInfo.of({ - version: '0.18.0:1', + version: '0.19.0:0', releaseNotes: { en_US: - 'v0.18.0:1 — portability cleanup, no runtime behavior change. Scrubbed owner-specific hostnames, IPs, usernames, and personal names out of all shared docs, ops notes, and the offline redaction test, replacing them with neutral placeholders (, , Alice/Bob, etc.) so the package is safe to share with another dual-Spark operator. The bundled install instructions and README no longer suggest a specific SSH username. Removed an obsolete build-time prompt file. Real cluster values now live only in your StartOS install config and local env — nothing identifying is committed.', + 'v0.19.0:0 — security hardening of the cluster-control surface (no change to the proxy/data APIs your other apps use). (1) Every user-supplied value that reaches an SSH command on the Sparks — model repo, vLLM args/knobs, NIM image/container, service names — is now strictly validated and shell-quoted, closing a command-injection path. (2) The Qdrant collection name in /api/search is validated so it can no longer be used to reach other collections. (3) State-changing dashboard endpoints (model swap, NIM install, service start/stop, disk delete, etc.) now require a same-origin request, blocking cross-site (CSRF) attacks from a malicious page open in your browser. The OpenAI-compatible proxies (/v1/*), the redaction gateway (/scrub, /rehydrate), /api/search, /api/audio/*, and /api/health-event are exempt, so Recap Relay, the CRM, Open WebUI and other consumers are unaffected.', }, migrations: { up: async ({ effects }) => {},