v0.19.0:0 - harden cluster-control surface: ssh injection, qdrant path, csrf

Triaged from a full independent evaluation (EVALUATION.md). Addresses the
three P0/P1 code findings; the proxy/data APIs that downstream apps consume
are deliberately untouched.

- ssh command injection (P0): new shellsafe.py validates + shlex.quotes every
  user-supplied value crossing into an SSH command on the Sparks (model repo,
  vllm args/knobs, NIM image/container/volume/port/env, service names).
  Boundary validation on POST /api/models and POST /api/nim/install; quoting at
  every sink in models/download/nim/services. NGC key now quoted too.
- qdrant path injection (P1): /api/search validates the collection name against
  a metacharacter-free whitelist and URL-encodes the path segment.
- csrf (P1): csrf_guard middleware enforces same-origin on state-changing
  control endpoints; /v1/*, /scrub, /rehydrate, /api/search, /api/audio/* and
  /api/health-event are exempt so external consumers are unaffected.

Verified: injection survives only as a single quoted token, vLLM preflight
shlex.split round-trip intact, CSRF behaviors covered via TestClient, both
offline redaction suites still pass, tsc clean, s9pk rebuilt.
This commit is contained in:
Keysat
2026-06-12 16:36:33 -05:00
parent 98988057a2
commit 1c4e861783
10 changed files with 260 additions and 24 deletions
+23 -12
View File
@@ -18,6 +18,7 @@ from datetime import datetime, timezone
from typing import Optional
from .config import Settings
from .shellsafe import quote_arg
from .ssh import ssh_stream, StreamHandle
@@ -138,30 +139,40 @@ class NimManager:
async def _do(self, job: NimInstallJob, extra_env: dict[str, str]) -> None:
# Build the bash one-liner. We use docker login non-interactively with the NGC API key.
env_parts = [f'-e NGC_API_KEY=$NGC_API_KEY']
# The real docker commands use shlex.quote'd values (img/ctr/vol) so nothing
# user-controlled can break out of the SSH shell. The cosmetic `echo` log lines
# embed the *raw* values inside single quotes — safe because image/container are
# validated against a metacharacter-free whitelist at the API boundary, and
# volume/port derive from them. (Embedding shlex.quote output inside another
# quoted echo string would be wrong — it can re-expose $() / $VAR.)
img = quote_arg(job.image)
ctr = quote_arg(job.container)
vol = quote_arg(job.volume)
port = int(job.port) # int can't inject; coerce defensively
env_parts = ['-e NGC_API_KEY=$NGC_API_KEY']
for k, v in extra_env.items():
env_parts.append(f"-e {k}={v}")
env_parts.append(f"-e {quote_arg(k)}={quote_arg(v)}")
env_str = " ".join(env_parts)
cmd = (
f"set -e; "
f"export NGC_API_KEY='{self.settings.ngc_api_key}'; "
f"export NGC_API_KEY={quote_arg(self.settings.ngc_api_key or '')}; "
f"echo '=== docker login nvcr.io ==='; "
f"echo \"$NGC_API_KEY\" | docker login nvcr.io -u '$oauthtoken' --password-stdin; "
f"echo '=== docker pull {job.image} (this can be 1-10 GB) ==='; "
f"docker pull {job.image}; "
f"docker pull {img}; "
f"echo '=== remove any prior container with the same name ==='; "
f"docker rm -f {job.container} 2>/dev/null || true; "
f"echo '=== docker run -d --gpus all -p {job.port}:{job.port} -v {job.volume}:/opt/nim/.cache {env_str} --name {job.container} --restart unless-stopped {job.image} ==='; "
f"docker rm -f {ctr} 2>/dev/null || true; "
f"echo '=== docker run -d --gpus all -p {job.port}:{job.port} -v {job.volume}:/opt/nim/.cache --name {job.container} --restart unless-stopped {job.image} ==='; "
f"docker run -d --gpus all "
f"-p {job.port}:{job.port} "
f"-v {job.volume}:/opt/nim/.cache "
f"-p {port}:{port} "
f"-v {vol}:/opt/nim/.cache "
f"{env_str} "
f"--name {job.container} "
f"--name {ctr} "
f"--restart unless-stopped "
f"{job.image}; "
f"{img}; "
f"echo '=== ensuring cache volume is writable by uid 1000 (riva-server) ==='; "
f"docker run --rm -v {job.volume}:/cache alpine chown -R 1000:1000 /cache && "
f"docker restart {job.container}; "
f"docker run --rm -v {vol}:/cache alpine chown -R 1000:1000 /cache && "
f"docker restart {ctr}; "
f"echo '=== install complete; container is starting up and will download its model on first boot ==='"
)
job.append(f"$ <install command for {job.image} on {job.host}>")