v0.19.0:0 - harden cluster-control surface: ssh injection, qdrant path, csrf
Triaged from a full independent evaluation (EVALUATION.md). Addresses the three P0/P1 code findings; the proxy/data APIs that downstream apps consume are deliberately untouched. - ssh command injection (P0): new shellsafe.py validates + shlex.quotes every user-supplied value crossing into an SSH command on the Sparks (model repo, vllm args/knobs, NIM image/container/volume/port/env, service names). Boundary validation on POST /api/models and POST /api/nim/install; quoting at every sink in models/download/nim/services. NGC key now quoted too. - qdrant path injection (P1): /api/search validates the collection name against a metacharacter-free whitelist and URL-encodes the path segment. - csrf (P1): csrf_guard middleware enforces same-origin on state-changing control endpoints; /v1/*, /scrub, /rehydrate, /api/search, /api/audio/* and /api/health-event are exempt so external consumers are unaffected. Verified: injection survives only as a single quoted token, vLLM preflight shlex.split round-trip intact, CSRF behaviors covered via TestClient, both offline redaction suites still pass, tsc clean, s9pk rebuilt.
This commit is contained in:
+23
-12
@@ -18,6 +18,7 @@ from datetime import datetime, timezone
|
||||
from typing import Optional
|
||||
|
||||
from .config import Settings
|
||||
from .shellsafe import quote_arg
|
||||
from .ssh import ssh_stream, StreamHandle
|
||||
|
||||
|
||||
@@ -138,30 +139,40 @@ class NimManager:
|
||||
|
||||
async def _do(self, job: NimInstallJob, extra_env: dict[str, str]) -> None:
|
||||
# Build the bash one-liner. We use docker login non-interactively with the NGC API key.
|
||||
env_parts = [f'-e NGC_API_KEY=$NGC_API_KEY']
|
||||
# The real docker commands use shlex.quote'd values (img/ctr/vol) so nothing
|
||||
# user-controlled can break out of the SSH shell. The cosmetic `echo` log lines
|
||||
# embed the *raw* values inside single quotes — safe because image/container are
|
||||
# validated against a metacharacter-free whitelist at the API boundary, and
|
||||
# volume/port derive from them. (Embedding shlex.quote output inside another
|
||||
# quoted echo string would be wrong — it can re-expose $() / $VAR.)
|
||||
img = quote_arg(job.image)
|
||||
ctr = quote_arg(job.container)
|
||||
vol = quote_arg(job.volume)
|
||||
port = int(job.port) # int can't inject; coerce defensively
|
||||
env_parts = ['-e NGC_API_KEY=$NGC_API_KEY']
|
||||
for k, v in extra_env.items():
|
||||
env_parts.append(f"-e {k}={v}")
|
||||
env_parts.append(f"-e {quote_arg(k)}={quote_arg(v)}")
|
||||
env_str = " ".join(env_parts)
|
||||
cmd = (
|
||||
f"set -e; "
|
||||
f"export NGC_API_KEY='{self.settings.ngc_api_key}'; "
|
||||
f"export NGC_API_KEY={quote_arg(self.settings.ngc_api_key or '')}; "
|
||||
f"echo '=== docker login nvcr.io ==='; "
|
||||
f"echo \"$NGC_API_KEY\" | docker login nvcr.io -u '$oauthtoken' --password-stdin; "
|
||||
f"echo '=== docker pull {job.image} (this can be 1-10 GB) ==='; "
|
||||
f"docker pull {job.image}; "
|
||||
f"docker pull {img}; "
|
||||
f"echo '=== remove any prior container with the same name ==='; "
|
||||
f"docker rm -f {job.container} 2>/dev/null || true; "
|
||||
f"echo '=== docker run -d --gpus all -p {job.port}:{job.port} -v {job.volume}:/opt/nim/.cache {env_str} --name {job.container} --restart unless-stopped {job.image} ==='; "
|
||||
f"docker rm -f {ctr} 2>/dev/null || true; "
|
||||
f"echo '=== docker run -d --gpus all -p {job.port}:{job.port} -v {job.volume}:/opt/nim/.cache --name {job.container} --restart unless-stopped {job.image} ==='; "
|
||||
f"docker run -d --gpus all "
|
||||
f"-p {job.port}:{job.port} "
|
||||
f"-v {job.volume}:/opt/nim/.cache "
|
||||
f"-p {port}:{port} "
|
||||
f"-v {vol}:/opt/nim/.cache "
|
||||
f"{env_str} "
|
||||
f"--name {job.container} "
|
||||
f"--name {ctr} "
|
||||
f"--restart unless-stopped "
|
||||
f"{job.image}; "
|
||||
f"{img}; "
|
||||
f"echo '=== ensuring cache volume is writable by uid 1000 (riva-server) ==='; "
|
||||
f"docker run --rm -v {job.volume}:/cache alpine chown -R 1000:1000 /cache && "
|
||||
f"docker restart {job.container}; "
|
||||
f"docker run --rm -v {vol}:/cache alpine chown -R 1000:1000 /cache && "
|
||||
f"docker restart {ctr}; "
|
||||
f"echo '=== install complete; container is starting up and will download its model on first boot ==='"
|
||||
)
|
||||
job.append(f"$ <install command for {job.image} on {job.host}>")
|
||||
|
||||
Reference in New Issue
Block a user