docs: handoff — trim Current state, move full-eval debt to ROADMAP, record SSH-input + CSRF conventions

- AGENTS.md: rewrite Current state lean for v0.19.0:0; drop the now-completed
  full-eval triage block (history lives in git log + EVALUATION.md).
- docs/guides/fastapi-image.md: add two durable conventions — user values
  crossing into SSH must go through shellsafe; new endpoints and the
  csrf_guard exempt-prefix rule.
- ROADMAP.md: park the remaining non-blocking P2/P3 tech debt from the eval.
This commit is contained in:
Keysat
2026-06-12 17:10:03 -05:00
parent 56a519ff4f
commit 05d03beeeb
3 changed files with 29 additions and 33 deletions
+18
View File
@@ -22,3 +22,21 @@ Longer-term backlog, roughly ordered. An item moves to "Current state" in CLAUDE
- Per-model configurable vLLM flags editable from the UI (today: edit `models.yaml` and rebuild).
- Spark host update actions (OS/driver) from the UI.
- Open WebUI link-out integration; richer per-service detail views.
## Tech debt (from the 2026-06-12 full-eval — see EVALUATION.md)
P0/P1 security findings are all fixed in v0.19.0:0. Remaining, none blocking:
**P2 — track:**
- No automated tests beyond the two redaction suites — swap state machine, proxies, SSH wrapper, and the StartOS package are untested; live-cluster paths (swap exec, audio, embeddings/search) are exercised only by hand. Biggest coverage gap; a small pytest harness for `build_launch_command` (incl. injection cases), swap transitions, and `_merge_words_with_speakers` is the highest-value start.
- Loose dependency floors permit vulnerable `python-multipart`/`starlette` (DoS CVEs) on rebuild; no lockfile; no upload size caps (`pyproject.toml`).
- Opaque HTTP 500 on `POST /api/models` / `PUT /knobs` when `MODELS_OVERRIDES` unset in dev (write to read-only `/data`) — catch the `OSError`.
- NGC API key still appears on the remote process command line (`nim.py`) — the quote-breakout risk is fixed; pass via stdin/env to also remove the process-list exposure.
- Global mutable `catalog` reassigned via `global`, shared across async requests with no snapshot (`server.py`) — latent race as concurrency grows.
- Container runs uvicorn as **root** bound to `0.0.0.0:9999` (no `USER` in Dockerfile) — amplifies any RCE blast radius.
**P3 — bulk-fix when next touching docs/packaging:**
- README Status block stale (`v0.2.3 / 0.13.0:4` → now v0.19.0:0); deprecated `@app.on_event` + hardcoded `app.version="0.1.0"`; `NimInstallBody.register` shadows `BaseModel` (rename → `register_service`); httpx class names leak into TTS/speech-models error text; one unescaped `innerHTML` sink (`app.js`) + `task_id` reflected in scrub JSON.
- Packaging: `marketingUrl`/`packageRepo`/`upstreamRepo` are `example.com` placeholders; broken `instructions.md` source link; per-service SSH users (`parakeet_user` etc.) absent from the Configure-Sparks action inputSpec (silent default-empty); `Makefile` builds only x86 though the manifest declares `aarch64`.
- Hardening misc: no body/upload size limits on `/v1/audio/*`, `/v1/chat/completions`, `/scrub`; `int(_env(...))` startup crash on bad `VLLM_PORT`; upstream error text echoed to clients.
- StartOS registry (only if ever pursuing it): source must be public + real repo URLs.