Files
spark-control/ROADMAP.md
T

4.4 KiB

ROADMAP

Longer-term backlog, roughly ordered. An item moves to "Current state" in CLAUDE.md when picked up.

Near term

  • parakeet-asr --memory cap, shipped via the Reapply-patches action (guards against swap-thrash on very long audio).
  • Controlled concurrency sweep of the audio endpoints in a quiet window — replace the reasoned in-flight cap (2, ceiling 3) with the measured knee.

Audio quality

  • Echo cancellation for dual-channel label-merge — removes the mic-bleed limit when the local user isn't wearing headphones.
  • LLM "referee" pass for low-confidence label-merge speaker naming.

Platform hardening

  • Qdrant auth (API key) + scheduled snapshots/backups.
  • Observability: request metrics + GPU-busy tracking, so load questions are answered from data instead of log archaeology.
  • API-key auth on Spark Control — only if public (non-VPN) exposure is ever needed; current stance is LAN + split-tunnel VPN only.

Throughput (only if audio load outgrows one GPU)

  • Second audio worker / queueing layer; revisit which services share Spark 2.

Dashboard

  • Support local-path / fine-tuned models in the swap catalog. Today the catalog is static (models.yaml + custom overrides) and the "Add custom model" path (POST /api/models) only accepts an HF org/name repo (shellsafe._HF_REPO_RE), so a model that exists only as a directory on a Spark (the usual fine-tuning output) can't be registered or swapped. Needs: (a) a "local model" add form/field taking a Spark-side directory path, with its own safe validation instead of the org/name regex (path whitelist + shlex.quote, no traversal); (b) models.build_launch_command / launch-cluster.sh able to vllm serve <path>; (c) disk.py size-probe handling a path instead of deriving the HF cache dir from a repo id. Raised 2026-06-15 — a colleague's locally fine-tuned model doesn't appear because nothing scans the machine; the list is a curated catalog, not a discovery probe.
  • Per-model configurable vLLM flags editable from the UI (today: edit models.yaml and rebuild).
  • Spark host update actions (OS/driver) from the UI.
  • Open WebUI link-out integration; richer per-service detail views.

Tech debt (from the 2026-06-12 full-eval — see EVALUATION.md)

P0/P1 security findings are all fixed in v0.19.0:0. Remaining, none blocking:

P2 — track:

  • No automated tests beyond the two redaction suites — swap state machine, proxies, SSH wrapper, and the StartOS package are untested; live-cluster paths (swap exec, audio, embeddings/search) are exercised only by hand. Biggest coverage gap; a small pytest harness for build_launch_command (incl. injection cases), swap transitions, and _merge_words_with_speakers is the highest-value start.
  • Loose dependency floors permit vulnerable python-multipart/starlette (DoS CVEs) on rebuild; no lockfile; no upload size caps (pyproject.toml).
  • Opaque HTTP 500 on POST /api/models / PUT /knobs when MODELS_OVERRIDES unset in dev (write to read-only /data) — catch the OSError.
  • NGC API key still appears on the remote process command line (nim.py) — the quote-breakout risk is fixed; pass via stdin/env to also remove the process-list exposure.
  • Global mutable catalog reassigned via global, shared across async requests with no snapshot (server.py) — latent race as concurrency grows.
  • Container runs uvicorn as root bound to 0.0.0.0:9999 (no USER in Dockerfile) — amplifies any RCE blast radius.

P3 — bulk-fix when next touching docs/packaging:

  • README Status block stale (v0.2.3 / 0.13.0:4 → now v0.19.0:0); deprecated @app.on_event + hardcoded app.version="0.1.0"; NimInstallBody.register shadows BaseModel (rename → register_service); httpx class names leak into TTS/speech-models error text; one unescaped innerHTML sink (app.js) + task_id reflected in scrub JSON.
  • Packaging: marketingUrl/packageRepo/upstreamRepo are example.com placeholders; broken instructions.md source link; per-service SSH users (parakeet_user etc.) absent from the Configure-Sparks action inputSpec (silent default-empty); Makefile builds only x86 though the manifest declares aarch64.
  • Hardening misc: no body/upload size limits on /v1/audio/*, /v1/chat/completions, /scrub; int(_env(...)) startup crash on bad VLLM_PORT; upstream error text echoed to clients.
  • StartOS registry (only if ever pursuing it): source must be public + real repo URLs.