ROADMAP

Longer-term backlog, roughly ordered. An item moves to "Current state" in CLAUDE.md when picked up.

Cluster coordination — OpenClaw coexistence (committed 2026-06-17, from Johnny 5 report 2026-06-16)

Driven by the one other Spark Control adopter (a colleague running OpenClaw + cron jobs against his own dual Sparks; report at the date above). His cluster is configured differently from ours (vLLM on both Sparks, port 8000, raw docker run, container vllm-gemma4) and an automated cron physically swaps models — so his notes are partly portability gaps (the package hard-codes our layout) and partly coordination gaps (his dashboard and his crons fight over the GPU).

Design stance (decided): Spark Control is the control plane / GPU arbiter, not a job runner. Recurring business pipelines (his "Daily Vol" generator; our own future scheduled jobs) live in separate application services that call Spark Control's swap API. The dividing line is what a scheduled job does: control-plane actions (swap a model, warm it, restart a service, run a health sweep) are in scope for an in-package scheduler; business logic (scrape / summarize / build / deploy) stays in the app layer. Swaps are already API-driven (POST /api/swap → GET /api/swap/{id} / …/stream, POST /api/swap/{key}/validate) and non-browser clients pass the CSRF guard, so an external scheduler can drive swaps today — the items below add the safety layer, not the capability.

Sequenced:

Configurable VLLM_PORT — DONE, v0.22.0:0. Field in Configure Sparks (blank ⇒ 8888); numeric-setting parsing hardened so a blank/bad value falls back instead of crashing startup. Was the immediate "vLLM unreachable" bug for an adopter on port 8000.
Local-path / fine-tuned model support — DONE, v0.23.0:0. Catalog/ModelDef gained local_path (exactly one of repo/local_path); swap bind-mounts the dir into the vLLM container at the same path via the launch script's VLLM_SPARK_EXTRA_DOCKER_ARGS hook (no launch-cluster.sh change); "+ Add local model" form + local badge; disk-delete refused for local models; validate_local_path boundary check. His merged ten31-v2 was the motivating case.
Configurable topology — DONE, v0.24.0:0. Three optional Configure-Sparks knobs: vLLM container name (VLLM_CONTAINER, blank ⇒ vllm_node; threaded through the swap log-tail + pre-flight validator via quote_arg); "services to hide" (DISABLED_SERVICES, comma list — hidden services show no tile and are skipped by status/deep-health/connectivity probes, killing the Parakeet-on-8000 collision); and a second-Spark vLLM monitor via a kind: vllm custom service in services-overrides.yaml (read-only tile probed through the shared probe_vllm_endpoint). /api/endpoints gained a disabled flag. Covers report P4/P5/#6. (Generalizing the swap mechanism to the adopter's raw docker run was deliberately left out — that's coordination, item 4; he swaps via his own crons and uses Spark Control to monitor.)
Coordination layer — DONE in tree, staged as v0.25.0:0 (built/typechecked clean; install pending). All three primitives shipped; image/app/coordination.py + docs/COORDINATION.md. Brought forward 2026-06-17 on request rather than waiting for our own automation.
- Swap lock with holder + TTL (POST / GET / DELETE /api/swap/lock). Acquire returns a secret token; the swap endpoint refuses any real swap (423) that doesn't present it in X-Swap-Lock-Token, so the dashboard's manual swap is paused while a scheduler holds it (with a ?force=true human override). In-memory + TTL-bounded → resets to unlocked on restart; re-acquire with the token extends. Enforced in post_swap, not advisory.
- Swap-event webhook (swap_complete / swap_failed) to a configurable URL (Configure-Sparks field), fired from SwapManager._run outside the swap lock; optional shared secret ⇒ X-Spark-Signature HMAC. Fire-and-forget (5 s, no retries); dry runs don't fire.
- Schedule visibility — GET/POST/DELETE /api/schedule; read-only "Scheduled jobs" dashboard panel, registered by external schedulers. Spark Control stores and displays, never executes.
- Tests: image/tests/test_coordination.py (22 cases — lock lifecycle/expiry/token, the single-read swap gate, schedule CRUD + id validation, webhook payload+signature). Known limit: lock + schedules are in-memory (a restart frees the lock and empties the registry until schedulers re-register) — persist to /data only if that bites.

Generalizing the swap mechanism to raw `docker run` — DEFERRED (decided 2026-06-18, research-backed; was item 4's last open thread)

Our swap drives ~/spark-vllm-docker/launch-cluster.sh over SSH on Spark 1 (./launch-cluster.sh stop, then [VLLM_SPARK_EXTRA_DOCKER_ARGS=…] ./launch-cluster.sh [--solo ]-d exec vllm serve <model> <args>, then docker logs -f until the ready marker). The OpenClaw adopter launches vLLM with a plain docker run instead, so the swap button can't drive his cluster — only monitor it. The portability fix would be a configurable "swap backend": keep launch-cluster.sh as the default and add a "bring your own command" mode (operator-authored stop/launch templates in services-overrides.yaml with quoted {model}/{container}/{port}/{extra_args} substitution; ready-detection unchanged; the vLLM-argparse pre-flight disabled for that backend).

Why deferred, not built:

Raw docker is not an upgrade for us — for half our catalog it's impossible. launch-cluster.sh is the eugr/spark-vllm-docker community project (de-facto DGX Spark standard; mirrors NVIDIA's own dgx-spark-playbooks Ray+RDMA architecture). Its headline job is multi-node serving: our 235B cluster models (Qwen3-VL 235B, Qwen3 235B) exceed one Spark's 128 GB and must shard across both Sparks via Ray over the 200 Gbps ConnectX/RoCE link — plumbing (NCCL/MTU/per-node env) that a single-node docker run cannot do. So we keep the helper script; switching our own cluster to raw docker is off the table.
The feature is therefore portability-only (for differently-wired adopters), and the one known adopter doesn't need it — he swaps via his own crons and uses Spark Control to watch.
Untestable on our hardware — our cluster uses the helper script, so we can't validate a real raw-docker swap without risking the live vLLM.
The one real standing risk is eugr's single-maintainer status; fallback is community forks or migrating to NVIDIA's official dgx-spark-playbooks launcher (same design). No reason to switch now.

Revisit only if an adopter explicitly wants Spark Control to drive (not just monitor) swaps on a raw-docker run cluster. At that point, get their actual working docker run command and build the command-template backend to it.

Near term

parakeet-asr long-audio memory guard — deferred 2026-06-15, low priority. A duration cap on /v1/audio/diarize: Sortformer runs the whole file in one pass (diarizer.py:128-135) over Spark 2's shared 128 GB unified memory (also feeding Kokoro/embeddings/Qdrant), so one giant single file can thrash into swap. Precautionary — no observed incident, and the production consumer (Recap Relay) already chunks via /diarize-chunk (~5-min, already bounded), so the only exposed path is a consumer POSTing one huge file to the full /diarize. When picked up: add a configurable MAX_DIARIZE_SECONDS guard in diarizer.py right after duration is computed (~line 130) → raise → HTTP 413 in main.py (mirrors the existing MAX_UPLOAD_MB 413); ship via the Reapply-patches action (restarts the live parakeet-asr container → needs go/no-go). Leave transcription out of v1 (upstream/un-patched file; parakeet-TDT handles long audio better). Revisit only if a consumer starts sending long single files.
Controlled concurrency sweep of the audio endpoints in a quiet window — replace the reasoned in-flight cap (2, ceiling 3) with the measured knee.

Audio quality

Echo cancellation for dual-channel label-merge — removes the mic-bleed limit when the local user isn't wearing headphones.
LLM "referee" pass for low-confidence label-merge speaker naming.

Platform hardening

Qdrant auth (API key) + scheduled snapshots/backups.
Observability: request metrics + GPU-busy tracking, so load questions are answered from data instead of log archaeology.
API-key auth on Spark Control — only if public (non-VPN) exposure is ever needed; current stance is LAN + split-tunnel VPN only.

Throughput (only if audio load outgrows one GPU)

Second audio worker / queueing layer; revisit which services share Spark 2.

Dashboard

Per-model configurable vLLM flags editable from the UI (today: edit models.yaml and rebuild).
Spark host update actions (OS/driver) from the UI.
Open WebUI link-out integration; richer per-service detail views.

Tech debt (from the 2026-06-12 full-eval — see EVALUATION.md)

P0/P1 security findings are all fixed in v0.19.0:0. Remaining, none blocking:

P2 — track:

No automated tests beyond the two redaction suites — swap state machine, proxies, SSH wrapper, and the StartOS package are untested; live-cluster paths (swap exec, audio, embeddings/search) are exercised only by hand. Biggest coverage gap; a small pytest harness for build_launch_command (incl. injection cases), swap transitions, and _merge_words_with_speakers is the highest-value start.
Loose dependency floors permit vulnerable python-multipart/starlette (DoS CVEs) on rebuild; no lockfile; no upload size caps (pyproject.toml).
Opaque HTTP 500 on POST /api/models / PUT /knobs when MODELS_OVERRIDES unset in dev (write to read-only /data) — catch the OSError.
NGC API key still appears on the remote process command line (nim.py) — the quote-breakout risk is fixed; pass via stdin/env to also remove the process-list exposure.
Global mutable catalog reassigned via global, shared across async requests with no snapshot (server.py) — latent race as concurrency grows.
Container runs uvicorn as root bound to 0.0.0.0:9999 (no USER in Dockerfile) — amplifies any RCE blast radius.

P3 — bulk-fix when next touching docs/packaging:

README Status block stale (v0.2.3 / 0.13.0:4 → now v0.19.0:0); deprecated @app.on_event + hardcoded app.version="0.1.0"; NimInstallBody.register shadows BaseModel (rename → register_service); httpx class names leak into TTS/speech-models error text; one unescaped innerHTML sink (app.js) + task_id reflected in scrub JSON.
Packaging: marketingUrl/packageRepo/upstreamRepo are example.com placeholders; broken instructions.md source link; per-service SSH users (parakeet_user etc.) absent from the Configure-Sparks action inputSpec (silent default-empty); Makefile builds only x86 though the manifest declares aarch64.
Hardening misc: no body/upload size limits on /v1/audio/*, /v1/chat/completions, /scrub; int(_env(...)) startup crash on bad VLLM_PORT; upstream error text echoed to clients.
StartOS registry (only if ever pursuing it): source must be public + real repo URLs.

11 KiB Raw Blame History