11 KiB
ROADMAP
Longer-term backlog, roughly ordered. An item moves to "Current state" in CLAUDE.md when picked up.
Cluster coordination — OpenClaw coexistence (committed 2026-06-17, from Johnny 5 report 2026-06-16)
Driven by the one other Spark Control adopter (a colleague running OpenClaw + cron jobs against his own dual Sparks; report at the date above). His cluster is configured differently from ours (vLLM on both Sparks, port 8000, raw docker run, container vllm-gemma4) and an automated cron physically swaps models — so his notes are partly portability gaps (the package hard-codes our layout) and partly coordination gaps (his dashboard and his crons fight over the GPU).
Design stance (decided): Spark Control is the control plane / GPU arbiter, not a job runner. Recurring business pipelines (his "Daily Vol" generator; our own future scheduled jobs) live in separate application services that call Spark Control's swap API. The dividing line is what a scheduled job does: control-plane actions (swap a model, warm it, restart a service, run a health sweep) are in scope for an in-package scheduler; business logic (scrape / summarize / build / deploy) stays in the app layer. Swaps are already API-driven (POST /api/swap → GET /api/swap/{id} / …/stream, POST /api/swap/{key}/validate) and non-browser clients pass the CSRF guard, so an external scheduler can drive swaps today — the items below add the safety layer, not the capability.
Sequenced:
- Configurable
VLLM_PORT— DONE, v0.22.0:0. Field in Configure Sparks (blank ⇒ 8888); numeric-setting parsing hardened so a blank/bad value falls back instead of crashing startup. Was the immediate "vLLM unreachable" bug for an adopter on port 8000. - Local-path / fine-tuned model support — DONE, v0.23.0:0. Catalog/
ModelDefgainedlocal_path(exactly one ofrepo/local_path); swap bind-mounts the dir into the vLLM container at the same path via the launch script'sVLLM_SPARK_EXTRA_DOCKER_ARGShook (nolaunch-cluster.shchange); "+ Add local model" form +localbadge; disk-delete refused for local models;validate_local_pathboundary check. His mergedten31-v2was the motivating case. - Configurable topology — DONE, v0.24.0:0. Three optional Configure-Sparks knobs: vLLM container name (
VLLM_CONTAINER, blank ⇒vllm_node; threaded through the swap log-tail + pre-flight validator viaquote_arg); "services to hide" (DISABLED_SERVICES, comma list — hidden services show no tile and are skipped by status/deep-health/connectivity probes, killing the Parakeet-on-8000 collision); and a second-Spark vLLM monitor via akind: vllmcustom service inservices-overrides.yaml(read-only tile probed through the sharedprobe_vllm_endpoint)./api/endpointsgained adisabledflag. Covers report P4/P5/#6. (Generalizing the swap mechanism to the adopter's rawdocker runwas deliberately left out — that's coordination, item 4; he swaps via his own crons and uses Spark Control to monitor.) - Coordination layer — DONE in tree, staged as v0.25.0:0 (built/typechecked clean; install pending). All three primitives shipped;
image/app/coordination.py+docs/COORDINATION.md. Brought forward 2026-06-17 on request rather than waiting for our own automation.- Swap lock with holder + TTL (
POST/GET/DELETE /api/swap/lock). Acquire returns a secret token; the swap endpoint refuses any real swap (423) that doesn't present it inX-Swap-Lock-Token, so the dashboard's manual swap is paused while a scheduler holds it (with a?force=truehuman override). In-memory + TTL-bounded → resets to unlocked on restart; re-acquire with the token extends. Enforced inpost_swap, not advisory. - Swap-event webhook (
swap_complete/swap_failed) to a configurable URL (Configure-Sparks field), fired fromSwapManager._runoutside the swap lock; optional shared secret ⇒X-Spark-SignatureHMAC. Fire-and-forget (5 s, no retries); dry runs don't fire. - Schedule visibility —
GET/POST/DELETE /api/schedule; read-only "Scheduled jobs" dashboard panel, registered by external schedulers. Spark Control stores and displays, never executes. - Tests:
image/tests/test_coordination.py(22 cases — lock lifecycle/expiry/token, the single-read swap gate, schedule CRUD + id validation, webhook payload+signature). Known limit: lock + schedules are in-memory (a restart frees the lock and empties the registry until schedulers re-register) — persist to/dataonly if that bites.
- Swap lock with holder + TTL (
Generalizing the swap mechanism to raw docker run — DEFERRED (decided 2026-06-18, research-backed; was item 4's last open thread)
Our swap drives ~/spark-vllm-docker/launch-cluster.sh over SSH on Spark 1 (./launch-cluster.sh stop, then [VLLM_SPARK_EXTRA_DOCKER_ARGS=…] ./launch-cluster.sh [--solo ]-d exec vllm serve <model> <args>, then docker logs -f until the ready marker). The OpenClaw adopter launches vLLM with a plain docker run instead, so the swap button can't drive his cluster — only monitor it. The portability fix would be a configurable "swap backend": keep launch-cluster.sh as the default and add a "bring your own command" mode (operator-authored stop/launch templates in services-overrides.yaml with quoted {model}/{container}/{port}/{extra_args} substitution; ready-detection unchanged; the vLLM-argparse pre-flight disabled for that backend).
Why deferred, not built:
- Raw docker is not an upgrade for us — for half our catalog it's impossible.
launch-cluster.shis theeugr/spark-vllm-dockercommunity project (de-facto DGX Spark standard; mirrors NVIDIA's owndgx-spark-playbooksRay+RDMA architecture). Its headline job is multi-node serving: our 235Bclustermodels (Qwen3-VL 235B, Qwen3 235B) exceed one Spark's 128 GB and must shard across both Sparks via Ray over the 200 Gbps ConnectX/RoCE link — plumbing (NCCL/MTU/per-node env) that a single-nodedocker runcannot do. So we keep the helper script; switching our own cluster to raw docker is off the table. - The feature is therefore portability-only (for differently-wired adopters), and the one known adopter doesn't need it — he swaps via his own crons and uses Spark Control to watch.
- Untestable on our hardware — our cluster uses the helper script, so we can't validate a real raw-docker swap without risking the live vLLM.
- The one real standing risk is eugr's single-maintainer status; fallback is community forks or migrating to NVIDIA's official
dgx-spark-playbookslauncher (same design). No reason to switch now.
Revisit only if an adopter explicitly wants Spark Control to drive (not just monitor) swaps on a raw-docker run cluster. At that point, get their actual working docker run command and build the command-template backend to it.
Near term
- parakeet-asr long-audio memory guard — deferred 2026-06-15, low priority. A duration cap on
/v1/audio/diarize: Sortformer runs the whole file in one pass (diarizer.py:128-135) over Spark 2's shared 128 GB unified memory (also feeding Kokoro/embeddings/Qdrant), so one giant single file can thrash into swap. Precautionary — no observed incident, and the production consumer (Recap Relay) already chunks via/diarize-chunk(~5-min, already bounded), so the only exposed path is a consumer POSTing one huge file to the full/diarize. When picked up: add a configurableMAX_DIARIZE_SECONDSguard indiarizer.pyright afterdurationis computed (~line 130) → raise → HTTP 413 inmain.py(mirrors the existingMAX_UPLOAD_MB413); ship via the Reapply-patches action (restarts the live parakeet-asr container → needs go/no-go). Leave transcription out of v1 (upstream/un-patched file; parakeet-TDT handles long audio better). Revisit only if a consumer starts sending long single files. - Controlled concurrency sweep of the audio endpoints in a quiet window — replace the reasoned in-flight cap (2, ceiling 3) with the measured knee.
Audio quality
- Echo cancellation for dual-channel label-merge — removes the mic-bleed limit when the local user isn't wearing headphones.
- LLM "referee" pass for low-confidence label-merge speaker naming.
Platform hardening
- Qdrant auth (API key) + scheduled snapshots/backups.
- Observability: request metrics + GPU-busy tracking, so load questions are answered from data instead of log archaeology.
- API-key auth on Spark Control — only if public (non-VPN) exposure is ever needed; current stance is LAN + split-tunnel VPN only.
Throughput (only if audio load outgrows one GPU)
- Second audio worker / queueing layer; revisit which services share Spark 2.
Dashboard
- Per-model configurable vLLM flags editable from the UI (today: edit
models.yamland rebuild). - Spark host update actions (OS/driver) from the UI.
- Open WebUI link-out integration; richer per-service detail views.
Tech debt (from the 2026-06-12 full-eval — see EVALUATION.md)
P0/P1 security findings are all fixed in v0.19.0:0. Remaining, none blocking:
P2 — track:
- No automated tests beyond the two redaction suites — swap state machine, proxies, SSH wrapper, and the StartOS package are untested; live-cluster paths (swap exec, audio, embeddings/search) are exercised only by hand. Biggest coverage gap; a small pytest harness for
build_launch_command(incl. injection cases), swap transitions, and_merge_words_with_speakersis the highest-value start. - Loose dependency floors permit vulnerable
python-multipart/starlette(DoS CVEs) on rebuild; no lockfile; no upload size caps (pyproject.toml). - Opaque HTTP 500 on
POST /api/models/PUT /knobswhenMODELS_OVERRIDESunset in dev (write to read-only/data) — catch theOSError. - NGC API key still appears on the remote process command line (
nim.py) — the quote-breakout risk is fixed; pass via stdin/env to also remove the process-list exposure. - Global mutable
catalogreassigned viaglobal, shared across async requests with no snapshot (server.py) — latent race as concurrency grows. - Container runs uvicorn as root bound to
0.0.0.0:9999(noUSERin Dockerfile) — amplifies any RCE blast radius.
P3 — bulk-fix when next touching docs/packaging:
- README Status block stale (
v0.2.3 / 0.13.0:4→ now v0.19.0:0); deprecated@app.on_event+ hardcodedapp.version="0.1.0";NimInstallBody.registershadowsBaseModel(rename →register_service); httpx class names leak into TTS/speech-models error text; one unescapedinnerHTMLsink (app.js) +task_idreflected in scrub JSON. - Packaging:
marketingUrl/packageRepo/upstreamRepoareexample.complaceholders; brokeninstructions.mdsource link; per-service SSH users (parakeet_useretc.) absent from the Configure-Sparks action inputSpec (silent default-empty);Makefilebuilds only x86 though the manifest declaresaarch64. - Hardening misc: no body/upload size limits on
/v1/audio/*,/v1/chat/completions,/scrub;int(_env(...))startup crash on badVLLM_PORT; upstream error text echoed to clients. - StartOS registry (only if ever pursuing it): source must be public + real repo URLs.