Make the cluster topology configurable so an adopter wired differently (vLLM on both Sparks, port 8000, different container name, no Parakeet) can monitor without forking. Covers the OpenClaw report P4/P5/#6. - VLLM_CONTAINER override (default vllm_node), validated at the boundary and quote_arg-quoted into the swap log-tail + pre-flight validator exec. - DISABLED_SERVICES list: hidden services show no tile and are skipped by status/deep-health/connectivity probes (kills the Parakeet-on-8000 collision). - kind: vllm custom service monitors a second Spark's vLLM via the shared probe_vllm_endpoint; /api/endpoints gains a disabled flag. Swap mechanism intentionally not generalized to raw docker run (that's coordination, roadmap item 4).
8.0 KiB
ROADMAP
Longer-term backlog, roughly ordered. An item moves to "Current state" in CLAUDE.md when picked up.
Cluster coordination — OpenClaw coexistence (committed 2026-06-17, from Johnny 5 report 2026-06-16)
Driven by the one other Spark Control adopter (a colleague running OpenClaw + cron jobs against his own dual Sparks; report at the date above). His cluster is configured differently from ours (vLLM on both Sparks, port 8000, raw docker run, container vllm-gemma4) and an automated cron physically swaps models — so his notes are partly portability gaps (the package hard-codes our layout) and partly coordination gaps (his dashboard and his crons fight over the GPU).
Design stance (decided): Spark Control is the control plane / GPU arbiter, not a job runner. Recurring business pipelines (his "Daily Vol" generator; our own future scheduled jobs) live in separate application services that call Spark Control's swap API. The dividing line is what a scheduled job does: control-plane actions (swap a model, warm it, restart a service, run a health sweep) are in scope for an in-package scheduler; business logic (scrape / summarize / build / deploy) stays in the app layer. Swaps are already API-driven (POST /api/swap → GET /api/swap/{id} / …/stream, POST /api/swap/{key}/validate) and non-browser clients pass the CSRF guard, so an external scheduler can drive swaps today — the items below add the safety layer, not the capability.
Sequenced:
- Configurable
VLLM_PORT— DONE, v0.22.0:0. Field in Configure Sparks (blank ⇒ 8888); numeric-setting parsing hardened so a blank/bad value falls back instead of crashing startup. Was the immediate "vLLM unreachable" bug for an adopter on port 8000. - Local-path / fine-tuned model support — DONE, v0.23.0:0. Catalog/
ModelDefgainedlocal_path(exactly one ofrepo/local_path); swap bind-mounts the dir into the vLLM container at the same path via the launch script'sVLLM_SPARK_EXTRA_DOCKER_ARGShook (nolaunch-cluster.shchange); "+ Add local model" form +localbadge; disk-delete refused for local models;validate_local_pathboundary check. His mergedten31-v2was the motivating case. - Configurable topology — DONE, v0.24.0:0. Three optional Configure-Sparks knobs: vLLM container name (
VLLM_CONTAINER, blank ⇒vllm_node; threaded through the swap log-tail + pre-flight validator viaquote_arg); "services to hide" (DISABLED_SERVICES, comma list — hidden services show no tile and are skipped by status/deep-health/connectivity probes, killing the Parakeet-on-8000 collision); and a second-Spark vLLM monitor via akind: vllmcustom service inservices-overrides.yaml(read-only tile probed through the sharedprobe_vllm_endpoint)./api/endpointsgained adisabledflag. Covers report P4/P5/#6. (Generalizing the swap mechanism to the adopter's rawdocker runwas deliberately left out — that's coordination, item 4; he swaps via his own crons and uses Spark Control to monitor.) - Coordination layer — build when our own automation actually lands (zero value until something other than the dashboard swaps models):
- Swap lock with holder + TTL (
POST/GET/DELETE /api/swap/lock). An external scheduler acquires it before swapping; the dashboard then refuses manual swaps and shows who holds the GPU and until when. Enforced by the swap path, not advisory. - Swap-event webhook (
swap_complete/swap_failed) to a configurable URL, so downstream consumers update their provider config when the running model changes. - Schedule visibility — read-only view the dashboard surfaces, registered by external schedulers (Spark Control does not own the schedule).
- Swap lock with holder + TTL (
Near term
- parakeet-asr long-audio memory guard — deferred 2026-06-15, low priority. A duration cap on
/v1/audio/diarize: Sortformer runs the whole file in one pass (diarizer.py:128-135) over Spark 2's shared 128 GB unified memory (also feeding Kokoro/embeddings/Qdrant), so one giant single file can thrash into swap. Precautionary — no observed incident, and the production consumer (Recap Relay) already chunks via/diarize-chunk(~5-min, already bounded), so the only exposed path is a consumer POSTing one huge file to the full/diarize. When picked up: add a configurableMAX_DIARIZE_SECONDSguard indiarizer.pyright afterdurationis computed (~line 130) → raise → HTTP 413 inmain.py(mirrors the existingMAX_UPLOAD_MB413); ship via the Reapply-patches action (restarts the live parakeet-asr container → needs go/no-go). Leave transcription out of v1 (upstream/un-patched file; parakeet-TDT handles long audio better). Revisit only if a consumer starts sending long single files. - Controlled concurrency sweep of the audio endpoints in a quiet window — replace the reasoned in-flight cap (2, ceiling 3) with the measured knee.
Audio quality
- Echo cancellation for dual-channel label-merge — removes the mic-bleed limit when the local user isn't wearing headphones.
- LLM "referee" pass for low-confidence label-merge speaker naming.
Platform hardening
- Qdrant auth (API key) + scheduled snapshots/backups.
- Observability: request metrics + GPU-busy tracking, so load questions are answered from data instead of log archaeology.
- API-key auth on Spark Control — only if public (non-VPN) exposure is ever needed; current stance is LAN + split-tunnel VPN only.
Throughput (only if audio load outgrows one GPU)
- Second audio worker / queueing layer; revisit which services share Spark 2.
Dashboard
- Per-model configurable vLLM flags editable from the UI (today: edit
models.yamland rebuild). - Spark host update actions (OS/driver) from the UI.
- Open WebUI link-out integration; richer per-service detail views.
Tech debt (from the 2026-06-12 full-eval — see EVALUATION.md)
P0/P1 security findings are all fixed in v0.19.0:0. Remaining, none blocking:
P2 — track:
- No automated tests beyond the two redaction suites — swap state machine, proxies, SSH wrapper, and the StartOS package are untested; live-cluster paths (swap exec, audio, embeddings/search) are exercised only by hand. Biggest coverage gap; a small pytest harness for
build_launch_command(incl. injection cases), swap transitions, and_merge_words_with_speakersis the highest-value start. - Loose dependency floors permit vulnerable
python-multipart/starlette(DoS CVEs) on rebuild; no lockfile; no upload size caps (pyproject.toml). - Opaque HTTP 500 on
POST /api/models/PUT /knobswhenMODELS_OVERRIDESunset in dev (write to read-only/data) — catch theOSError. - NGC API key still appears on the remote process command line (
nim.py) — the quote-breakout risk is fixed; pass via stdin/env to also remove the process-list exposure. - Global mutable
catalogreassigned viaglobal, shared across async requests with no snapshot (server.py) — latent race as concurrency grows. - Container runs uvicorn as root bound to
0.0.0.0:9999(noUSERin Dockerfile) — amplifies any RCE blast radius.
P3 — bulk-fix when next touching docs/packaging:
- README Status block stale (
v0.2.3 / 0.13.0:4→ now v0.19.0:0); deprecated@app.on_event+ hardcodedapp.version="0.1.0";NimInstallBody.registershadowsBaseModel(rename →register_service); httpx class names leak into TTS/speech-models error text; one unescapedinnerHTMLsink (app.js) +task_idreflected in scrub JSON. - Packaging:
marketingUrl/packageRepo/upstreamRepoareexample.complaceholders; brokeninstructions.mdsource link; per-service SSH users (parakeet_useretc.) absent from the Configure-Sparks action inputSpec (silent default-empty);Makefilebuilds only x86 though the manifest declaresaarch64. - Hardening misc: no body/upload size limits on
/v1/audio/*,/v1/chat/completions,/scrub;int(_env(...))startup crash on badVLLM_PORT; upstream error text echoed to clients. - StartOS registry (only if ever pursuing it): source must be public + real repo URLs.