GPU-arbiter safety layer for when automation, not just the dashboard, swaps models: - swap reservation lock (POST/GET/DELETE /api/swap/lock); 423-enforced in post_swap via a single-read gate, TTL-bounded, secret-token auth, human force-release override + dashboard banner - swap webhook (swap_complete/swap_failed) fired outside the swap lock, optional HMAC signature, configurable URL+secret - read-only schedule registry (GET/POST/DELETE /api/schedule) + dashboard panel New module image/app/coordination.py; docs/COORDINATION.md for consumers; 22 offline tests in test_coordination.py.
8.9 KiB
ROADMAP
Longer-term backlog, roughly ordered. An item moves to "Current state" in CLAUDE.md when picked up.
Cluster coordination — OpenClaw coexistence (committed 2026-06-17, from Johnny 5 report 2026-06-16)
Driven by the one other Spark Control adopter (a colleague running OpenClaw + cron jobs against his own dual Sparks; report at the date above). His cluster is configured differently from ours (vLLM on both Sparks, port 8000, raw docker run, container vllm-gemma4) and an automated cron physically swaps models — so his notes are partly portability gaps (the package hard-codes our layout) and partly coordination gaps (his dashboard and his crons fight over the GPU).
Design stance (decided): Spark Control is the control plane / GPU arbiter, not a job runner. Recurring business pipelines (his "Daily Vol" generator; our own future scheduled jobs) live in separate application services that call Spark Control's swap API. The dividing line is what a scheduled job does: control-plane actions (swap a model, warm it, restart a service, run a health sweep) are in scope for an in-package scheduler; business logic (scrape / summarize / build / deploy) stays in the app layer. Swaps are already API-driven (POST /api/swap → GET /api/swap/{id} / …/stream, POST /api/swap/{key}/validate) and non-browser clients pass the CSRF guard, so an external scheduler can drive swaps today — the items below add the safety layer, not the capability.
Sequenced:
- Configurable
VLLM_PORT— DONE, v0.22.0:0. Field in Configure Sparks (blank ⇒ 8888); numeric-setting parsing hardened so a blank/bad value falls back instead of crashing startup. Was the immediate "vLLM unreachable" bug for an adopter on port 8000. - Local-path / fine-tuned model support — DONE, v0.23.0:0. Catalog/
ModelDefgainedlocal_path(exactly one ofrepo/local_path); swap bind-mounts the dir into the vLLM container at the same path via the launch script'sVLLM_SPARK_EXTRA_DOCKER_ARGShook (nolaunch-cluster.shchange); "+ Add local model" form +localbadge; disk-delete refused for local models;validate_local_pathboundary check. His mergedten31-v2was the motivating case. - Configurable topology — DONE, v0.24.0:0. Three optional Configure-Sparks knobs: vLLM container name (
VLLM_CONTAINER, blank ⇒vllm_node; threaded through the swap log-tail + pre-flight validator viaquote_arg); "services to hide" (DISABLED_SERVICES, comma list — hidden services show no tile and are skipped by status/deep-health/connectivity probes, killing the Parakeet-on-8000 collision); and a second-Spark vLLM monitor via akind: vllmcustom service inservices-overrides.yaml(read-only tile probed through the sharedprobe_vllm_endpoint)./api/endpointsgained adisabledflag. Covers report P4/P5/#6. (Generalizing the swap mechanism to the adopter's rawdocker runwas deliberately left out — that's coordination, item 4; he swaps via his own crons and uses Spark Control to monitor.) - Coordination layer — DONE in tree, staged as v0.25.0:0 (built/typechecked clean; install pending). All three primitives shipped;
image/app/coordination.py+docs/COORDINATION.md. Brought forward 2026-06-17 on request rather than waiting for our own automation.- Swap lock with holder + TTL (
POST/GET/DELETE /api/swap/lock). Acquire returns a secret token; the swap endpoint refuses any real swap (423) that doesn't present it inX-Swap-Lock-Token, so the dashboard's manual swap is paused while a scheduler holds it (with a?force=truehuman override). In-memory + TTL-bounded → resets to unlocked on restart; re-acquire with the token extends. Enforced inpost_swap, not advisory. - Swap-event webhook (
swap_complete/swap_failed) to a configurable URL (Configure-Sparks field), fired fromSwapManager._runoutside the swap lock; optional shared secret ⇒X-Spark-SignatureHMAC. Fire-and-forget (5 s, no retries); dry runs don't fire. - Schedule visibility —
GET/POST/DELETE /api/schedule; read-only "Scheduled jobs" dashboard panel, registered by external schedulers. Spark Control stores and displays, never executes. - Still NOT generalized: the swap mechanism to raw
docker run(that's the adopter's own crons' job). Tests:image/tests/test_coordination.py(22 cases — lock lifecycle/expiry/token, the single-read swap gate, schedule CRUD + id validation, webhook payload+signature). Known limit: lock + schedules are in-memory (a restart frees the lock and empties the registry until schedulers re-register) — persist to/dataonly if that bites.
- Swap lock with holder + TTL (
Near term
- parakeet-asr long-audio memory guard — deferred 2026-06-15, low priority. A duration cap on
/v1/audio/diarize: Sortformer runs the whole file in one pass (diarizer.py:128-135) over Spark 2's shared 128 GB unified memory (also feeding Kokoro/embeddings/Qdrant), so one giant single file can thrash into swap. Precautionary — no observed incident, and the production consumer (Recap Relay) already chunks via/diarize-chunk(~5-min, already bounded), so the only exposed path is a consumer POSTing one huge file to the full/diarize. When picked up: add a configurableMAX_DIARIZE_SECONDSguard indiarizer.pyright afterdurationis computed (~line 130) → raise → HTTP 413 inmain.py(mirrors the existingMAX_UPLOAD_MB413); ship via the Reapply-patches action (restarts the live parakeet-asr container → needs go/no-go). Leave transcription out of v1 (upstream/un-patched file; parakeet-TDT handles long audio better). Revisit only if a consumer starts sending long single files. - Controlled concurrency sweep of the audio endpoints in a quiet window — replace the reasoned in-flight cap (2, ceiling 3) with the measured knee.
Audio quality
- Echo cancellation for dual-channel label-merge — removes the mic-bleed limit when the local user isn't wearing headphones.
- LLM "referee" pass for low-confidence label-merge speaker naming.
Platform hardening
- Qdrant auth (API key) + scheduled snapshots/backups.
- Observability: request metrics + GPU-busy tracking, so load questions are answered from data instead of log archaeology.
- API-key auth on Spark Control — only if public (non-VPN) exposure is ever needed; current stance is LAN + split-tunnel VPN only.
Throughput (only if audio load outgrows one GPU)
- Second audio worker / queueing layer; revisit which services share Spark 2.
Dashboard
- Per-model configurable vLLM flags editable from the UI (today: edit
models.yamland rebuild). - Spark host update actions (OS/driver) from the UI.
- Open WebUI link-out integration; richer per-service detail views.
Tech debt (from the 2026-06-12 full-eval — see EVALUATION.md)
P0/P1 security findings are all fixed in v0.19.0:0. Remaining, none blocking:
P2 — track:
- No automated tests beyond the two redaction suites — swap state machine, proxies, SSH wrapper, and the StartOS package are untested; live-cluster paths (swap exec, audio, embeddings/search) are exercised only by hand. Biggest coverage gap; a small pytest harness for
build_launch_command(incl. injection cases), swap transitions, and_merge_words_with_speakersis the highest-value start. - Loose dependency floors permit vulnerable
python-multipart/starlette(DoS CVEs) on rebuild; no lockfile; no upload size caps (pyproject.toml). - Opaque HTTP 500 on
POST /api/models/PUT /knobswhenMODELS_OVERRIDESunset in dev (write to read-only/data) — catch theOSError. - NGC API key still appears on the remote process command line (
nim.py) — the quote-breakout risk is fixed; pass via stdin/env to also remove the process-list exposure. - Global mutable
catalogreassigned viaglobal, shared across async requests with no snapshot (server.py) — latent race as concurrency grows. - Container runs uvicorn as root bound to
0.0.0.0:9999(noUSERin Dockerfile) — amplifies any RCE blast radius.
P3 — bulk-fix when next touching docs/packaging:
- README Status block stale (
v0.2.3 / 0.13.0:4→ now v0.19.0:0); deprecated@app.on_event+ hardcodedapp.version="0.1.0";NimInstallBody.registershadowsBaseModel(rename →register_service); httpx class names leak into TTS/speech-models error text; one unescapedinnerHTMLsink (app.js) +task_idreflected in scrub JSON. - Packaging:
marketingUrl/packageRepo/upstreamRepoareexample.complaceholders; brokeninstructions.mdsource link; per-service SSH users (parakeet_useretc.) absent from the Configure-Sparks action inputSpec (silent default-empty);Makefilebuilds only x86 though the manifest declaresaarch64. - Hardening misc: no body/upload size limits on
/v1/audio/*,/v1/chat/completions,/scrub;int(_env(...))startup crash on badVLLM_PORT; upstream error text echoed to clients. - StartOS registry (only if ever pursuing it): source must be public + real repo URLs.