Files
spark-control/ROADMAP.md
T
Keysat 26070eb191 v0.24.0:0 - configurable cluster topology (vllm container name, hide services, second-vllm monitor)
Make the cluster topology configurable so an adopter wired differently
(vLLM on both Sparks, port 8000, different container name, no Parakeet)
can monitor without forking. Covers the OpenClaw report P4/P5/#6.

- VLLM_CONTAINER override (default vllm_node), validated at the boundary
  and quote_arg-quoted into the swap log-tail + pre-flight validator exec.
- DISABLED_SERVICES list: hidden services show no tile and are skipped by
  status/deep-health/connectivity probes (kills the Parakeet-on-8000
  collision).
- kind: vllm custom service monitors a second Spark's vLLM via the shared
  probe_vllm_endpoint; /api/endpoints gains a disabled flag.

Swap mechanism intentionally not generalized to raw docker run (that's
coordination, roadmap item 4).
2026-06-17 23:03:33 -05:00

8.0 KiB

ROADMAP

Longer-term backlog, roughly ordered. An item moves to "Current state" in CLAUDE.md when picked up.

Cluster coordination — OpenClaw coexistence (committed 2026-06-17, from Johnny 5 report 2026-06-16)

Driven by the one other Spark Control adopter (a colleague running OpenClaw + cron jobs against his own dual Sparks; report at the date above). His cluster is configured differently from ours (vLLM on both Sparks, port 8000, raw docker run, container vllm-gemma4) and an automated cron physically swaps models — so his notes are partly portability gaps (the package hard-codes our layout) and partly coordination gaps (his dashboard and his crons fight over the GPU).

Design stance (decided): Spark Control is the control plane / GPU arbiter, not a job runner. Recurring business pipelines (his "Daily Vol" generator; our own future scheduled jobs) live in separate application services that call Spark Control's swap API. The dividing line is what a scheduled job does: control-plane actions (swap a model, warm it, restart a service, run a health sweep) are in scope for an in-package scheduler; business logic (scrape / summarize / build / deploy) stays in the app layer. Swaps are already API-driven (POST /api/swapGET /api/swap/{id} / …/stream, POST /api/swap/{key}/validate) and non-browser clients pass the CSRF guard, so an external scheduler can drive swaps today — the items below add the safety layer, not the capability.

Sequenced:

  1. Configurable VLLM_PORT — DONE, v0.22.0:0. Field in Configure Sparks (blank ⇒ 8888); numeric-setting parsing hardened so a blank/bad value falls back instead of crashing startup. Was the immediate "vLLM unreachable" bug for an adopter on port 8000.
  2. Local-path / fine-tuned model support — DONE, v0.23.0:0. Catalog/ModelDef gained local_path (exactly one of repo/local_path); swap bind-mounts the dir into the vLLM container at the same path via the launch script's VLLM_SPARK_EXTRA_DOCKER_ARGS hook (no launch-cluster.sh change); "+ Add local model" form + local badge; disk-delete refused for local models; validate_local_path boundary check. His merged ten31-v2 was the motivating case.
  3. Configurable topology — DONE, v0.24.0:0. Three optional Configure-Sparks knobs: vLLM container name (VLLM_CONTAINER, blank ⇒ vllm_node; threaded through the swap log-tail + pre-flight validator via quote_arg); "services to hide" (DISABLED_SERVICES, comma list — hidden services show no tile and are skipped by status/deep-health/connectivity probes, killing the Parakeet-on-8000 collision); and a second-Spark vLLM monitor via a kind: vllm custom service in services-overrides.yaml (read-only tile probed through the shared probe_vllm_endpoint). /api/endpoints gained a disabled flag. Covers report P4/P5/#6. (Generalizing the swap mechanism to the adopter's raw docker run was deliberately left out — that's coordination, item 4; he swaps via his own crons and uses Spark Control to monitor.)
  4. Coordination layer — build when our own automation actually lands (zero value until something other than the dashboard swaps models):
    • Swap lock with holder + TTL (POST / GET / DELETE /api/swap/lock). An external scheduler acquires it before swapping; the dashboard then refuses manual swaps and shows who holds the GPU and until when. Enforced by the swap path, not advisory.
    • Swap-event webhook (swap_complete / swap_failed) to a configurable URL, so downstream consumers update their provider config when the running model changes.
    • Schedule visibility — read-only view the dashboard surfaces, registered by external schedulers (Spark Control does not own the schedule).

Near term

  • parakeet-asr long-audio memory guard — deferred 2026-06-15, low priority. A duration cap on /v1/audio/diarize: Sortformer runs the whole file in one pass (diarizer.py:128-135) over Spark 2's shared 128 GB unified memory (also feeding Kokoro/embeddings/Qdrant), so one giant single file can thrash into swap. Precautionary — no observed incident, and the production consumer (Recap Relay) already chunks via /diarize-chunk (~5-min, already bounded), so the only exposed path is a consumer POSTing one huge file to the full /diarize. When picked up: add a configurable MAX_DIARIZE_SECONDS guard in diarizer.py right after duration is computed (~line 130) → raise → HTTP 413 in main.py (mirrors the existing MAX_UPLOAD_MB 413); ship via the Reapply-patches action (restarts the live parakeet-asr container → needs go/no-go). Leave transcription out of v1 (upstream/un-patched file; parakeet-TDT handles long audio better). Revisit only if a consumer starts sending long single files.
  • Controlled concurrency sweep of the audio endpoints in a quiet window — replace the reasoned in-flight cap (2, ceiling 3) with the measured knee.

Audio quality

  • Echo cancellation for dual-channel label-merge — removes the mic-bleed limit when the local user isn't wearing headphones.
  • LLM "referee" pass for low-confidence label-merge speaker naming.

Platform hardening

  • Qdrant auth (API key) + scheduled snapshots/backups.
  • Observability: request metrics + GPU-busy tracking, so load questions are answered from data instead of log archaeology.
  • API-key auth on Spark Control — only if public (non-VPN) exposure is ever needed; current stance is LAN + split-tunnel VPN only.

Throughput (only if audio load outgrows one GPU)

  • Second audio worker / queueing layer; revisit which services share Spark 2.

Dashboard

  • Per-model configurable vLLM flags editable from the UI (today: edit models.yaml and rebuild).
  • Spark host update actions (OS/driver) from the UI.
  • Open WebUI link-out integration; richer per-service detail views.

Tech debt (from the 2026-06-12 full-eval — see EVALUATION.md)

P0/P1 security findings are all fixed in v0.19.0:0. Remaining, none blocking:

P2 — track:

  • No automated tests beyond the two redaction suites — swap state machine, proxies, SSH wrapper, and the StartOS package are untested; live-cluster paths (swap exec, audio, embeddings/search) are exercised only by hand. Biggest coverage gap; a small pytest harness for build_launch_command (incl. injection cases), swap transitions, and _merge_words_with_speakers is the highest-value start.
  • Loose dependency floors permit vulnerable python-multipart/starlette (DoS CVEs) on rebuild; no lockfile; no upload size caps (pyproject.toml).
  • Opaque HTTP 500 on POST /api/models / PUT /knobs when MODELS_OVERRIDES unset in dev (write to read-only /data) — catch the OSError.
  • NGC API key still appears on the remote process command line (nim.py) — the quote-breakout risk is fixed; pass via stdin/env to also remove the process-list exposure.
  • Global mutable catalog reassigned via global, shared across async requests with no snapshot (server.py) — latent race as concurrency grows.
  • Container runs uvicorn as root bound to 0.0.0.0:9999 (no USER in Dockerfile) — amplifies any RCE blast radius.

P3 — bulk-fix when next touching docs/packaging:

  • README Status block stale (v0.2.3 / 0.13.0:4 → now v0.19.0:0); deprecated @app.on_event + hardcoded app.version="0.1.0"; NimInstallBody.register shadows BaseModel (rename → register_service); httpx class names leak into TTS/speech-models error text; one unescaped innerHTML sink (app.js) + task_id reflected in scrub JSON.
  • Packaging: marketingUrl/packageRepo/upstreamRepo are example.com placeholders; broken instructions.md source link; per-service SSH users (parakeet_user etc.) absent from the Configure-Sparks action inputSpec (silent default-empty); Makefile builds only x86 though the manifest declares aarch64.
  • Hardening misc: no body/upload size limits on /v1/audio/*, /v1/chat/completions, /scrub; int(_env(...)) startup crash on bad VLLM_PORT; upstream error text echoed to clients.
  • StartOS registry (only if ever pursuing it): source must be public + real repo URLs.