26070eb191
Make the cluster topology configurable so an adopter wired differently (vLLM on both Sparks, port 8000, different container name, no Parakeet) can monitor without forking. Covers the OpenClaw report P4/P5/#6. - VLLM_CONTAINER override (default vllm_node), validated at the boundary and quote_arg-quoted into the swap log-tail + pre-flight validator exec. - DISABLED_SERVICES list: hidden services show no tile and are skipped by status/deep-health/connectivity probes (kills the Parakeet-on-8000 collision). - kind: vllm custom service monitors a second Spark's vLLM via the shared probe_vllm_endpoint; /api/endpoints gains a disabled flag. Swap mechanism intentionally not generalized to raw docker run (that's coordination, roadmap item 4).
58 lines
8.0 KiB
Markdown
58 lines
8.0 KiB
Markdown
# ROADMAP
|
|
|
|
Longer-term backlog, roughly ordered. An item moves to "Current state" in CLAUDE.md when picked up.
|
|
|
|
## Cluster coordination — OpenClaw coexistence (committed 2026-06-17, from Johnny 5 report 2026-06-16)
|
|
|
|
Driven by the one other Spark Control adopter (a colleague running OpenClaw + cron jobs against his own dual Sparks; report at the date above). His cluster is configured differently from ours (vLLM on **both** Sparks, port 8000, raw `docker run`, container `vllm-gemma4`) and an automated cron physically swaps models — so his notes are partly *portability gaps* (the package hard-codes our layout) and partly *coordination gaps* (his dashboard and his crons fight over the GPU).
|
|
|
|
**Design stance (decided):** Spark Control is the **control plane / GPU arbiter, not a job runner.** Recurring business pipelines (his "Daily Vol" generator; our own future scheduled jobs) live in *separate* application services that *call* Spark Control's swap API. The dividing line is what a scheduled job *does*: control-plane actions (swap a model, warm it, restart a service, run a health sweep) are in scope for an in-package scheduler; business logic (scrape / summarize / build / deploy) stays in the app layer. Swaps are already API-driven (`POST /api/swap` → `GET /api/swap/{id}` / `…/stream`, `POST /api/swap/{key}/validate`) and non-browser clients pass the CSRF guard, so an external scheduler can drive swaps **today** — the items below add the *safety* layer, not the capability.
|
|
|
|
Sequenced:
|
|
1. **Configurable `VLLM_PORT`** — DONE, v0.22.0:0. Field in Configure Sparks (blank ⇒ 8888); numeric-setting parsing hardened so a blank/bad value falls back instead of crashing startup. Was the immediate "vLLM unreachable" bug for an adopter on port 8000.
|
|
2. **Local-path / fine-tuned model support** — DONE, v0.23.0:0. Catalog/`ModelDef` gained `local_path` (exactly one of `repo`/`local_path`); swap bind-mounts the dir into the vLLM container at the same path via the launch script's `VLLM_SPARK_EXTRA_DOCKER_ARGS` hook (no `launch-cluster.sh` change); "+ Add local model" form + `local` badge; disk-delete refused for local models; `validate_local_path` boundary check. His merged `ten31-v2` was the motivating case.
|
|
3. **Configurable topology** — DONE, v0.24.0:0. Three optional Configure-Sparks knobs: vLLM container name (`VLLM_CONTAINER`, blank ⇒ `vllm_node`; threaded through the swap log-tail + pre-flight validator via `quote_arg`); "services to hide" (`DISABLED_SERVICES`, comma list — hidden services show no tile and are skipped by status/deep-health/connectivity probes, killing the Parakeet-on-8000 collision); and a second-Spark vLLM monitor via a `kind: vllm` custom service in `services-overrides.yaml` (read-only tile probed through the shared `probe_vllm_endpoint`). `/api/endpoints` gained a `disabled` flag. Covers report P4/P5/#6. (Generalizing the *swap* mechanism to the adopter's raw `docker run` was deliberately left out — that's coordination, item 4; he swaps via his own crons and uses Spark Control to monitor.)
|
|
4. **Coordination layer** — build when our own automation actually lands (zero value until something other than the dashboard swaps models):
|
|
- **Swap lock** with holder + TTL (`POST` / `GET` / `DELETE /api/swap/lock`). An external scheduler acquires it before swapping; the dashboard then refuses manual swaps and shows who holds the GPU and until when. Enforced by the swap path, not advisory.
|
|
- **Swap-event webhook** (`swap_complete` / `swap_failed`) to a configurable URL, so downstream consumers update their provider config when the running model changes.
|
|
- **Schedule visibility** — read-only view the dashboard surfaces, *registered by* external schedulers (Spark Control does not own the schedule).
|
|
|
|
## Near term
|
|
- parakeet-asr long-audio memory guard — **deferred 2026-06-15, low priority.** A duration cap on `/v1/audio/diarize`: Sortformer runs the whole file in one pass (`diarizer.py:128-135`) over Spark 2's *shared* 128 GB unified memory (also feeding Kokoro/embeddings/Qdrant), so one giant single file can thrash into swap. **Precautionary — no observed incident**, and the production consumer (Recap Relay) already chunks via `/diarize-chunk` (~5-min, already bounded), so the only exposed path is a consumer POSTing one huge file to the full `/diarize`. When picked up: add a configurable `MAX_DIARIZE_SECONDS` guard in `diarizer.py` right after `duration` is computed (~line 130) → raise → HTTP 413 in `main.py` (mirrors the existing `MAX_UPLOAD_MB` 413); ship via the Reapply-patches action (restarts the live parakeet-asr container → needs go/no-go). Leave transcription out of v1 (upstream/un-patched file; parakeet-TDT handles long audio better). Revisit only if a consumer starts sending long single files.
|
|
- Controlled concurrency sweep of the audio endpoints in a quiet window — replace the reasoned in-flight cap (2, ceiling 3) with the measured knee.
|
|
|
|
## Audio quality
|
|
- Echo cancellation for dual-channel label-merge — removes the mic-bleed limit when the local user isn't wearing headphones.
|
|
- LLM "referee" pass for low-confidence label-merge speaker naming.
|
|
|
|
## Platform hardening
|
|
- Qdrant auth (API key) + scheduled snapshots/backups.
|
|
- Observability: request metrics + GPU-busy tracking, so load questions are answered from data instead of log archaeology.
|
|
- API-key auth on Spark Control — only if public (non-VPN) exposure is ever needed; current stance is LAN + split-tunnel VPN only.
|
|
|
|
## Throughput (only if audio load outgrows one GPU)
|
|
- Second audio worker / queueing layer; revisit which services share Spark 2.
|
|
|
|
## Dashboard
|
|
- Per-model configurable vLLM flags editable from the UI (today: edit `models.yaml` and rebuild).
|
|
- Spark host update actions (OS/driver) from the UI.
|
|
- Open WebUI link-out integration; richer per-service detail views.
|
|
|
|
## Tech debt (from the 2026-06-12 full-eval — see EVALUATION.md)
|
|
|
|
P0/P1 security findings are all fixed in v0.19.0:0. Remaining, none blocking:
|
|
|
|
**P2 — track:**
|
|
- No automated tests beyond the two redaction suites — swap state machine, proxies, SSH wrapper, and the StartOS package are untested; live-cluster paths (swap exec, audio, embeddings/search) are exercised only by hand. Biggest coverage gap; a small pytest harness for `build_launch_command` (incl. injection cases), swap transitions, and `_merge_words_with_speakers` is the highest-value start.
|
|
- Loose dependency floors permit vulnerable `python-multipart`/`starlette` (DoS CVEs) on rebuild; no lockfile; no upload size caps (`pyproject.toml`).
|
|
- Opaque HTTP 500 on `POST /api/models` / `PUT /knobs` when `MODELS_OVERRIDES` unset in dev (write to read-only `/data`) — catch the `OSError`.
|
|
- NGC API key still appears on the remote process command line (`nim.py`) — the quote-breakout risk is fixed; pass via stdin/env to also remove the process-list exposure.
|
|
- Global mutable `catalog` reassigned via `global`, shared across async requests with no snapshot (`server.py`) — latent race as concurrency grows.
|
|
- Container runs uvicorn as **root** bound to `0.0.0.0:9999` (no `USER` in Dockerfile) — amplifies any RCE blast radius.
|
|
|
|
**P3 — bulk-fix when next touching docs/packaging:**
|
|
- README Status block stale (`v0.2.3 / 0.13.0:4` → now v0.19.0:0); deprecated `@app.on_event` + hardcoded `app.version="0.1.0"`; `NimInstallBody.register` shadows `BaseModel` (rename → `register_service`); httpx class names leak into TTS/speech-models error text; one unescaped `innerHTML` sink (`app.js`) + `task_id` reflected in scrub JSON.
|
|
- Packaging: `marketingUrl`/`packageRepo`/`upstreamRepo` are `example.com` placeholders; broken `instructions.md` source link; per-service SSH users (`parakeet_user` etc.) absent from the Configure-Sparks action inputSpec (silent default-empty); `Makefile` builds only x86 though the manifest declares `aarch64`.
|
|
- Hardening misc: no body/upload size limits on `/v1/audio/*`, `/v1/chat/completions`, `/scrub`; `int(_env(...))` startup crash on bad `VLLM_PORT`; upstream error text echoed to clients.
|
|
- StartOS registry (only if ever pursuing it): source must be public + real repo URLs.
|