v0.13.0:4 - redaction gateway, embeddings proxy, expanded audio API

- Add redaction gateway (redaction_gateway.py, redaction/ scrub + tests) - Add embeddings proxy and spark_embed service (Dockerfile + main.py) - Expand audio_proxy with speaker-aware handling; deep_health/health/server updates - Package: configureSparks action + sparkConfig model updates, manifest/main wiring - Docs: AUDIO_API, EMBEDDINGS, REDACTION_GATEWAY; HANDOFF and runbook/known-issues refresh
2026-06-11 17:45:21 -05:00
parent 4a75274db3
commit 8d839e3714
37 changed files with 3763 additions and 197 deletions
@@ -2,11 +2,14 @@

 A browser-based control panel for a dual-DGX-Spark vLLM cluster. Designed to run as a [StartOS 0.4](https://docs.start9.com/packaging/0.4.0.x/) package on a Start9 server on the same LAN as the Sparks.

+> **If you've just received this package from someone**, start with [HANDOFF.md](./HANDOFF.md) — it has the prereq checklist and a step-by-step install guide written for a fresh user.
+
 ## What it does

- Shows which LLM is currently loaded on the cluster (`:8888/v1/models`).
+- Shows which LLM is currently loaded on the cluster (`<spark1-host>:8888/v1/models`).
 - Click to swap to a different model — stops the current one, launches the new one, streams logs to the UI until `Application startup complete.` appears.
- Surfaces health for Parakeet (STT, `:8000`) and Magpie (TTS, `:9000`) on Spark 2.
+- Surfaces health for Parakeet (STT, `:8000`) and Kokoro (TTS, `:8880`) on Spark 2.
+- Proxies OpenAI-compatible chat-completions, transcribe, diarize, and TTS through one trusted host so external apps only need to know about Spark Control.

 ## Architecture

@@ -32,16 +35,16 @@ cd image
 python3 -m venv .venv && source .venv/bin/activate
 pip install -e .
 export SPARK1_HOST=<spark-1-ip>
-export SPARK1_USER=<spark-user>
+export SPARK1_USER=<your-ssh-user>
 export SPARK2_HOST=<spark-2-ip>
-export SPARK2_USER=<spark-user>
-export SSH_KEY_PATH="$HOME/Library/Application Support/NVIDIA/Sync/config/nvsync.key"
+export SPARK2_USER=<your-ssh-user>
+export SSH_KEY_PATH=<path-to-your-private-key>
 uvicorn app.server:app --host 0.0.0.0 --port 9999 --reload
 ```

 Open <http://localhost:9999>.

-> **Note:** use the **IP** `<spark-1-ip>` for Spark 1, not `<spark-1-host>.local`. mDNS resolves to IPv6 first and `httpx` hangs on it because vLLM only binds IPv4.
+> **Note:** prefer the **IP** for Spark 1 over a `.local` hostname. mDNS can resolve to IPv6 first, and `httpx` will hang on it because vLLM only binds IPv4.

 ## Build the StartOS package

@@ -49,6 +52,8 @@ Open <http://localhost:9999>.
 cd package
 npm i        # one-time
 make x86     # produces spark-control_x86_64.s9pk (~55 MB)
+# or
+make aarch64 # for ARM-based Start9 servers
 ```

 Requires [`start-cli`](https://docs.start9.com/latest/developer-guide/sdk/installing-the-sdk), Node ≥ 22, Docker. The build runs `tsc` + `ncc` for the TS bundle, then `docker build` on `image/Dockerfile`, then `start-cli s9pk pack` to produce the `.s9pk`.
@@ -57,15 +62,18 @@ To sideload onto your Start9: `make install` (needs `host:` set in `~/.startos/c

 ## Post-install setup (one-time per Start9 install)

-1. Open the Spark Control service → **Actions** → **Show Public Key** → copy the line.
-2. SSH to each Spark and append the line to `~/.ssh/authorized_keys` for the `<spark-user>` user.
-3. **Actions** → **Configure Sparks** → enter `<spark-1-ip>` / `<spark-user>` for Spark 1 and `<spark-2-ip>` / `<spark-user>` for Spark 2.
+1. Open the Spark Control service → **Actions** → **Show Public Key** → copy the produced one-liner.
+2. Run that one-liner from any machine that already has SSH access to your Sparks. It appends the package's pubkey to `~/.ssh/authorized_keys` on each Spark.
+3. **Actions** → **Configure Sparks** → enter your Spark 1 / Spark 2 IPs and the SSH username you use to log into them.
 4. Start the service. Open the Web UI — current model + health should show within ~5 s.

+See [HANDOFF.md](./HANDOFF.md) for a fuller prereq checklist and the hardware-side setup required *before* this package is useful.
+
 ## Repo layout

 - `image/` — Docker image source (FastAPI app + `models.yaml`)
 - `package/` — StartOS 0.4 package source
+- `HANDOFF.md` — prereqs + first-time install guide for a fresh user
 - `runbook.md` — operating notes
 - `known-issues.md` — known quirks and workarounds
 - `LICENSE` — MIT
@@ -76,9 +84,9 @@ Other services on your LAN can hit `GET /api/endpoints` to learn where the curre

 ```json
 {
-  "vllm":    { "ready": true,  "base_url": "http://<spark-1-ip>:8888/v1", "model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4", "openai_compat": true },
-  "parakeet":{ "ready": true,  "base_url": "http://<spark-2-ip>:8000",   "kind": "stt", "model": "nvidia/parakeet-tdt-0.6b-v3" },
-  "magpie":  { "ready": false, "base_url": "http://<spark-2-ip>:9000",   "kind": "tts" }
+  "vllm":    { "ready": true,  "base_url": "http://<spark1-host>:8888/v1", "model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4", "openai_compat": true },
+  "parakeet":{ "ready": true,  "base_url": "http://<spark2-host>:8000",   "kind": "stt", "model": "nvidia/parakeet-tdt-0.6b-v3" },
+  "kokoro":  { "ready": true,  "base_url": "http://<spark2-host>:8880",   "kind": "tts" }
 }
 ```

@@ -86,7 +94,7 @@ Other services on your LAN can hit `GET /api/endpoints` to learn where the curre

 ## Reporting failures from external apps

-Spark Control polls every 5 s, so a brief blip in Parakeet/Magpie/vLLM availability can slip between polls and never make it into the connectivity log. To capture short failures, an external app (e.g. Open WebUI) can POST whenever a call fails (or succeeds):
+Spark Control polls every 5 s, so a brief blip in Parakeet/Kokoro/vLLM availability can slip between polls and never make it into the connectivity log. To capture short failures, an external app (e.g. Open WebUI) can POST whenever a call fails (or succeeds):

 ```bash
 curl -X POST http://<dashboard-url>/api/health-event \
@@ -104,15 +112,17 @@ Fields: `service` (required), `ok` (required), `source` (optional, free-form), `

 ## Status

-**v0.2.3** — installed and verified on a Start9 server. Five bundled LLMs in the catalog (qwen3-vl, gemma4, qwen36, qwen3-235b-fp8, qwen2.5-72b), plus any custom models added through the UI.
+**v0.2.3 / s9pk version 0.13.0:4** — installed and verified on a Start9 server. Five bundled LLMs in the catalog (qwen3-vl, gemma4, qwen36, qwen3-235b-fp8, qwen2.5-72b), plus any custom models added through the UI.

 ### What v0.2 added on top of v0.1

 - **Service discovery API** (`/api/endpoints`) for other LAN services
- **Magpie crash fix** documented (chown the model-cache volume to uid 1000)
- **Always-on services panel** with Start/Stop/Restart for Parakeet + Magpie, plus per-service host configuration in Configure Sparks (so Parakeet/Magpie can live on Spark 1, Spark 2, or anywhere)
+- **Kokoro-82M TTS** replaces Magpie/Riva NIM as the default TTS backend (v0.14.0). Magpie's decoder had a ~30-50% truncation rate on multi-sentence inputs and ate 49 GB of GPU memory; Kokoro is 24/24 reliable at every input length tested, uses 1.3 GB GPU, and renders in ~1s. See HANDOFF.md and the release notes for the migration story.
+- **Always-on services panel** with Start/Stop/Restart for Parakeet + Kokoro, plus per-service host configuration in Configure Sparks (so they can live on Spark 1, Spark 2, or anywhere)
 - **Model download** from the dashboard — paste an HF repo, pick solo or cluster, watch percent progress with bytes/rate/ETA. After completion, an "Add to catalog" dialog appears pre-filled.
 - **spark-vllm-docker update check** — banner shows "N commits behind upstream"; Apply Update runs `git pull && ./build-and-copy.sh -c` over SSH with a streamed log
 - **Per-model Advanced settings** — knobs for max context, GPU memory %, and three optimization toggles (fastsafetensors, prefix caching, FP8 KV cache). Persisted to `/data/models-overrides.yaml` so they survive package updates. Bundled and custom models alike.
+- **Diarization with speaker fingerprints** via Sortformer + TitaNet, exposed at `/api/audio/diarize-chunk` for chunked workflows
+- **OpenAI chat-completions proxy** (`/v1/chat/completions`, `/v1/completions`) — forwards to the loaded vLLM so external apps need only one trusted host

 v0.3+ roadmap (loose): richer dashboard (SSH/GPU/tokens-per-sec), Open WebUI deep-link integration, optional auth, multi-cluster.