v0.13.0:4 - redaction gateway, embeddings proxy, expanded audio API

- Add redaction gateway (redaction_gateway.py, redaction/ scrub + tests) - Add embeddings proxy and spark_embed service (Dockerfile + main.py) - Expand audio_proxy with speaker-aware handling; deep_health/health/server updates - Package: configureSparks action + sparkConfig model updates, manifest/main wiring - Docs: AUDIO_API, EMBEDDINGS, REDACTION_GATEWAY; HANDOFF and runbook/known-issues refresh
2026-06-11 17:45:21 -05:00
parent 4a75274db3
commit 8d839e3714
37 changed files with 3763 additions and 197 deletions
@@ -0,0 +1,168 @@
+# Spark Control — handoff guide
+
+You've received a `spark-control.s9pk` file. This guide gets you from "fresh install" to "working dashboard" in about an hour, most of which is waiting for downloads.
+
+## What this is
+
+Spark Control is a StartOS 0.4 package that runs on your Start9 server and gives you a browser dashboard for a **dual-DGX-Spark vLLM cluster**. From the dashboard you can:
+
+- See which LLM is currently loaded
+- Swap to a different LLM with one click (live log streaming until ready)
+- Download new LLM weights from HuggingFace
+- Install and monitor audio services (Parakeet STT, Kokoro TTS, Sortformer diarization)
+- Expose OpenAI-compatible endpoints (`/v1/chat/completions`, `/v1/audio/transcriptions`, `/v1/audio/speech`, etc.) to other apps on your LAN through a single trusted host
+
+It does **not** run any models itself — it's a controller. The actual GPU work happens on your two Sparks. Spark Control SSHes into Spark 1 to invoke `launch-cluster.sh`, and HTTP-polls both Sparks for health.
+
+---
+
+## Prerequisites before installing the s9pk
+
+You need all of the following set up **first**. The s9pk assumes they exist.
+
+### Hardware
+
+- A **Start9 server** running StartOS 0.4.x with sideload-install enabled.
+- **Two NVIDIA DGX Sparks** (or similar boxes with NVIDIA GPUs + Docker). One will be "Spark 1" (head node) and one will be "Spark 2" (worker node + audio services). They must be on the same LAN as the Start9 server.
+
+### Spark 1 (the head node)
+
+- A Linux user account you can SSH into (any username — `ubuntu`, `nvidia`, your own — just be consistent). Note the username; you'll enter it later.
+- **Docker + NVIDIA Container Toolkit** installed and working.
+- **`~/spark-vllm-docker/`** cloned from the community repo:
+
+  ```bash
+  git clone https://github.com/eugr/spark-vllm-docker ~/spark-vllm-docker
+  cd ~/spark-vllm-docker
+  ./build-and-copy.sh -c    # builds the vLLM container image
+  ```
+
+  > **The path matters.** Spark Control hardcodes `~/spark-vllm-docker` as the working directory for cluster commands. If you clone it elsewhere, the dashboard's swap and download actions will silently fail.
+
+- A HuggingFace cache at `~/.cache/huggingface/hub/`. Either pre-download one model now, or use the dashboard's "Download a new model" button after install.
+
+### Spark 2 (the worker node)
+
+- Same Linux user account as Spark 1, with passwordless SSH from Spark 1 working.
+- **Docker + NVIDIA Container Toolkit** installed.
+- That's it — the rest can be installed through the Spark Control dashboard once it's running.
+
+### Optional but recommended
+
+- An **NVIDIA NGC personal API key** if you want to install Parakeet (STT) from `nvcr.io`. Free: <https://ngc.nvidia.com/setup/personal-key>. Starts with `nvapi-...`. (Not needed for Kokoro — it's Apache 2.0 and pulls from a public GitHub Container Registry image with no auth.)
+
+---
+
+## Install steps
+
+### 1. Sideload the s9pk
+
+In your Start9 web UI, go to **Sideload Service** and upload the `spark-control_*.s9pk` file (x86_64 or aarch64 depending on your Start9). Install it.
+
+### 2. Start the service once
+
+The first start generates an ed25519 SSH keypair inside the package volume. Wait until the service shows "Running" status — should take only a few seconds.
+
+### 3. Show the public key and install it on both Sparks
+
+- Open Spark Control → **Actions → Show Public Key**.
+- If you haven't run Configure Sparks yet, you'll just see the raw key. Skip to step 4, then come back here.
+- Once Configure Sparks is filled in, this action produces a **ready-to-paste install command** (a multi-line `ssh ... 'echo ... >> authorized_keys'` block). Copy the entire block.
+- Run it in a terminal on a machine that already has SSH access to your Sparks. You'll be prompted for each Spark's SSH password once. After it completes, the Start9 server can SSH into both Sparks.
+
+### 4. Configure Sparks
+
+- Open Spark Control → **Actions → Configure Sparks**.
+- Fill in:
+  - **Spark 1 hostname or IP** — prefer the **IP** (e.g. `192.168.1.x`) over `.local` hostnames; vLLM only binds IPv4 and mDNS can resolve to IPv6 first.
+  - **Spark 1 SSH user** — whatever username you set up on Spark 1.
+  - **Spark 2 hostname or IP** + **SSH user** — same idea.
+  - Optional Parakeet/Kokoro overrides — leave blank if those services run on Spark 2 (the normal case).
+  - Optional **Open WebUI URL** — paste your Open WebUI LAN URL to get a deep-link button in the dashboard next to the current model.
+  - Optional **NGC API key** — paste it here if you have one.
+
+Save.
+
+### 5. Re-run Show Public Key (if you skipped earlier)
+
+Now that hosts are configured, Show Public Key will give you the paste-ready install command. Run it as described in step 3.
+
+### 6. Open the Web UI
+
+From the Spark Control service page, click the Web UI button. You should see:
+
+- A **top status bar** with the currently loaded LLM (or "no model loaded" if Spark 1's vLLM container is fresh).
+- An **LLM tab** with cards for each model in the bundled catalog. Models you've downloaded show "on disk" badges; others show "not downloaded".
+- An **Audio / Speech tab** with health status and Install / Start / Stop / Restart buttons for Parakeet and Kokoro.
+
+If the dashboard loads and both Spark hardware cards show CPU/RAM/GPU stats, **you're in**.
+
+### 7. Load your first LLM
+
+Click **"Switch to this"** on any model card. The dashboard will:
+
+1. SSH into Spark 1, stop any running vLLM container.
+2. Run `launch-cluster.sh` with the model's bundled flags.
+3. Stream `docker logs -f` back to your browser until `Application startup complete.` appears.
+4. Mark the new model as active.
+
+Typical times: solo-mode models (Qwen3.6, Gemma 4) take ~3–5 min. Cluster-mode models (Qwen3-VL 235B) take ~5–8 min — they have to coordinate across both Sparks via Ray.
+
+### 8. (Optional) install audio services
+
+From the Audio / Speech tab, click **Install Parakeet**. This pulls and starts the parakeet-asr container on Spark 2 with appropriate settings. Takes ~2–3 min for the first install.
+
+For diarization with speaker fingerprints, also click **Reapply patches** — that overlays Sortformer + TitaNet support onto the parakeet container. The patches survive `docker restart` but are wiped by `docker rm`; if you ever recreate the container, re-run Reapply patches.
+
+Kokoro TTS is similar — pull `ghcr.io/remsky/kokoro-fastapi-gpu:latest` on Spark 2 and run with `--gpus all -p 8880:8880`. No NGC key required (Kokoro is Apache 2.0). Boots in ~5 seconds and uses only ~1.3 GB of GPU memory. (A one-click Kokoro install action is planned for a near-future release; for now you can install it manually or Spark Control will pick it up automatically once it's running on port 8880.)
+
+---
+
+## Endpoints exposed to your other apps
+
+Once Spark Control is healthy, your other LAN apps can hit it as a single trusted backend:
+
+| Path | Backend | Notes |
+|---|---|---|
+| `GET /api/endpoints` | (self) | Service discovery — JSON of base_urls + ready flags. Hit this first so you don't have to hardcode Spark IPs in other apps. |
+| `POST /v1/chat/completions` | vLLM on Spark 1 | OpenAI-compatible; supports `stream: true` |
+| `POST /v1/completions` | vLLM on Spark 1 | Legacy OpenAI completions |
+| `POST /v1/audio/transcriptions` | Parakeet on Spark 2 | OpenAI-compatible STT |
+| `POST /v1/audio/speech` | Kokoro on Spark 2 | OpenAI-compatible TTS. Default voice `bm_george`; pass `voice` to pick any of Kokoro's 67 voices. Reliable at any input length (no chunking/retry needed). |
+| `POST /api/audio/diarize-chunk` | Sortformer + TitaNet | Per-chunk diarization with voice fingerprints for cross-chunk re-clustering |
+| `POST /api/audio/transcribe-with-speakers` | Parakeet + Sortformer | One-shot transcribe + diarize, merged |
+
+All of these inherit Spark Control's TLS cert and StartOS access controls. You only need one allowlist entry in downstream apps.
+
+---
+
+## Operational notes
+
+- **vLLM does not auto-load a model after a power loss.** When your Sparks reboot, the dashboard will show "no model loaded" — you click "Switch to this" on whichever LLM you want. Parakeet/Kokoro auto-restart with their containers (Kokoro is `--restart unless-stopped` and Parakeet runs the same way).
+- **Single-slot chunked workflows.** If you're calling `/v1/audio/transcriptions` or `/api/audio/diarize-chunk` in chunked workflows, send chunks **sequentially**, not in parallel. Parallel requests can trigger a known cuFFT race on the Spark 2 GPU that returns a 503 + Retry-After. Spark Control recovers automatically but each retry costs ~60s.
+- **Context window**: the bundled Qwen3.6 entry runs at 64K total tokens (input + output combined). Adjust per-model via the Advanced button on each card.
+- **Update path**: model-catalog overrides and custom services live in `/data/*` inside the volume; they survive s9pk updates.
+
+---
+
+## Resources
+
+- `README.md` — repo overview, build instructions, dev environment
+- `runbook.md` — model-swap recipes and operating notes
+- `known-issues.md` — debugging fixes (Mamba block-size, vision token budget, historical Magpie notes, etc.)
+- Source: `image/` is the FastAPI app; `package/` is the StartOS wrapper. The s9pk build is `cd package && make x86` (or `aarch64`).
+
+---
+
+## If you're an AI agent helping with this install
+
+A few things worth knowing:
+
+- The codebase is **two halves**: `image/` is a standalone FastAPI app you can run with `uvicorn app.server:app` for local dev. `package/` is the StartOS wrapper. Changes to either should be coordinated.
+- **All connection info** comes from environment variables in `image/app/config.py`, populated from `package/startos/fileModels/sparkConfig.yaml.ts` via the Configure Sparks action. No IPs, usernames, or paths are hardcoded in runtime code.
+- The **path `~/spark-vllm-docker`** *is* hardcoded in `swap.py`, `download.py`, `updates.py`, and `models.py`. If the user has cloned the upstream repo elsewhere, either fix the path or symlink it.
+- **Persistent state** lives at `/data/` inside the container: `config.yaml`, `models-overrides.yaml`, `services-overrides.yaml`, `connectivity.json`, `ssh/`. These survive package updates.
+- The dashboard polls every 5 s; check `image/app/health.py` and `image/app/connectivity.py` for the probing logic. External apps can also POST failures to `/api/health-event` to log between-poll blips.
+- Debugging audio issues: SSH into Spark 2 and run `docker logs --tail 100 parakeet-asr`. cuFFT errors usually mean parallel requests; see the operational note above.
+- Debugging LLM swaps: the swap log is streamed in the browser, but the underlying `docker logs -f vllm_node` on Spark 1 is the ground truth.
+- The package supports both `x86_64` and `aarch64` builds. Match your Start9 server architecture.