v0.13.0:4 - redaction gateway, embeddings proxy, expanded audio API

- Add redaction gateway (redaction_gateway.py, redaction/ scrub + tests)
- Add embeddings proxy and spark_embed service (Dockerfile + main.py)
- Expand audio_proxy with speaker-aware handling; deep_health/health/server updates
- Package: configureSparks action + sparkConfig model updates, manifest/main wiring
- Docs: AUDIO_API, EMBEDDINGS, REDACTION_GATEWAY; HANDOFF and runbook/known-issues refresh
This commit is contained in:
Keysat
2026-06-11 17:45:21 -05:00
parent 4a75274db3
commit 8d839e3714
37 changed files with 3763 additions and 197 deletions
+168
View File
@@ -0,0 +1,168 @@
# Spark Control — handoff guide
You've received a `spark-control.s9pk` file. This guide gets you from "fresh install" to "working dashboard" in about an hour, most of which is waiting for downloads.
## What this is
Spark Control is a StartOS 0.4 package that runs on your Start9 server and gives you a browser dashboard for a **dual-DGX-Spark vLLM cluster**. From the dashboard you can:
- See which LLM is currently loaded
- Swap to a different LLM with one click (live log streaming until ready)
- Download new LLM weights from HuggingFace
- Install and monitor audio services (Parakeet STT, Kokoro TTS, Sortformer diarization)
- Expose OpenAI-compatible endpoints (`/v1/chat/completions`, `/v1/audio/transcriptions`, `/v1/audio/speech`, etc.) to other apps on your LAN through a single trusted host
It does **not** run any models itself — it's a controller. The actual GPU work happens on your two Sparks. Spark Control SSHes into Spark 1 to invoke `launch-cluster.sh`, and HTTP-polls both Sparks for health.
---
## Prerequisites before installing the s9pk
You need all of the following set up **first**. The s9pk assumes they exist.
### Hardware
- A **Start9 server** running StartOS 0.4.x with sideload-install enabled.
- **Two NVIDIA DGX Sparks** (or similar boxes with NVIDIA GPUs + Docker). One will be "Spark 1" (head node) and one will be "Spark 2" (worker node + audio services). They must be on the same LAN as the Start9 server.
### Spark 1 (the head node)
- A Linux user account you can SSH into (any username — `ubuntu`, `nvidia`, your own — just be consistent). Note the username; you'll enter it later.
- **Docker + NVIDIA Container Toolkit** installed and working.
- **`~/spark-vllm-docker/`** cloned from the community repo:
```bash
git clone https://github.com/eugr/spark-vllm-docker ~/spark-vllm-docker
cd ~/spark-vllm-docker
./build-and-copy.sh -c # builds the vLLM container image
```
> **The path matters.** Spark Control hardcodes `~/spark-vllm-docker` as the working directory for cluster commands. If you clone it elsewhere, the dashboard's swap and download actions will silently fail.
- A HuggingFace cache at `~/.cache/huggingface/hub/`. Either pre-download one model now, or use the dashboard's "Download a new model" button after install.
### Spark 2 (the worker node)
- Same Linux user account as Spark 1, with passwordless SSH from Spark 1 working.
- **Docker + NVIDIA Container Toolkit** installed.
- That's it — the rest can be installed through the Spark Control dashboard once it's running.
### Optional but recommended
- An **NVIDIA NGC personal API key** if you want to install Parakeet (STT) from `nvcr.io`. Free: <https://ngc.nvidia.com/setup/personal-key>. Starts with `nvapi-...`. (Not needed for Kokoro — it's Apache 2.0 and pulls from a public GitHub Container Registry image with no auth.)
---
## Install steps
### 1. Sideload the s9pk
In your Start9 web UI, go to **Sideload Service** and upload the `spark-control_*.s9pk` file (x86_64 or aarch64 depending on your Start9). Install it.
### 2. Start the service once
The first start generates an ed25519 SSH keypair inside the package volume. Wait until the service shows "Running" status — should take only a few seconds.
### 3. Show the public key and install it on both Sparks
- Open Spark Control → **Actions → Show Public Key**.
- If you haven't run Configure Sparks yet, you'll just see the raw key. Skip to step 4, then come back here.
- Once Configure Sparks is filled in, this action produces a **ready-to-paste install command** (a multi-line `ssh ... 'echo ... >> authorized_keys'` block). Copy the entire block.
- Run it in a terminal on a machine that already has SSH access to your Sparks. You'll be prompted for each Spark's SSH password once. After it completes, the Start9 server can SSH into both Sparks.
### 4. Configure Sparks
- Open Spark Control → **Actions → Configure Sparks**.
- Fill in:
- **Spark 1 hostname or IP** — prefer the **IP** (e.g. `192.168.1.x`) over `.local` hostnames; vLLM only binds IPv4 and mDNS can resolve to IPv6 first.
- **Spark 1 SSH user** — whatever username you set up on Spark 1.
- **Spark 2 hostname or IP** + **SSH user** — same idea.
- Optional Parakeet/Kokoro overrides — leave blank if those services run on Spark 2 (the normal case).
- Optional **Open WebUI URL** — paste your Open WebUI LAN URL to get a deep-link button in the dashboard next to the current model.
- Optional **NGC API key** — paste it here if you have one.
Save.
### 5. Re-run Show Public Key (if you skipped earlier)
Now that hosts are configured, Show Public Key will give you the paste-ready install command. Run it as described in step 3.
### 6. Open the Web UI
From the Spark Control service page, click the Web UI button. You should see:
- A **top status bar** with the currently loaded LLM (or "no model loaded" if Spark 1's vLLM container is fresh).
- An **LLM tab** with cards for each model in the bundled catalog. Models you've downloaded show "on disk" badges; others show "not downloaded".
- An **Audio / Speech tab** with health status and Install / Start / Stop / Restart buttons for Parakeet and Kokoro.
If the dashboard loads and both Spark hardware cards show CPU/RAM/GPU stats, **you're in**.
### 7. Load your first LLM
Click **"Switch to this"** on any model card. The dashboard will:
1. SSH into Spark 1, stop any running vLLM container.
2. Run `launch-cluster.sh` with the model's bundled flags.
3. Stream `docker logs -f` back to your browser until `Application startup complete.` appears.
4. Mark the new model as active.
Typical times: solo-mode models (Qwen3.6, Gemma 4) take ~35 min. Cluster-mode models (Qwen3-VL 235B) take ~58 min — they have to coordinate across both Sparks via Ray.
### 8. (Optional) install audio services
From the Audio / Speech tab, click **Install Parakeet**. This pulls and starts the parakeet-asr container on Spark 2 with appropriate settings. Takes ~23 min for the first install.
For diarization with speaker fingerprints, also click **Reapply patches** — that overlays Sortformer + TitaNet support onto the parakeet container. The patches survive `docker restart` but are wiped by `docker rm`; if you ever recreate the container, re-run Reapply patches.
Kokoro TTS is similar — pull `ghcr.io/remsky/kokoro-fastapi-gpu:latest` on Spark 2 and run with `--gpus all -p 8880:8880`. No NGC key required (Kokoro is Apache 2.0). Boots in ~5 seconds and uses only ~1.3 GB of GPU memory. (A one-click Kokoro install action is planned for a near-future release; for now you can install it manually or Spark Control will pick it up automatically once it's running on port 8880.)
---
## Endpoints exposed to your other apps
Once Spark Control is healthy, your other LAN apps can hit it as a single trusted backend:
| Path | Backend | Notes |
|---|---|---|
| `GET /api/endpoints` | (self) | Service discovery — JSON of base_urls + ready flags. Hit this first so you don't have to hardcode Spark IPs in other apps. |
| `POST /v1/chat/completions` | vLLM on Spark 1 | OpenAI-compatible; supports `stream: true` |
| `POST /v1/completions` | vLLM on Spark 1 | Legacy OpenAI completions |
| `POST /v1/audio/transcriptions` | Parakeet on Spark 2 | OpenAI-compatible STT |
| `POST /v1/audio/speech` | Kokoro on Spark 2 | OpenAI-compatible TTS. Default voice `bm_george`; pass `voice` to pick any of Kokoro's 67 voices. Reliable at any input length (no chunking/retry needed). |
| `POST /api/audio/diarize-chunk` | Sortformer + TitaNet | Per-chunk diarization with voice fingerprints for cross-chunk re-clustering |
| `POST /api/audio/transcribe-with-speakers` | Parakeet + Sortformer | One-shot transcribe + diarize, merged |
All of these inherit Spark Control's TLS cert and StartOS access controls. You only need one allowlist entry in downstream apps.
---
## Operational notes
- **vLLM does not auto-load a model after a power loss.** When your Sparks reboot, the dashboard will show "no model loaded" — you click "Switch to this" on whichever LLM you want. Parakeet/Kokoro auto-restart with their containers (Kokoro is `--restart unless-stopped` and Parakeet runs the same way).
- **Single-slot chunked workflows.** If you're calling `/v1/audio/transcriptions` or `/api/audio/diarize-chunk` in chunked workflows, send chunks **sequentially**, not in parallel. Parallel requests can trigger a known cuFFT race on the Spark 2 GPU that returns a 503 + Retry-After. Spark Control recovers automatically but each retry costs ~60s.
- **Context window**: the bundled Qwen3.6 entry runs at 64K total tokens (input + output combined). Adjust per-model via the Advanced button on each card.
- **Update path**: model-catalog overrides and custom services live in `/data/*` inside the volume; they survive s9pk updates.
---
## Resources
- `README.md` — repo overview, build instructions, dev environment
- `runbook.md` — model-swap recipes and operating notes
- `known-issues.md` — debugging fixes (Mamba block-size, vision token budget, historical Magpie notes, etc.)
- Source: `image/` is the FastAPI app; `package/` is the StartOS wrapper. The s9pk build is `cd package && make x86` (or `aarch64`).
---
## If you're an AI agent helping with this install
A few things worth knowing:
- The codebase is **two halves**: `image/` is a standalone FastAPI app you can run with `uvicorn app.server:app` for local dev. `package/` is the StartOS wrapper. Changes to either should be coordinated.
- **All connection info** comes from environment variables in `image/app/config.py`, populated from `package/startos/fileModels/sparkConfig.yaml.ts` via the Configure Sparks action. No IPs, usernames, or paths are hardcoded in runtime code.
- The **path `~/spark-vllm-docker`** *is* hardcoded in `swap.py`, `download.py`, `updates.py`, and `models.py`. If the user has cloned the upstream repo elsewhere, either fix the path or symlink it.
- **Persistent state** lives at `/data/` inside the container: `config.yaml`, `models-overrides.yaml`, `services-overrides.yaml`, `connectivity.json`, `ssh/`. These survive package updates.
- The dashboard polls every 5 s; check `image/app/health.py` and `image/app/connectivity.py` for the probing logic. External apps can also POST failures to `/api/health-event` to log between-poll blips.
- Debugging audio issues: SSH into Spark 2 and run `docker logs --tail 100 parakeet-asr`. cuFFT errors usually mean parallel requests; see the operational note above.
- Debugging LLM swaps: the swap log is streamed in the browser, but the underlying `docker logs -f vllm_node` on Spark 1 is the ground truth.
- The package supports both `x86_64` and `aarch64` builds. Match your Start9 server architecture.