v0.13.0:4 - redaction gateway, embeddings proxy, expanded audio API

- Add redaction gateway (redaction_gateway.py, redaction/ scrub + tests) - Add embeddings proxy and spark_embed service (Dockerfile + main.py) - Expand audio_proxy with speaker-aware handling; deep_health/health/server updates - Package: configureSparks action + sparkConfig model updates, manifest/main wiring - Docs: AUDIO_API, EMBEDDINGS, REDACTION_GATEWAY; HANDOFF and runbook/known-issues refresh
2026-06-11 17:45:21 -05:00
parent 4a75274db3
commit 8d839e3714
37 changed files with 3763 additions and 197 deletions
@@ -0,0 +1,168 @@
+# Spark Control — handoff guide
+
+You've received a `spark-control.s9pk` file. This guide gets you from "fresh install" to "working dashboard" in about an hour, most of which is waiting for downloads.
+
+## What this is
+
+Spark Control is a StartOS 0.4 package that runs on your Start9 server and gives you a browser dashboard for a **dual-DGX-Spark vLLM cluster**. From the dashboard you can:
+
+- See which LLM is currently loaded
+- Swap to a different LLM with one click (live log streaming until ready)
+- Download new LLM weights from HuggingFace
+- Install and monitor audio services (Parakeet STT, Kokoro TTS, Sortformer diarization)
+- Expose OpenAI-compatible endpoints (`/v1/chat/completions`, `/v1/audio/transcriptions`, `/v1/audio/speech`, etc.) to other apps on your LAN through a single trusted host
+
+It does **not** run any models itself — it's a controller. The actual GPU work happens on your two Sparks. Spark Control SSHes into Spark 1 to invoke `launch-cluster.sh`, and HTTP-polls both Sparks for health.
+
+---
+
+## Prerequisites before installing the s9pk
+
+You need all of the following set up **first**. The s9pk assumes they exist.
+
+### Hardware
+
+- A **Start9 server** running StartOS 0.4.x with sideload-install enabled.
+- **Two NVIDIA DGX Sparks** (or similar boxes with NVIDIA GPUs + Docker). One will be "Spark 1" (head node) and one will be "Spark 2" (worker node + audio services). They must be on the same LAN as the Start9 server.
+
+### Spark 1 (the head node)
+
+- A Linux user account you can SSH into (any username — `ubuntu`, `nvidia`, your own — just be consistent). Note the username; you'll enter it later.
+- **Docker + NVIDIA Container Toolkit** installed and working.
+- **`~/spark-vllm-docker/`** cloned from the community repo:
+
+  ```bash
+  git clone https://github.com/eugr/spark-vllm-docker ~/spark-vllm-docker
+  cd ~/spark-vllm-docker
+  ./build-and-copy.sh -c    # builds the vLLM container image
+  ```
+
+  > **The path matters.** Spark Control hardcodes `~/spark-vllm-docker` as the working directory for cluster commands. If you clone it elsewhere, the dashboard's swap and download actions will silently fail.
+
+- A HuggingFace cache at `~/.cache/huggingface/hub/`. Either pre-download one model now, or use the dashboard's "Download a new model" button after install.
+
+### Spark 2 (the worker node)
+
+- Same Linux user account as Spark 1, with passwordless SSH from Spark 1 working.
+- **Docker + NVIDIA Container Toolkit** installed.
+- That's it — the rest can be installed through the Spark Control dashboard once it's running.
+
+### Optional but recommended
+
+- An **NVIDIA NGC personal API key** if you want to install Parakeet (STT) from `nvcr.io`. Free: <https://ngc.nvidia.com/setup/personal-key>. Starts with `nvapi-...`. (Not needed for Kokoro — it's Apache 2.0 and pulls from a public GitHub Container Registry image with no auth.)
+
+---
+
+## Install steps
+
+### 1. Sideload the s9pk
+
+In your Start9 web UI, go to **Sideload Service** and upload the `spark-control_*.s9pk` file (x86_64 or aarch64 depending on your Start9). Install it.
+
+### 2. Start the service once
+
+The first start generates an ed25519 SSH keypair inside the package volume. Wait until the service shows "Running" status — should take only a few seconds.
+
+### 3. Show the public key and install it on both Sparks
+
+- Open Spark Control → **Actions → Show Public Key**.
+- If you haven't run Configure Sparks yet, you'll just see the raw key. Skip to step 4, then come back here.
+- Once Configure Sparks is filled in, this action produces a **ready-to-paste install command** (a multi-line `ssh ... 'echo ... >> authorized_keys'` block). Copy the entire block.
+- Run it in a terminal on a machine that already has SSH access to your Sparks. You'll be prompted for each Spark's SSH password once. After it completes, the Start9 server can SSH into both Sparks.
+
+### 4. Configure Sparks
+
+- Open Spark Control → **Actions → Configure Sparks**.
+- Fill in:
+  - **Spark 1 hostname or IP** — prefer the **IP** (e.g. `192.168.1.x`) over `.local` hostnames; vLLM only binds IPv4 and mDNS can resolve to IPv6 first.
+  - **Spark 1 SSH user** — whatever username you set up on Spark 1.
+  - **Spark 2 hostname or IP** + **SSH user** — same idea.
+  - Optional Parakeet/Kokoro overrides — leave blank if those services run on Spark 2 (the normal case).
+  - Optional **Open WebUI URL** — paste your Open WebUI LAN URL to get a deep-link button in the dashboard next to the current model.
+  - Optional **NGC API key** — paste it here if you have one.
+
+Save.
+
+### 5. Re-run Show Public Key (if you skipped earlier)
+
+Now that hosts are configured, Show Public Key will give you the paste-ready install command. Run it as described in step 3.
+
+### 6. Open the Web UI
+
+From the Spark Control service page, click the Web UI button. You should see:
+
+- A **top status bar** with the currently loaded LLM (or "no model loaded" if Spark 1's vLLM container is fresh).
+- An **LLM tab** with cards for each model in the bundled catalog. Models you've downloaded show "on disk" badges; others show "not downloaded".
+- An **Audio / Speech tab** with health status and Install / Start / Stop / Restart buttons for Parakeet and Kokoro.
+
+If the dashboard loads and both Spark hardware cards show CPU/RAM/GPU stats, **you're in**.
+
+### 7. Load your first LLM
+
+Click **"Switch to this"** on any model card. The dashboard will:
+
+1. SSH into Spark 1, stop any running vLLM container.
+2. Run `launch-cluster.sh` with the model's bundled flags.
+3. Stream `docker logs -f` back to your browser until `Application startup complete.` appears.
+4. Mark the new model as active.
+
+Typical times: solo-mode models (Qwen3.6, Gemma 4) take ~3–5 min. Cluster-mode models (Qwen3-VL 235B) take ~5–8 min — they have to coordinate across both Sparks via Ray.
+
+### 8. (Optional) install audio services
+
+From the Audio / Speech tab, click **Install Parakeet**. This pulls and starts the parakeet-asr container on Spark 2 with appropriate settings. Takes ~2–3 min for the first install.
+
+For diarization with speaker fingerprints, also click **Reapply patches** — that overlays Sortformer + TitaNet support onto the parakeet container. The patches survive `docker restart` but are wiped by `docker rm`; if you ever recreate the container, re-run Reapply patches.
+
+Kokoro TTS is similar — pull `ghcr.io/remsky/kokoro-fastapi-gpu:latest` on Spark 2 and run with `--gpus all -p 8880:8880`. No NGC key required (Kokoro is Apache 2.0). Boots in ~5 seconds and uses only ~1.3 GB of GPU memory. (A one-click Kokoro install action is planned for a near-future release; for now you can install it manually or Spark Control will pick it up automatically once it's running on port 8880.)
+
+---
+
+## Endpoints exposed to your other apps
+
+Once Spark Control is healthy, your other LAN apps can hit it as a single trusted backend:
+
+| Path | Backend | Notes |
+|---|---|---|
+| `GET /api/endpoints` | (self) | Service discovery — JSON of base_urls + ready flags. Hit this first so you don't have to hardcode Spark IPs in other apps. |
+| `POST /v1/chat/completions` | vLLM on Spark 1 | OpenAI-compatible; supports `stream: true` |
+| `POST /v1/completions` | vLLM on Spark 1 | Legacy OpenAI completions |
+| `POST /v1/audio/transcriptions` | Parakeet on Spark 2 | OpenAI-compatible STT |
+| `POST /v1/audio/speech` | Kokoro on Spark 2 | OpenAI-compatible TTS. Default voice `bm_george`; pass `voice` to pick any of Kokoro's 67 voices. Reliable at any input length (no chunking/retry needed). |
+| `POST /api/audio/diarize-chunk` | Sortformer + TitaNet | Per-chunk diarization with voice fingerprints for cross-chunk re-clustering |
+| `POST /api/audio/transcribe-with-speakers` | Parakeet + Sortformer | One-shot transcribe + diarize, merged |
+
+All of these inherit Spark Control's TLS cert and StartOS access controls. You only need one allowlist entry in downstream apps.
+
+---
+
+## Operational notes
+
+- **vLLM does not auto-load a model after a power loss.** When your Sparks reboot, the dashboard will show "no model loaded" — you click "Switch to this" on whichever LLM you want. Parakeet/Kokoro auto-restart with their containers (Kokoro is `--restart unless-stopped` and Parakeet runs the same way).
+- **Single-slot chunked workflows.** If you're calling `/v1/audio/transcriptions` or `/api/audio/diarize-chunk` in chunked workflows, send chunks **sequentially**, not in parallel. Parallel requests can trigger a known cuFFT race on the Spark 2 GPU that returns a 503 + Retry-After. Spark Control recovers automatically but each retry costs ~60s.
+- **Context window**: the bundled Qwen3.6 entry runs at 64K total tokens (input + output combined). Adjust per-model via the Advanced button on each card.
+- **Update path**: model-catalog overrides and custom services live in `/data/*` inside the volume; they survive s9pk updates.
+
+---
+
+## Resources
+
+- `README.md` — repo overview, build instructions, dev environment
+- `runbook.md` — model-swap recipes and operating notes
+- `known-issues.md` — debugging fixes (Mamba block-size, vision token budget, historical Magpie notes, etc.)
+- Source: `image/` is the FastAPI app; `package/` is the StartOS wrapper. The s9pk build is `cd package && make x86` (or `aarch64`).
+
+---
+
+## If you're an AI agent helping with this install
+
+A few things worth knowing:
+
+- The codebase is **two halves**: `image/` is a standalone FastAPI app you can run with `uvicorn app.server:app` for local dev. `package/` is the StartOS wrapper. Changes to either should be coordinated.
+- **All connection info** comes from environment variables in `image/app/config.py`, populated from `package/startos/fileModels/sparkConfig.yaml.ts` via the Configure Sparks action. No IPs, usernames, or paths are hardcoded in runtime code.
+- The **path `~/spark-vllm-docker`** *is* hardcoded in `swap.py`, `download.py`, `updates.py`, and `models.py`. If the user has cloned the upstream repo elsewhere, either fix the path or symlink it.
+- **Persistent state** lives at `/data/` inside the container: `config.yaml`, `models-overrides.yaml`, `services-overrides.yaml`, `connectivity.json`, `ssh/`. These survive package updates.
+- The dashboard polls every 5 s; check `image/app/health.py` and `image/app/connectivity.py` for the probing logic. External apps can also POST failures to `/api/health-event` to log between-poll blips.
+- Debugging audio issues: SSH into Spark 2 and run `docker logs --tail 100 parakeet-asr`. cuFFT errors usually mean parallel requests; see the operational note above.
+- Debugging LLM swaps: the swap log is streamed in the browser, but the underlying `docker logs -f vllm_node` on Spark 1 is the ground truth.
+- The package supports both `x86_64` and `aarch64` builds. Match your Start9 server architecture.
@@ -2,11 +2,14 @@

 A browser-based control panel for a dual-DGX-Spark vLLM cluster. Designed to run as a [StartOS 0.4](https://docs.start9.com/packaging/0.4.0.x/) package on a Start9 server on the same LAN as the Sparks.

+> **If you've just received this package from someone**, start with [HANDOFF.md](./HANDOFF.md) — it has the prereq checklist and a step-by-step install guide written for a fresh user.
+
 ## What it does

- Shows which LLM is currently loaded on the cluster (`:8888/v1/models`).
+- Shows which LLM is currently loaded on the cluster (`<spark1-host>:8888/v1/models`).
 - Click to swap to a different model — stops the current one, launches the new one, streams logs to the UI until `Application startup complete.` appears.
- Surfaces health for Parakeet (STT, `:8000`) and Magpie (TTS, `:9000`) on Spark 2.
+- Surfaces health for Parakeet (STT, `:8000`) and Kokoro (TTS, `:8880`) on Spark 2.
+- Proxies OpenAI-compatible chat-completions, transcribe, diarize, and TTS through one trusted host so external apps only need to know about Spark Control.

 ## Architecture

@@ -32,16 +35,16 @@ cd image
 python3 -m venv .venv && source .venv/bin/activate
 pip install -e .
 export SPARK1_HOST=<spark-1-ip>
-export SPARK1_USER=<spark-user>
+export SPARK1_USER=<your-ssh-user>
 export SPARK2_HOST=<spark-2-ip>
-export SPARK2_USER=<spark-user>
-export SSH_KEY_PATH="$HOME/Library/Application Support/NVIDIA/Sync/config/nvsync.key"
+export SPARK2_USER=<your-ssh-user>
+export SSH_KEY_PATH=<path-to-your-private-key>
 uvicorn app.server:app --host 0.0.0.0 --port 9999 --reload
 ```

 Open <http://localhost:9999>.

-> **Note:** use the **IP** `<spark-1-ip>` for Spark 1, not `<spark-1-host>.local`. mDNS resolves to IPv6 first and `httpx` hangs on it because vLLM only binds IPv4.
+> **Note:** prefer the **IP** for Spark 1 over a `.local` hostname. mDNS can resolve to IPv6 first, and `httpx` will hang on it because vLLM only binds IPv4.

 ## Build the StartOS package

@@ -49,6 +52,8 @@ Open <http://localhost:9999>.
 cd package
 npm i        # one-time
 make x86     # produces spark-control_x86_64.s9pk (~55 MB)
+# or
+make aarch64 # for ARM-based Start9 servers
 ```

 Requires [`start-cli`](https://docs.start9.com/latest/developer-guide/sdk/installing-the-sdk), Node ≥ 22, Docker. The build runs `tsc` + `ncc` for the TS bundle, then `docker build` on `image/Dockerfile`, then `start-cli s9pk pack` to produce the `.s9pk`.
@@ -57,15 +62,18 @@ To sideload onto your Start9: `make install` (needs `host:` set in `~/.startos/c

 ## Post-install setup (one-time per Start9 install)

-1. Open the Spark Control service → **Actions** → **Show Public Key** → copy the line.
-2. SSH to each Spark and append the line to `~/.ssh/authorized_keys` for the `<spark-user>` user.
-3. **Actions** → **Configure Sparks** → enter `<spark-1-ip>` / `<spark-user>` for Spark 1 and `<spark-2-ip>` / `<spark-user>` for Spark 2.
+1. Open the Spark Control service → **Actions** → **Show Public Key** → copy the produced one-liner.
+2. Run that one-liner from any machine that already has SSH access to your Sparks. It appends the package's pubkey to `~/.ssh/authorized_keys` on each Spark.
+3. **Actions** → **Configure Sparks** → enter your Spark 1 / Spark 2 IPs and the SSH username you use to log into them.
 4. Start the service. Open the Web UI — current model + health should show within ~5 s.

+See [HANDOFF.md](./HANDOFF.md) for a fuller prereq checklist and the hardware-side setup required *before* this package is useful.
+
 ## Repo layout

 - `image/` — Docker image source (FastAPI app + `models.yaml`)
 - `package/` — StartOS 0.4 package source
+- `HANDOFF.md` — prereqs + first-time install guide for a fresh user
 - `runbook.md` — operating notes
 - `known-issues.md` — known quirks and workarounds
 - `LICENSE` — MIT
@@ -76,9 +84,9 @@ Other services on your LAN can hit `GET /api/endpoints` to learn where the curre

 ```json
 {
-  "vllm":    { "ready": true,  "base_url": "http://<spark-1-ip>:8888/v1", "model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4", "openai_compat": true },
-  "parakeet":{ "ready": true,  "base_url": "http://<spark-2-ip>:8000",   "kind": "stt", "model": "nvidia/parakeet-tdt-0.6b-v3" },
-  "magpie":  { "ready": false, "base_url": "http://<spark-2-ip>:9000",   "kind": "tts" }
+  "vllm":    { "ready": true,  "base_url": "http://<spark1-host>:8888/v1", "model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4", "openai_compat": true },
+  "parakeet":{ "ready": true,  "base_url": "http://<spark2-host>:8000",   "kind": "stt", "model": "nvidia/parakeet-tdt-0.6b-v3" },
+  "kokoro":  { "ready": true,  "base_url": "http://<spark2-host>:8880",   "kind": "tts" }
 }
 ```

@@ -86,7 +94,7 @@ Other services on your LAN can hit `GET /api/endpoints` to learn where the curre

 ## Reporting failures from external apps

-Spark Control polls every 5 s, so a brief blip in Parakeet/Magpie/vLLM availability can slip between polls and never make it into the connectivity log. To capture short failures, an external app (e.g. Open WebUI) can POST whenever a call fails (or succeeds):
+Spark Control polls every 5 s, so a brief blip in Parakeet/Kokoro/vLLM availability can slip between polls and never make it into the connectivity log. To capture short failures, an external app (e.g. Open WebUI) can POST whenever a call fails (or succeeds):

 ```bash
 curl -X POST http://<dashboard-url>/api/health-event \
@@ -104,15 +112,17 @@ Fields: `service` (required), `ok` (required), `source` (optional, free-form), `

 ## Status

-**v0.2.3** — installed and verified on a Start9 server. Five bundled LLMs in the catalog (qwen3-vl, gemma4, qwen36, qwen3-235b-fp8, qwen2.5-72b), plus any custom models added through the UI.
+**v0.2.3 / s9pk version 0.13.0:4** — installed and verified on a Start9 server. Five bundled LLMs in the catalog (qwen3-vl, gemma4, qwen36, qwen3-235b-fp8, qwen2.5-72b), plus any custom models added through the UI.

 ### What v0.2 added on top of v0.1

 - **Service discovery API** (`/api/endpoints`) for other LAN services
- **Magpie crash fix** documented (chown the model-cache volume to uid 1000)
- **Always-on services panel** with Start/Stop/Restart for Parakeet + Magpie, plus per-service host configuration in Configure Sparks (so Parakeet/Magpie can live on Spark 1, Spark 2, or anywhere)
+- **Kokoro-82M TTS** replaces Magpie/Riva NIM as the default TTS backend (v0.14.0). Magpie's decoder had a ~30-50% truncation rate on multi-sentence inputs and ate 49 GB of GPU memory; Kokoro is 24/24 reliable at every input length tested, uses 1.3 GB GPU, and renders in ~1s. See HANDOFF.md and the release notes for the migration story.
+- **Always-on services panel** with Start/Stop/Restart for Parakeet + Kokoro, plus per-service host configuration in Configure Sparks (so they can live on Spark 1, Spark 2, or anywhere)
 - **Model download** from the dashboard — paste an HF repo, pick solo or cluster, watch percent progress with bytes/rate/ETA. After completion, an "Add to catalog" dialog appears pre-filled.
 - **spark-vllm-docker update check** — banner shows "N commits behind upstream"; Apply Update runs `git pull && ./build-and-copy.sh -c` over SSH with a streamed log
 - **Per-model Advanced settings** — knobs for max context, GPU memory %, and three optimization toggles (fastsafetensors, prefix caching, FP8 KV cache). Persisted to `/data/models-overrides.yaml` so they survive package updates. Bundled and custom models alike.
+- **Diarization with speaker fingerprints** via Sortformer + TitaNet, exposed at `/api/audio/diarize-chunk` for chunked workflows
+- **OpenAI chat-completions proxy** (`/v1/chat/completions`, `/v1/completions`) — forwards to the loaded vLLM so external apps need only one trusted host

 v0.3+ roadmap (loose): richer dashboard (SSH/GPU/tokens-per-sec), Open WebUI deep-link integration, optional auth, multi-cluster.
@@ -0,0 +1,288 @@
+# Spark Control — Audio API reference (transcription + diarization)
+
+For external clients (e.g. the **Ten31 Transcripts** capture app) integrating with
+the transcription and diarization endpoints. All examples are **real responses**
+from the live deployment.
+
+---
+
+## 1. Connection / auth
+
+- **Base URL:** `https://<spark-control-host>` (the operator's Start9 LAN address,
+  e.g. `https://<spark-control-host>:62419`). A `.local` form also exists (survives IP
+  changes); the operator can provide it.
+- **TLS:** Start9's self-signed Root CA. On the LAN, set `verify=False` /
+  `rejectUnauthorized:false` (curl `-k`), or install the Start9 Root CA into your
+  trust store. Same story as every other Spark Control endpoint.
+- **Auth:** **none on the LAN** today — the endpoints sit behind StartOS access
+  control + TLS on a trusted network. No bearer token / API key. (If you need
+  per-client auth later, the operator can add it; it's not there now.)
+- **Limits:**
+  - Max upload: **200 MB** per request (`413` if exceeded). For long calls, chunk
+    (see §4).
+  - Request timeout: transcription ~300 s, diarization ~600 s per request.
+  - **Send requests sequentially, not in parallel.** Concurrent audio requests can
+    trip a GPU FFT race on the backend (returns `503` + `Retry-After`). One in
+    flight at a time is the safe pattern.
+
+---
+
+## 2. Audio format
+
+- The backend (NVIDIA Parakeet / NeMo) **decodes common formats** (WAV, FLAC, MP3,
+  m4a) and internally resamples to **16 kHz mono**. So **16 kHz mono WAV is ideal**;
+  anything decodable also works (stereo is downmixed).
+- **Single mixed-mono file** is what the endpoints expect. Diarization (Sortformer)
+  separates speakers *from one mixed stream*, so for diarization you want everyone
+  in one file — **mix your system-audio track + mic track to one mono WAV** before
+  sending.
+- **Your two-track capture is an asset:** your mic track is, by definition, *you* —
+  a known identity. Two clean options:
+  1. **Mix both tracks → diarize the mix** (simplest; pairs perfectly with your
+     visual-timeline name-merge — see §4 note).
+  2. **Diarize only the system track** (the other participants) and label your own
+     mic track as the user directly (no diarization needed for your own voice).
+- **Upload mechanism:** `multipart/form-data`, file field name **`file`** (OpenAI-
+  compatible). Not base64, not a path/URL — send the bytes.
+
+---
+
+## 3. Transcription endpoint
+
+**`POST /v1/audio/transcriptions`** — OpenAI-compatible. **Synchronous** (returns the
+result; no job/polling).
+
+Multipart fields:
+| field | required | notes |
+|---|---|---|
+| `file` | **yes** | the audio bytes |
+| `model` | no | default `parakeet-tdt-0.6b-v3` (one STT model server-side; you don't need to pick) |
+| `response_format` | no | `json` (default, just text) · `verbose_json` (timestamps) · `text` |
+| `language` | no | default auto/en |
+| `temperature`, `prompt` | no | passthrough |
+
+```bash
+curl -k -X POST https://<host>/v1/audio/transcriptions \
+  -F "file=@call.wav" -F "response_format=verbose_json"
+```
+
+**Real `verbose_json` response** — includes **word-level AND segment-level
+timestamps** (seconds):
+```json
+{
+  "task": "transcribe",
+  "language": "en",
+  "duration": 9.259,
+  "text": "Good morning everyone. I think the energy thesis is strong this quarter. I agree, but I am worried about the lockup terms and the fee load this time.",
+  "segments": [
+    { "start": 0.0,  "end": 1.28, "text": "Good morning everyone." },
+    { "start": 1.44, "end": 4.48, "text": "I think the energy thesis is strong this quarter." }
+  ],
+  "words": [
+    { "start": 0.0,  "end": 0.32, "text": "Good" },
+    { "start": 0.32, "end": 0.72, "text": "morning" },
+    { "start": 0.8,  "end": 1.28, "text": "everyone." }
+  ]
+}
+```
+(`json` → `{"text": "..."}`; `text` → plain text body.)
+
+---
+
+## 4. Diarization
+
+Two endpoints, both **synchronous**, both returning **anonymous clusters**
+(`Speaker_0`, `Speaker_1`, …) with timestamps. NVIDIA **Sortformer** owns
+segmentation; **TitaNet** produces a voiceprint per speaker.
+
+### `POST /api/audio/diarize-chunk` — segmentation + voice fingerprints
+
+Multipart: `file` (required). Designed to be called **per chunk** for long calls;
+returns a 192-dim fingerprint per local speaker so you can re-cluster the same
+person across chunks.
+
+```bash
+curl -k -X POST https://<host>/api/audio/diarize-chunk -F "file=@call.wav"
+```
+**Real response:**
+```json
+{
+  "duration": 9.259,
+  "segments": [
+    { "start_s": 0.0,  "end_s": 1.52, "speaker": "Speaker_0", "confidence": 0.931 },
+    { "start_s": 1.6,  "end_s": 4.56, "speaker": "Speaker_0", "confidence": 0.9662 },
+    { "start_s": 4.88, "end_s": 9.04, "speaker": "Speaker_1", "confidence": 0.9681 }
+  ],
+  "speakers_detected": ["Speaker_0", "Speaker_1"],
+  "fingerprints": {
+    "Speaker_0": [0.0028, 0.0173, -0.0114, "...192 floats"],
+    "Speaker_1": [0.0020, 0.0056, -0.0045, "...192 floats"]
+  },
+  "models": {
+    "diarization": "nvidia/diar_sortformer_4spk-v1",
+    "embedding": "nvidia/speakerverification_en_titanet_large"
+  }
+}
+```
+- `confidence` ∈ [0,1] = mean probability the assigned speaker was active over the
+  segment (threshold it to render uncertain segments as "Speaker_0?").
+- `speaker` labels are **local to this chunk** — use the fingerprints + cosine
+  similarity (NeMo default distance threshold ~0.7) to merge `chunkA.Speaker_0` with
+  `chunkB.Speaker_2` when they're the same voice.
+
+### `POST /api/audio/transcribe-with-speakers` — ASR + diarization merged
+
+Multipart: `file` (required). Runs transcription + diarization and **merges by
+timestamp** into speaker-attributed text blocks (`start_ms`/`end_ms`).
+```json
+{
+  "duration": 9.259, "language": "en",
+  "speakers_detected": ["Speaker_0", "Speaker_1"],
+  "segments": [
+    { "start_ms": 0,    "end_ms": 4480, "speaker": "Speaker_0",
+      "text": "Good morning everyone. I think the energy thesis is strong this quarter." },
+    { "start_ms": 4800, "end_ms": 9040, "speaker": "Speaker_1",
+      "text": "I agree, but I am worried about the lockup terms and the fee load this time." }
+  ],
+  "models": { "transcription": "parakeet", "diarization": "nvidia/diar_sortformer_4spk-v1" }
+}
+```
+
+### Your key question — can it accept a *prior* (named timeline) to label clusters?
+
+**Yes — that's exactly what `/api/audio/label-merge` does (§4.5 below).** You POST
+the audio plus your visual `(start, end, name)` timeline; the backend diarizes,
+runs the majority-temporal-overlap vote, and returns **named** segments — no
+client-side merge needed. (The two endpoints above still return anonymous clusters
+if you'd rather do the merge yourself.) Note the diarizer is fixed at **≤4 speakers
+per chunk** (NVIDIA Sortformer, not pyannote) and takes **no `num_speakers` hint** —
+for >4-person calls, chunk the audio; your visual timeline actually helps
+disambiguate across chunks.
+
+**The fingerprints compound.** Every named cluster comes back with its 192-dim
+voiceprint. Persist those (keyed by name) and pass them back as `known_voiceprints`
+on the next call — `label-merge` will recover a speaker by **voice** even when the
+visual cue is missing (camera off, a bad OCR frame). Your visual capture *enrolls a
+voice library for free*.
+
+---
+
+## 4.5 `POST /api/audio/label-merge` — named segments from a visual timeline
+
+Diarize + **name the clusters** from your screen-derived timeline (majority temporal
+overlap), with an optional **voiceprint fallback** for anyone the visual track
+missed. **Synchronous. Stateless** — you own the timeline and the voiceprint
+library; the backend just diarizes + merges and persists nothing.
+
+`multipart/form-data` fields:
+| field | required | notes |
+|---|---|---|
+| `file` | **yes** | mixed-mono audio |
+| `timeline` | **yes** | JSON array: `[{"start":0.0,"end":4.5,"name":"Alice","confidence":0.9}, ...]` (seconds) |
+| `known_voiceprints` | no | JSON object `{"Alice":[192 floats], "Bob":[...]}` — named voiceprints from past calls, used to label clusters with no visual overlap |
+| `transcribe` | no | `"true"` to also return per-segment text (default false) |
+| `min_overlap` | no | min fraction of a cluster's time that must overlap the winning name (default `0.0` = any overlap wins) |
+| `voiceprint_threshold` | no | cosine similarity to accept a voiceprint match (default `0.5`) |
+
+```bash
+curl -k -X POST https://<host>/api/audio/label-merge \
+  -F "file=@call.wav" \
+  -F 'timeline=[{"start":0,"end":4.5,"name":"Alice"},{"start":4.8,"end":9.3,"name":"Bob"}]' \
+  -F "transcribe=true"
+```
+
+**Real response** (the 2-speaker test clip; visual named both, with transcript):
+```json
+{
+  "duration": 9.259,
+  "speakers": [
+    { "cluster": "Speaker_0", "name": "Alice", "source": "visual", "overlap_confidence": 0.9866,
+      "fingerprint": [0.0028, 0.0173, "...192 floats"] },
+    { "cluster": "Speaker_1", "name": "Bob",   "source": "visual", "overlap_confidence": 1.0,
+      "fingerprint": [0.0020, 0.0056, "...192 floats"] }
+  ],
+  "segments": [
+    { "start_ms": 0,    "end_ms": 4480, "speaker": "Alice", "text": "Good morning everyone. I think the energy thesis is strong this quarter." },
+    { "start_ms": 4800, "end_ms": 9040, "speaker": "Bob",   "text": "I agree, but I am worried about the lockup terms and the fee load this time." }
+  ],
+  "fingerprints": { "Alice": [192 floats], "Bob": [192 floats] },
+  "models": { "diarization": "nvidia/diar_sortformer_4spk-v1", "embedding": "nvidia/speakerverification_en_titanet_large" }
+}
+```
+
+**Name resolution per cluster, in order:** (1) the visual-timeline name with the
+most temporal overlap (`source: "visual"`); (2) if none, the closest
+`known_voiceprints` match above `voiceprint_threshold` (`source: "voiceprint"`,
+with `match_similarity`); (3) otherwise `Unknown_N` (`source: "unmatched"`) — never
+mislabeled. The `fingerprints` map (keyed by the resolved name) is what you persist
+to grow your voiceprint library for the next call. When `transcribe=false`, segments
+are `{start_s, end_s, speaker, confidence}` instead of text blocks.
+
+**Verified live** — visual match (both speakers named), voiceprint recovery (a
+camera-off speaker matched by voice), and unmatched (→ `Unknown_0`) all confirmed.
+
+### Dual-channel mode (recommended for Ten31 Transcripts)
+
+If you capture two sample-aligned tracks — **`mic_file`** (the local user) + **`system_file`**
+(everyone else, from screen capture) — send them *instead of* `file`. This is strictly
+better than mixing to mono: the diarizer over-segments a mono mix (a stereo clip of two
+clean voices comes back as **3** speakers), whereas the two channels let each model get
+the easiest possible mono input.
+
+Extra form fields for dual mode:
+| field | required | notes |
+|---|---|---|
+| `mic_file` + `system_file` | **yes (dual)** | the two aligned mono-16k tracks |
+| `self_name` | no | the local user's name (mic channel). Default `"Me"`. |
+| `self_vad` | no | JSON `[{"start","end"}]` — windows where the mic is active *and louder than* system. If omitted, computed server-side per-window. |
+
+How it works: the **mic track** → your words, gated to windows where the mic is genuinely
+you speaking (the mic also picks up the remote audio as quiet bleed, so this loudness gate
+is essential — without it the bleed gets transcribed as you). The **system track** →
+diarized (it only has to separate the *remote* people) and named via the timeline +
+voiceprints. Your clean voiceprint is **enrolled from the mic track** and injected into the
+library, so a system cluster that's you dialed in from a second device (dual-login) resolves
+to you, not a stranger. You also free a Sortformer speaker slot (you no longer consume one).
+
+```bash
+curl -k -X POST https://<host>/api/audio/label-merge \
+  -F "mic_file=@mic.wav" -F "system_file=@system.wav" \
+  -F "self_name=Alice" -F 'timeline=[...]' -F "transcribe=true" \
+  -F 'known_voiceprints={"Alice":[...],"Bob":[...]}'   # include your own
+```
+
+Response is the same shape with `"mode":"dual_channel"`; `speakers` includes a
+`{"name":self_name,"source":"mic_channel"}` entry, and `fingerprints[self_name]` is your
+clean mic-enrolled voiceprint to store.
+
+**Validated on a real misattributing call:** dual-channel fixed both mono-mix
+misattributions (a remote "Go Bitcoin" no longer credited to the user; a local "There"
+recovered from `Unknown`), and **correctly split overlapping speech** — two people saying
+"Hello" at once that the coarse ground truth itself had conflated.
+
+> **One known limit:** if *loud* remote bleed masks a *quiet* local word, the mic-track ASR
+> can miss it entirely (we can't attribute a word that was never transcribed). A cleaner mic
+> (headphones, so there's no speaker bleed) avoids it; channel-subtraction echo-cancellation
+> is a possible future enhancement since the tracks are sample-aligned.
+
+---
+
+## 5. Anything else
+
+- **No OpenAPI/Swagger yet.** This doc + the curl examples are the contract.
+- **Health / discovery:**
+  - `GET /api/status` — per-service health (`parakeet`, etc.).
+  - `GET /api/endpoints` — service-discovery JSON (base URLs + ready flags).
+  - `GET /v1/models` — lists the STT model + diarizer.
+- **Errors:** JSON body, conventional status codes — `400` malformed, `413` too
+  large, `503` + `Retry-After` if the backend briefly wedges (retry after the
+  interval; transcription auto-recovers). Most error bodies are `{"detail": "..."}`.
+- **Long calls:** chunk into ~2–3 min pieces, send **sequentially**, diarize each
+  with `/api/audio/diarize-chunk`, and stitch speakers across chunks via the
+  fingerprints. (The operator's other apps use exactly this pattern.)
+
+---
+
+*Backend: NVIDIA Parakeet TDT 0.6B (STT) + Sortformer 4spk-v1 (diarization) +
+TitaNet (voice fingerprints) on DGX Spark, fronted by Spark Control. All on the
+operator's LAN — nothing leaves the box.*
@@ -0,0 +1,202 @@
+# Embeddings + Retrieval through Spark Control (v0.15.0)
+
+Spark Control now fronts a local RAG stack so your agent/CRM system can do
+dense embeddings, reranking, and hybrid vector search against one trusted host
+— same TLS cert and allowlist as the LLM and audio endpoints.
+
+## What runs where
+
+| Component | Host | Port | Role |
+|---|---|---|---|
+| **spark-embed** | Spark 2 (GPU) | 8088 | `BAAI/bge-m3` dense embeddings (1024-d) + `BAAI/bge-reranker-v2-m3` cross-encoder rerank |
+| **Qdrant** | Spark 2 (CPU) | 6333/6334 | Vector storage, hybrid dense+sparse retrieval, RRF fusion, payload filtering |
+| **Spark Control** | Start9 | (your LAN URL) | Proxies all of the above behind one host |
+
+`spark-embed` is a small FastAPI server built **from the NGC PyTorch image**
+(the torch we've proven runs on the GB10's sm_121 GPU). We did *not* use HF Text
+Embeddings Inference because as of 2026 it ships no arm64 CUDA image. No
+torchaudio, no flash-attn — bge-m3 + the reranker are plain XLM-RoBERTa
+encoders.
+
+## Endpoints (all on the Spark Control host)
+
+### `POST /v1/embeddings` — dense embeddings (OpenAI-compatible)
+
+```json
+// request
+{ "input": "text to embed", "model": "BAAI/bge-m3" }
+// or  { "input": ["batch", "of", "texts"] }
+
+// response (OpenAI shape)
+{
+  "object": "list",
+  "data": [ { "object": "embedding", "index": 0, "embedding": [0.01, ...1024 floats] } ],
+  "model": "BAAI/bge-m3",
+  "usage": { "prompt_tokens": 0, "total_tokens": 0 }
+}
+```
+
+Vectors are L2-normalized by default (cosine == dot product). Works with the
+stock OpenAI Python/JS client by pointing `base_url` at Spark Control.
+
+### `POST /v1/rerank` — cross-encoder rerank
+
+```json
+// request
+{ "query": "did Brightwater commit?", "documents": ["chunk a", "chunk b", ...],
+  "top_n": 5, "return_documents": false }
+
+// response (Cohere-ish)
+{ "object": "rerank.result", "model": "BAAI/bge-reranker-v2-m3",
+  "results": [ { "index": 3, "relevance_score": 5.21 }, { "index": 0, "relevance_score": 1.04 }, ... ] }
+```
+
+`relevance_score` is the reranker's raw logit (higher = more relevant; unbounded,
+roughly −10..+10). Sort desc. Send a candidate set (≤200), not your whole corpus.
+
+### `POST /api/search` — orchestrated hybrid retrieval
+
+One call: embeds the query (dense), retrieves from Qdrant (hybrid dense+sparse
+with RRF **when you supply a sparse vector**, else dense-only), optionally
+cross-encoder reranks, returns top_k.
+
+```json
+// request
+{
+  "query": "Did Brightwater commit to the Fund III close in Q1?",
+  "collection": "crm_chunks",
+  "top_k": 8,
+  "retrieve_n": 80,                 // first-stage candidates (default max(50, top_k*10))
+  "sparse": { "indices": [12, 904], "values": [0.7, 1.2] },  // optional BM25 vector for hybrid
+  "fusion": "rrf",                  // or "dbsf"
+  "filter": {                       // raw Qdrant filter (pre-filter, see below)
+    "must": [ { "key": "lp_id", "match": { "value": "lp_0427" } } ]
+  },
+  "rerank": true,
+  "text_field": "text",             // payload field holding the chunk text
+  "with_payload": true
+}
+
+// response
+{
+  "object": "search.result_list",
+  "model": "BAAI/bge-m3+bge-reranker-v2-m3",
+  "query": "...",
+  "collection": "crm_chunks",
+  "reranked": true,
+  "data": [
+    { "object": "search.result", "index": 0, "id": "chunk_99c1_3",
+      "score": 5.21, "fused_score": 0.41, "rerank_score": 5.21,
+      "text": "...Brightwater confirmed verbal commitment...",
+      "payload": { "lp_id": "lp_0427", "doc_type": "email", "date_ts": 1771027200, ... } }
+  ],
+  "usage": { "embed_ms": 21, "qdrant_ms": 9, "rerank_ms": 140, "candidates": 80 }
+}
+```
+
+`score` is the rerank score when reranked, else the fused/dense score.
+
+## The sparse (BM25) story — important for entity-heavy data
+
+bge-m3's dense vectors carry semantic meaning, but exact entity matches
+(fund names, tickers, people) are a **lexical** signal. For that you want hybrid
+dense + sparse. Two design facts:
+
+1. **spark-embed serves dense only.** TEI/Infinity can't emit bge-m3's learned
+   sparse either, and bge-m3's pretrained sparse weights underweight novel
+   entity tokens (brand-new fund names). So we don't use bge-m3 sparse.
+2. **Use Qdrant BM25 with its built-in IDF**, generated client-side. This learns
+   IDF over *your* corpus, so novel entity strings get correct weight.
+
+### Your ingest pipeline (the part you own)
+
+For each chunk, produce and upsert **both** vectors:
+
+- **dense**: call Spark Control `POST /v1/embeddings` → 1024-d vector.
+- **sparse**: run [FastEmbed](https://github.com/qdrant/fastembed) BM25 client-side:
+  ```python
+  from fastembed import SparseTextEmbedding
+  bm25 = SparseTextEmbedding(model_name="Qdrant/bm25")
+  sp = next(bm25.embed([chunk_text]))   # -> {indices, values}
+  ```
+
+Create the collection with a named dense vector and a named sparse vector that
+uses `modifier: idf` (so Qdrant applies IDF server-side):
+
+```json
+PUT /collections/crm_chunks
+{ "vectors": { "dense": { "size": 1024, "distance": "Cosine" } },
+  "sparse_vectors": { "sparse": { "modifier": "idf" } } }
+```
+
+Add payload indexes for your filter fields so filtered queries stay fast:
+
+```json
+PUT /collections/crm_chunks/index   { "field_name": "lp_id",     "field_schema": "keyword" }
+PUT /collections/crm_chunks/index   { "field_name": "doc_type",  "field_schema": "keyword" }
+PUT /collections/crm_chunks/index   { "field_name": "date_ts",   "field_schema": "integer" }
+```
+
+Upsert points with both vectors + payload:
+
+```json
+PUT /collections/crm_chunks/points
+{ "points": [ { "id": 1,
+    "vector": { "dense": [...1024...], "sparse": { "indices": [...], "values": [...] } },
+    "payload": { "lp_id": "lp_0427", "lp_name": "Brightwater Capital",
+                 "doc_type": "email", "date_ts": 1771027200, "text": "..." } } ] }
+```
+
+### At query time
+
+Generate the query's BM25 sparse vector with the **same** FastEmbed model, and
+pass it to `/api/search` as `sparse`. Spark Control fuses dense+sparse with RRF
+inside Qdrant, then reranks. If you omit `sparse`, you get dense + rerank
+(still good, just no exact-lexical leg).
+
+> You can talk to Qdrant directly (`http://<spark2>:6333`) for collection
+> management and upserts — that's the natural home for ingest. `/api/search` is
+> the convenience path for the read side so your agents hit one host. If you'd
+> rather Spark Control proxy Qdrant admin too, say the word.
+
+## Chunking guidance (entity-heavy CRM)
+
+- One chunk per email / note / transcript-turn; one chunk per memo *section* —
+  don't split mid-thought.
+- Keep entity + date as **payload fields** (filterable), not buried in embedded text.
+- Time-aware: don't merge a 2022 note with a 2026 update in one chunk; store
+  `date_ts` so you can pre-filter and recency-boost.
+- Resolve entity-name variants ("J. Smith" / "Jonathan Smith" / "JS") to one
+  canonical `lp_id` at ingest, or retrieval fragments across variants.
+
+## Pre-filtering
+
+Most agent queries aren't pure semantic — they're "recent emails with Bob about
+energy". Pass a Qdrant `filter` to restrict the search space *before* vector
+scoring (faster + more precise). Qdrant also supports server-side recency
+boosting via Formula/Score-Boosting if you want decay-by-`date_ts` without an
+app round-trip — ask and we can expose it through `/api/search`.
+
+## Service discovery + health
+
+- `GET /api/endpoints` includes `embeddings` and `qdrant` entries (base_url +
+  ready flag) alongside `vllm`, `parakeet`, `kokoro`.
+- The dashboard shows Embeddings + Qdrant health dots and Start/Restart/Stop
+  controls in the Always-On Services panel.
+- Spark Control will auto-restart a wedged **embedding** container (GPU CUDA
+  wedge recovery, like the audio services) but **never** auto-restarts Qdrant —
+  it holds your only copy of the index, so a restart is surfaced for manual
+  action instead.
+
+## Capacity
+
+At your scale (tens of thousands of chunks now → low hundreds of thousands),
+this is trivial: ~0.6–1.2 GB of dense vectors at 300k chunks, spark-embed +
+Qdrant together use a few GB of GPU/RAM on Spark 2's 122 GB. Full re-embed of
+300k chunks is ~8–15 min, so re-indexing is cheap if you change models.
+
+## Model upgrade path
+
+If dense recall becomes the bottleneck, `Qwen3-Embedding-4B` (Matryoshka-trained,
+tops 2026 MTEB) is the A/B candidate — same `/v1/embeddings` contract, swap the
+model in spark-embed. bge-m3 is the lower-risk starting point and ships now.
@@ -0,0 +1,105 @@
+# Redaction Gateway — `/scrub` + `/rehydrate` (Spark Control v0.16.0)
+
+The privacy boundary between sovereign LP data and the Claude API, living at the
+same trusted Spark Control host as `/v1/chat/completions`, `/v1/embeddings`,
+`/v1/rerank`, and `/api/search`. Built to **behavioral parity** with the CRM's
+reference `backend/redaction/scrub.py` — that engine is vendored verbatim into
+Spark Control and its leak test passes here, so `SCRUB_BACKEND=gateway` is a
+drop-in for the in-repo path.
+
+## What it is
+
+- `POST /scrub` — de-identify an agent's assembled context. Returns placeholder-only
+  text (the agent forwards that to Claude) plus an opaque `map_handle`.
+- `POST /rehydrate` — swap the real values back into Claude's placeholder-bearing
+  response, locally, for human review.
+
+Spark Control does **not** call Claude. It's the scrub/rehydrate transform pair
+plus a server-held pseudonym map.
+
+## Contract (matches the handover doc)
+
+`POST /scrub`
+```json
+{ "task_id": "...", "actor": "analyst",
+  "items": [{"id": "ctx_1", "text": "..."}],
+  "known_entities": {"persons": [], "orgs": [], "funds": [], "emails": [], "locations": []},
+  "tier1_action": "drop",            // or "reject" (fail-closed 422 on any Tier-1)
+  "bucket": {"amounts": false, "dates": false},
+  "ner": "auto",                     // "auto" | "rules_only" | "qwen"
+  "map_handle": null }               // pass to reuse/extend a task's map (stable tokens)
+```
+→ `200 { task_id, map_handle, items:[{id, scrubbed_text, tokens_used}], stats:{tier1_dropped, tier2_tokenized, distinct_entities, descriptive_flags:[{item, span, action}]}, expires_at }`
+- `422 {"error":"tier1_detected","spans":[{item, kinds}]}` when `tier1_action="reject"` and Tier-1 found (kinds only — never the raw value).
+- `422 {"error":"ner_unavailable", ...}` when `ner=auto|qwen` and the local Qwen is unreachable / no model loaded — **fail-closed, emits nothing**.
+- `400` on malformed input.
+
+`POST /rehydrate`
+```json
+{ "task_id": "...", "map_handle": "...", "items": [{"id": "out_1", "text": "...[PERSON_1]..."}],
+  "actor": "analyst", "strict": true }
+```
+→ `200 { items:[{id, rehydrated_text}], stats:{tokens_substituted, unknown_tokens} }`
+- `409 {"error":"unknown_tokens","tokens":[...]}` when `strict` and a token has no map entry (your tripwire for a Claude-hallucinated/smuggled token).
+- `410 {"error":"map_expired"}` if the map TTL lapsed or the handle is unknown.
+
+## The dictionary is caller-supplied — and treated as sensitive
+
+You supply `known_entities` (built by your `build_known_entities`, scoped to the LP
+in play) in each `/scrub` call. Spark Control never reads your CRM — keeps the
+package portable and needs no CRM credentials. The gateway treats your dictionary
+as a slice of the LP list: used transiently for the scrub, **never persisted beyond
+the resulting tokens, never logged, never echoed**. Only the resulting
+`{token → real_value}` map is held server-side.
+
+## NER backstop is load-bearing, not optional
+
+The dictionary is the deterministic floor; the local-Qwen NER pass catches the
+unknown names it can't know (new prospects, an advisor named in passing) and flags
+**descriptive re-identifiers** ("the family that sold the mining company in Texas" →
+redacted). Under `ner=auto` (default) or `ner=qwen`, if the local Qwen is unreachable
+or no model is loaded, `/scrub` **fails closed (422)** rather than passing name-blind
+text to Claude. `ner=rules_only` is the explicit, knowing opt-out — never the silent
+fallback. The NER uses the same local Qwen at `/v1/chat/completions`; the sensitive
+text never reaches a remote model.
+
+> Verified live against Qwen3.6: an unknown "Sarah Kim from Atlas Ventures" → `[PERSON_1] from [ORG_1]`; a descriptive re-identifier → `[redacted]` + flagged.
+
+## Map-stays-local
+
+The pseudonym map (the de-anonymization key) is held only on this box, keyed by
+`map_handle`, in a TTL-swept local store on the StartOS `/data` volume (default 2h;
+survives a Spark Control restart mid-review). Never returned in full, never logged,
+never in a Claude-bound payload. `REDACTION_MAP_TTL` and `REDACTION_MAP_DB` are
+configurable via env if you want a different TTL/path.
+
+## Logging stays on your side
+
+`/scrub` and `/rehydrate` return counts-only `stats`; **your app writes the
+`interaction_log` row** (you already have `log_scrub`/`log_rehydrate`). Spark Control
+does not write to your DB and keeps no audit log of its own that contains real values.
+The `descriptive_flags` span text is in the `/scrub` *response* (to you, the local
+caller) — strip it before you persist, per your own logging rule (payload = counts only).
+
+## Acceptance — what passed
+
+1. **Parity** — the reference leak fixtures run through the live `/scrub` endpoint: every Tier-1 + Tier-2 identifier absent from the response; substance survives verbatim.
+2. **Map-leak** — no real value (incl. Tier-1) in any response body; Tier-1 values absent from the server map entirely.
+3. **Round-trip** — `/rehydrate` via the server-held map reproduces the original (Tier-1 → `[redacted]`, the only lossy part).
+4. **Handle reuse** — same entity → same token across items and across calls reusing `map_handle` (cache-stable for Claude prompt caching).
+5. **Tripwires** — 409 on a strict unmapped token; 410 on expired/unknown handle; 422 fail-closed on `tier1_action=reject`.
+6. **Live NER** — unknown names tokenized + descriptive re-identifier redacted against the real local Qwen.
+
+## Cutover
+
+Point your `SCRUB_BACKEND=gateway` client at `https://<spark-control-host>/scrub` and
+`/rehydrate` (same TLS-skip / Root-CA story as the other endpoints). The request/
+response shapes match your in-repo module, so agents cut over with no app changes.
+
+## Honest caveat (unchanged from your design)
+
+The NER pass is the probabilistic layer — it will not catch every free-text or
+descriptive re-identifier. The strong defenses remain: **minimize-first** (does Claude
+need the record content at all?), the deterministic dictionary + rules, and the
+re-identification spot-check. Treat the gateway as the enforcement *point*, not a
+guarantee that any text is safe to send.
@@ -1,10 +1,12 @@
 """OpenAI-compatible audio proxy: lets any OpenAI-shaped client (Open WebUI,
-Home Assistant, etc.) talk to Parakeet (STT) and Magpie (TTS) through one URL.
+Home Assistant, etc.) talk to Parakeet (STT) and Kokoro (TTS) through one URL.

 Endpoints exposed on spark-control's port (same as the dashboard):
-  GET  /v1/models                 — lists STT model + Magpie voices in OpenAI shape
-  POST /v1/audio/speech           — OpenAI TTS → Magpie /v1/audio/synthesize
+  GET  /v1/models                       — lists STT model + Kokoro voices in OpenAI shape
+  POST /v1/audio/speech                 — OpenAI TTS → Kokoro /v1/audio/speech
  POST /v1/audio/transcriptions         — forward to Parakeet (already OpenAI-compatible)
+  POST /api/audio/diarize-chunk         — per-chunk diarization (Parakeet container, Sortformer+TitaNet)
+  POST /api/audio/transcribe-with-speakers — ASR + diarization merged

 Both downstream services already speak HTTP on the LAN; this module just adapts
 request/response shapes so OpenAI clients don't need a custom integration.
@@ -13,10 +15,20 @@ When Parakeet returns a 500 (commonly the recurring CUDA wedge), the proxy
 returns a clearer 503 with Retry-After=60, and fires the deep-health probe in
 the background — which detects the wedge and triggers a rate-limited container
 restart inside seconds. The client's next attempt ~60s later then succeeds.
+
+TTS is intentionally simple: forward the request body to Kokoro and stream the
+response back. Kokoro-82M is reliable enough (24/24 successful renders across
+the same input lengths that broke Magpie 13/24 times) that no retry, chunking,
+or duration-validation layer is needed. This used to be a ~150-line tangle
+under v0.13.0:6's Magpie-with-chunking workaround; it's now a single forward.
 """
 from __future__ import annotations
 import asyncio
+import io
+import json
 import logging
+import wave
+from array import array
 from typing import Any, Optional

 import httpx
@@ -28,38 +40,33 @@ from .config import Settings

 logger = logging.getLogger("spark-control.audio")

-# Magpie voice name encodes its language. Example:
-#   Magpie-Multilingual.EN-US.Mia        -> en-US
-#   Magpie-Multilingual.ES-US.Diego      -> es-US
-#   Magpie-Multilingual.FR-FR.Pascal     -> fr-FR
-def _lang_from_voice(voice: str) -> str:
-    try:
-        parts = voice.split(".")
-        # parts = ["Magpie-Multilingual", "EN-US", "Mia"] (or with emotion suffix)
-        if len(parts) >= 2 and "-" in parts[1]:
-            lang_part = parts[1]  # "EN-US"
-            primary, region = lang_part.split("-", 1)
-            return f"{primary.lower()}-{region.upper()}"
-    except Exception:
-        pass
-    return "en-US"

+# Kokoro default voice. The four curated voices below were Alice-tested for
+# narration/recap-style content; bm_george is the default. Clients can pass
+# any of Kokoro's 67 voices in the `voice` field — see /v1/models.
+DEFAULT_VOICE = "bm_george"

-# Default voice: configurable, falls back to a sensible English voice if unset.
-DEFAULT_VOICE = "Magpie-Multilingual.EN-US.Mia"
+# Curated quick-pick voices surfaced at the top of /v1/models. The full list
+# of 67 voices is fetched live from Kokoro and appended after these.
+CURATED_VOICES: list[dict] = [
+    {"id": "bm_george", "name": "George (British male, narrator-style)",      "language": "en-GB"},
+    {"id": "bf_emma",   "name": "Emma (British female, audiobook-style)",     "language": "en-GB"},
+    {"id": "am_michael","name": "Michael (American male, warm narrator)",     "language": "en-US"},
+    {"id": "af_heart",  "name": "Heart (American female, warm and balanced)", "language": "en-US"},
+]


 class SpeechRequest(BaseModel):
-    """OpenAI /v1/audio/speech request body."""
-    model: Optional[str] = None              # ignored — Magpie has one model
+    """OpenAI /v1/audio/speech request body. Forwarded to Kokoro mostly-verbatim.
+
+    Kokoro accepts the OpenAI shape natively, so we only need to substitute the
+    default voice when the client doesn't specify one.
+    """
+    model: Optional[str] = None              # Kokoro tolerates any model id
    input: str                               # the text to speak
-    voice: Optional[str] = None              # e.g. "Magpie-Multilingual.EN-US.Mia"
-    response_format: Optional[str] = "wav"   # only "wav" supported today
-    speed: Optional[float] = 1.0             # ignored by Magpie
-    # Magpie-specific extensions (clients may pass these through)
-    language: Optional[str] = None
-    sample_rate_hz: Optional[int] = 22050
-    encoding: Optional[str] = "LINEAR_PCM"
+    voice: Optional[str] = None              # e.g. "bm_george"; default: DEFAULT_VOICE
+    response_format: Optional[str] = "wav"   # Kokoro supports wav, mp3, opus, flac
+    speed: Optional[float] = 1.0


 def build_router(settings: Settings, deep_health: Any = None) -> APIRouter:
@@ -74,15 +81,17 @@ def build_router(settings: Settings, deep_health: Any = None) -> APIRouter:
    def _parakeet_base() -> str:
        return f"http://{settings.parakeet_host}:{settings.parakeet_port}"

-    def _magpie_base() -> str:
-        return f"http://{settings.magpie_host}:{settings.magpie_port}"
+    def _kokoro_base() -> str:
+        return f"http://{settings.kokoro_host}:{settings.kokoro_port}"

    # ---- /v1/models ----
    @router.get("/v1/models")
    async def list_models() -> dict:
-        """Advertise the STT model + a small voice menu so clients can
-        populate their voice-picker UIs. Falls back gracefully if Magpie
-        is offline (returns just the STT entry)."""
+        """Advertise the STT model + Kokoro voices in OpenAI list shape.
+
+        Curated voices appear first; the rest of Kokoro's catalog follows.
+        Falls back to just the STT entry + curated voices if Kokoro is offline.
+        """
        data: list[dict] = [
            {
                "id": "parakeet-tdt-0.6b-v3",
@@ -91,66 +100,82 @@ def build_router(settings: Settings, deep_health: Any = None) -> APIRouter:
                "kind": "stt",
            },
        ]
-        # Try to enumerate voices from Magpie; if unreachable, just skip.
+        # Curated first — these are the four Alice chose for narration/recap.
+        seen = set()
+        for v in CURATED_VOICES:
+            data.append({
+                "id": v["id"],
+                "object": "model",
+                "owned_by": "kokoro",
+                "kind": "tts",
+                "display_name": v.get("name"),
+                "language": v.get("language"),
+                "curated": True,
+            })
+            seen.add(v["id"])
+
+        # Append everything else Kokoro advertises (~63 more voices across many
+        # languages). Best-effort — if Kokoro is unreachable, the curated list
+        # alone is still usable.
        try:
            async with httpx.AsyncClient(timeout=5.0) as client:
-                r = await client.get(f"{_magpie_base()}/v1/audio/list_voices")
+                r = await client.get(f"{_kokoro_base()}/v1/audio/voices")
            if r.status_code == 200:
-                voices_by_locales = r.json()
-                seen = set()
-                for _locales, payload in voices_by_locales.items():
-                    for v in payload.get("voices", []):
-                        # Collapse emotion variants — expose only the base voice name.
-                        # "Magpie-Multilingual.EN-US.Mia.Angry" -> "Magpie-Multilingual.EN-US.Mia"
-                        parts = v.split(".")
-                        base = ".".join(parts[:3]) if len(parts) >= 3 else v
-                        if base not in seen:
-                            seen.add(base)
+                body = r.json()
+                for v in body.get("voices", []):
+                    vid = v.get("id") if isinstance(v, dict) else v
+                    if not vid or vid in seen:
+                        continue
                    data.append({
-                                "id": base,
+                        "id": vid,
                        "object": "model",
-                                "owned_by": "nvidia",
+                        "owned_by": "kokoro",
                        "kind": "tts",
                    })
+                    seen.add(vid)
        except Exception as e:
-            logger.warning("magpie voice list unavailable: %s", e)
+            logger.warning("kokoro voice list unavailable: %s", e)
        return {"object": "list", "data": data}

    # ---- /v1/audio/speech (TTS) ----
    @router.post("/v1/audio/speech")
    async def speech(body: SpeechRequest) -> Response:
-        """OpenAI-style TTS. Translates to Magpie's multipart synth call.
+        """OpenAI-style TTS. Forwards to Kokoro and returns the audio bytes.

-        Returns raw WAV bytes (Content-Type: audio/wav) — browsers and most
-        clients play these directly.
+        Kokoro accepts the OpenAI shape natively. We only substitute the
+        default voice when not specified. Response is whatever format Kokoro
+        produces (WAV by default, mp3/opus/flac if the client asked for one).
+
+        No retry layer needed — Kokoro is reliable at any input length.
        """
        text = (body.input or "").strip()
        if not text:
            raise HTTPException(400, "input text is required")

        voice = body.voice or DEFAULT_VOICE
-        language = body.language or _lang_from_voice(voice)
-        sample_rate = int(body.sample_rate_hz or 22050)
-        encoding = body.encoding or "LINEAR_PCM"
-
-        form = {
-            "text": text,
-            "language": language,
+        response_format = body.response_format or "wav"
+        payload = {
+            "model": body.model or "kokoro",
+            "input": text,
            "voice": voice,
-            "sample_rate_hz": str(sample_rate),
-            "encoding": encoding,
+            "response_format": response_format,
        }
+        if body.speed is not None:
+            payload["speed"] = body.speed
+
        try:
            async with httpx.AsyncClient(timeout=120.0) as client:
-                r = await client.post(f"{_magpie_base()}/v1/audio/synthesize", data=form)
+                r = await client.post(
+                    f"{_kokoro_base()}/v1/audio/speech", json=payload
+                )
        except httpx.HTTPError as e:
-            raise HTTPException(502, f"magpie unreachable: {e}")
+            raise HTTPException(502, f"kokoro unreachable: {e}")

        if r.status_code != 200:
-            # Surface Magpie's error message verbatim so clients can debug voice/lang typos.
+            # Surface Kokoro's error verbatim (bad voice, bad format, etc.).
            raise HTTPException(r.status_code, r.text[:500])

-        # Magpie returns WAV bytes already (Content-Type: audio/wav). Pass through.
+        # Forward Kokoro's content-type so the client knows the format.
        media_type = r.headers.get("content-type", "audio/wav")
        return Response(content=r.content, media_type=media_type)

@@ -209,11 +234,11 @@ def build_router(settings: Settings, deep_health: Any = None) -> APIRouter:
            raise HTTPException(r.status_code, r.text[:500])
        return Response(content=r.content, media_type=r.headers.get("content-type", "application/json"))

-    # ---- /api/audio/diarize-chunk (per-chunk worker for Recap Relay) ----
+    # ---- /api/audio/diarize-chunk (per-chunk worker for chunked workflows) ----
    @router.post("/api/audio/diarize-chunk")
    async def diarize_chunk(file: UploadFile = File(...)) -> dict:
-        """Per-chunk worker designed for orchestrators (Recap Relay) that
-        handle chunking + cross-chunk speaker clustering themselves.
+        """Per-chunk worker designed for orchestrators that handle chunking +
+        cross-chunk speaker clustering themselves.

        Given ONE audio chunk, returns diarization segments (with LOCAL
        speaker labels — Speaker_0/1/... reset per chunk) AND a 192-dim
@@ -271,7 +296,7 @@ def build_router(settings: Settings, deep_health: Any = None) -> APIRouter:
        """Diarized transcription: run Parakeet ASR and Sortformer diarization on
        the same audio in parallel, then merge by timestamp.

-        Response shape (designed for downstream UIs like recap-relay):
+        Response shape (designed for downstream UIs):

            {
              "duration": 90.5,
@@ -299,8 +324,6 @@ def build_router(settings: Settings, deep_health: Any = None) -> APIRouter:
        filename = file.filename or "audio.wav"
        content_type = file.content_type or "application/octet-stream"

-        # Parakeet ASR + Sortformer diarizer in parallel. (A WhisperX detour
-        # lived here briefly — reverted in v0.13.0:0; see release notes.)
        async def _call_transcribe(client: httpx.AsyncClient) -> dict:
            files = {"file": (filename, body, content_type)}
            data = {"response_format": "verbose_json"}
@@ -359,9 +382,353 @@ def build_router(settings: Settings, deep_health: Any = None) -> APIRouter:
            },
        }

+    # ---- /api/audio/label-merge (diarize + name clusters from a visual timeline) ----
+    async def _diar(client, b, fn):
+        r = await client.post(f"{_parakeet_base()}/v1/audio/diarize-chunk",
+                              files={"file": (fn, b, "audio/wav")})
+        r.raise_for_status()
+        return r.json()
+
+    async def _txn(client, b, fn):
+        r = await client.post(f"{_parakeet_base()}/v1/audio/transcriptions",
+                              files={"file": (fn, b, "audio/wav")},
+                              data={"response_format": "verbose_json"})
+        r.raise_for_status()
+        return r.json()
+
+    @router.post("/api/audio/label-merge")
+    async def label_merge(
+        file: Optional[UploadFile] = File(default=None),
+        mic_file: Optional[UploadFile] = File(default=None),
+        system_file: Optional[UploadFile] = File(default=None),
+        timeline: str = Form(...),
+        self_name: str = Form(default="Me"),
+        self_vad: Optional[str] = Form(default=None),
+        known_voiceprints: Optional[str] = Form(default=None),
+        transcribe: bool = Form(default=False),
+        min_overlap: float = Form(default=0.0),
+        voiceprint_threshold: float = Form(default=0.5),
+    ) -> dict:
+        """Diarize audio and NAME each anonymous cluster from a caller-supplied visual
+        timeline (who-was-on-screen-when) by majority temporal overlap, with a voice-
+        fingerprint fallback. Stateless + portable — the caller owns the timeline and
+        voiceprint library; nothing is persisted here.
+
+        TWO MODES:
+
+        * MONO (legacy): send `file` (mixed mono). Diarizes the mix, names clusters.
+
+        * DUAL-CHANNEL: send `mic_file` (the local user's mic) + `system_file`
+          (everyone else, from screen capture), sample-aligned to a shared t0. This
+          uses the channels to SPLIT the problem instead of forcing the diarizer to
+          re-disentangle a mono mix:
+            - mic track  -> the local user's words, gated to windows where the mic is
+              actually the user speaking (mic louder than system — a self-VAD computed
+              server-side from the two channels, or supplied via `self_vad`). The mic
+              picks up the remote audio as quiet bleed, so this gate is LOAD-BEARING:
+              without it the bleed would be transcribed as the user.
+            - system track -> diarized (only has to separate the *remote* people, a
+              strictly easier problem) and named via the visual timeline + voiceprints.
+            - the user's clean voiceprint is enrolled from the mic track and injected
+              into the voiceprint library, so a system-track cluster that's actually the
+              user dialed in from a second device (dual-login) resolves to the user, not
+              a stranger.
+          Self-attribution becomes near-perfect (dedicated channel), remote diarization
+          gets cleaner, overlapping speech is trivially separated, and the user no longer
+          consumes one of Sortformer's 4 speaker slots.
+
+        Form fields (multipart):
+          file | (mic_file + system_file)   audio — mono mix OR the two channels
+          timeline              JSON [{"start","end","name","confidence?"}, ...] (visual hints for remote folks)
+          self_name             name for the local user (mic channel). Default "Me".
+          self_vad              optional JSON [{"start","end"}] mic-active-and-louder windows;
+                                if omitted, computed server-side by per-window RMS.
+          known_voiceprints     optional JSON {name: [192 floats]} from past calls (include the user's)
+          transcribe            "true" to attach per-segment text (always on in dual-channel)
+          min_overlap           min fraction of a cluster's time overlapping the winning name (default 0)
+          voiceprint_threshold  cosine similarity to accept a voiceprint match (default 0.5)
+        """
+        try:
+            tl = json.loads(timeline)
+            assert isinstance(tl, list)
+        except Exception:
+            raise HTTPException(400, "timeline must be a JSON array of {start,end,name}")
+        known_vp: dict[str, list[float]] = {}
+        if known_voiceprints:
+            try:
+                known_vp = json.loads(known_voiceprints)
+                assert isinstance(known_vp, dict)
+            except Exception:
+                raise HTTPException(400, "known_voiceprints must be a JSON object {name: [floats]}")
+
+        dual = mic_file is not None and system_file is not None
+        if not dual and file is None:
+            raise HTTPException(400, "provide either 'file' (mono) or both 'mic_file' and 'system_file'")
+
+        try:
+            async with httpx.AsyncClient(timeout=600.0) as client:
+                if dual:
+                    return await _label_merge_dual(
+                        client, _diar, _txn, await mic_file.read(), await system_file.read(),
+                        tl, self_name, self_vad, known_vp, min_overlap, voiceprint_threshold)
+                body = await file.read()
+                if not body:
+                    raise HTTPException(400, "Empty file")
+                fn = file.filename or "audio.wav"
+                if transcribe:
+                    diar, stt = await asyncio.gather(_diar(client, body, fn), _txn(client, body, fn))
+                else:
+                    diar, stt = await _diar(client, body, fn), None
+        except HTTPException:
+            raise
+        except httpx.HTTPStatusError as e:
+            if e.response.status_code == 500 and deep_health is not None:
+                try:
+                    asyncio.create_task(deep_health.run_one("parakeet"))
+                except Exception:
+                    pass
+                raise HTTPException(503, "Parakeet transient error (likely CUDA wedge). Retry in ~60s.",
+                                    headers={"Retry-After": "60"})
+            raise HTTPException(e.response.status_code, e.response.text[:500])
+        except httpx.HTTPError as e:
+            raise HTTPException(502, f"parakeet unreachable: {e}")
+
+        # ---- MONO path ----
+        diar_segments = diar.get("segments", [])
+        fingerprints = diar.get("fingerprints", {}) or {}
+        clusters = diar.get("speakers_detected", [])
+        assignment = _name_clusters(diar_segments, fingerprints, clusters, tl, known_vp,
+                                    min_overlap, voiceprint_threshold)
+        relabeled_turns = [
+            {"start_s": s.get("start_s"), "end_s": s.get("end_s"),
+             "speaker": assignment[s.get("speaker")]["name"]}
+            for s in diar_segments if s.get("speaker") in assignment
+        ]
+        if transcribe and stt is not None:
+            out_segments = _merge_words_with_speakers(stt.get("words", []), relabeled_turns)
+        else:
+            out_segments = [{
+                "start_s": s.get("start_s"), "end_s": s.get("end_s"),
+                "speaker": assignment.get(s.get("speaker"), {}).get("name", s.get("speaker")),
+                "confidence": s.get("confidence"),
+            } for s in diar_segments]
+        speakers, named_fingerprints = _speaker_list(clusters, assignment, fingerprints)
+        return {
+            "mode": "mono",
+            "duration": diar.get("duration", 0.0),
+            "speakers": speakers,
+            "segments": out_segments,
+            "fingerprints": named_fingerprints,
+            "models": diar.get("models", {}),
+        }
+
    return router


+# ---- Label-merge helpers ----
+
+def _overlap_seconds(a0: float, a1: float, b0: float, b1: float) -> float:
+    return max(0.0, min(a1, b1) - max(a0, b0))
+
+
+def _cosine(a: Optional[list], b: Optional[list]) -> float:
+    if not a or not b or len(a) != len(b):
+        return 0.0
+    dot = sum(x * y for x, y in zip(a, b))
+    na = sum(x * x for x in a) ** 0.5
+    nb = sum(x * x for x in b) ** 0.5
+    if na == 0 or nb == 0:
+        return 0.0
+    return dot / (na * nb)
+
+
+def _name_clusters(diar_segments, fingerprints, clusters, tl, known_vp,
+                   min_overlap, voiceprint_threshold):
+    """Assign a name to each anonymous diarization cluster: visual-timeline overlap
+    winner -> closest known-voiceprint match -> Unknown_N. Shared by mono + dual."""
+    cluster_dur: dict[str, float] = {}
+    cluster_name_overlap: dict[str, dict[str, float]] = {}
+    for seg in diar_segments:
+        spk = seg.get("speaker")
+        s0, s1 = float(seg.get("start_s", 0)), float(seg.get("end_s", 0))
+        cluster_dur[spk] = cluster_dur.get(spk, 0.0) + max(0.0, s1 - s0)
+        for entry in tl:
+            name = (entry.get("name") or "").strip()
+            if not name:
+                continue
+            ov = _overlap_seconds(s0, s1, float(entry.get("start", 0)), float(entry.get("end", 0)))
+            if ov > 0:
+                cluster_name_overlap.setdefault(spk, {})
+                cluster_name_overlap[spk][name] = cluster_name_overlap[spk].get(name, 0.0) + ov
+    assignment: dict[str, dict] = {}
+    used_unknown = 0
+    for cluster in clusters:
+        names = cluster_name_overlap.get(cluster, {})
+        total = cluster_dur.get(cluster, 0.0) or 1.0
+        if names:
+            winner = max(names.items(), key=lambda kv: kv[1])
+            conf = winner[1] / total
+            if conf >= min_overlap:
+                assignment[cluster] = {"name": winner[0], "source": "visual",
+                                       "overlap_confidence": round(conf, 4)}
+                continue
+        fp = fingerprints.get(cluster)
+        best_name, best_sim = None, 0.0
+        if fp and known_vp:
+            for nm, vec in known_vp.items():
+                sim = _cosine(fp, vec)
+                if sim > best_sim:
+                    best_name, best_sim = nm, sim
+        if best_name and best_sim >= voiceprint_threshold:
+            assignment[cluster] = {"name": best_name, "source": "voiceprint",
+                                   "match_similarity": round(best_sim, 4)}
+        else:
+            assignment[cluster] = {"name": f"Unknown_{used_unknown}", "source": "unmatched"}
+            used_unknown += 1
+    return assignment
+
+
+def _speaker_list(clusters, assignment, fingerprints):
+    """Build the response `speakers` list + name->fingerprint map from an assignment."""
+    speakers, named = [], {}
+    for cluster in clusters:
+        a = assignment[cluster]
+        entry = {"cluster": cluster, "name": a["name"], "source": a["source"],
+                 "fingerprint": fingerprints.get(cluster)}
+        if "overlap_confidence" in a:
+            entry["overlap_confidence"] = a["overlap_confidence"]
+        if "match_similarity" in a:
+            entry["match_similarity"] = a["match_similarity"]
+        speakers.append(entry)
+        if fingerprints.get(cluster) is not None:
+            named[a["name"]] = fingerprints.get(cluster)
+    return speakers, named
+
+
+def _wav_pcm(b: bytes):
+    """Decode a 16-bit mono/stereo WAV to (int16 array, sample_rate). Returns
+    (None, 0) if it can't decode (caller then requires a client-supplied self_vad)."""
+    try:
+        with wave.open(io.BytesIO(b), "rb") as w:
+            sr, n, ch, sw = w.getframerate(), w.getnframes(), w.getnchannels(), w.getsampwidth()
+            raw = w.readframes(n)
+        if sw != 2:
+            return None, 0
+        a = array("h")
+        a.frombytes(raw)
+        if ch > 1:
+            a = a[0::ch]  # take channel 0
+        return a, sr
+    except Exception:
+        return None, 0
+
+
+def _win_rms(pcm_sr, s: float, e: float) -> float:
+    """Normalized RMS (0..1) of the [s,e]-second window of a decoded PCM array."""
+    a, sr = pcm_sr
+    if a is None or sr <= 0:
+        return 0.0
+    i, j = max(0, int(s * sr)), min(len(a), int(e * sr))
+    if j <= i:
+        return 0.0
+    ss = 0
+    for x in a[i:j]:
+        ss += x * x
+    return (ss / (j - i)) ** 0.5 / 32768.0
+
+
+async def _label_merge_dual(client, diar_fn, txn_fn, mic_b, sys_b, tl, self_name,
+                            self_vad_json, known_vp, min_overlap, voiceprint_threshold):
+    """Dual-channel label-merge: mic track = the local user (gated to mic-dominant
+    windows so remote bleed isn't transcribed as the user); system track = diarized +
+    named remote speakers. See label_merge docstring for the full rationale."""
+    if not mic_b or not sys_b:
+        raise HTTPException(400, "empty mic_file or system_file")
+
+    # System: diarize + transcribe (parallel). Mic: transcribe + diarize (parallel) —
+    # the mic diarization yields the user's clean enrollment voiceprint.
+    sys_diar, sys_stt, mic_stt, mic_diar = await asyncio.gather(
+        diar_fn(client, sys_b, "system.wav"), txn_fn(client, sys_b, "system.wav"),
+        txn_fn(client, mic_b, "mic.wav"), diar_fn(client, mic_b, "mic.wav"))
+
+    # Enroll the user's voiceprint = fingerprint of the dominant cluster on the mic track.
+    self_vp = None
+    mic_fps = mic_diar.get("fingerprints", {}) or {}
+    if mic_fps:
+        durs: dict[str, float] = {}
+        for s in mic_diar.get("segments", []):
+            durs[s["speaker"]] = durs.get(s["speaker"], 0.0) + (s["end_s"] - s["start_s"])
+        top = max(durs, key=durs.get) if durs else next(iter(mic_fps))
+        self_vp = mic_fps.get(top)
+    # Inject self voiceprint so a dual-login (phone) system cluster resolves to the user.
+    vp_lib = dict(known_vp)
+    if self_vp is not None:
+        vp_lib.setdefault(self_name, self_vp)
+
+    # Name the SYSTEM clusters (remote people, possibly incl. phone-self via voiceprint).
+    sys_segments = sys_diar.get("segments", [])
+    sys_fps = sys_diar.get("fingerprints", {}) or {}
+    sys_clusters = sys_diar.get("speakers_detected", [])
+    sys_assign = _name_clusters(sys_segments, sys_fps, sys_clusters, tl, vp_lib,
+                                min_overlap, voiceprint_threshold)
+    sys_turns = [{"start_s": s["start_s"], "end_s": s["end_s"],
+                  "speaker": sys_assign[s["speaker"]]["name"]}
+                 for s in sys_segments if s["speaker"] in sys_assign]
+    remote_blocks = _merge_words_with_speakers(sys_stt.get("words", []), sys_turns)
+
+    # Self-VAD: keep only mic words where the mic is genuinely the local user (mic
+    # louder than system), excluding the remote bleed the mic also picks up.
+    vad_windows = None
+    if self_vad_json:
+        try:
+            vad_windows = json.loads(self_vad_json)
+            assert isinstance(vad_windows, list)
+        except Exception:
+            vad_windows = None
+    mic_pcm = _wav_pcm(mic_b)
+    sys_pcm = _wav_pcm(sys_b)
+    if vad_windows is None and mic_pcm[0] is None:
+        raise HTTPException(400, "could not decode WAV for self-VAD; send 16-bit mono WAV or a self_vad array")
+
+    # Margin so the mic must be CLEARLY louder than system to count as local — guards
+    # against brief remote bleed near utterance boundaries (real local speech runs many
+    # times louder than the bleed; real remote runs many times quieter).
+    _LOCAL_MARGIN = 1.2
+
+    def _is_local(s: float, e: float) -> bool:
+        if vad_windows is not None:
+            return any(_overlap_seconds(s, e, float(w.get("start", 0)), float(w.get("end", 0))) > 0
+                       for w in vad_windows)
+        return _win_rms(mic_pcm, s, e) > _win_rms(sys_pcm, s, e) * _LOCAL_MARGIN
+
+    # Keep mic words where the mic is clearly the dominant channel (margin excludes the
+    # remote bleed the mic also picks up), THEN group the surviving local words into
+    # blocks. Filtering before grouping means a block never mixes local speech with loud
+    # bleed (which would average to system-dominant and drop the whole utterance).
+    local_words = [w for w in mic_stt.get("words", [])
+                   if _is_local(float(w.get("start", 0)), float(w.get("end", 0)))]
+    local_blocks = (_merge_words_with_speakers(
+        local_words, [{"start_s": 0.0, "end_s": 1e12, "speaker": self_name}])
+        if local_words else [])
+
+    segments = sorted(remote_blocks + local_blocks, key=lambda b: b.get("start_ms", 0))
+
+    speakers, named = _speaker_list(sys_clusters, sys_assign, sys_fps)
+    speakers.append({"cluster": "mic", "name": self_name, "source": "mic_channel",
+                     "fingerprint": self_vp})
+    if self_vp is not None:
+        named[self_name] = self_vp
+
+    return {
+        "mode": "dual_channel",
+        "duration": max(sys_diar.get("duration", 0.0), mic_stt.get("duration", 0.0)),
+        "speakers": speakers,
+        "segments": segments,
+        "fingerprints": named,
+        "models": sys_diar.get("models", {}),
+    }
+
+
 # ---- Merge helper: assign speaker to each word, then group into blocks ----

 def _assign_speaker_to_word(word_start_s: float, word_end_s: float, diar_turns: list[dict]) -> str:
@@ -32,15 +32,26 @@ class Settings:
    parakeet_host: str
    parakeet_user: str
    parakeet_container: str
-    magpie_host: str
-    magpie_user: str
-    magpie_container: str
+    kokoro_host: str
+    kokoro_user: str
+    kokoro_container: str
+    embed_host: str
+    embed_user: str
+    embed_container: str
+    qdrant_host: str
+    qdrant_user: str
+    qdrant_container: str
+    qdrant_collection: str
+    redaction_map_db: str
+    redaction_map_ttl: int
    ssh_key_path: str
    ssh_known_hosts: str
    models_yaml: str
    vllm_port: int
    parakeet_port: int
-    magpie_port: int
+    kokoro_port: int
+    embed_port: int
+    qdrant_port: int
    bind_port: int
    open_webui_url: str
    ngc_api_key: str
@@ -49,7 +60,7 @@ class Settings:
    def from_env(cls) -> "Settings":
        spark2_host = _env("SPARK2_HOST")
        spark2_user = _env("SPARK2_USER")
-        # Parakeet and Magpie default to Spark 2 unless explicitly overridden.
+        # Parakeet (STT) and Kokoro (TTS) default to Spark 2 unless overridden.
        return cls(
            spark1_host=_env("SPARK1_HOST"),
            spark1_user=_env("SPARK1_USER"),
@@ -58,15 +69,29 @@ class Settings:
            parakeet_host=_env("PARAKEET_HOST") or spark2_host,
            parakeet_user=_env("PARAKEET_USER") or spark2_user,
            parakeet_container=_env("PARAKEET_CONTAINER") or "parakeet-asr",
-            magpie_host=_env("MAGPIE_HOST") or spark2_host,
-            magpie_user=_env("MAGPIE_USER") or spark2_user,
-            magpie_container=_env("MAGPIE_CONTAINER") or "magpie-tts",
+            kokoro_host=_env("KOKORO_HOST") or spark2_host,
+            kokoro_user=_env("KOKORO_USER") or spark2_user,
+            kokoro_container=_env("KOKORO_CONTAINER") or "kokoro-tts",
+            # Embeddings (spark-embed: bge-m3 dense + reranker) and Qdrant
+            # (vector storage) default to Spark 2 unless overridden.
+            embed_host=_env("EMBED_HOST") or spark2_host,
+            embed_user=_env("EMBED_USER") or spark2_user,
+            embed_container=_env("EMBED_CONTAINER") or "spark-embed",
+            qdrant_host=_env("QDRANT_HOST") or spark2_host,
+            qdrant_user=_env("QDRANT_USER") or spark2_user,
+            qdrant_container=_env("QDRANT_CONTAINER") or "qdrant",
+            qdrant_collection=_env("QDRANT_COLLECTION", ""),
+            # Redaction gateway pseudonym-map store (server-held de-anon key).
+            redaction_map_db=_env("REDACTION_MAP_DB", "/data/redaction_maps.db"),
+            redaction_map_ttl=int(_env("REDACTION_MAP_TTL", "7200")),
            ssh_key_path=_env("SSH_KEY_PATH"),
            ssh_known_hosts=_env("SSH_KNOWN_HOSTS"),
            models_yaml=_resolve_models_yaml(),
            vllm_port=int(_env("VLLM_PORT", "8888")),
            parakeet_port=int(_env("PARAKEET_PORT", "8000")),
-            magpie_port=int(_env("MAGPIE_PORT", "9000")),
+            kokoro_port=int(_env("KOKORO_PORT", "8880")),
+            embed_port=int(_env("EMBED_PORT", "8088")),
+            qdrant_port=int(_env("QDRANT_PORT", "6333")),
            bind_port=int(_env("BIND_PORT", "9999")),
            open_webui_url=_env("OPEN_WEBUI_URL", ""),
            ngc_api_key=_env("NGC_API_KEY", ""),
@@ -4,7 +4,7 @@ Persisted to /data/connectivity.json. Schema:

    {
      "macs": { "spark1": "aa:bb:..", "spark2": "11:22:.." },
-      "current": { "spark1": "up", "parakeet": "up", "magpie": "down", ... },
+      "current": { "spark1": "up", "parakeet": "up", "kokoro": "up", ... },
      "last_change": { ... },
      "events": [
        # Active-probe transition (logged when state flips during polling)
@@ -87,7 +87,7 @@ def record_state(subject: str, reachable: bool) -> Optional[dict]:
    was recorded, else None.

    `subject` can be a Spark host key (spark1/spark2) or a service name
-    (parakeet/magpie/vllm).
+    (parakeet/kokoro/vllm).
    """
    new_state = "up" if reachable else "down"
    now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
@@ -4,8 +4,8 @@ Format:
    custom:
      - key: my-riva
        kind: stt
-        host: <spark-2-ip>
-        user: <spark-user>
+        host: <spark-host-or-ip>
+        user: <ssh-user>
        container: riva-asr
        port: 8001
        health_path: /health
@@ -8,7 +8,7 @@ real transcription returns 500 cudaErrorUnknown.

 So this module sends *real* but tiny synthetic inference requests:
  - Parakeet: 1 second of digital silence (16 kHz mono PCM, in-memory WAV)
-  - Magpie:   short text-to-speech, response audio discarded
+  - Kokoro:   short text-to-speech, response audio discarded
  - vLLM:     1-token chat completion against whatever model is loaded

 All synthetic payloads are generated on demand into BytesIO, sent over HTTP,
@@ -98,7 +98,9 @@ class DeepHealth:
        self.interval_sec = interval_sec
        self.state: dict[str, ServiceState] = {
            "parakeet": ServiceState(),
-            "magpie": ServiceState(),
+            "kokoro": ServiceState(),
+            "embeddings": ServiceState(),
+            "qdrant": ServiceState(),
            "vllm": ServiceState(),
        }
        self._stop = asyncio.Event()
@@ -133,30 +135,30 @@ class DeepHealth:
        except Exception as e:
            return ProbeResult(ok=False, at=now_iso, error=f"{type(e).__name__}: {e}")

-    async def probe_magpie(self) -> ProbeResult:
+    async def probe_kokoro(self) -> ProbeResult:
        s = self.settings
        now_iso = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
-        if not s.magpie_host:
+        if not s.kokoro_host:
            return ProbeResult(ok=False, at=now_iso, error="not configured")
-        # Magpie /v1/audio/synthesize expects multipart form-data, not JSON.
-        # The (None, value) tuple in httpx's `files=` produces a non-file form field.
-        url = f"http://{s.magpie_host}:{s.magpie_port}/v1/audio/synthesize"
-        form: dict = {"text": (None, "hi"), "language": (None, "en-US")}
+        # Kokoro is OpenAI-shape: POST /v1/audio/speech with JSON body. We don't
+        # care about the audio body; just confirm the model produces a 200.
+        url = f"http://{s.kokoro_host}:{s.kokoro_port}/v1/audio/speech"
+        body = {"model": "kokoro", "input": "hi", "voice": "bm_george",
+                "response_format": "wav"}
        t0 = time.monotonic()
        try:
            async with httpx.AsyncClient(timeout=PROBE_TIMEOUT_SEC) as c:
-                r = await c.post(url, files=form)
+                r = await c.post(url, json=body)
            latency = round((time.monotonic() - t0) * 1000)
            if 200 <= r.status_code < 300:
                return ProbeResult(ok=True, at=now_iso, latency_ms=latency)
-            # 4xx that aren't 5xx mean server is alive but our payload is off —
-            # don't classify as wedge.
+            # 4xx (bad voice, bad params) means server is alive — don't wedge-classify.
            if 400 <= r.status_code < 500:
                return ProbeResult(
                    ok=True,
                    at=now_iso,
                    latency_ms=latency,
-                    note=f"{r.status_code} — server alive (probe payload may need a voice name)",
+                    note=f"{r.status_code} — server alive (probe payload may need adjustment)",
                )
            return ProbeResult(
                ok=False,
@@ -167,6 +169,52 @@ class DeepHealth:
        except Exception as e:
            return ProbeResult(ok=False, at=now_iso, error=f"{type(e).__name__}: {e}")

+    async def probe_embeddings(self) -> ProbeResult:
+        s = self.settings
+        now_iso = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
+        if not s.embed_host:
+            return ProbeResult(ok=False, at=now_iso, error="not configured")
+        base = f"http://{s.embed_host}:{s.embed_port}"
+        t0 = time.monotonic()
+        try:
+            async with httpx.AsyncClient(timeout=PROBE_TIMEOUT_SEC) as c:
+                # First check readiness; the model takes a while to load on boot.
+                h = await c.get(f"{base}/health")
+                if h.status_code == 200 and isinstance(h.json(), dict) and h.json().get("status") != "ready":
+                    # Still loading models — not a wedge, just warming.
+                    return ProbeResult(ok=True, at=now_iso, note="loading models (warming)")
+                r = await c.post(f"{base}/embed", json={"input": "health probe"})
+            latency = round((time.monotonic() - t0) * 1000)
+            if 200 <= r.status_code < 300:
+                return ProbeResult(ok=True, at=now_iso, latency_ms=latency)
+            if r.status_code == 503:
+                # spark-embed says model loading — warming, not wedged.
+                return ProbeResult(ok=True, at=now_iso, latency_ms=latency, note="model loading (503)")
+            return ProbeResult(ok=False, at=now_iso, latency_ms=latency,
+                               error=f"HTTP {r.status_code}: {r.text[:240]}")
+        except Exception as e:
+            # Connection refused during boot is warming, not a wedge — same
+            # philosophy as the vllm idle case; don't trigger auto-restart.
+            return ProbeResult(ok=True, at=now_iso, note=f"unreachable/warming: {type(e).__name__}")
+
+    async def probe_qdrant(self) -> ProbeResult:
+        s = self.settings
+        now_iso = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
+        if not s.qdrant_host:
+            return ProbeResult(ok=False, at=now_iso, error="not configured")
+        base = f"http://{s.qdrant_host}:{s.qdrant_port}"
+        t0 = time.monotonic()
+        try:
+            async with httpx.AsyncClient(timeout=PROBE_TIMEOUT_SEC) as c:
+                r = await c.get(f"{base}/readyz")
+            latency = round((time.monotonic() - t0) * 1000)
+            if 200 <= r.status_code < 300:
+                return ProbeResult(ok=True, at=now_iso, latency_ms=latency)
+            return ProbeResult(ok=False, at=now_iso, latency_ms=latency,
+                               error=f"HTTP {r.status_code}: {r.text[:240]}")
+        except Exception as e:
+            return ProbeResult(ok=False, at=now_iso, error=f"{type(e).__name__}: {e}")
+
    async def probe_vllm(self) -> ProbeResult:
        s = self.settings
        now_iso = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
@@ -233,7 +281,9 @@ class DeepHealth:

    PROBES = {
        "parakeet": "probe_parakeet",
-        "magpie": "probe_magpie",
+        "kokoro": "probe_kokoro",
+        "embeddings": "probe_embeddings",
+        "qdrant": "probe_qdrant",
        "vllm": "probe_vllm",
    }

@@ -302,6 +352,18 @@ class DeepHealth:
        svc = services[service]
        if not svc.host or not svc.user:
            return
+        # Only auto-restart GPU model servers (stt/tts/embedding). A vector DB
+        # (qdrant, kind=vectordb) holds the only copy of the index — a restart
+        # on a benign/transient probe error (e.g. a 404 on a not-yet-created
+        # collection, or a 5xx during HNSW build) could corrupt or interrupt a
+        # write. Never auto-restart it; surface the failure instead.
+        from .services import RESTARTABLE_KINDS
+        if svc.kind not in RESTARTABLE_KINDS:
+            record_report(
+                service, ok=False, source="deep-health",
+                detail=f"probe failed but kind='{svc.kind}' is not auto-restartable; manual check needed",
+            )
+            return
        result = await run_action(self.settings, svc, "restart")
        st.restarts.append(now)
        ok = result.get("ok", False)
@@ -0,0 +1,338 @@
+"""OpenAI-compatible embeddings + rerank + hybrid-search proxy.
+
+Fronts two services that live on Spark 2:
+  * spark-embed (GPU): BAAI/bge-m3 dense embeddings + bge-reranker-v2-m3 rerank
+  * Qdrant (CPU): vector storage with hybrid dense+sparse retrieval
+
+So agent/CRM clients only ever talk to one trusted host (Spark Control) for
+embeddings, reranking, and retrieval — same TLS cert + allowlist as the LLM and
+audio proxies.
+
+Endpoints:
+  POST /v1/embeddings   — OpenAI-shape dense embeddings -> spark-embed /embed
+  POST /v1/rerank       — cross-encoder rerank -> spark-embed /rerank
+  POST /api/search      — orchestrated retrieval: embed query -> Qdrant
+                          (hybrid when a sparse vector is supplied, else dense)
+                          -> optional cross-encoder rerank -> top_k
+
+Sparse/BM25 design note: spark-embed serves DENSE only. For hybrid lexical
+retrieval (which matters for entity-heavy data — exact names/tickers), the
+caller's ingest pipeline generates BM25 term-weights client-side (FastEmbed
+Qdrant/bm25) and upserts them as a named sparse vector with Qdrant's
+modifier:idf. At query time the caller passes that sparse vector in the
+/api/search body and we fuse dense+sparse with RRF inside Qdrant. If no sparse
+vector is supplied, /api/search degrades cleanly to dense + rerank.
+"""
+from __future__ import annotations
+import logging
+import time
+from typing import Any, Optional, Union
+
+import httpx
+from fastapi import APIRouter, HTTPException
+from pydantic import BaseModel, Field
+
+from .config import Settings
+
+logger = logging.getLogger("spark-control.embeddings")
+
+# Embedding/rerank can be slow on a cold model; search is interactive.
+EMBED_TIMEOUT = 120.0
+QDRANT_TIMEOUT = 30.0
+RERANK_TIMEOUT = 120.0
+# Max candidates sent to the reranker in one call. MUST match spark-embed's
+# RERANK_MAX_DOCS (200) so /api/search never trips its 413 and silently falls
+# back to fused order.
+RERANK_DOC_CAP = 200
+
+
+# Request models are defined at MODULE scope (not inside build_router): FastAPI
+# mis-introspects locally-defined BaseModel params as query parameters (422
+# "field required"), so a single-model body param must reference a module-level
+# class to be read from the request body.
+class EmbeddingsBody(BaseModel):
+    input: Union[str, list[str]]
+    model: Optional[str] = None              # advisory; spark-embed has one model
+    encoding_format: Optional[str] = "float"
+    normalize: bool = True
+
+
+class RerankBody(BaseModel):
+    query: str
+    documents: list[str]
+    top_n: Optional[int] = None
+    model: Optional[str] = None
+    return_documents: bool = False
+
+
+class SearchBody(BaseModel):
+    query: str
+    collection: Optional[str] = None         # falls back to settings.qdrant_collection
+    top_k: int = 8
+    retrieve_n: Optional[int] = None         # first-stage candidates; default max(50, top_k*10)
+    # Optional caller-supplied BM25/sparse vector for hybrid retrieval.
+    sparse: Optional[dict] = None            # {"indices": [...], "values": [...]}
+    dense_vector_name: str = "dense"
+    sparse_vector_name: str = "sparse"
+    fusion: str = "rrf"                      # "rrf" | "dbsf"
+    filter: Optional[dict] = None            # raw Qdrant filter object
+    rerank: bool = True
+    text_field: str = "text"                 # payload field holding chunk text (for rerank)
+    with_payload: bool = True
+    min_score: Optional[float] = None
+
+
+def build_router(settings: Settings) -> APIRouter:
+    router = APIRouter()
+
+    def _embed_base() -> str:
+        return f"http://{settings.embed_host}:{settings.embed_port}"
+
+    def _qdrant_base() -> str:
+        return f"http://{settings.qdrant_host}:{settings.qdrant_port}"
+
+    async def _post(url: str, json_body: dict, timeout: float, who: str) -> httpx.Response:
+        try:
+            async with httpx.AsyncClient(timeout=timeout) as client:
+                return await client.post(url, json=json_body)
+        except httpx.HTTPError as e:
+            raise HTTPException(502, f"{who} unreachable: {e}")
+
+    # ---- POST /v1/embeddings (OpenAI-compatible) ----
+    @router.post("/v1/embeddings")
+    async def embeddings(body: EmbeddingsBody) -> dict:
+        """OpenAI /v1/embeddings. Forwards to spark-embed and returns the
+        OpenAI list shape so off-the-shelf OpenAI clients work unchanged."""
+        if not settings.embed_host:
+            raise HTTPException(503, "embedding service not configured")
+        texts = [body.input] if isinstance(body.input, str) else list(body.input)
+        if not texts:
+            raise HTTPException(400, "input is required")
+        r = await _post(
+            f"{_embed_base()}/embed",
+            {"input": texts, "normalize": body.normalize},
+            EMBED_TIMEOUT, "embedding service",
+        )
+        if r.status_code != 200:
+            raise HTTPException(r.status_code, r.text[:500])
+        payload = r.json()
+        vectors = payload.get("embeddings", [])
+        data = [
+            {"object": "embedding", "index": i, "embedding": v}
+            for i, v in enumerate(vectors)
+        ]
+        return {
+            "object": "list",
+            "data": data,
+            "model": payload.get("model", body.model or "BAAI/bge-m3"),
+            "usage": {"prompt_tokens": 0, "total_tokens": 0},
+        }
+
+    # ---- POST /v1/rerank (Cohere/Jina-ish) ----
+    @router.post("/v1/rerank")
+    async def rerank(body: RerankBody) -> dict:
+        """Cross-encoder rerank of `documents` against `query` -> spark-embed."""
+        if not settings.embed_host:
+            raise HTTPException(503, "embedding service not configured")
+        if not body.documents:
+            raise HTTPException(400, "documents is required")
+        r = await _post(
+            f"{_embed_base()}/rerank",
+            {
+                "query": body.query,
+                "documents": body.documents,
+                "top_n": body.top_n,
+                "return_documents": body.return_documents,
+            },
+            RERANK_TIMEOUT, "embedding service",
+        )
+        if r.status_code != 200:
+            raise HTTPException(r.status_code, r.text[:500])
+        payload = r.json()
+        # Normalize to a Cohere-ish shape: results[].relevance_score
+        results = []
+        for item in payload.get("results", []):
+            out = {"index": item["index"], "relevance_score": item["score"]}
+            if body.return_documents and "document" in item:
+                out["document"] = item["document"]
+            results.append(out)
+        return {"object": "rerank.result", "model": payload.get("model"), "results": results}
+
+    # ---- POST /api/search (orchestrated hybrid retrieval) ----
+    @router.post("/api/search")
+    async def search(body: SearchBody) -> dict:
+        """Embed the query (dense, spark-embed), retrieve from Qdrant (hybrid
+        dense+sparse with RRF when a sparse vector is supplied, else dense),
+        optionally cross-encoder rerank the candidates, return top_k.
+
+        Uses Qdrant's modern Query API (points/query with prefetch + fusion) —
+        NOT the deprecated points/search.
+        """
+        if not settings.embed_host:
+            raise HTTPException(503, "embedding service not configured")
+        if not settings.qdrant_host:
+            raise HTTPException(503, "qdrant not configured")
+        collection = body.collection or settings.qdrant_collection
+        if not collection:
+            raise HTTPException(400, "collection is required (no default configured)")
+
+        top_k = max(1, min(body.top_k, 100))
+        retrieve_n = body.retrieve_n or max(50, top_k * 10)
+        retrieve_n = max(top_k, min(retrieve_n, 500))
+        want_payload = body.with_payload or body.rerank  # rerank needs the text
+
+        t0 = time.time()
+        # 1. Dense-embed the query.
+        er = await _post(
+            f"{_embed_base()}/embed",
+            {"input": body.query, "normalize": True},
+            EMBED_TIMEOUT, "embedding service",
+        )
+        if er.status_code != 200:
+            raise HTTPException(er.status_code, er.text[:500])
+        dense_vec = (er.json().get("embeddings") or [[]])[0]
+        if not dense_vec:
+            raise HTTPException(502, "embedding service returned no vector")
+        embed_ms = round((time.time() - t0) * 1000)
+
+        # 2. Build the Qdrant Query API body.
+        dense_branch = {
+            "query": dense_vec,
+            "using": body.dense_vector_name,
+            "limit": retrieve_n,
+        }
+        if body.filter:
+            dense_branch["filter"] = body.filter
+
+        if body.sparse and body.sparse.get("indices"):
+            sparse_branch = {
+                "query": {
+                    "indices": body.sparse["indices"],
+                    "values": body.sparse.get("values", []),
+                },
+                "using": body.sparse_vector_name,
+                "limit": retrieve_n,
+            }
+            if body.filter:
+                sparse_branch["filter"] = body.filter
+            query_body: dict[str, Any] = {
+                "prefetch": [dense_branch, sparse_branch],
+                "query": {"fusion": body.fusion if body.fusion in ("rrf", "dbsf") else "rrf"},
+                "limit": retrieve_n,
+                "with_payload": want_payload,
+            }
+        else:
+            # Dense-only retrieval.
+            query_body = {
+                "query": dense_vec,
+                "using": body.dense_vector_name,
+                "limit": retrieve_n,
+                "with_payload": want_payload,
+            }
+            if body.filter:
+                query_body["filter"] = body.filter
+
+        t1 = time.time()
+        qr = await _post(
+            f"{_qdrant_base()}/collections/{collection}/points/query",
+            query_body, QDRANT_TIMEOUT, "qdrant",
+        )
+        if qr.status_code == 404:
+            raise HTTPException(404, f"qdrant collection '{collection}' not found")
+        if qr.status_code != 200:
+            raise HTTPException(qr.status_code, qr.text[:500])
+        points = (qr.json().get("result") or {}).get("points", [])
+        qdrant_ms = round((time.time() - t1) * 1000)
+
+        # 3. Optional cross-encoder rerank over retrieved candidates.
+        rerank_ms = 0
+        reranked = False
+        rerank_truncated = False
+        if body.rerank and points:
+            docs, idx_map = [], []
+            for i, p in enumerate(points):
+                # Cap candidates at the rerank service's per-call limit. Points
+                # are fused-ordered (best first), so the first RERANK_DOC_CAP
+                # with text are the strongest candidates — truncating the tail
+                # is safe and avoids a 413 that would silently disable rerank.
+                if len(docs) >= RERANK_DOC_CAP:
+                    rerank_truncated = True
+                    break
+                text = (p.get("payload") or {}).get(body.text_field)
+                if isinstance(text, str) and text.strip():
+                    docs.append(text)
+                    idx_map.append(i)
+            if docs:
+                t2 = time.time()
+                rr = await _post(
+                    f"{_embed_base()}/rerank",
+                    {"query": body.query, "documents": docs},
+                    RERANK_TIMEOUT, "embedding service",
+                )
+                if rr.status_code == 200:
+                    reranked = True
+                    rerank_ms = round((time.time() - t2) * 1000)
+                    order = rr.json().get("results", [])  # sorted desc by score
+                    new_points = []
+                    for res in order:
+                        p = points[idx_map[res["index"]]]
+                        p = dict(p)
+                        p["_rerank_score"] = res["score"]
+                        new_points.append(p)
+                    # Append any points that had no text (kept after reranked ones).
+                    reranked_ids = {id(points[idx_map[r["index"]]]) for r in order}
+                    for p in points:
+                        if id(p) not in reranked_ids:
+                            new_points.append(dict(p))
+                    points = new_points
+                else:
+                    logger.warning("rerank failed (%s); returning fused order", rr.status_code)
+
+        # 4. Assemble top_k results. Filter THEN slice so a min_score cutoff
+        #    doesn't starve the result set (qualifying candidates past the raw
+        #    top_k position still count). Apply min_score per-score-type: when
+        #    reranked, only gate points that actually carry a rerank score —
+        #    don't compare a cross-encoder logit threshold against a fused
+        #    cosine/RRF score on the no-text points appended after reranking.
+        results = []
+        for p in points:
+            if len(results) >= top_k:
+                break
+            rerank_score = p.get("_rerank_score")
+            fused_score = p.get("score")
+            score = rerank_score if rerank_score is not None else fused_score
+            if body.min_score is not None:
+                if reranked:
+                    if rerank_score is not None and rerank_score < body.min_score:
+                        continue
+                elif score is not None and score < body.min_score:
+                    continue
+            payload = p.get("payload") or {}
+            results.append({
+                "object": "search.result",
+                "index": len(results),
+                "id": p.get("id"),
+                "score": score,
+                "fused_score": fused_score,
+                "rerank_score": rerank_score,
+                "text": payload.get(body.text_field) if body.with_payload else None,
+                "payload": payload if body.with_payload else None,
+            })
+
+        return {
+            "object": "search.result_list",
+            "model": "BAAI/bge-m3+bge-reranker-v2-m3" if reranked else "BAAI/bge-m3",
+            "query": body.query,
+            "collection": collection,
+            "reranked": reranked,
+            "data": results,
+            "usage": {
+                "embed_ms": embed_ms,
+                "qdrant_ms": qdrant_ms,
+                "rerank_ms": rerank_ms,
+                "candidates": len(points),
+                "rerank_truncated": rerank_truncated,
+            },
+        }
+
+    return router
@@ -46,17 +46,17 @@ async def check_parakeet(settings: Settings) -> dict:
        return {"ok": False, "error": str(e), "base_url": base_url}


-async def check_magpie(settings: Settings) -> dict:
+async def check_kokoro(settings: Settings) -> dict:
    base_url = (
-        f"http://{settings.magpie_host}:{settings.magpie_port}"
-        if settings.magpie_host
+        f"http://{settings.kokoro_host}:{settings.kokoro_port}"
+        if settings.kokoro_host
        else None
    )
-    if not settings.magpie_host:
-        return {"ok": False, "error": "magpie host not configured", "base_url": base_url}
+    if not settings.kokoro_host:
+        return {"ok": False, "error": "kokoro host not configured", "base_url": base_url}
    try:
        async with httpx.AsyncClient(timeout=_TIMEOUT) as c:
-            r = await c.get(f"http://{settings.magpie_host}:{settings.magpie_port}/v1/health/ready")
+            r = await c.get(f"http://{settings.kokoro_host}:{settings.kokoro_port}/health")
            r.raise_for_status()
            return {
                "ok": True,
@@ -65,3 +65,42 @@ async def check_magpie(settings: Settings) -> dict:
            }
    except Exception as e:
        return {"ok": False, "error": str(e), "base_url": base_url}
+
+
+async def check_embeddings(settings: Settings) -> dict:
+    base_url = (
+        f"http://{settings.embed_host}:{settings.embed_port}"
+        if settings.embed_host
+        else None
+    )
+    if not settings.embed_host:
+        return {"ok": False, "error": "embedding host not configured", "base_url": base_url}
+    try:
+        async with httpx.AsyncClient(timeout=_TIMEOUT) as c:
+            r = await c.get(f"{base_url}/health")
+            r.raise_for_status()
+            detail = r.json() if r.headers.get("content-type", "").startswith("application/json") else r.text
+            # spark-embed reports {"status":"ready"|"loading", ...} — only "ready" is healthy.
+            ready = isinstance(detail, dict) and detail.get("status") == "ready"
+            return {"ok": ready, "detail": detail, "base_url": base_url,
+                    "model": detail.get("dense_model") if isinstance(detail, dict) else None}
+    except Exception as e:
+        return {"ok": False, "error": str(e), "base_url": base_url}
+
+
+async def check_qdrant(settings: Settings) -> dict:
+    base_url = (
+        f"http://{settings.qdrant_host}:{settings.qdrant_port}"
+        if settings.qdrant_host
+        else None
+    )
+    if not settings.qdrant_host:
+        return {"ok": False, "error": "qdrant host not configured", "base_url": base_url}
+    try:
+        async with httpx.AsyncClient(timeout=_TIMEOUT) as c:
+            # /readyz returns 200 "all shards are ready" when serving.
+            r = await c.get(f"{base_url}/readyz")
+            r.raise_for_status()
+            return {"ok": True, "detail": r.text.strip()[:120], "base_url": base_url}
+    except Exception as e:
+        return {"ok": False, "error": str(e), "base_url": base_url}
@@ -1,10 +1,10 @@
 """OpenAI-compatible chat-completions proxy that forwards to the vLLM
 process currently running on Spark 1.

-Lets clients (recap-relay, Open WebUI, etc.) use a single Spark Control
+Lets clients (Open WebUI, custom apps, etc.) use a single Spark Control
 host for everything — same TLS cert, same allowlist, same place to add
 rate limiting/observability later — instead of having to also reach
-into <spark-1-ip>:8888 directly.
+into <spark1-host>:8888 directly.

 Endpoints:
  POST /v1/chat/completions   — OpenAI chat completions (streams when stream=true)
@@ -38,16 +38,6 @@ SUGGESTED_NIMS: list[dict] = [
        "description": "Streaming speech-to-text (English). Used by Open WebUI for voice input. ~1 GB.",
        "homepage": "https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/parakeet-tdt-0-6b-v3",
    },
-    {
-        "key": "magpie-tts-multilingual",
-        "name": "Magpie TTS Multilingual",
-        "image": "nvcr.io/nim/nvidia/magpie-tts-multilingual:latest",
-        "default_container": "magpie-tts",
-        "default_port": 9000,
-        "kind": "tts",
-        "description": "Multilingual text-to-speech. Counterpart to Parakeet for 'read aloud'. ~3 GB.",
-        "homepage": "https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/magpie-tts-multilingual",
-    },
    {
        "key": "riva-multilingual",
        "name": "Riva Multilingual ASR",
@@ -0,0 +1,17 @@
+"""Redaction engine — VENDORED from the CRM repo for behavioral parity.
+
+`scrub.py` and `test_scrub_leak.py` in this directory are byte-for-byte copies of
+the CRM's reference implementation, kept verbatim so re-syncing is a trivial `cp`
+and a diff. Do NOT edit scrub.py here — change it in the CRM repo, re-vendor, and
+re-run the leak test. The Spark Control *gateway* (server-held pseudonym map, TTL,
+map_handle, local-Qwen NER backstop, the /scrub + /rehydrate HTTP contract) is
+built AROUND this engine in app/redaction_gateway.py — the engine's detection
+logic is never reimplemented.
+
+Parity source: CRM backend/redaction/scrub.py
+  sha256: 412c5fdf7006275a98fa427457293a43256165e97eebaee878c310c68cea054b
+  (re-vendored after the upstream hardening pass: currency-only amounts with a
+   word-boundary suffix, SWIFT/letter-prefixed-account Tier-1, NFKC+zero-width
+   normalization, single-pass rehydrate, and the dictionary deleted_at fix.)
+Acceptance: backend/redaction/test_scrub_leak.py — must pass against this copy.
+"""
@@ -0,0 +1,411 @@
+"""Redaction / re-hydration boundary — the privacy gate between Ten31's sovereign
+data and the Claude API. Implements docs/redaction-rehydration.md, hardened against an
+adversarial leak-hunt (see docs/spark-control-scrub-endpoints.md for the gateway twin).
+
+Defense in depth — NO single layer is trusted as "leak-proof":
+  1. MINIMIZE-FIRST (caller): a local-Qwen summary strips most identity before scrub runs.
+  2. PRE-NEUTRALIZE: any pre-existing [TYPE_N]-shaped string in the input is tokenized
+     first, so every placeholder that reaches Claude is one WE minted (no injection).
+  3. TIER-1 DROP: labelled/structured account-wire-SSN-IBAN-passport data, separator
+     tolerant, excised entirely (never tokenized, never in the map).
+  4. KNOWN-ENTITY tokenize: the LP identities we own (dictionary from the canonical
+     layer), matched UNICODE-FOLDED (accents/case) with hyphenated-surname extension.
+  5. STRUCTURED-PII tokenize/bucket: emails, URLs (incl. scheme-less/social), phones
+     (intl + extensions), amounts (currency words/codes/symbols + worded + ranges),
+     dates (ISO + worded + numeric + quarter), street addresses, bare long digit runs.
+  6. NER BACKSTOP (ner_fn, on-infra local Qwen): tokenizes residual unknown person/org/
+     location names the dictionary can't know. Unknown names are the largest residual,
+     so callers in production pass ner_fn and FAIL CLOSED if it is unreachable.
+
+The pseudonym map ({token: real_value}) is the de-anonymization key: local-only, NEVER
+sent to Claude, NEVER written to interaction_log (only counts).
+"""
+import json
+import re
+import sqlite3
+import unicodedata
+import uuid
+from datetime import datetime, timezone
+
+TOKEN_TYPES = ("PERSON", "ORG", "FUND", "EMAIL", "PHONE", "URL", "ADDR", "AMOUNT", "DATE", "LOC", "MISC")
+_TOKEN_RE = re.compile(r"\[(?:" + "|".join(TOKEN_TYPES) + r")_\d+\]")
+
+# ── Tier-1: NEVER-SEND (dropped, not tokenized). Separator-tolerant + label-anchored. ──
+# Separators allow space/dot/dash/SLASH/COMMA so grouped account/SSN forms can't bypass.
+_SEP = r"[\s.\-/,]"
+_LABEL = (r"(?:acct|account|a/c|wire|routing|aba|sort\s?code|ssn|social\s?security|tax\s?id|"
+          r"ein|policy|member|ref)")
+TIER1_PATTERNS = [
+    ("ssn",      re.compile(r"\b\d{3}" + _SEP + r"\d{2}" + _SEP + r"\d{4}\b")),
+    ("ssn",      re.compile(r"(?i)\b(?:ssn|social\s?security|tax\s?id|ein)\b[^\d]{0,12}\(?\d{3}\)?" + _SEP + r"{0,3}\d{2}" + _SEP + r"{0,3}\d{4}\b")),
+    ("iban",     re.compile(r"\b[A-Z]{2}\d{2}(?:\s?[A-Z0-9]){11,30}\b")),                                 # IBAN >=15 chars; excludes 12-char ISIN
+    ("swift",    re.compile(r"(?i)\b(?:swift|bic)\b[^A-Za-z0-9]{0,8}[A-Z]{4}[A-Z]{2}[A-Z0-9]{2,5}\b")),
+    ("passport", re.compile(r"(?i)\bpassport\b(?:\s?(?:no|number|num|#)\.?)?[^\dA-Za-z]{0,6}[A-Za-z]{0,2}[\s\-]?\d{6,9}\b")),
+    ("labeled_account", re.compile(r"(?i)\b" + _LABEL + r"\b[^\dA-Za-z]{0,14}[#:]?\s*[\dXx](?:[\dXx]" + _SEP + r"?){5,}\b")),
+    # labelled identifier with a LETTER prefix or an intervening 'no/number/id/ref/to' word
+    # (e.g. 'acct A123456789012', 'member ID: X4451200931', 'Wire to GB123456789012') — these
+    # slip the digit-led rule above, the bare-digit catch, and the IBAN floor.
+    ("labeled_account", re.compile(r"(?i)\b" + _LABEL + r"\b(?:[\s.:#\-]{0,3}(?:no|number|num|id|ref|to)\b)?[\s.:#\-]{0,4}[A-Za-z]{0,4}\d[\dA-Za-z]{4,}\b")),
+]
+
+# ── structured PII (Tier-2) ────────────────────────────────────────────────────
+_EMAIL_RE = re.compile(r"\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b")
+_URL_RE = re.compile(
+    r"\bhttps?://[^\s)\]]+"
+    r"|\bwww\.[^\s)\]]+"
+    r"|\b(?:[a-z0-9\-]+\.)?(?:linkedin|twitter|github|facebook|instagram|x|substack|medium)\.com/[^\s)\]]+",
+    re.IGNORECASE)
+# Phones: NANP (3-3-4, optional +1, optional extension) OR E.164/international (leading +).
+# Tightened so plain 4-4 year ranges ('2019-2024') don't match.
+_PHONE_RE = re.compile(
+    r"(?<![\w.])(?:"
+    r"(?:\+?1[\s.\-]?)?(?:\(\d{3}\)[\s.\-]?|\d{3}[\s.\-])\d{3}[\s.\-]\d{4}"
+    r"|\+\d{1,3}(?:[\s.\-]?\d){7,14}"
+    r")(?:\s?(?:x|ext\.?|extension)\s?\d{1,6})?(?![\w])")
+# Amounts: ONLY currency-anchored (symbol / code / currency-word), so non-money quantities
+# ('3m tall', 'ten million tokens', '250k followers') are NOT eaten. Bare magnitudes without
+# a currency cue are left to minimize-first + NER, which strip real money amounts.
+_NUMWORD = (r"(?:one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|"
+            r"fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty|thirty|forty|fifty|"
+            r"sixty|seventy|eighty|ninety|hundred|couple|few|several|half|a)")
+_MAG = r"(?:mm|bn|tn|thousand|million|billion|trillion|k|m|b)"   # longest-first so 'MM' isn't split into 'M'
+_AMOUNT_RES = [
+    re.compile(r"[$€£]\s?\d[\d,. ]*\d?\s?-\s?[$€£]?\s?\d[\d,. ]*\d?(?:\s?" + _MAG + r"\b)?", re.IGNORECASE),   # $3-5M range
+    re.compile(r"[$€£]\s?\d[\d,]*(?:\.\d+)?(?:\s?" + _MAG + r"\b)?", re.IGNORECASE),                          # $5,000,000 / $5m
+    re.compile(r"\b(?:USD|EUR|GBP|CHF|CAD|AUD)\s?[$€£]?\s?\d[\d,]*(?:\.\d+)?(?:\s?" + _MAG + r"\b)?", re.IGNORECASE),
+    re.compile(r"\b\d[\d,]*(?:\.\d+)?\s?(?:dollars?|euros?|pounds?)\b", re.IGNORECASE),                     # 5,000,000 dollars
+    re.compile(r"(?i)\b(?:" + _NUMWORD + r"[\s\-]+){1,4}" + _MAG + r"\s+(?:dollars?|euros?|pounds?)\b"),     # five million dollars
+]
+_MONTHS = (r"(?:jan|feb|mar|apr|may|jun|jul|aug|sep|sept|oct|nov|dec)[a-z]*\.?")
+_DATE_RES = [
+    re.compile(r"\b(?:19|20)\d{2}-\d{2}-\d{2}\b"),                                                          # ISO
+    re.compile(r"(?i)\b" + _MONTHS + r"\s+\d{1,2}(?:st|nd|rd|th)?,?\s+(?:19|20)?\d{2}\b"),                   # March 12, 1986
+    re.compile(r"(?i)\b\d{1,2}(?:st|nd|rd|th)?\s+" + _MONTHS + r",?\s+(?:19|20)?\d{2}\b"),                   # 12 March 1986
+    re.compile(r"\b(?:0?[1-9]|1[0-2])[/.\-](?:0?[1-9]|[12]\d|3[01])[/.\-](?:19|20)?\d{2}\b"),                # 3/12/86 (valid m/d only)
+    re.compile(r"(?i)\bQ[1-4][\s\-]?(?:19|20)\d{2}\b"),                                                      # Q1 1986
+    re.compile(r"(?i)\b" + _MONTHS + r"\s+(?:19|20)\d{2}\b"),                                                # March 1986
+]
+# Addresses: US number-first, PO Box, and European -strasse/-gasse + 'Rue/Calle/Via X N'.
+# Comprehensive international address detection relies on the NER LOC backstop + minimize-first.
+_ADDR_RE = re.compile(
+    r"\bP\.?\s?O\.?\s?Box\s+\d+"
+    r"|\b\d{1,6}\s+(?:[A-Z][A-Za-z'.]+\s?){1,4}"
+    r"(?:Street|St|Avenue|Ave|Road|Rd|Lane|Ln|Boulevard|Blvd|Drive|Dr|Court|Ct|Way|Place|Pl|Square|Sq|Terrace|Ter)\b\.?"
+    r"(?:,?\s+[A-Z][A-Za-z]+)*"
+    r"|\b[A-Z][A-Za-z]*(?:strasse|straße|gasse|weg)\s+\d{1,5}"
+    r"|\b(?:Rue|Calle|Via|Avenida)\s+(?:[A-Z][A-Za-z'.]+\s?){1,3}\d{1,5}",
+    re.IGNORECASE)
+_ZIP_RE = re.compile(r"\b[A-Z]{2}\s+\d{5}(?:-\d{4})?\b")
+# bare long unlabeled run -> reversible [MISC]. Not glued to letters (so an ISIN/ticker like
+# US0378331005 stays intact substance), and a trailing sentence period doesn't block it.
+_BARE_DIGITS_RE = re.compile(r"(?<![\dA-Za-z.\-])\d{9,}(?![A-Za-z]|\.?\d)")
+
+_WORDX = r"[^\W_]"   # unicode word char without underscore
+
+
+def _fold(s):
+    """1:1 length-preserving fold: strip diacritics per char + casefold, so 'Jonathán'
+    matches a stored ASCII 'Jonathan'. Length preserved so match spans map to the original."""
+    out = []
+    for ch in s:
+        d = unicodedata.normalize("NFKD", ch)
+        base = "".join(c for c in d if not unicodedata.combining(c))
+        out.append((base[0] if base else ch).lower())
+    return "".join(out)
+
+
+def _bucket_amount(s):
+    num = re.sub(r"[^\d.]", "", s)
+    try:
+        v = float(num)
+    except ValueError:
+        return "~$?"
+    low = s.lower()
+    if "billion" in low or re.search(r"\d\s?bn?\b", low):
+        v *= 1_000_000_000
+    elif "million" in low or re.search(r"\d\s?mm?\b", low):
+        v *= 1_000_000
+    elif "thousand" in low or re.search(r"\d\s?k\b", low):
+        v *= 1_000
+    if v >= 1_000_000_000:
+        return f"~${round(v/1_000_000_000)}B"
+    if v >= 1_000_000:
+        return f"~${round(v/1_000_000)}M"
+    if v >= 1_000:
+        return f"~${round(v/1_000)}k"
+    return "~$<1k"
+
+
+def _bucket_date(s):
+    iso = re.match(r"((?:19|20)\d{2})-(\d{2})-\d{2}", s)
+    if iso:
+        return f"Q{(int(iso.group(2))-1)//3 + 1} {iso.group(1)}"
+    q = re.search(r"(?i)Q([1-4])[\s\-]?((?:19|20)\d{2})", s)
+    if q:
+        return f"Q{q.group(1)} {q.group(2)}"
+    y = re.search(r"\b((?:19|20)\d{2})\b", s)
+    if y:
+        return y.group(1)
+    yy = re.search(r"[/.\-](\d{2})\b", s)   # 2-digit year fallback
+    if yy:
+        return "19" + yy.group(1) if int(yy.group(1)) > 30 else "20" + yy.group(1)
+    return "(period)"
+
+
+class ScrubState:
+    """Local pseudonym map for ONE task: same surface string -> same token (injective).
+    The map is the de-anon key — local-only, never sent/serialized to a third party."""
+    def __init__(self):
+        self.token_map = {}
+        self._by_value = {}
+        self._counters = {t: 0 for t in TOKEN_TYPES}
+        self.tier1_dropped = []
+
+    def token_for(self, ttype, surface):
+        key = (ttype, surface)
+        tok = self._by_value.get(key)
+        if tok is None:
+            self._counters[ttype] += 1
+            tok = f"[{ttype}_{self._counters[ttype]}]"
+            self._by_value[key] = tok
+            self.token_map[tok] = surface
+        return tok
+
+
+def _flatten_known(known_entities):
+    if not known_entities:
+        return []
+    type_by_key = {"persons": "PERSON", "orgs": "ORG", "funds": "FUND", "emails": "EMAIL", "locations": "LOC"}
+    out = []
+    for key, ttype in type_by_key.items():
+        for s in known_entities.get(key, []) or []:
+            s = (s or "").strip()
+            if s:
+                out.append((s, ttype))
+    return out
+
+
+def _match_known(text, known_list, state):
+    """Tokenize known entities, matched UNICODE-FOLDED + case-insensitive, longest-first,
+    extending over hyphen/apostrophe compounds so a known half of a double-barrelled
+    surname pulls in the whole token. Operates by span so we can fold for matching but
+    replace the ORIGINAL surface (preserved for rehydrate)."""
+    if not known_list:
+        return text
+    folded = _fold(text)
+    pairs = sorted(((_fold(unicodedata.normalize("NFKC", s)), t) for s, t in known_list),
+                   key=lambda x: len(x[0]), reverse=True)
+    type_by_folded = {}
+    for fs, t in pairs:
+        type_by_folded.setdefault(fs, t)
+    alt = "|".join(re.escape(fs) for fs, _ in pairs if fs)
+    if not alt:
+        return text
+    rx = re.compile(r"(?<![0-9A-Za-z])(?:" + alt + r")(?![0-9A-Za-z])")
+    spans = []
+    for m in rx.finditer(folded):
+        st, en = m.start(), m.end()
+        ttype = type_by_folded.get(folded[st:en], "MISC")
+        # extend over hyphen/apostrophe compounds on both sides
+        while st > 1 and folded[st - 1] in "-'’" and re.match(_WORDX, folded[st - 2] or ""):
+            k = st - 2
+            while k >= 0 and (re.match(_WORDX, folded[k]) or folded[k] in "-'’"):
+                k -= 1
+            st = k + 1
+        while en < len(folded) - 1 and folded[en] in "-'’" and re.match(_WORDX, folded[en + 1] or ""):
+            k = en + 1
+            while k < len(folded) and (re.match(_WORDX, folded[k]) or folded[k] in "-'’"):
+                k += 1
+            en = k
+        spans.append((st, en, ttype))
+    if not spans:
+        return text
+    # merge overlaps, replace right-to-left in the ORIGINAL
+    spans.sort()
+    merged = [spans[0]]
+    for st, en, tt in spans[1:]:
+        ps, pe, ptt = merged[-1]
+        if st <= pe:
+            merged[-1] = (ps, max(pe, en), ptt)
+        else:
+            merged.append((st, en, tt))
+    for st, en, tt in reversed(merged):
+        surface = text[st:en]
+        text = text[:st] + state.token_for(tt, surface) + text[en:]
+    return text
+
+
+def scrub(text, known_entities=None, bucket=False, state=None, ner_fn=None):
+    """De-identify `text`. Returns (outbound_text, token_map, audit). Pass ner_fn (a
+    local-model NER callable text->[(surface,type)]) in production to catch unknown
+    names; without it the dictionary+regex path leaves unknown free-text names as
+    residual (callers should minimize-first and/or fail closed)."""
+    if text is None:
+        text = ""
+    st = state or ScrubState()
+    # NFKC-normalize so decomposed (NFD) names and ligatures align with the dictionary
+    # (else 'Reyés' in NFD or 'Steﬀen' with a ligature would miss and leak), and strip
+    # zero-width characters that could split a known name ('Rey<U+200B>es').
+    s = unicodedata.normalize("NFKC", str(text))
+    s = re.sub(r"[\u200b\u200c\u200d\u2060\ufeff]", "", s)
+
+    # 1) PRE-NEUTRALIZE pre-existing [TYPE_N] strings so they can't collide with our tokens.
+    s = _TOKEN_RE.sub(lambda m: st.token_for("MISC", m.group(0)), s)
+
+    # 2) TIER-1 DROP (labelled/structured; separator tolerant). Neutral marker, no value.
+    for label, pat in TIER1_PATTERNS:
+        def _drop(_m, _label=label):
+            st.tier1_dropped.append(_label)
+            return "[redacted]"
+        s = pat.sub(_drop, s)
+
+    # 3) KNOWN ENTITIES (unicode-folded, hyphen-extended).
+    s = _match_known(s, _flatten_known(known_entities), st)
+
+    # 4) STRUCTURED PII. Order matters: emails/urls/addresses, then DATES and AMOUNTS
+    #    (so dashed ISO dates / ranges aren't swallowed by the permissive phone matcher),
+    #    then PHONES, then any bare long digit run left over.
+    s = _EMAIL_RE.sub(lambda m: st.token_for("EMAIL", m.group(0)), s)
+    s = _URL_RE.sub(lambda m: st.token_for("URL", m.group(0)), s)
+    s = _ZIP_RE.sub(lambda m: st.token_for("LOC", m.group(0)), s)     # state+ZIP before ADDR (which would eat the state)
+    s = _ADDR_RE.sub(lambda m: st.token_for("ADDR", m.group(0)), s)
+    for date_re in _DATE_RES:
+        if bucket:
+            s = date_re.sub(lambda m: _bucket_date(m.group(0)), s)
+        else:
+            s = date_re.sub(lambda m: st.token_for("DATE", m.group(0)), s)
+    for amt_re in _AMOUNT_RES:
+        if bucket:
+            s = amt_re.sub(lambda m: _bucket_amount(m.group(0)), s)
+        else:
+            s = amt_re.sub(lambda m: st.token_for("AMOUNT", m.group(0)), s)
+    s = _PHONE_RE.sub(lambda m: st.token_for("PHONE", m.group(0)), s)
+    # bare long unlabeled digit runs -> reversible [MISC] (never leak digits to Claude;
+    # don't DROP, since these may be substance like share counts / security ids).
+    s = _BARE_DIGITS_RE.sub(lambda m: st.token_for("MISC", m.group(0)), s)
+
+    # 5) NER BACKSTOP for unknown names (production: local Qwen). Tokenize what it finds.
+    #    A connection failure here propagates so the caller can FAIL CLOSED rather than
+    #    emit name-blind. Sort longest-first so a full name is tokenized before its parts.
+    if ner_fn is not None:
+        for surface, ntype in sorted((ner_fn(s) or []), key=lambda e: len(e[0] or ""), reverse=True):
+            surface = (surface or "").strip()
+            if not surface or _TOKEN_RE.search(surface):
+                continue
+            tt = ntype if ntype in TOKEN_TYPES else "PERSON"
+            s = re.sub(r"(?<![0-9A-Za-z])" + re.escape(surface) + r"(?![0-9A-Za-z])",
+                       lambda m: st.token_for(tt, m.group(0)), s)
+
+    audit = {
+        "token_count": len(st.token_map),
+        "tokens_by_type": _counts_by_type(st.token_map),
+        "tier1_dropped_count": len(st.tier1_dropped),
+        "tier1_dropped_kinds": sorted(set(st.tier1_dropped)),
+        "bucketed": bool(bucket),
+        "outbound_chars": len(s),
+    }
+    return s, dict(st.token_map), audit
+
+
+def _counts_by_type(token_map):
+    out = {}
+    for tok in token_map:
+        m = re.match(r"\[([A-Z]+)_\d+\]", tok)
+        if m:
+            out[m.group(1)] = out.get(m.group(1), 0) + 1
+    return out
+
+
+def rehydrate(text, token_map):
+    """Substitute real values back in via a SINGLE non-overlapping pass (one alternation,
+    longest tokens first) so an inserted value that is itself token-shaped can't be
+    re-substituted by a later pass. Tier-1 drops are not restorable — excluded by design."""
+    s = str(text or "")
+    if not token_map:
+        return s
+    rx = re.compile("|".join(re.escape(t) for t in sorted(token_map, key=len, reverse=True)))
+    return rx.sub(lambda m: token_map[m.group(0)], s)
+
+
+def residual_tokens(text):
+    return _TOKEN_RE.findall(str(text or ""))
+
+
+# ── known-entity dictionary from the CRM (read-only) ───────────────────────────
+
+def build_known_entities(db_path):
+    """Deterministic dictionary of OUR entities to tokenize, read-only from the CRM.
+    Includes full names AND every name part (so mid-prose surnames are caught) + email
+    local-parts. RAISES on read failure — callers must fail closed, never run name-blind."""
+    persons, orgs, funds, emails = set(), set(), set(), set()
+    conn = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)
+    conn.row_factory = sqlite3.Row
+
+    def _add_person(name):
+        name = (name or "").strip()
+        if len(name) >= 2:
+            persons.add(name)
+            for part in re.split(r"[\s'’\-]+", name):
+                if len(part) >= 2 and not part.isdigit():    # index every part incl. short surnames (Wu, Li)
+                    persons.add(part)
+
+    def _safe(q, fn):
+        try:
+            for r in conn.execute(q):
+                fn(r)
+        except sqlite3.OperationalError:
+            pass
+
+    # No `deleted_at` filter: tokenizing a soft-deleted name is desirable, and the live
+    # contacts/canonical schemas vary on that column — filtering on it silently zeroed the
+    # whole dictionary (a missing-column OperationalError swallowed by _safe).
+    _safe("SELECT display_name, primary_email FROM canonical_entities WHERE entity_kind='person'",
+          lambda r: (_add_person(r["display_name"]), r["primary_email"] and emails.add(r["primary_email"].strip().lower())))
+    _safe("SELECT first_name, last_name, email FROM contacts",
+          lambda r: (_add_person(f"{r['first_name'] or ''} {r['last_name'] or ''}"),
+                     r["email"] and emails.add(r["email"].strip().lower())))
+    _safe("SELECT full_name, email FROM fundraising_contacts",
+          lambda r: (_add_person(r["full_name"]), r["email"] and emails.add(r["email"].strip().lower())))
+    _safe("SELECT display_name FROM canonical_entities WHERE entity_kind IN ('organization','investor','lp')",
+          lambda r: r["display_name"] and orgs.add(r["display_name"].strip()))
+    _safe("SELECT name FROM organizations", lambda r: r["name"] and orgs.add(r["name"].strip()))
+    _safe("SELECT investor_name FROM fundraising_investors", lambda r: r["investor_name"] and orgs.add(r["investor_name"].strip()))
+    _safe("SELECT fund_name FROM fundraising_funds", lambda r: r["fund_name"] and funds.add(r["fund_name"].strip()))
+    conn.close()
+
+    for e in list(emails):
+        lp = e.split("@")[0]
+        if len(lp) >= 3 and not lp.isdigit():
+            persons.add(lp)
+    return {"persons": sorted(persons, key=len, reverse=True),
+            "orgs": sorted(orgs, key=len, reverse=True),
+            "funds": sorted(funds, key=len, reverse=True),
+            "emails": sorted(emails, key=len, reverse=True)}
+
+
+# ── audit logging (metadata only — never the map or real values) ───────────────
+
+def _now():
+    return datetime.now(timezone.utc).replace(tzinfo=None).isoformat() + "Z"
+
+
+def log_scrub(conn, actor_id, audit, task=None, session_id=None, target_id=None, source="mcp"):
+    payload = {"task": task, "session_id": session_id,
+               "token_count": audit.get("token_count"), "tokens_by_type": audit.get("tokens_by_type"),
+               "tier1_dropped_count": audit.get("tier1_dropped_count"),
+               "tier1_dropped_kinds": audit.get("tier1_dropped_kinds"),
+               "bucketed": audit.get("bucketed"), "outbound_chars": audit.get("outbound_chars")}
+    conn.execute(
+        """INSERT INTO interaction_log (id, ts, actor_type, actor_id, action, target_type, target_id, payload, source, created_at)
+           VALUES (?,?, 'agent', ?, 'redaction.scrub', 'canonical_entity', ?, ?, ?, ?)""",
+        (str(uuid.uuid4()), _now(), actor_id, target_id, json.dumps(payload), source, _now()))
+
+
+def log_rehydrate(conn, actor_id, tokens_rehydrated, residual, human_decision="pending",
+                  reviewer_id=None, task=None, session_id=None, source="mcp"):
+    payload = {"task": task, "session_id": session_id, "tokens_rehydrated": tokens_rehydrated,
+               "residual_placeholders": residual, "human_decision": human_decision, "reviewer_id": reviewer_id}
+    conn.execute(
+        """INSERT INTO interaction_log (id, ts, actor_type, actor_id, action, target_type, target_id, payload, source, created_at)
+           VALUES (?,?, 'agent', ?, 'redaction.rehydrate', 'canonical_entity', NULL, ?, ?, ?)""",
+        (str(uuid.uuid4()), _now(), actor_id, json.dumps(payload), source, _now()))
@@ -0,0 +1,182 @@
+#!/usr/bin/env python3
+"""Gateway acceptance test: runs the reference leak fixtures THROUGH the live
+/scrub + /rehydrate ASGI endpoints (ner=rules_only, deterministic/offline) plus
+the gateway-specific security contract:
+
+  - parity: every must_vanish identifier absent from /scrub responses; substance survives
+  - map-leak: no real value (incl. Tier-1) appears in any response body OR the server map's
+    Claude-bound surface; Tier-1 values are absent from the stored map entirely
+  - round-trip: /rehydrate via the server-held map reproduces raw (Tier-1 -> [redacted])
+  - handle reuse: a 2nd /scrub with the same map_handle keeps tokens stable
+  - 409 tripwire: strict /rehydrate with an unmapped token
+  - 410: rehydrate against an unknown/expired handle
+  - 422 fail-closed: tier1_action=reject on Tier-1 input emits nothing
+
+Run: cd image && python3 -m app.redaction.test_gateway   (no Spark/Qwen/network needed)
+"""
+import asyncio
+import os
+import re
+import sys
+import tempfile
+
+import httpx
+from fastapi import FastAPI
+
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+import scrub as R                       # noqa: E402  (vendored engine)
+import test_scrub_leak as REF           # noqa: E402  (reference fixtures)
+
+# Build the gateway app against a throwaway map store.
+os.environ.setdefault("SPARK1_HOST", "<spark-1-ip>")
+os.environ.setdefault("SPARK2_HOST", "<spark-2-ip>")
+from app.config import Settings                       # noqa: E402
+from app.redaction_gateway import build_router, MapStore  # noqa: E402
+
+FAILS = []
+
+
+def check(cond, msg):
+    print(("  PASS " if cond else "  FAIL ") + msg)
+    if not cond:
+        FAILS.append(msg)
+
+
+def tier1_redacted(raw):
+    s = raw
+    for _, pat in R.TIER1_PATTERNS:
+        s = pat.sub("[redacted]", s)
+    return s
+
+
+async def main():
+    db = os.path.join(tempfile.mkdtemp(), "maps.db")
+    store = MapStore(db, ttl_seconds=3600)
+    app = FastAPI()
+    app.include_router(build_router(Settings.from_env(), store))
+    transport = httpx.ASGITransport(app=app)
+
+    async with httpx.AsyncClient(transport=transport, base_url="http://gw") as c:
+        for case in REF.CASES:
+            print(f"\n[{case['name']}]")
+            r = await c.post("/scrub", json={
+                "task_id": "t-" + case["name"][:8], "actor": "analyst",
+                "items": [{"id": "ctx_1", "text": case["raw"]}],
+                "known_entities": case["known"], "ner": "rules_only",
+            })
+            check(r.status_code == 200, f"/scrub 200 (got {r.status_code} {r.text[:120]})")
+            if r.status_code != 200:
+                continue
+            d = r.json()
+            scrubbed = d["items"][0]["scrubbed_text"]
+            handle = d["map_handle"]
+            body_blob = r.text
+
+            for v in case["must_vanish"]:
+                check(v not in scrubbed, f"identifier {v!r} absent from scrubbed_text")
+                check(v not in body_blob, f"identifier {v!r} absent from entire /scrub response body")
+            for s in case["substance"]:
+                check(s in scrubbed, f"substance survives: {s!r}")
+
+            # map-leak: Tier-1 values must not be in the server-held map at all
+            stored = store.get(handle)
+            for v in case["tier1_excluded"]:
+                check(all(v not in val for val in stored.values()),
+                      f"Tier-1 {v!r} not in server map (excluded, not tokenized)")
+
+            # round-trip via the server-held map
+            rr = await c.post("/rehydrate", json={
+                "task_id": "t", "map_handle": handle,
+                "items": [{"id": "out_1", "text": scrubbed}], "strict": True,
+            })
+            check(rr.status_code == 200, f"/rehydrate 200 (got {rr.status_code})")
+            if rr.status_code == 200:
+                rehy = rr.json()["items"][0]["rehydrated_text"]
+                check(rehy == tier1_redacted(case["raw"]),
+                      "rehydrate via server map == raw with Tier-1 redacted")
+
+        # ── handle reuse keeps tokens stable across calls ──
+        print("\n[map_handle reuse — stable tokens]")
+        r1 = await c.post("/scrub", json={"task_id": "reuse", "items": [{"id": "a", "text": "Dana Whitfield called."}],
+                                          "known_entities": {"persons": ["Dana Whitfield", "Dana", "Whitfield"]}, "ner": "rules_only"})
+        h = r1.json()["map_handle"]
+        tok1 = r1.json()["items"][0]["scrubbed_text"]
+        r2 = await c.post("/scrub", json={"task_id": "reuse", "map_handle": h,
+                                          "items": [{"id": "b", "text": "Dana Whitfield emailed again."}],
+                                          "known_entities": {"persons": ["Dana Whitfield", "Dana", "Whitfield"]}, "ner": "rules_only"})
+        tok2 = r2.json()["items"][0]["scrubbed_text"]
+        same_token = re.findall(r"\[PERSON_\d+\]", tok1) == re.findall(r"\[PERSON_\d+\]", tok2)
+        check("Dana Whitfield" not in tok1 and "Dana Whitfield" not in tok2, "name tokenized both calls")
+        check(same_token and bool(re.search(r"\[PERSON_1\]", tok2)), "same entity -> same token across calls (reuse)")
+
+        # ── 409 strict tripwire on unmapped token ──
+        print("\n[strict rehydrate tripwire]")
+        r409 = await c.post("/rehydrate", json={"task_id": "reuse", "map_handle": h,
+                                                "items": [{"id": "x", "text": "see [PERSON_99] smuggled"}], "strict": True})
+        check(r409.status_code == 409, f"unmapped token -> 409 (got {r409.status_code})")
+
+        # ── 410 unknown/expired handle ──
+        print("\n[unknown handle -> 410]")
+        r410 = await c.post("/rehydrate", json={"task_id": "z", "map_handle": "deadbeef" * 4,
+                                                "items": [{"id": "x", "text": "[PERSON_1]"}], "strict": True})
+        check(r410.status_code == 410, f"unknown handle -> 410 (got {r410.status_code})")
+
+        # ── 422 fail-closed: tier1_action=reject emits nothing ──
+        print("\n[fail-closed tier1 reject]")
+        r422 = await c.post("/scrub", json={"task_id": "fc", "tier1_action": "reject",
+                                            "items": [{"id": "x", "text": "Wire to acct 000123456789 today."}],
+                                            "known_entities": {}, "ner": "rules_only"})
+        check(r422.status_code == 422, f"Tier-1 + reject -> 422 (got {r422.status_code})")
+        check("000123456789" not in r422.text, "rejected call does NOT echo the Tier-1 value")
+
+        # ── error bodies expose top-level documented keys (NOT wrapped under "detail") ──
+        print("\n[error body shape]")
+        check(r409.json().get("error") == "unknown_tokens" and "tokens" in r409.json(),
+              "409 body top-level {error:unknown_tokens, tokens:[...]}")
+        check(r410.json().get("error") == "map_expired", "410 body top-level {error:map_expired}")
+        check(r422.json().get("error") == "tier1_detected", "422 body top-level {error:tier1_detected}")
+
+        # ── tokens_used is BARE (PERSON_1, not [PERSON_1]) per the handover contract ──
+        print("\n[tokens_used bare]")
+        rb = await c.post("/scrub", json={"task_id": "bare", "items": [{"id": "a", "text": "Dana Whitfield called."}],
+                                          "known_entities": {"persons": ["Dana Whitfield"]}, "ner": "rules_only"})
+        tu = rb.json()["items"][0]["tokens_used"]
+        check(tu and all("[" not in t and "]" not in t for t in tu), f"tokens_used bare: {tu}")
+
+    # ── P0 fix unit tests: descriptive token-substitution match + fail-closed ──
+    print("\n[descriptive redaction — P0 fail-open fix]")
+    from app.redaction_gateway import _redact_descriptive, _apply_tokenmap_to_span, _Contract
+    tmap = {"[ORG_1]": "Acme Mining"}
+    # The NER stashed the span with the plaintext name; the final text has it tokenized.
+    final_text = "He is part of [redacted-was-here] the family that sold [ORG_1] in Texas last year, big deal."
+    span = "the family that sold Acme Mining in Texas last year"
+    sub = _apply_tokenmap_to_span(span, tmap)
+    check(sub == "the family that sold [ORG_1] in Texas last year", "token-substituted span matches scrubbed form")
+    out, flags = _redact_descriptive(final_text, [span], tmap, "i")
+    check("[redacted]" in out and "the family that sold" not in out,
+          "descriptive span removed via token-substituted match (no fail-open leak)")
+    # substantial span that can't be located anywhere -> fail closed (422)
+    try:
+        _redact_descriptive("totally unrelated text", ["the founder who sold his company in Wyoming last year"], {}, "i")
+        check(False, "unremovable substantial span should fail closed")
+    except _Contract as e:
+        check(e.status == 422 and e.body.get("error") == "descriptive_unredactable",
+              "unremovable substantial descriptive span -> 422 fail-closed")
+
+    # ── P0 fix: map store db file is NOT world-readable ──
+    print("\n[map store file perms — P0]")
+    import stat as _stat
+    mode = _stat.S_IMODE(os.stat(db).st_mode)
+    check(mode & 0o077 == 0, f"map db is 0600-ish (mode={oct(mode)}, no group/other access)")
+
+    print()
+    if FAILS:
+        print(f"FAILED ({len(FAILS)}):")
+        for f in FAILS:
+            print("  - " + f)
+        sys.exit(1)
+    print("ALL PASS (gateway acceptance — parity + map-leak + round-trip + tripwires)")
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
@@ -0,0 +1,187 @@
+#!/usr/bin/env python3
+"""Golden-file LEAK TEST for the redaction boundary, hardened across two adversarial
+leak-hunts. Synthetic fixtures only (guardrail #9).
+
+Per case: must_vanish (never reach Claude), tier1_excluded (also not in the map),
+substance (survives verbatim), perfect inverse, leak-proof audit. Plus a round-2
+"hardening vectors" section that regression-locks: NFD/ligature unicode names,
+slash/comma SSN + SWIFT + passport Tier-1 drops, sentence-final bare digits, the
+rehydrate collision fix, and the FALSE-POSITIVE survival of non-money quantities /
+version numbers / ISINs (we de-identify, we don't destroy substance).
+
+Deterministic + offline (the dictionary is each case's own lists; the unknown-name
+NER backstop is exercised in test_grounding_boundary.py). Currency-CUED amounts are
+caught here; bare magnitudes ('5MM') are left to minimize-first + NER by design.
+Run: cd backend && python3 redaction/test_scrub_leak.py
+"""
+import json
+import os
+import re
+import sqlite3
+import sys
+import unicodedata
+
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+import scrub as R  # noqa: E402
+
+CASES = [
+    {
+        "name": "labeled-tier1 + core tier2",
+        "raw": ("Jonathan Reyes (jon@cedarpoint.example) at Cedar Point Capital is cooling on Fund III. "
+                "Reyes would commit $5,000,000. Wire to acct 000123456789 spooked compliance. Met 1986-03-12. "
+                "Substance: the objection is fee load and lock-up; sentiment negative on the energy thesis."),
+        "known": {"persons": ["Jonathan Reyes", "Reyes"], "orgs": ["Cedar Point Capital"],
+                  "funds": ["Fund III"], "emails": ["jon@cedarpoint.example"]},
+        "must_vanish": ["Jonathan Reyes", "Reyes", "jon@cedarpoint.example", "Cedar Point Capital",
+                        "Fund III", "$5,000,000", "1986-03-12", "000123456789"],
+        "tier1_excluded": ["000123456789"],
+        "substance": ["the objection is fee load and lock-up", "sentiment negative on the energy thesis"],
+    },
+    {
+        "name": "worded/coded amounts, intl phone, urls, non-iso dates",
+        "raw": ("He would commit five million dollars; a $5MM ticket, USD 5,000,000, and a $3-5M range. "
+                "Reach +44 20 7946 0958 or www.cedarpoint.example; profile linkedin.com/in/jreyes. "
+                "Met March 12, 1986 and again 3/12/86. Concern: liquidity timeline only."),
+        "known": {"persons": [], "orgs": [], "funds": [], "emails": []},
+        "must_vanish": ["five million dollars", "$5MM", "USD 5,000,000", "$3-5M", "+44 20 7946 0958",
+                        "www.cedarpoint.example", "linkedin.com/in/jreyes", "March 12, 1986", "3/12/86"],
+        "tier1_excluded": [],
+        "substance": ["Concern: liquidity timeline only"],
+    },
+    {
+        "name": "diacritics + hyphenated + short surnames",
+        "raw": ("Spoke to Jonathán Reyés about the thesis. Reyes-Castellanos co-invests. "
+                "Wu is warm; Li wants a side letter on fees."),
+        "known": {"persons": ["Jonathan Reyes", "Reyes", "Li Wu", "Li", "Wu"], "orgs": [], "funds": [], "emails": []},
+        "must_vanish": ["Jonathán", "Reyés", "Castellanos", "Wu", "Li"],
+        "tier1_excluded": [],
+        "substance": ["wants a side letter on fees"],
+    },
+    {
+        "name": "tier1 separators (slash/comma/space) + swift + address + ext",
+        "raw": ("Wire to acct # 1234-5678-9012 spooked compliance. SSN 123/45/6789 and 123 45 6789 on file. "
+                "Via SWIFT CHASUS33XXX. Lives at 42 Maple Avenue, Greenwich, CT 06830. Office 212-555-0188 x4021. "
+                "Substance: wants a co-investment right."),
+        "known": {"persons": [], "orgs": [], "funds": [], "emails": []},
+        "must_vanish": ["1234-5678-9012", "123/45/6789", "123 45 6789", "CHASUS33XXX", "42 Maple Avenue",
+                        "212-555-0188", "x4021", "06830"],
+        "tier1_excluded": ["1234-5678-9012", "123/45/6789", "123 45 6789", "CHASUS33XXX"],
+        "substance": ["wants a co-investment right"],
+    },
+]
+
+FAILS = []
+
+
+def check(cond, msg):
+    print(("  PASS " if cond else "  FAIL ") + msg)
+    if not cond:
+        FAILS.append(msg)
+
+
+def tier1_redacted(raw):
+    s = unicodedata.normalize("NFKC", raw)
+    for _, pat in R.TIER1_PATTERNS:
+        s = pat.sub("[redacted]", s)
+    return s
+
+
+def main():
+    db = os.path.join(__import__("tempfile").mkdtemp(), "log.db")
+    conn = sqlite3.connect(db)
+    conn.execute("""CREATE TABLE interaction_log (id TEXT PRIMARY KEY, ts TEXT, actor_type TEXT, actor_id TEXT,
+        action TEXT, target_type TEXT, target_id TEXT, payload TEXT, source TEXT, created_at TEXT)""")
+
+    for case in CASES:
+        raw, known = case["raw"], case["known"]
+        print(f"\n[{case['name']}]")
+        check(not R.residual_tokens(raw), "raw fixture has no [TYPE_N]-shaped strings")
+        outbound, tmap, audit = R.scrub(raw, known_entities=known, bucket=False)
+        for v in case["must_vanish"]:
+            check(v not in outbound, f"identifier {v!r} absent from outbound")
+        for v in case["tier1_excluded"]:
+            check(all(v not in mv for mv in tmap.values()), f"Tier-1 {v!r} excluded, not tokenized")
+        for s in case["substance"]:
+            check(s in outbound, f"substance survives: {s!r}")
+        check(len(set(tmap.values())) == len(tmap), "map injective")
+        check(R.rehydrate(outbound, tmap) == tier1_redacted(raw), "rehydrate == raw w/ Tier-1 redacted (perfect inverse)")
+        check(not R.residual_tokens(R.rehydrate(outbound, tmap)), "no placeholder survives rehydrate")
+        R.log_scrub(conn, "architect", audit, task="g", session_id="t", source="mcp")
+        conn.commit()
+        blob = " ".join(r[0] for r in conn.execute("SELECT payload FROM interaction_log"))
+        check(all(v not in blob for v in case["must_vanish"]), "audit log carries NO sensitive value")
+
+    # ── round-2 hardening vectors ──
+    def out(raw, known=None):
+        o, _m, _a = R.scrub(raw, known_entities=known or {}, bucket=False)
+        return o
+
+    print("\n[unicode — NFD / ligature names]")
+    nfd = unicodedata.normalize("NFD", "Jonathan Reyés is cooling.")
+    check("Reyés" not in unicodedata.normalize("NFKC", out(nfd, {"persons": ["Jonathan Reyes", "Reyes"]})),
+          "NFD-decomposed accented name does not leak")
+    check("Steﬀen" not in out("LP Steﬀen is cooling.", {"persons": ["Steffen"]}),
+          "ligature name (Steﬀen) does not leak")
+
+    print("\n[tier1 — slash/comma/swift/passport]")
+    o, m, _ = R.scrub("Reyes SSN 123/45/6789 and 123,45,6789 on the W9.", known_entities={}, bucket=False)
+    check("123/45/6789" not in o and "123,45,6789" not in o, "slash/comma SSN dropped")
+    check(all("123/45/6789" not in v and "123,45,6789" not in v for v in m.values()), "SSN not in map (excluded)")
+    check("CHASUS33XXX" not in out("Wire via SWIFT CHASUS33XXX today."), "SWIFT/BIC dropped")
+    check("a1234567" not in out("Passport number a1234567 expires 2030."), "passport-with-'number' dropped")
+
+    print("\n[bare digits at sentence end]")
+    check("123456789012" not in out("The security ID is 123456789012."), "9+ digit run at sentence end tokenized")
+
+    print("\n[FALSE-POSITIVE survival — substance preserved]")
+    check("3m tall" in out("The wall is 3m tall."), "'3m tall' (meters) NOT eaten as money")
+    check("250k followers" in out("She has 250k followers on X."), "'250k followers' NOT eaten as money")
+    check("3.14.159" in out("Pi is roughly 3.14.159 here."), "version-ish number NOT eaten as a date")
+    check("US0378331005" in out("We hold ISIN US0378331005 in the sleeve."), "ISIN preserved (substance, not dropped)")
+    check("2019-2024" in out("Track record spans 2019-2024."), "year range NOT mislabeled as a phone")
+
+    print("\n[integrity — rehydrate single-pass, no cascade]")
+    raw = "Refer to [MISC_2] then [PERSON_9]."
+    oo, mm, _ = R.scrub(raw, known_entities={}, bucket=False)
+    check(R.rehydrate(oo, mm) == raw, "same-length placeholder literals round-trip without cascade")
+
+    print("\n[round-4 — alpha-prefixed accounts, MM, zero-width]")
+    o, m, _ = R.scrub("Acct A123456789012 flagged. Member ID: X4451200931 noted. Wire to GB123456789012 today.",
+                      known_entities={}, bucket=False)
+    for v in ["A123456789012", "X4451200931", "GB123456789012"]:
+        check(v not in o, f"alpha-prefixed labelled identifier {v!r} dropped")
+        check(all(v not in mv for mv in m.values()), f"{v!r} excluded, not tokenized")
+    o2 = out("Commit of $5MM and €10MM confirmed.")
+    check("$5MM" not in o2 and "5M " not in o2 and "MM" not in o2, "double-magnitude $5MM fully tokenized (no stray 'M')")
+    zw = "LP Reyes is cooling."   # zero-width space splitting the surname
+    check("Reyes" not in out(zw, {"persons": ["Reyes"]}) and "Reyes" not in out(zw, {"persons": ["Reyes"]}),
+          "zero-width-split known name does not leak")
+
+    print("\n[round-5 — magnitude suffix must not eat a following word]")
+    # A single-letter magnitude (k/m/b) immediately before a real word must NOT be
+    # consumed as a suffix: '$5,000,000 but' -> the 'b' of 'but' was being eaten,
+    # yielding '[AMOUNT_1]ut'. A \b after the magnitude fixes it. Money still vanishes,
+    # the following word survives intact, and legitimate suffixes still tokenize.
+    for raw, word in [("$5,000,000 but he hesitates", "but he hesitates"),
+                      ("committed $250,000 because timing", "because timing"),
+                      ("USD 5,000,000 but capped", "but capped"),
+                      ("between $3-5M but capped", "but capped")]:
+        o = out(raw)
+        check("[AMOUNT_1]ut" not in o and "[AMOUNT_1]ecause" not in o, f"magnitude does not bleed into next word: {raw!r}")
+        check(word in o, f"following word survives intact: {word!r}")
+        check("$" not in o and "USD 5" not in o, f"amount still tokenized: {raw!r}")
+    check(out("raised $5m but later") == "raised [AMOUNT_1] but later", "real 'm' suffix still tokenizes ($5m)")
+    check(out("about $5b in assets") == "about [AMOUNT_1] in assets", "real 'b' suffix still tokenizes ($5b)")
+
+    conn.close()
+    print()
+    if FAILS:
+        print(f"FAILED ({len(FAILS)}):")
+        for f in FAILS:
+            print(f"  - {f}")
+        sys.exit(1)
+    print("ALL PASS (redaction leak test — hardened x2)")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,559 @@
+"""Redaction gateway — `POST /scrub` + `POST /rehydrate`.
+
+The privacy boundary between sovereign LP data and the Claude API. An agent sends
+its assembled LP-specific context to `/scrub`; we de-identify it (the real values
+never leave this box) and return placeholder-only text the agent forwards to
+Claude. Claude reasons over `[PERSON_1] introduced [PERSON_2] to [FUND_1]` and
+replies in the same placeholders; the agent sends Claude's reply to `/rehydrate`,
+which swaps the real values back in for human review.
+
+Design:
+  * Detection logic is the VENDORED reference engine (app/redaction/scrub.py),
+    never reimplemented — parity is by construction (its leak test must pass).
+  * The pseudonym map {token -> real_value} is the de-anonymization key. It is the
+    ONE place real values live; held server-side keyed by an opaque map_handle in a
+    TTL-swept local store on /data (0700 dir / 0600 file — never world-readable),
+    NEVER returned in full, NEVER logged, NEVER in a Claude-bound payload.
+  * The caller-supplied `known_entities` dictionary is itself a slice of the LP
+    list — treated as sensitive: used transiently for the scrub, never persisted
+    beyond the resulting tokens, never logged or echoed.
+  * The local-Qwen NER backstop is LOAD-BEARING, not optional, and FAILS CLOSED:
+    if Qwen is unreachable / returns a malformed or empty-schema result under
+    ner=auto/qwen, /scrub returns 422 and emits nothing rather than passing
+    name-blind text to Claude. Descriptive re-identifiers it flags are redacted,
+    and if a substantial flagged span cannot be located+removed from the final
+    text we ALSO fail closed (no identifier-blind prose reaches Claude).
+
+This gateway does NOT call Claude. It is the scrub/rehydrate transform pair plus
+the server-held map.
+"""
+from __future__ import annotations
+import asyncio
+import json
+import logging
+import os
+import re
+import sqlite3
+import time
+import uuid
+from datetime import datetime, timezone
+from typing import Any, Optional
+
+import httpx
+from fastapi import APIRouter
+from fastapi.responses import JSONResponse
+from pydantic import BaseModel
+
+from .config import Settings
+from .redaction import scrub as engine  # vendored parity-locked engine
+
+logger = logging.getLogger("spark-control.redaction")
+
+DEFAULT_TTL_SECONDS = 7200          # 2h — spans a human-review round-trip
+QWEN_NER_TIMEOUT = 60.0
+QWEN_NER_MAX_CHARS = 24000          # guard the NER prompt size per item
+# A descriptive re-identifier span is "substantial" (and so must be removable, or
+# we fail closed) when it's a real phrase, not model noise like "the founder".
+DESCRIPTIVE_MIN_WORDS = 4
+DESCRIPTIVE_MIN_CHARS = 25
+
+
+# ────────────────────────── typed control-flow errors ──────────────────────────
+
+class NerUnavailable(RuntimeError):
+    """Raised from the NER pass for ANY unreachable/malformed/empty-schema result,
+    so the endpoint can fail closed (422) without brittle string matching."""
+
+
+class _Contract(Exception):
+    """A documented gateway error. Carries the exact top-level body shape the
+    handover contract specifies (e.g. {"error":"tier1_detected","spans":[...]}),
+    returned via JSONResponse so keys sit at top level (NOT wrapped under
+    FastAPI's "detail")."""
+    def __init__(self, status: int, body: dict) -> None:
+        self.status = status
+        self.body = body
+
+
+# ────────────────────────── server-held pseudonym map store ──────────────────────────
+
+class MapStore:
+    """TTL-swept local store for pseudonym maps, keyed by map_handle.
+
+    Stored on the /data volume so an in-flight task survives a container restart.
+    Holds ONLY the {token -> real_value} map (the de-anon key) — never the raw
+    caller dictionary, never any Claude-bound text. The db + its WAL/journal/shm
+    sidecars are created 0600 under a 0700 dir, so no other local user/process can
+    read the real values. Rows TTL-expired.
+    """
+
+    def __init__(self, db_path: str, ttl_seconds: int = DEFAULT_TTL_SECONDS) -> None:
+        self.db_path = db_path
+        self.ttl_seconds = ttl_seconds
+        d = os.path.dirname(db_path) or "."
+        try:
+            os.makedirs(d, mode=0o700, exist_ok=True)
+            os.chmod(d, 0o700)
+        except Exception as e:
+            logger.warning("could not tighten map dir perms on %s: %s", d, e)
+        # Create the db (and sidecars) under a tight umask so they're 0600.
+        old_umask = os.umask(0o077)
+        try:
+            self._init_db()
+            for suffix in ("", "-wal", "-shm", "-journal"):
+                p = db_path + suffix
+                if os.path.exists(p):
+                    try:
+                        os.chmod(p, 0o600)
+                    except Exception:
+                        pass
+        finally:
+            os.umask(old_umask)
+
+    def _conn(self) -> sqlite3.Connection:
+        c = sqlite3.connect(self.db_path)
+        c.row_factory = sqlite3.Row
+        return c
+
+    def _init_db(self) -> None:
+        with self._conn() as c:
+            c.execute(
+                """CREATE TABLE IF NOT EXISTS pseudonym_maps (
+                    map_handle TEXT PRIMARY KEY,
+                    task_id    TEXT NOT NULL,
+                    token_map  TEXT NOT NULL,
+                    created_at REAL NOT NULL,
+                    expires_at REAL NOT NULL
+                )"""
+            )
+
+    def _sweep(self, c: sqlite3.Connection) -> None:
+        c.execute("DELETE FROM pseudonym_maps WHERE expires_at < ?", (time.time(),))
+
+    def create(self, task_id: str, token_map: dict) -> tuple[str, float]:
+        handle = uuid.uuid4().hex
+        now = time.time()
+        expires = now + self.ttl_seconds
+        with self._conn() as c:
+            self._sweep(c)
+            c.execute(
+                "INSERT INTO pseudonym_maps (map_handle, task_id, token_map, created_at, expires_at) VALUES (?,?,?,?,?)",
+                (handle, task_id, json.dumps(token_map), now, expires),
+            )
+        return handle, expires
+
+    def extend(self, map_handle: str, token_map: dict) -> float:
+        now = time.time()
+        expires = now + self.ttl_seconds
+        with self._conn() as c:
+            self._sweep(c)
+            cur = c.execute(
+                "UPDATE pseudonym_maps SET token_map=?, expires_at=? WHERE map_handle=? AND expires_at>=?",
+                (json.dumps(token_map), expires, map_handle, now),
+            )
+            if cur.rowcount == 0:
+                raise KeyError("map_handle not found or expired")
+        return expires
+
+    def get(self, map_handle: str) -> Optional[dict]:
+        """Return the token_map, None if unknown, or raises _Expired if TTL lapsed."""
+        with self._conn() as c:
+            row = c.execute(
+                "SELECT token_map, expires_at FROM pseudonym_maps WHERE map_handle=?",
+                (map_handle,),
+            ).fetchone()
+        if row is None:
+            return None
+        if row["expires_at"] < time.time():
+            raise _Expired()
+        return json.loads(row["token_map"])
+
+
+class _Expired(Exception):
+    pass
+
+
+def _state_from_map(token_map: dict) -> engine.ScrubState:
+    """Reconstruct a ScrubState from a stored token_map so a reused map_handle keeps
+    token assignment stable (same surface -> same token) and continues numbering for
+    new entities. Does not modify the vendored engine."""
+    st = engine.ScrubState()
+    st.token_map = dict(token_map)
+    for tok, surface in token_map.items():
+        m = re.match(r"\[([A-Z]+)_(\d+)\]", tok)
+        if not m:
+            continue
+        ttype, n = m.group(1), int(m.group(2))
+        st._by_value[(ttype, surface)] = tok
+        if ttype in st._counters:
+            st._counters[ttype] = max(st._counters[ttype], n)
+    return st
+
+
+# ────────────────────────── local-Qwen NER backstop ──────────────────────────
+
+_NER_SYSTEM = (
+    "You are a PII extraction engine inside a privacy redaction gateway. You receive text "
+    "in which known names and structured identifiers may ALREADY be replaced by placeholder "
+    "tokens shaped like [PERSON_1] or [AMOUNT_2]. Your job is to find what is NOT yet redacted. "
+    "Return ONLY a single JSON object, no prose, no code fence. Schema:\n"
+    '{"entities":[{"text":"<exact surface substring>","type":"PERSON|ORG|FUND|LOC"}],'
+    '"descriptive":[{"span":"<exact substring that could re-identify a real person or org '
+    'WITHOUT naming them, e.g. occupation+location+event combinations like '
+    "'the family that sold the mining company in Texas'>\"}]}\n"
+    "Rules: include real person names, company/org names, fund names, and place names that are "
+    "NOT already a [TOKEN]. NEVER include any [TYPE_N] placeholder. 'text' and 'span' must be "
+    "exact substrings copied from the input. If nothing is found, return both arrays empty."
+)
+
+
+def _strip_think(s: str) -> str:
+    """Remove any <think>...</think> block so its braces can't confuse JSON extraction."""
+    return re.sub(r"<think>.*?</think>", "", s, flags=re.DOTALL | re.IGNORECASE).strip()
+
+
+def _parse_ner_json(content: str) -> Any:
+    s = _strip_think(content).strip()
+    if s.startswith("```"):
+        s = re.sub(r"^```[a-zA-Z]*\n?", "", s)
+        s = re.sub(r"\n?```$", "", s).strip()
+    try:
+        return json.loads(s)
+    except Exception:
+        a, b = s.find("{"), s.rfind("}")
+        if a != -1 and b != -1 and b > a:
+            return json.loads(s[a : b + 1])
+        raise
+
+
+class QwenNER:
+    """Synchronous NER caller (scrub() invokes ner_fn synchronously, so the whole
+    scrub runs in a threadpool and this uses a sync HTTP client). Fails CLOSED:
+    any unreachable/malformed/empty-schema/truncated result raises NerUnavailable,
+    so the endpoint returns 422 rather than emitting name-blind text."""
+
+    def __init__(self, base_url: str, model_id: str) -> None:
+        self.base_url = base_url
+        self.model_id = model_id
+        self.descriptive: list[str] = []
+
+    def _call(self, text: str) -> dict:
+        body = {
+            "model": self.model_id,
+            "messages": [
+                {"role": "system", "content": _NER_SYSTEM},
+                {"role": "user", "content": text[:QWEN_NER_MAX_CHARS]},
+            ],
+            "temperature": 0,
+            "max_tokens": 2048,
+            "response_format": {"type": "json_object"},
+            "chat_template_kwargs": {"enable_thinking": False},
+        }
+        try:
+            with httpx.Client(timeout=QWEN_NER_TIMEOUT) as c:
+                r = c.post(f"{self.base_url}/v1/chat/completions", json=body)
+        except Exception as e:
+            raise NerUnavailable(f"local Qwen NER unreachable: {e}")
+        if r.status_code != 200:
+            raise NerUnavailable(f"local Qwen NER HTTP {r.status_code}")
+        try:
+            choice = r.json()["choices"][0]
+            if choice.get("finish_reason") == "length":
+                # Truncated NER output is unreliable -> fail closed.
+                raise NerUnavailable("local Qwen NER output truncated (finish_reason=length)")
+            data = _parse_ner_json(choice["message"]["content"])
+        except NerUnavailable:
+            raise
+        except Exception as e:
+            raise NerUnavailable(f"local Qwen NER unparseable: {e}")
+        # Schema validation: json_object guarantees valid JSON, not a populated
+        # schema. An empty {} or a missing/!list field is a fail-OPEN trap -> fail closed.
+        if (not isinstance(data, dict)
+                or not isinstance(data.get("entities"), list)
+                or not isinstance(data.get("descriptive"), list)):
+            raise NerUnavailable("local Qwen NER returned a malformed/empty schema")
+        return data
+
+    def ner_fn(self, text: str):
+        """text -> [(surface, type)] for the engine to tokenize. Side-effect: stashes
+        descriptive re-identifier spans for the gateway to redact post-scrub."""
+        data = self._call(text)
+        for d in data.get("descriptive", []) or []:
+            span = (d.get("span") or "").strip() if isinstance(d, dict) else str(d).strip()
+            if span and not engine._TOKEN_RE.search(span):
+                self.descriptive.append(span)
+        out = []
+        for e in data.get("entities", []) or []:
+            if not isinstance(e, dict):
+                continue
+            t = (e.get("text") or "").strip()
+            ty = (e.get("type") or "").strip().upper()
+            if t and not engine._TOKEN_RE.search(t):
+                out.append((t, ty if ty in engine.TOKEN_TYPES else "PERSON"))
+        return out
+
+
+def _apply_tokenmap_to_span(span: str, token_map: dict) -> str:
+    """Rewrite real values inside a descriptive span into their tokens, longest value
+    first, so a span the NER returned BEFORE its embedded names were tokenized still
+    matches the final scrubbed text (the P0 fail-open fix)."""
+    s = span
+    for tok in sorted(token_map, key=lambda t: len(token_map.get(t, "")), reverse=True):
+        val = token_map[tok]
+        if val:
+            s = s.replace(val, tok)
+    return s
+
+
+def _redact_descriptive(scrubbed: str, spans: list[str], token_map: dict, item_id: str):
+    """Remove descriptive re-identifier spans from the final scrubbed text. For a
+    SUBSTANTIAL span that cannot be located+removed (even after applying the token
+    map), FAIL CLOSED (422) — never let identifier-blind prose reach Claude. Short/
+    generic model-noise spans are flagged but not blanket-removed (avoid over-redaction)."""
+    flags: list[dict] = []
+    for span in sorted(set(spans), key=len, reverse=True):
+        span = (span or "").strip()
+        if not span:
+            continue
+        substantial = (len(span.split()) >= DESCRIPTIVE_MIN_WORDS) or (len(span) >= DESCRIPTIVE_MIN_CHARS)
+        removed = False
+        for variant in (span, _apply_tokenmap_to_span(span, token_map)):
+            if variant and variant in scrubbed:
+                scrubbed = scrubbed.replace(variant, "[redacted]")
+                flags.append({"item": item_id, "span": span, "action": "redacted"})
+                removed = True
+                break
+        if not removed:
+            if substantial:
+                raise _Contract(422, {"error": "descriptive_unredactable", "item": item_id})
+            flags.append({"item": item_id, "span": span, "action": "skipped_generic"})
+    return scrubbed, flags
+
+
+async def _current_model_id(base_url: str) -> Optional[str]:
+    try:
+        async with httpx.AsyncClient(timeout=5.0) as c:
+            r = await c.get(f"{base_url}/v1/models")
+        if r.status_code == 200:
+            data = r.json().get("data") or []
+            return data[0]["id"] if data else None
+    except Exception:
+        return None
+    return None
+
+
+# ────────────────────────── request / response models ──────────────────────────
+
+class ScrubItem(BaseModel):
+    id: str
+    text: str
+
+
+class KnownEntities(BaseModel):
+    persons: list[str] = []
+    orgs: list[str] = []
+    funds: list[str] = []
+    emails: list[str] = []
+    locations: list[str] = []
+
+
+class BucketSpec(BaseModel):
+    amounts: bool = False
+    dates: bool = False
+
+
+class ScrubBody(BaseModel):
+    task_id: str
+    actor: Optional[str] = None
+    items: list[ScrubItem]
+    known_entities: Optional[KnownEntities] = None
+    tier1_action: str = "drop"
+    bucket: BucketSpec = BucketSpec()
+    ner: str = "auto"
+    map_handle: Optional[str] = None
+
+
+class RehydrateItem(BaseModel):
+    id: str
+    text: str
+
+
+class RehydrateBody(BaseModel):
+    task_id: str
+    map_handle: str
+    items: list[RehydrateItem]
+    actor: Optional[str] = None
+    strict: bool = True
+
+
+def _bare(tokens: list[str]) -> list[str]:
+    """[PERSON_1] -> PERSON_1 for the tokens_used field (matches the handover contract)."""
+    return [t.strip("[]") for t in tokens]
+
+
+# ────────────────────────── router ──────────────────────────
+
+def build_router(settings: Settings, map_store: MapStore) -> APIRouter:
+    router = APIRouter()
+
+    def _qwen_base() -> str:
+        return f"http://{settings.spark1_host}:{settings.vllm_port}"
+
+    async def _do_scrub(body: ScrubBody):
+        if not body.items:
+            raise _Contract(400, {"error": "bad_request", "detail": "items is required"})
+        if body.tier1_action not in ("drop", "reject"):
+            raise _Contract(400, {"error": "bad_request", "detail": "tier1_action must be 'drop' or 'reject'"})
+        if body.ner not in ("auto", "rules_only", "qwen"):
+            raise _Contract(400, {"error": "bad_request", "detail": "ner must be 'auto', 'rules_only', or 'qwen'"})
+
+        # Caller dictionary -> engine shape. Sensitive: transient, never logged/echoed.
+        known = None
+        if body.known_entities:
+            ke = body.known_entities
+            known = {"persons": ke.persons, "orgs": ke.orgs, "funds": ke.funds,
+                     "emails": ke.emails, "locations": ke.locations}
+
+        # NER backstop wiring (load-bearing under auto/qwen; fail-closed if unreachable).
+        ner_enabled = body.ner in ("auto", "qwen")
+        model_id: Optional[str] = None
+        if ner_enabled:
+            model_id = await _current_model_id(_qwen_base())
+            if not model_id:
+                raise _Contract(422, {
+                    "error": "ner_unavailable",
+                    "detail": "local Qwen NER is required (ner=%s) but no model is loaded; load a model "
+                              "or call with ner='rules_only' to knowingly skip the NER backstop" % body.ner,
+                })
+
+        # Reuse/extend an existing task map for stable cross-call tokens, else fresh.
+        if body.map_handle:
+            try:
+                existing = map_store.get(body.map_handle)
+            except _Expired:
+                raise _Contract(410, {"error": "map_expired"})
+            if existing is None:
+                raise _Contract(400, {"error": "unknown_map_handle"})
+            state = _state_from_map(existing)
+        else:
+            state = engine.ScrubState()
+
+        out_items: list[dict] = []
+        descriptive_flags: list[dict] = []
+        tier1_total = 0
+        bucket_on = bool(body.bucket.amounts or body.bucket.dates)
+
+        def _run_one(text: str, ner_obj: Optional[QwenNER]):
+            ner_fn = ner_obj.ner_fn if ner_obj is not None else None
+            return engine.scrub(text, known_entities=known, bucket=bucket_on,
+                                state=state, ner_fn=ner_fn)
+
+        for item in body.items:
+            item_ner = QwenNER(_qwen_base(), model_id) if (ner_enabled and model_id) else None
+            tier1_before = len(state.tier1_dropped)
+            try:
+                scrubbed, _full_map, audit = await asyncio.to_thread(_run_one, item.text, item_ner)
+            except NerUnavailable as e:
+                raise _Contract(422, {"error": "ner_unavailable", "detail": str(e)[:300]})
+            except _Contract:
+                raise
+            except Exception:
+                logger.exception("scrub failed for item %s", item.id)
+                # Generic message only — never interpolate engine exception text.
+                raise _Contract(500, {"error": "scrub_failed"})
+
+            # Per-item Tier-1 delta (state.tier1_dropped accumulates across items).
+            item_tier1_kinds = state.tier1_dropped[tier1_before:]
+            if body.tier1_action == "reject" and item_tier1_kinds:
+                # KINDS + item id only — never the raw Tier-1 values.
+                raise _Contract(422, {
+                    "error": "tier1_detected",
+                    "spans": [{"item": item.id, "kinds": sorted(set(item_tier1_kinds))}],
+                })
+            tier1_total += len(item_tier1_kinds)
+
+            # Redact descriptive re-identifiers (fail-closed on a substantial miss).
+            if item_ner is not None and item_ner.descriptive:
+                scrubbed, flags = _redact_descriptive(
+                    scrubbed, item_ner.descriptive, state.token_map, item.id)
+                descriptive_flags.extend(flags)
+
+            out_items.append({
+                "id": item.id,
+                "scrubbed_text": scrubbed,
+                "tokens_used": _bare(engine.residual_tokens(scrubbed)),
+            })
+
+        # Persist/refresh the resulting token map (the de-anon key) under a handle.
+        token_map = dict(state.token_map)
+        if body.map_handle:
+            try:
+                expires = map_store.extend(body.map_handle, token_map)
+            except KeyError:
+                raise _Contract(410, {"error": "map_expired"})
+            handle = body.map_handle
+        else:
+            handle, expires = map_store.create(body.task_id, token_map)
+
+        # tier2_tokenized = total placeholder OCCURRENCES across items;
+        # distinct_entities = distinct tokens in the map.
+        tier2_occurrences = sum(len(engine.residual_tokens(it["scrubbed_text"])) for it in out_items)
+        stats = {
+            "tier1_dropped": tier1_total,
+            "tier2_tokenized": tier2_occurrences,
+            "distinct_entities": len(token_map),
+            "descriptive_flags": descriptive_flags,
+        }
+        return {
+            "task_id": body.task_id,
+            "map_handle": handle,
+            "items": out_items,
+            "stats": stats,
+            "expires_at": datetime.fromtimestamp(expires, tz=timezone.utc).isoformat(),
+        }
+
+    @router.post("/scrub")
+    async def scrub_endpoint(body: ScrubBody):
+        try:
+            return await _do_scrub(body)
+        except _Contract as e:
+            return JSONResponse(status_code=e.status, content=e.body)
+
+    async def _do_rehydrate(body: RehydrateBody):
+        if not body.items:
+            raise _Contract(400, {"error": "bad_request", "detail": "items is required"})
+        try:
+            token_map = map_store.get(body.map_handle)
+        except _Expired:
+            raise _Contract(410, {"error": "map_expired"})
+        if token_map is None:
+            # Unknown handle == nothing to restore (doc: 410 on lapsed OR unknown handle).
+            raise _Contract(410, {"error": "map_expired"})
+
+        out_items = []
+        total_subbed = 0
+        all_unknown: set[str] = set()
+        for item in body.items:
+            present = engine.residual_tokens(item.text)
+            unknown = [t for t in present if t not in token_map]
+            if unknown and body.strict:
+                # Tripwire: a token with no map entry == hallucinated/smuggled.
+                raise _Contract(409, {"error": "unknown_tokens", "tokens": sorted(set(unknown))})
+            all_unknown.update(unknown)
+            rehydrated = engine.rehydrate(item.text, token_map)
+            total_subbed += sum(1 for t in present if t in token_map)
+            out_items.append({"id": item.id, "rehydrated_text": rehydrated})
+
+        return {
+            "items": out_items,
+            "stats": {"tokens_substituted": total_subbed, "unknown_tokens": sorted(all_unknown)},
+        }
+
+    @router.post("/rehydrate")
+    async def rehydrate_endpoint(body: RehydrateBody):
+        try:
+            return await _do_rehydrate(body)
+        except _Contract as e:
+            return JSONResponse(status_code=e.status, content=e.body)
+
+    return router
@@ -17,8 +17,10 @@ from .deep_health import DeepHealth
 from .disk import delete_from_disk, probe_disk
 from .download import DownloadManager
 from .llm_proxy import build_router as build_llm_router
+from .embeddings_proxy import build_router as build_embeddings_router
+from .redaction_gateway import build_router as build_redaction_router, MapStore
 from .hardware import HardwareProbe
-from .health import check_magpie, check_parakeet, check_vllm
+from .health import check_kokoro, check_parakeet, check_vllm, check_embeddings, check_qdrant
 from .models import load_catalog
 from .nim import SUGGESTED_NIMS, CATALOG_URL, NimManager
 from .overrides import add_custom, delete_custom, extract_knobs_from_args, load_overrides, set_knobs
@@ -60,7 +62,7 @@ app.mount("/static", StaticFiles(directory=_STATIC_DIR), name="static")

 # OpenAI-compatible audio proxy: /v1/audio/speech, /v1/audio/transcriptions, /v1/models.
 # Lets Open WebUI, Home Assistant, and any other OpenAI-shaped client talk to
-# Parakeet (STT) and Magpie (TTS) through a single spark-control URL.
+# Parakeet (STT) and Kokoro (TTS) through a single spark-control URL.
 # Passing deep_health lets the proxy fire an immediate wedge-detect + auto-restart
 # when Parakeet returns 500, instead of waiting up to 5 min for the periodic probe.
 app.include_router(build_audio_router(settings, deep_health=deep_health))
@@ -71,6 +73,20 @@ app.include_router(build_audio_router(settings, deep_health=deep_health))
 # as the audio proxy — clients only need one URL for everything.
 app.include_router(build_llm_router(settings))

+# OpenAI-compatible embeddings + rerank + hybrid search proxy:
+# /v1/embeddings -> spark-embed (bge-m3 dense), /v1/rerank -> spark-embed
+# (bge-reranker-v2-m3), /api/search -> orchestrated dense(+sparse) retrieval
+# from Qdrant with optional cross-encoder rerank. Same single-trusted-host
+# model as the LLM and audio proxies.
+app.include_router(build_embeddings_router(settings))
+
+# Redaction gateway: /scrub + /rehydrate. The privacy boundary between sovereign
+# LP data and the Claude API — de-identify context before it leaves the box,
+# re-identify Claude's response locally. The pseudonym map (the de-anon key) is
+# held server-side in a TTL-swept store on /data and never leaves this host.
+redaction_map_store = MapStore(settings.redaction_map_db, settings.redaction_map_ttl)
+app.include_router(build_redaction_router(settings, redaction_map_store))
+

@app.get("/", include_in_schema=False)
 async def index() -> FileResponse:
@@ -274,7 +290,7 @@ async def run_deep_health(service: str) -> dict:


 class HealthEventBody(BaseModel):
-    service: str                 # e.g. "parakeet", "magpie", "vllm"
+    service: str                 # e.g. "parakeet", "kokoro", "vllm"
    ok: bool                     # true on success, false on failure
    source: str | None = None    # what app reported (e.g. "open-webui")
    error: str | None = None     # optional detail
@@ -344,7 +360,7 @@ async def wake_spark(name: str) -> dict:

@app.get("/api/services")
 async def get_services() -> dict:
-    """Lifecycle state of always-on support services (Parakeet, Magpie, …).
+    """Lifecycle state of always-on support services (Parakeet, Kokoro, …).

    Each entry includes:
      - host/port/container/user (configured)
@@ -362,8 +378,15 @@ async def get_services() -> dict:
        docker = await docker_state(settings, svc)
        if name == "parakeet":
            http = await check_parakeet(settings)
+        elif name == "kokoro":
+            http = await check_kokoro(settings)
+        elif name == "embeddings":
+            http = await check_embeddings(settings)
+        elif name == "qdrant":
+            http = await check_qdrant(settings)
        else:
-            http = await check_magpie(settings)
+            # Custom services expose a /health endpoint by convention.
+            http = await check_kokoro(settings) if svc.kind == "tts" else {"ok": None, "base_url": svc.host and f"http://{svc.host}:{svc.port}"}
        return name, {
            "host": svc.host,
            "user": svc.user,
@@ -372,7 +395,10 @@ async def get_services() -> dict:
            "kind": svc.kind,
            "base_url": http.get("base_url"),
            "http_ready": bool(http.get("ok")),
-            "model": (http.get("detail") or {}).get("model") if isinstance(http.get("detail"), dict) else None,
+            # Prefer the check fn's own top-level model key (embeddings reports
+            # it there); fall back to a model field inside detail for services
+            # whose /health embeds it (parakeet).
+            "model": http.get("model") or ((http.get("detail") or {}).get("model") if isinstance(http.get("detail"), dict) else None),
            "docker_state": docker.get("state"),
            "restart_count": docker.get("restart_count"),
            "started_at": docker.get("started_at"),
@@ -484,8 +510,8 @@ async def stream_nim_install(job_id: str):

@app.delete("/api/services/{name}")
 async def del_service(name: str) -> dict:
-    # Only allow deleting custom services (not the bundled parakeet/magpie keys)
-    if name in ("parakeet", "magpie"):
+    # Only allow deleting custom services (not the bundled built-in keys)
+    if name in ("parakeet", "kokoro", "embeddings", "qdrant"):
        raise HTTPException(400, "built-in service; cannot delete (use Configure Sparks to point at a different host)")
    delete_custom_service(name)
    return {"ok": True, "name": name}
@@ -551,12 +577,15 @@ async def post_speech_models_restart() -> dict:
@app.get("/api/endpoints")
 async def get_endpoints() -> dict:
    """Service-discovery summary. Stable shape; other apps on the LAN can poll this
-    to learn the OpenAI-compatible vLLM endpoint, the Parakeet STT endpoint, and the
-    Magpie TTS endpoint without needing to know the individual Spark IPs."""
-    vllm, parakeet, magpie = await asyncio.gather(
+    to learn the OpenAI-compatible vLLM endpoint, the Parakeet STT endpoint, the
+    Kokoro TTS endpoint, and the embeddings + Qdrant retrieval endpoints without
+    needing to know the individual Spark IPs."""
+    vllm, parakeet, kokoro, embeddings, qdrant = await asyncio.gather(
        check_vllm(settings),
        check_parakeet(settings),
-        check_magpie(settings),
+        check_kokoro(settings),
+        check_embeddings(settings),
+        check_qdrant(settings),
    )
    return {
        "vllm": {
@@ -571,31 +600,51 @@ async def get_endpoints() -> dict:
            "kind": "stt",
            "model": (parakeet.get("detail") or {}).get("model") if isinstance(parakeet.get("detail"), dict) else None,
        },
-        "magpie": {
-            "ready": bool(magpie.get("ok")),
-            "base_url": magpie.get("base_url"),
+        "kokoro": {
+            "ready": bool(kokoro.get("ok")),
+            "base_url": kokoro.get("base_url"),
            "kind": "tts",
        },
+        "embeddings": {
+            "ready": bool(embeddings.get("ok")),
+            "base_url": embeddings.get("base_url"),
+            "kind": "embedding",
+            "model": embeddings.get("model"),
+            # The proxied OpenAI-compatible endpoints live on Spark Control itself.
+            "openai_endpoints": ["/v1/embeddings", "/v1/rerank", "/api/search"],
+        },
+        "qdrant": {
+            "ready": bool(qdrant.get("ok")),
+            "base_url": qdrant.get("base_url"),
+            "kind": "vectordb",
+            "collection": settings.qdrant_collection or None,
+        },
    }


@app.get("/api/status")
 async def get_status() -> dict:
-    vllm, parakeet, magpie = await asyncio.gather(
+    vllm, parakeet, kokoro, embeddings, qdrant = await asyncio.gather(
        check_vllm(settings),
        check_parakeet(settings),
-        check_magpie(settings),
+        check_kokoro(settings),
+        check_embeddings(settings),
+        check_qdrant(settings),
    )
    # Feed health into the connectivity log (deduped — only logs on transition)
    record_state("vllm", bool(vllm.get("ok")))
    record_state("parakeet", bool(parakeet.get("ok")))
-    record_state("magpie", bool(magpie.get("ok")))
+    record_state("kokoro", bool(kokoro.get("ok")))
+    record_state("embeddings", bool(embeddings.get("ok")))
+    record_state("qdrant", bool(qdrant.get("ok")))
    current_key = _identify_current_model(vllm.get("current_model"))
    return {
        "configured": settings.configured,
        "vllm": vllm,
        "parakeet": parakeet,
-        "magpie": magpie,
+        "kokoro": kokoro,
+        "embeddings": embeddings,
+        "qdrant": qdrant,
        "current_model_key": current_key,
        "current_swap_job": swap_manager.current_job_id,
    }
@@ -1,4 +1,4 @@
-"""Lifecycle controls for support-service containers (Parakeet, Magpie, etc.).
+"""Lifecycle controls for support-service containers (Parakeet, Kokoro, etc.).

 These are independent always-on containers that don't go through the LLM-swap
 machinery. We just run `docker start|stop|restart <container>` via SSH on the
@@ -32,9 +32,16 @@ def _clear_unreachable(host: str, user: str) -> None:
    _unreachable_cache.pop((host, user), None)


-ServiceName = Literal["parakeet", "magpie"]
+ServiceName = Literal["parakeet", "kokoro", "embeddings", "qdrant"]
 ServiceAction = Literal["start", "stop", "restart"]

+# Which service kinds are safe to auto-restart on a wedge probe. GPU model
+# servers can wedge their CUDA context and recover via restart. A vector DB
+# (qdrant) holds the only copy of the index and must NOT be auto-restarted on
+# a transient/benign probe error (e.g. a 404 on a missing collection) — a
+# restart mid-write/mid-snapshot is exactly what we don't want.
+RESTARTABLE_KINDS = {"stt", "tts", "embedding"}
+

@dataclass(frozen=True)
 class ServiceDef:
@@ -57,13 +64,29 @@ def services_from_settings(s: Settings) -> dict[str, ServiceDef]:
            container=s.parakeet_container,
            port=s.parakeet_port,
        ),
-        "magpie": ServiceDef(
-            name="magpie",
+        "kokoro": ServiceDef(
+            name="kokoro",
            kind="tts",
-            host=s.magpie_host,
-            user=s.magpie_user,
-            container=s.magpie_container,
-            port=s.magpie_port,
+            host=s.kokoro_host,
+            user=s.kokoro_user,
+            container=s.kokoro_container,
+            port=s.kokoro_port,
+        ),
+        "embeddings": ServiceDef(
+            name="embeddings",
+            kind="embedding",
+            host=s.embed_host,
+            user=s.embed_user,
+            container=s.embed_container,
+            port=s.embed_port,
+        ),
+        "qdrant": ServiceDef(
+            name="qdrant",
+            kind="vectordb",
+            host=s.qdrant_host,
+            user=s.qdrant_user,
+            container=s.qdrant_container,
+            port=s.qdrant_port,
        ),
    }
    for entry in load_custom_services():
@@ -767,7 +767,9 @@ function renderHealth(status) {
  }
  setDot('#h-vllm', status.vllm && status.vllm.ok, status.vllm);
  setDot('#h-parakeet', status.parakeet && status.parakeet.ok, status.parakeet);
-  setDot('#h-magpie', status.magpie && status.magpie.ok, status.magpie);
+  setDot('#h-kokoro', status.kokoro && status.kokoro.ok, status.kokoro);
+  setDot('#h-embeddings', status.embeddings && status.embeddings.ok, status.embeddings);
+  setDot('#h-qdrant', status.qdrant && status.qdrant.ok, status.qdrant);
  el('#updated').textContent = `updated ${new Date().toLocaleTimeString()}`;
 }

@@ -352,7 +352,9 @@
      <div class="health">
        <span class="health-item" id="h-vllm"><span class="dot"></span> vLLM</span>
        <span class="health-item" id="h-parakeet"><span class="dot"></span> Parakeet</span>
-        <span class="health-item" id="h-magpie"><span class="dot"></span> Magpie</span>
+        <span class="health-item" id="h-kokoro"><span class="dot"></span> Kokoro</span>
+        <span class="health-item" id="h-embeddings"><span class="dot"></span> Embeddings</span>
+        <span class="health-item" id="h-qdrant"><span class="dot"></span> Qdrant</span>
      </div>
      <div class="muted small" id="updated"></div>
    </footer>
@@ -0,0 +1,36 @@
+# spark-embed — dense embeddings (bge-m3) + reranker (bge-reranker-v2-m3)
+# Built FROM the NGC PyTorch image that is already proven to run on the DGX
+# Spark's GB10 (sm_121) GPU — the same base behind our vLLM and Kokoro work.
+#
+# Why not HF Text Embeddings Inference (TEI)? As of 2026 TEI ships no arm64
+# CUDA image (all *-cuda tags are amd64-only), so it won't run on the Spark.
+# Building on NGC torch sidesteps that AND avoids torchaudio (the dependency
+# that sank the WhisperX attempt). bge-m3 + the reranker are XLM-RoBERTa
+# encoders — no flash-attn, no torchaudio, just SDPA attention on torch.
+FROM nvcr.io/nvidia/pytorch:25.11-py3
+
+WORKDIR /app
+
+# Hard-pin the NGC torch version in a constraints file so pip CANNOT replace it
+# while resolving sentence-transformers. NGC's torch carries a local version
+# string (…nv25.11) not on PyPI; pinning it makes pip treat the already-installed
+# build as satisfying the requirement instead of pulling a PyPI wheel that
+# wouldn't have sm_121 kernels. (Same technique as the v0.12.0 torch-ABI work.)
+# transformers is NOT preinstalled in this NGC base, so it installs fresh from
+# PyPI; we cap it (<5) so a future major can't silently change loading behavior.
+RUN python -c "import torch; \
+open('/tmp/constraints.txt','w').write('torch==%s\n' % torch.__version__)" \
+ && cat /tmp/constraints.txt \
+ && pip install --no-cache-dir -c /tmp/constraints.txt \
+      "sentence-transformers>=3.0" "transformers<5" "fastapi>=0.115" "uvicorn[standard]>=0.30"
+
+COPY main.py /app/main.py
+
+# Persist HuggingFace model downloads (bge-m3 ~2.3GB + reranker ~2.3GB) on a
+# mounted volume so container recreates don't re-download.
+ENV HF_HOME=/data/hf
+ENV DENSE_MODEL=BAAI/bge-m3
+ENV RERANK_MODEL=BAAI/bge-reranker-v2-m3
+
+EXPOSE 8088
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8088"]
@@ -0,0 +1,214 @@
+"""spark-embed — a tiny FastAPI server for dense text embeddings + reranking.
+
+Serves BAAI/bge-m3 (dense, 1024-d) and BAAI/bge-reranker-v2-m3 (cross-encoder
+rerank) on a DGX Spark (GB10 Grace-Blackwell, sm_121, ARM64).
+
+Why this exists instead of HF TEI: as of 2026 TEI publishes no arm64 CUDA
+image (every text-embeddings-inference:*-cuda tag is amd64-only), so the
+prebuilt-server path doesn't run on the Spark. This server is built FROM
+nvcr.io/nvidia/pytorch (the same NGC torch we've already proven runs on this
+GB10 for vLLM + Kokoro), so there's no Blackwell kernel risk and — crucially —
+no torchaudio (the dependency that sank the WhisperX attempt). bge-m3 and the
+reranker are XLM-RoBERTa encoders that run on standard SDPA attention; no
+flash-attn wheel needed.
+
+Endpoints:
+  GET  /health   — readiness + loaded model names + device
+  GET  /         — service info
+  POST /embed    — dense embeddings (OpenAI-ish raw arrays)
+  POST /rerank   — cross-encoder rerank of documents against a query
+
+Sparse/BM25 lexical retrieval is intentionally NOT served here. For the
+entity-heavy CRM use case we pair these dense vectors with Qdrant's built-in
+IDF (modifier:idf) over BM25 term-weights generated client-side at ingest +
+query time (FastEmbed Qdrant/bm25). Keeping BM25 in one place (the ingest
+pipeline) avoids vocabulary/IDF drift between ingest and query.
+"""
+from __future__ import annotations
+
+import os
+import time
+import logging
+from contextlib import asynccontextmanager
+from typing import Optional, Union
+
+import torch
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
+)
+logger = logging.getLogger("spark-embed")
+
+DENSE_MODEL = os.getenv("DENSE_MODEL", "BAAI/bge-m3")
+RERANK_MODEL = os.getenv("RERANK_MODEL", "BAAI/bge-reranker-v2-m3")
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+USE_FP16 = os.getenv("EMBED_FP16", "1") == "1" and DEVICE == "cuda"
+EMBED_BATCH = int(os.getenv("EMBED_BATCH", "64"))
+RERANK_BATCH = int(os.getenv("RERANK_BATCH", "32"))
+MAX_DOCS = int(os.getenv("RERANK_MAX_DOCS", "200"))
+
+
+class _State:
+    dense = None
+    reranker = None
+    dims: Optional[int] = None
+    loaded: bool = False
+    error: Optional[str] = None
+
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    # Imported here so module import (and --help, tooling) doesn't require the
+    # heavy deps; the container always has them.
+    from sentence_transformers import SentenceTransformer, CrossEncoder
+
+    # Load inside try/except and ALWAYS yield: a load failure (cold HF download
+    # error, GPU OOM on the 2nd model, bad /data perms) must become an
+    # observable degraded state (/health -> status:error) rather than a uvicorn
+    # "startup failed" crashloop that hides the real cause from the proxy.
+    try:
+        t0 = time.time()
+        logger.info("Loading dense model %s on %s (fp16=%s)", DENSE_MODEL, DEVICE, USE_FP16)
+        _State.dense = SentenceTransformer(DENSE_MODEL, device=DEVICE)
+        if USE_FP16:
+            _State.dense.half()
+        # Probe the dimension once with a tiny encode.
+        probe = _State.dense.encode(["dimension probe"], normalize_embeddings=True,
+                                    convert_to_numpy=True)
+        _State.dims = int(probe.shape[1])
+        logger.info("Dense model ready: dims=%d in %.1fs", _State.dims, time.time() - t0)
+
+        t1 = time.time()
+        logger.info("Loading reranker %s on %s", RERANK_MODEL, DEVICE)
+        _State.reranker = CrossEncoder(
+            RERANK_MODEL, device=DEVICE,
+            model_kwargs={"torch_dtype": torch.float16} if USE_FP16 else {},
+        )
+        logger.info("Reranker ready in %.1fs", time.time() - t1)
+
+        _State.loaded = True
+        logger.info("spark-embed ready (total %.1fs)", time.time() - t0)
+    except Exception as e:
+        _State.error = f"{type(e).__name__}: {e}"
+        logger.exception("spark-embed model load FAILED — serving in degraded state")
+    yield
+
+
+app = FastAPI(title="spark-embed", version="1.0.0", lifespan=lifespan)
+
+
+@app.get("/")
+async def root() -> dict:
+    return {
+        "service": "spark-embed",
+        "dense_model": DENSE_MODEL,
+        "rerank_model": RERANK_MODEL,
+        "dims": _State.dims,
+        "device": DEVICE,
+        "endpoints": {"embed": "/embed", "rerank": "/rerank", "health": "/health"},
+    }
+
+
+@app.get("/health")
+async def health() -> dict:
+    if _State.error:
+        status = "error"
+    elif _State.loaded:
+        status = "ready"
+    else:
+        status = "loading"
+    out = {
+        "status": status,
+        "dense_model": DENSE_MODEL,
+        "rerank_model": RERANK_MODEL,
+        "dims": _State.dims,
+        "device": DEVICE,
+    }
+    if _State.error:
+        out["error"] = _State.error
+    return out
+
+
+class EmbedBody(BaseModel):
+    # Accept either a single string or a batch. `input` mirrors OpenAI's field
+    # name so callers can reuse OpenAI client request shapes loosely.
+    input: Union[str, list[str]]
+    normalize: bool = True
+
+
+@app.post("/embed")
+async def embed(body: EmbedBody) -> dict:
+    if not _State.loaded or _State.dense is None:
+        raise HTTPException(503, "model loading")
+    texts = [body.input] if isinstance(body.input, str) else list(body.input)
+    if not texts:
+        raise HTTPException(400, "input is required")
+    if any(not isinstance(t, str) for t in texts):
+        raise HTTPException(400, "all inputs must be strings")
+    t0 = time.time()
+    try:
+        vecs = _State.dense.encode(
+            texts,
+            normalize_embeddings=body.normalize,
+            batch_size=EMBED_BATCH,
+            convert_to_numpy=True,
+        )
+    except Exception as e:
+        logger.exception("embed failed")
+        raise HTTPException(500, f"embed failed: {e}")
+    elapsed = time.time() - t0
+    logger.info("embed %d texts in %.0fms", len(texts), elapsed * 1000)
+    return {
+        "model": DENSE_MODEL,
+        "dims": int(vecs.shape[1]),
+        "count": len(texts),
+        "embeddings": vecs.tolist(),
+    }
+
+
+class RerankBody(BaseModel):
+    query: str
+    documents: list[str]
+    top_n: Optional[int] = None
+    # When True, return the document text alongside each result (OpenAI/Cohere style).
+    return_documents: bool = False
+
+
+@app.post("/rerank")
+async def rerank(body: RerankBody) -> dict:
+    if not _State.loaded or _State.reranker is None:
+        raise HTTPException(503, "model loading")
+    if not body.query.strip():
+        raise HTTPException(400, "query is required")
+    docs = list(body.documents or [])
+    if not docs:
+        raise HTTPException(400, "documents is required")
+    if len(docs) > MAX_DOCS:
+        raise HTTPException(413, f"too many documents (>{MAX_DOCS}); rerank a smaller candidate set")
+    pairs = [[body.query, d] for d in docs]
+    t0 = time.time()
+    try:
+        scores = _State.reranker.predict(pairs, batch_size=RERANK_BATCH)
+    except Exception as e:
+        logger.exception("rerank failed")
+        raise HTTPException(500, f"rerank failed: {e}")
+    elapsed = time.time() - t0
+    ranked = sorted(
+        ((i, float(s)) for i, s in enumerate(scores)),
+        key=lambda x: x[1],
+        reverse=True,
+    )
+    # top_n <= 0 means "return all" (same as None) — never silently return [].
+    if body.top_n is not None and body.top_n > 0:
+        ranked = ranked[: body.top_n]
+    logger.info("rerank %d docs in %.0fms", len(docs), elapsed * 1000)
+    results = []
+    for idx, score in ranked:
+        item = {"index": idx, "score": score}
+        if body.return_documents:
+            item["document"] = docs[idx]
+        results.append(item)
+    return {"model": RERANK_MODEL, "results": results}
@@ -1,6 +1,14 @@
 # Known issues

-## ~~magpie-tts crash loop (Spark 2)~~ — RESOLVED 2026-05-12
+## Magpie removed in v0.14.0 (2026-06-03)
+
+**Why**: Magpie/Riva's TTS decoder had a structural defect — ~30% truncation rate at short inputs, ~50%+ at multi-sentence inputs, fresh-container restart did not help. Reproduced server-side and confirmed in Riva's own logs (status:0 with implausibly short audio_duration). Switching to Riva's streaming endpoint did not help — same failure rate. Even with v0.13.0:5's retry layer and v0.13.0:6's server-side chunking, end-to-end reliability capped at ~85%.
+
+**What replaced it**: Kokoro-82M (Apache 2.0) via `ghcr.io/remsky/kokoro-fastapi-gpu`. 24/24 successful renders across the same input lengths that broke Magpie 13/24 times, ~1s wallclock per call, 1.3 GB GPU memory (vs Magpie's 49 GB). No retry/chunking layer needed in the proxy. Default voice `bm_george`; curated quick-picks include `bf_emma`, `am_michael`, `af_heart`.
+
+The old chunking/retry workaround in `audio_proxy.py` and the Magpie sections in the dashboard, config, services, and deep_health modules were all removed in v0.14.0. Migration: existing users need to pull and run the Kokoro container on Spark 2 (one `docker run` command), then either let Spark Control auto-discover it or update Configure Sparks if running on a non-default host.
+
+## ~~magpie-tts crash loop (Spark 2)~~ — RESOLVED 2026-05-12, then Magpie removed entirely 2026-06-03

 **What Magpie is:** NVIDIA's multilingual text-to-speech (TTS) model, served via the NIM (NVIDIA Inference Microservices) framework — a Riva Speech Server container that converts text into spoken audio. It's the counterpart to Parakeet (which is speech-to-text / STT). When working, it exposes `/v1/audio/speech` on port 9000 and is used by clients like Open WebUI for the "read aloud" feature.

@@ -8,7 +8,7 @@ After install you have:

 - **A web UI** at the package's LAN address (HTTPS, .local).
 - **One-click model swaps** for any model in your `models.yaml` catalog.
- **Live status** of vLLM, Parakeet (STT), and Magpie (TTS).
+- **Live status** of vLLM, Parakeet (STT), and Kokoro (TTS).

 ## Getting set up

@@ -8,7 +8,7 @@ After install you have:

 - **A web UI** at the package's LAN address (HTTPS, .local).
 - **One-click model swaps** for any model in your `models.yaml` catalog.
- **Live status** of vLLM, Parakeet (STT), and Magpie (TTS).
+- **Live status** of vLLM, Parakeet (STT), and Kokoro (TTS).

 ## Getting set up

@@ -25,7 +25,7 @@ const inputSpec = InputSpec.of({
  spark2_host: Value.text({
    name: 'Spark 2 hostname or IP',
    description:
-      'The worker node of your DGX Spark cluster (also runs always-on services like Parakeet/Magpie). Enter its LAN IP or hostname.',
+      'The worker node of your DGX Spark cluster (also runs always-on services like Parakeet and Kokoro). Enter its LAN IP or hostname.',
    required: true,
    default: null,
    placeholder: 'e.g. 192.168.1.11',
@@ -58,22 +58,65 @@ const inputSpec = InputSpec.of({
    placeholder: 'parakeet-asr',
    masked: false,
  }),
-  magpie_host: Value.text({
-    name: 'Magpie host (optional)',
+  kokoro_host: Value.text({
+    name: 'Kokoro host (optional)',
    description:
-      'Override the host running the Magpie TTS container. Leave blank if Magpie runs on Spark 2.',
+      'Override the host running the Kokoro TTS container. Leave blank if Kokoro runs on Spark 2.',
    required: false,
    default: null,
    placeholder: 'leave blank to use Spark 2',
    masked: false,
  }),
-  magpie_container: Value.text({
-    name: 'Magpie container name (optional)',
+  kokoro_container: Value.text({
+    name: 'Kokoro container name (optional)',
    description:
-      'Docker container name for Magpie. Defaults to "magpie-tts".',
+      'Docker container name for Kokoro. Defaults to "kokoro-tts".',
    required: false,
    default: null,
-    placeholder: 'magpie-tts',
+    placeholder: 'kokoro-tts',
+    masked: false,
+  }),
+  embed_host: Value.text({
+    name: 'Embedding server host (optional)',
+    description:
+      'Override the host running the spark-embed container (bge-m3 dense embeddings + reranker). Leave blank if it runs on Spark 2.',
+    required: false,
+    default: null,
+    placeholder: 'leave blank to use Spark 2',
+    masked: false,
+  }),
+  embed_container: Value.text({
+    name: 'Embedding container name (optional)',
+    description: 'Docker container name for the embedding server. Defaults to "spark-embed".',
+    required: false,
+    default: null,
+    placeholder: 'spark-embed',
+    masked: false,
+  }),
+  qdrant_host: Value.text({
+    name: 'Qdrant host (optional)',
+    description:
+      'Override the host running the Qdrant vector database. Leave blank if it runs on Spark 2.',
+    required: false,
+    default: null,
+    placeholder: 'leave blank to use Spark 2',
+    masked: false,
+  }),
+  qdrant_container: Value.text({
+    name: 'Qdrant container name (optional)',
+    description: 'Docker container name for Qdrant. Defaults to "qdrant".',
+    required: false,
+    default: null,
+    placeholder: 'qdrant',
+    masked: false,
+  }),
+  qdrant_collection: Value.text({
+    name: 'Default Qdrant collection (optional)',
+    description:
+      'Default collection name used by /api/search when a request does not specify one. Leave blank to require callers to pass a collection.',
+    required: false,
+    default: null,
+    placeholder: 'e.g. crm_chunks',
    masked: false,
  }),
  open_webui_url: Value.text({
@@ -88,7 +131,7 @@ const inputSpec = InputSpec.of({
  ngc_api_key: Value.text({
    name: 'NGC API key (optional)',
    description:
-      'NVIDIA NGC personal API key — needed to install NIM containers (Parakeet, Magpie, etc.) from nvcr.io. Get one free at https://ngc.nvidia.com/setup/personal-key. Stored only on this Start9 server; passed to docker as the NGC_API_KEY env var when installing NIM services.',
+      'NVIDIA NGC personal API key — needed to install NIM containers (Parakeet, etc.) from nvcr.io. Get one free at https://ngc.nvidia.com/setup/personal-key. Stored only on this Start9 server; passed to docker as the NGC_API_KEY env var when installing NIM services. (Kokoro TTS is Apache 2.0 and does not need an NGC key.)',
    required: false,
    default: null,
    placeholder: 'starts with "nvapi-..."',
@@ -11,9 +11,17 @@ export const sparkConfigSchema = z.object({
  parakeet_host: z.string().catch(''),
  parakeet_user: z.string().catch(''),
  parakeet_container: z.string().catch(''),
-  magpie_host: z.string().catch(''),
-  magpie_user: z.string().catch(''),
-  magpie_container: z.string().catch(''),
+  kokoro_host: z.string().catch(''),
+  kokoro_user: z.string().catch(''),
+  kokoro_container: z.string().catch(''),
+  // Optional overrides for the embedding server (spark-embed) + Qdrant.
+  embed_host: z.string().catch(''),
+  embed_user: z.string().catch(''),
+  embed_container: z.string().catch(''),
+  qdrant_host: z.string().catch(''),
+  qdrant_user: z.string().catch(''),
+  qdrant_container: z.string().catch(''),
+  qdrant_collection: z.string().catch(''),
  // Optional Open WebUI deep-link
  open_webui_url: z.string().catch(''),
  // Optional NGC API key for pulling NIM containers from nvcr.io/nim/...
@@ -17,7 +17,7 @@ const dict = {

  // interfaces.ts (api)
  'OpenAI-compatible API': 8,
-  'Service-discovery JSON at /api/endpoints. Other apps on the LAN can GET this to learn the current vLLM, Parakeet, and Magpie URLs.': 9,
+  'Service-discovery JSON at /api/endpoints. Other apps on the LAN can GET this to learn the current vLLM, Parakeet, and Kokoro URLs.': 9,
 } as const

 /**
@@ -22,7 +22,7 @@ export const setInterfaces = sdk.setupInterfaces(async ({ effects }) => {
    name: i18n('OpenAI-compatible API'),
    id: 'api',
    description: i18n(
-      'Service-discovery JSON at /api/endpoints. Other apps on the LAN can GET this to learn the current vLLM, Parakeet, and Magpie URLs.',
+      'Service-discovery JSON at /api/endpoints. Other apps on the LAN can GET this to learn the current vLLM, Parakeet, and Kokoro URLs.',
    ),
    type: 'api',
    masked: false,
@@ -16,9 +16,16 @@ export const main = sdk.setupMain(async ({ effects }) => {
    parakeet_host: '',
    parakeet_user: '',
    parakeet_container: '',
-    magpie_host: '',
-    magpie_user: '',
-    magpie_container: '',
+    kokoro_host: '',
+    kokoro_user: '',
+    kokoro_container: '',
+    embed_host: '',
+    embed_user: '',
+    embed_container: '',
+    qdrant_host: '',
+    qdrant_user: '',
+    qdrant_container: '',
+    qdrant_collection: '',
    open_webui_url: '',
    ngc_api_key: '',
  }
@@ -45,9 +52,16 @@ export const main = sdk.setupMain(async ({ effects }) => {
        PARAKEET_HOST: cfg.parakeet_host,
        PARAKEET_USER: cfg.parakeet_user,
        PARAKEET_CONTAINER: cfg.parakeet_container,
-        MAGPIE_HOST: cfg.magpie_host,
-        MAGPIE_USER: cfg.magpie_user,
-        MAGPIE_CONTAINER: cfg.magpie_container,
+        KOKORO_HOST: cfg.kokoro_host,
+        KOKORO_USER: cfg.kokoro_user,
+        KOKORO_CONTAINER: cfg.kokoro_container,
+        EMBED_HOST: cfg.embed_host,
+        EMBED_USER: cfg.embed_user,
+        EMBED_CONTAINER: cfg.embed_container,
+        QDRANT_HOST: cfg.qdrant_host,
+        QDRANT_USER: cfg.qdrant_user,
+        QDRANT_CONTAINER: cfg.qdrant_container,
+        QDRANT_COLLECTION: cfg.qdrant_collection,
        MODELS_OVERRIDES: '/data/models-overrides.yaml',
        SERVICES_OVERRIDES: '/data/services-overrides.yaml',
        CONNECTIVITY_LOG: '/data/connectivity.json',
@@ -5,10 +5,14 @@ export const manifest = setupManifest({
  id: 'spark-control',
  title: 'Spark Control',
  license: 'MIT',
-  packageRepo: 'https://github.com/grant/spark-control',
-  upstreamRepo: 'https://github.com/grant/spark-control',
-  marketingUrl: 'https://github.com/grant/spark-control',
-  donationUrl: 'https://github.com/grant/spark-control',
+  // Placeholder URLs — replace with a real repo before publishing the package
+  // publicly. The StartOS UI shows these as "Source" and "Marketing" links;
+  // example.com is RFC 2606 reserved-for-documentation so it's an obvious
+  // "fill me in" signal rather than pointing at anyone's personal account.
+  packageRepo: 'https://example.com',
+  upstreamRepo: 'https://example.com',
+  marketingUrl: 'https://example.com',
+  donationUrl: null,
  docsUrls: [],
  description: { short, long },
  volumes: ['main'],
@@ -1,10 +1,10 @@
 import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk'

 export const v0_1_0 = VersionInfo.of({
-  version: '0.13.0:3',
+  version: '0.18.0:0',
  releaseNotes: {
    en_US:
-      'v0.13.0:3 — chat-completions proxy. Adds POST /v1/chat/completions (and /v1/completions for the legacy endpoint) to Spark Control that forwards to whichever vLLM is currently loaded on Spark 1. Supports SSE streaming when stream=true in the request body. Request body is passed through unchanged — any vLLM-supported field works (model, messages, max_tokens, temperature, response_format, chat_template_kwargs, tools, ...). Closes the last gap that forced clients to know about both Spark Control AND the direct vLLM URL — recap-relay and friends can now use one trusted host for everything (transcribe, diarize, analyze) with one cert and one allowlist. 30-min request timeout to accommodate large-context completions. No parakeet container changes; no Reapply patches needed.',
+      'v0.18.0 — dual-channel mode for POST /api/audio/label-merge. Instead of one mixed-mono file, a caller (Ten31 Transcripts) can send two sample-aligned tracks: mic_file (the local user) + system_file (everyone else, from screen capture). Rather than force the diarizer to re-disentangle a mono mix (which over-segments — proven: a stereo clip of 2 clean voices returned 3 speakers), we split the problem so each model gets the easiest mono input. The mic track yields the local user\'s words, gated to windows where the mic is genuinely the user speaking (mic louder than system — a self-VAD computed server-side per-window, or supplied via self_vad); this gate is load-bearing because the mic picks up the remote audio as quiet bleed. The system track is diarized (only has to separate the remote people) and named via the visual timeline + voiceprints. The user\'s clean voiceprint is enrolled from the mic track and injected into the voiceprint library, so a system cluster that is the user dialed in from a second device (dual-login) resolves to the user, not a stranger. Validated on a real misattributing call: fixes both mono-mix misattributions, recovers the dropped-to-Unknown local line, and correctly splits overlapping speech (two people saying "Hello" at once) that the coarse ground truth itself conflated. New form fields: mic_file + system_file (dual mode), self_name, self_vad (optional). The mono file path is unchanged and fully backward-compatible. Response gains a "mode" field (mono | dual_channel). Known limit: if loud remote bleed masks a quiet local word, the mic-track ASR may miss it — mitigated by a cleaner mic (headphones) or future echo-cancellation. See docs/AUDIO_API.md.',
  },
  migrations: {
    up: async ({ effects }) => {},
@@ -72,8 +72,8 @@ ssh <spark-user>@<spark-1-host>.local 'docker logs --tail 200 -f vllm_node'
 # Parakeet
 curl -s http://<spark-2-ip>:8000/health

-# Magpie (see known-issues.md)
-curl -s http://<spark-2-ip>:9000/v1/health/ready
+# Kokoro TTS (v0.14.0+)
+curl -s http://<spark-2-ip>:8880/health
 ```

 ## Hard reset
@@ -0,0 +1,213 @@
+#!/bin/bash
+# End-to-end test of the v0.10 + v0.11 audio pipeline:
+#   audio file  →  spark-control /api/audio/transcribe-with-speakers
+#                  (Parakeet + Sortformer merged)
+#               →  Qwen3.6 via vLLM with long-form prompt + speaker name
+#                  resolution
+#               →  ~/Desktop/<filename>-analysis.md
+#
+# Usage:
+#   bash scripts/test-audio-with-speakers.sh <audio-file> [--people "Name1, Name2"]
+#
+# Examples:
+#   # No participants list (LLM will only resolve speakers it can verify from audio cues)
+#   bash scripts/test-audio-with-speakers.sh ~/Library/Application\ Support/hyprnote/sessions/*/audio.mp3
+#
+#   # With known participants (LLM constrained to these names)
+#   bash scripts/test-audio-with-speakers.sh ~/Downloads/podcast.mp3 --people "Dax, Will"
+#
+# Designed to mirror exactly what recap-relay's spark-control backend will do
+# once the PR lands. If the output looks good here, the recap-relay version
+# will look the same.
+
+set -e
+
+AUDIO="${1:?Usage: $0 <audio-file> [--people \"Name1, Name2\"]}"
+PEOPLE=""
+if [ "$2" = "--people" ] && [ -n "$3" ]; then
+  PEOPLE="$3"
+fi
+
+if [ ! -f "$AUDIO" ]; then
+  echo "ERROR: audio file not found: $AUDIO" >&2
+  exit 1
+fi
+
+SPARK_CONTROL="${SPARK_CONTROL:-https://spark.satsflows.com}"
+VLLM="${VLLM:-http://<spark-1-ip>:8888/v1}"
+
+echo "════════════════════════════════════════════════════════════════"
+echo "Audio:           $AUDIO ($(du -h "$AUDIO" | cut -f1))"
+echo "Spark Control:   $SPARK_CONTROL"
+echo "vLLM:            $VLLM"
+echo "Participants:    ${PEOPLE:-<none — LLM will only resolve speakers from audio cues>}"
+echo "════════════════════════════════════════════════════════════════"
+echo
+
+# ───────── Stage 1: transcribe + diarize ─────────
+echo "▶ Stage 1: transcribe + diarize (Parakeet + Sortformer in parallel)..."
+START=$(date +%s)
+HTTP=$(curl -sSk -X POST "$SPARK_CONTROL/api/audio/transcribe-with-speakers" \
+  -F "file=@$AUDIO" \
+  -o /tmp/diarized.json \
+  -w "%{http_code}")
+END=$(date +%s)
+echo "  HTTP $HTTP, $((END - START))s wall time"
+
+if [ "$HTTP" != "200" ]; then
+  echo "ERROR — non-200 response. Full body:"
+  cat /tmp/diarized.json
+  exit 1
+fi
+
+python3 -c "
+import json
+d = json.load(open('/tmp/diarized.json'))
+print(f\"  Duration: {d['duration']}s   Speakers: {d['speakers_detected']}   Segments: {len(d['segments'])}\")"
+
+# ───────── Stage 2: format transcript ─────────
+echo
+echo "▶ Stage 2: format diarized transcript as [MM:SS] Speaker_N: text..."
+python3 > /tmp/transcript-formatted.txt <<'PY'
+import json
+d = json.load(open('/tmp/diarized.json'))
+out = []
+for s in d['segments']:
+    ms = s['start_ms'] // 1000
+    h, m, sec = ms // 3600, (ms % 3600) // 60, ms % 60
+    ts = f"{h}:{m:02d}:{sec:02d}" if h else f"{m:02d}:{sec:02d}"
+    out.append(f"[{ts}] {s['speaker']}: {s['text']}")
+print("\n".join(out))
+PY
+echo "  $(wc -l < /tmp/transcript-formatted.txt) formatted lines"
+echo "  Sample (first 3):"
+head -3 /tmp/transcript-formatted.txt | sed 's/^/    /'
+
+# ───────── Stage 3: discover current LLM ─────────
+echo
+echo "▶ Stage 3: discover current vLLM model..."
+MODEL=$(curl -sS $VLLM/models | python3 -c "import json,sys; print(json.load(sys.stdin)['data'][0]['id'])")
+echo "  Model: $MODEL"
+
+# ───────── Stage 4: build LLM request ─────────
+echo
+echo "▶ Stage 4: build LLM request with speaker-name-resolution prompt..."
+python3 - "$MODEL" /tmp/transcript-formatted.txt "$PEOPLE" > /tmp/request.json <<'PY'
+import json, sys
+model, transcript_path, people = sys.argv[1], sys.argv[2], sys.argv[3]
+transcript = open(transcript_path).read()
+
+participants_block = ""
+if people.strip():
+    participants_block = f"""
+
+Known participants in this conversation: {people}
+Constrain your speaker→name mappings to this list. Still only assign a
+name when the audio cues unambiguously identify which participant is
+which — do not guess based on topic or role."""
+
+system = (
+    "You are a meeting analyst producing comprehensive long-form notes. "
+    "Preserve specific quotes, numbers, dates, names, and decisions verbatim. "
+    "Quote speakers directly when they said something memorable. "
+    "Generate as many sections as the meeting naturally has. "
+    "Do not summarize aggressively — aim for 3000-6000 words for a 60-90 min conversation."
+)
+
+user_prompt = f"""You will analyze a transcript with anonymous speaker labels (Speaker_0, Speaker_1, ...).
+
+CRITICAL — speaker name resolution rules:
+  Map a speaker label to a real name ONLY when you have direct, unambiguous evidence:
+    - The speaker explicitly identifies themselves ("I'm X", "this is X", "my name is X")
+    - Another speaker addresses them by name as a vocative ("thanks X", "X, what do you think?")
+  If you have ANY doubt, leave the mapping as null. False mappings are worse than no mapping.
+  Do NOT infer names from topic context, role descriptions, or weak associations.{participants_block}
+
+OUTPUT FORMAT — produce exactly two parts:
+
+PART 1: A JSON block at the very top of your response with this shape:
+```json
+{{
+  "speaker_mapping": {{
+    "Speaker_0": {{"name": "Real Name", "confidence": "high", "evidence": "quoted line + [MM:SS]"}},
+    "Speaker_1": {{"name": null, "confidence": null, "evidence": null}}
+  }}
+}}
+```
+
+PART 2: Below the JSON, a structured long-form report with these sections:
+
+# Detailed Discussion Log
+Chronological account of every topic discussed, with verbatim quotes from speakers for important points. Aim for 8+ bullets per major topic. Use sub-bullets for examples or supporting detail.
+
+# Decisions Made
+Every decision, with who proposed it, who agreed, any dissent, and rationale.
+
+# Action Items
+Every action item, with owner, deadline, and any context. Include even minor "I'll think about it" commitments.
+
+# Open Questions
+Things raised that weren't resolved, with who raised them.
+
+# Key Quotes
+Direct quotes worth preserving, with speaker attribution.
+
+In the report body: use REAL NAMES where you mapped them, and Speaker_N where you couldn't.
+
+---
+
+TRANSCRIPT:
+
+{transcript}"""
+
+print(json.dumps({
+    "model": model,
+    "messages": [
+        {"role": "system", "content": system},
+        {"role": "user", "content": user_prompt},
+    ],
+    "max_tokens": 16000,
+    "temperature": 0.3,
+    "chat_template_kwargs": {"enable_thinking": False},
+}))
+PY
+REQ_BYTES=$(wc -c < /tmp/request.json)
+echo "  Request size: $REQ_BYTES bytes"
+
+# ───────── Stage 5: LLM call ─────────
+echo
+echo "▶ Stage 5: send to Qwen3.6 (this is the slow part — 30-90s typical)..."
+START=$(date +%s)
+curl -sS $VLLM/chat/completions \
+  -H "Content-Type: application/json" \
+  -d @/tmp/request.json \
+  > /tmp/llm-raw.json
+END=$(date +%s)
+echo "  Wall time: $((END - START))s"
+
+# Extract content
+python3 -c "
+import json
+r = json.load(open('/tmp/llm-raw.json'))
+if 'choices' in r:
+    print(r['choices'][0]['message']['content'])
+else:
+    print('ERROR — unexpected response:')
+    print(json.dumps(r, indent=2))
+" > /tmp/analysis.md
+
+# ───────── Stage 6: save + display ─────────
+BASENAME=$(basename "$AUDIO" | sed 's/\.[^.]*$//')
+DEST="$HOME/Desktop/${BASENAME}-analysis.md"
+cp /tmp/analysis.md "$DEST"
+echo
+echo "════════════════════════════════════════════════════════════════"
+echo "✔ Saved: $DEST"
+echo "  ($(wc -l < "$DEST") lines, $(wc -w < "$DEST") words)"
+echo "════════════════════════════════════════════════════════════════"
+echo
+echo "─── Top of the report (speaker mapping JSON, if produced) ───"
+head -30 "$DEST"
+echo "..."
+echo
+open -a "TextEdit" "$DEST"