v0.13.0:4 - redaction gateway, embeddings proxy, expanded audio API

- Add redaction gateway (redaction_gateway.py, redaction/ scrub + tests) - Add embeddings proxy and spark_embed service (Dockerfile + main.py) - Expand audio_proxy with speaker-aware handling; deep_health/health/server updates - Package: configureSparks action + sparkConfig model updates, manifest/main wiring - Docs: AUDIO_API, EMBEDDINGS, REDACTION_GATEWAY; HANDOFF and runbook/known-issues refresh
2026-06-11 17:45:21 -05:00
parent 4a75274db3
commit 8d839e3714
37 changed files with 3763 additions and 197 deletions
@@ -0,0 +1,288 @@
+# Spark Control — Audio API reference (transcription + diarization)
+
+For external clients (e.g. the **Ten31 Transcripts** capture app) integrating with
+the transcription and diarization endpoints. All examples are **real responses**
+from the live deployment.
+
+---
+
+## 1. Connection / auth
+
+- **Base URL:** `https://<spark-control-host>` (the operator's Start9 LAN address,
+  e.g. `https://<spark-control-host>:62419`). A `.local` form also exists (survives IP
+  changes); the operator can provide it.
+- **TLS:** Start9's self-signed Root CA. On the LAN, set `verify=False` /
+  `rejectUnauthorized:false` (curl `-k`), or install the Start9 Root CA into your
+  trust store. Same story as every other Spark Control endpoint.
+- **Auth:** **none on the LAN** today — the endpoints sit behind StartOS access
+  control + TLS on a trusted network. No bearer token / API key. (If you need
+  per-client auth later, the operator can add it; it's not there now.)
+- **Limits:**
+  - Max upload: **200 MB** per request (`413` if exceeded). For long calls, chunk
+    (see §4).
+  - Request timeout: transcription ~300 s, diarization ~600 s per request.
+  - **Send requests sequentially, not in parallel.** Concurrent audio requests can
+    trip a GPU FFT race on the backend (returns `503` + `Retry-After`). One in
+    flight at a time is the safe pattern.
+
+---
+
+## 2. Audio format
+
+- The backend (NVIDIA Parakeet / NeMo) **decodes common formats** (WAV, FLAC, MP3,
+  m4a) and internally resamples to **16 kHz mono**. So **16 kHz mono WAV is ideal**;
+  anything decodable also works (stereo is downmixed).
+- **Single mixed-mono file** is what the endpoints expect. Diarization (Sortformer)
+  separates speakers *from one mixed stream*, so for diarization you want everyone
+  in one file — **mix your system-audio track + mic track to one mono WAV** before
+  sending.
+- **Your two-track capture is an asset:** your mic track is, by definition, *you* —
+  a known identity. Two clean options:
+  1. **Mix both tracks → diarize the mix** (simplest; pairs perfectly with your
+     visual-timeline name-merge — see §4 note).
+  2. **Diarize only the system track** (the other participants) and label your own
+     mic track as the user directly (no diarization needed for your own voice).
+- **Upload mechanism:** `multipart/form-data`, file field name **`file`** (OpenAI-
+  compatible). Not base64, not a path/URL — send the bytes.
+
+---
+
+## 3. Transcription endpoint
+
+**`POST /v1/audio/transcriptions`** — OpenAI-compatible. **Synchronous** (returns the
+result; no job/polling).
+
+Multipart fields:
+| field | required | notes |
+|---|---|---|
+| `file` | **yes** | the audio bytes |
+| `model` | no | default `parakeet-tdt-0.6b-v3` (one STT model server-side; you don't need to pick) |
+| `response_format` | no | `json` (default, just text) · `verbose_json` (timestamps) · `text` |
+| `language` | no | default auto/en |
+| `temperature`, `prompt` | no | passthrough |
+
+```bash
+curl -k -X POST https://<host>/v1/audio/transcriptions \
+  -F "file=@call.wav" -F "response_format=verbose_json"
+```
+
+**Real `verbose_json` response** — includes **word-level AND segment-level
+timestamps** (seconds):
+```json
+{
+  "task": "transcribe",
+  "language": "en",
+  "duration": 9.259,
+  "text": "Good morning everyone. I think the energy thesis is strong this quarter. I agree, but I am worried about the lockup terms and the fee load this time.",
+  "segments": [
+    { "start": 0.0,  "end": 1.28, "text": "Good morning everyone." },
+    { "start": 1.44, "end": 4.48, "text": "I think the energy thesis is strong this quarter." }
+  ],
+  "words": [
+    { "start": 0.0,  "end": 0.32, "text": "Good" },
+    { "start": 0.32, "end": 0.72, "text": "morning" },
+    { "start": 0.8,  "end": 1.28, "text": "everyone." }
+  ]
+}
+```
+(`json` → `{"text": "..."}`; `text` → plain text body.)
+
+---
+
+## 4. Diarization
+
+Two endpoints, both **synchronous**, both returning **anonymous clusters**
+(`Speaker_0`, `Speaker_1`, …) with timestamps. NVIDIA **Sortformer** owns
+segmentation; **TitaNet** produces a voiceprint per speaker.
+
+### `POST /api/audio/diarize-chunk` — segmentation + voice fingerprints
+
+Multipart: `file` (required). Designed to be called **per chunk** for long calls;
+returns a 192-dim fingerprint per local speaker so you can re-cluster the same
+person across chunks.
+
+```bash
+curl -k -X POST https://<host>/api/audio/diarize-chunk -F "file=@call.wav"
+```
+**Real response:**
+```json
+{
+  "duration": 9.259,
+  "segments": [
+    { "start_s": 0.0,  "end_s": 1.52, "speaker": "Speaker_0", "confidence": 0.931 },
+    { "start_s": 1.6,  "end_s": 4.56, "speaker": "Speaker_0", "confidence": 0.9662 },
+    { "start_s": 4.88, "end_s": 9.04, "speaker": "Speaker_1", "confidence": 0.9681 }
+  ],
+  "speakers_detected": ["Speaker_0", "Speaker_1"],
+  "fingerprints": {
+    "Speaker_0": [0.0028, 0.0173, -0.0114, "...192 floats"],
+    "Speaker_1": [0.0020, 0.0056, -0.0045, "...192 floats"]
+  },
+  "models": {
+    "diarization": "nvidia/diar_sortformer_4spk-v1",
+    "embedding": "nvidia/speakerverification_en_titanet_large"
+  }
+}
+```
+- `confidence` ∈ [0,1] = mean probability the assigned speaker was active over the
+  segment (threshold it to render uncertain segments as "Speaker_0?").
+- `speaker` labels are **local to this chunk** — use the fingerprints + cosine
+  similarity (NeMo default distance threshold ~0.7) to merge `chunkA.Speaker_0` with
+  `chunkB.Speaker_2` when they're the same voice.
+
+### `POST /api/audio/transcribe-with-speakers` — ASR + diarization merged
+
+Multipart: `file` (required). Runs transcription + diarization and **merges by
+timestamp** into speaker-attributed text blocks (`start_ms`/`end_ms`).
+```json
+{
+  "duration": 9.259, "language": "en",
+  "speakers_detected": ["Speaker_0", "Speaker_1"],
+  "segments": [
+    { "start_ms": 0,    "end_ms": 4480, "speaker": "Speaker_0",
+      "text": "Good morning everyone. I think the energy thesis is strong this quarter." },
+    { "start_ms": 4800, "end_ms": 9040, "speaker": "Speaker_1",
+      "text": "I agree, but I am worried about the lockup terms and the fee load this time." }
+  ],
+  "models": { "transcription": "parakeet", "diarization": "nvidia/diar_sortformer_4spk-v1" }
+}
+```
+
+### Your key question — can it accept a *prior* (named timeline) to label clusters?
+
+**Yes — that's exactly what `/api/audio/label-merge` does (§4.5 below).** You POST
+the audio plus your visual `(start, end, name)` timeline; the backend diarizes,
+runs the majority-temporal-overlap vote, and returns **named** segments — no
+client-side merge needed. (The two endpoints above still return anonymous clusters
+if you'd rather do the merge yourself.) Note the diarizer is fixed at **≤4 speakers
+per chunk** (NVIDIA Sortformer, not pyannote) and takes **no `num_speakers` hint** —
+for >4-person calls, chunk the audio; your visual timeline actually helps
+disambiguate across chunks.
+
+**The fingerprints compound.** Every named cluster comes back with its 192-dim
+voiceprint. Persist those (keyed by name) and pass them back as `known_voiceprints`
+on the next call — `label-merge` will recover a speaker by **voice** even when the
+visual cue is missing (camera off, a bad OCR frame). Your visual capture *enrolls a
+voice library for free*.
+
+---
+
+## 4.5 `POST /api/audio/label-merge` — named segments from a visual timeline
+
+Diarize + **name the clusters** from your screen-derived timeline (majority temporal
+overlap), with an optional **voiceprint fallback** for anyone the visual track
+missed. **Synchronous. Stateless** — you own the timeline and the voiceprint
+library; the backend just diarizes + merges and persists nothing.
+
+`multipart/form-data` fields:
+| field | required | notes |
+|---|---|---|
+| `file` | **yes** | mixed-mono audio |
+| `timeline` | **yes** | JSON array: `[{"start":0.0,"end":4.5,"name":"Alice","confidence":0.9}, ...]` (seconds) |
+| `known_voiceprints` | no | JSON object `{"Alice":[192 floats], "Bob":[...]}` — named voiceprints from past calls, used to label clusters with no visual overlap |
+| `transcribe` | no | `"true"` to also return per-segment text (default false) |
+| `min_overlap` | no | min fraction of a cluster's time that must overlap the winning name (default `0.0` = any overlap wins) |
+| `voiceprint_threshold` | no | cosine similarity to accept a voiceprint match (default `0.5`) |
+
+```bash
+curl -k -X POST https://<host>/api/audio/label-merge \
+  -F "file=@call.wav" \
+  -F 'timeline=[{"start":0,"end":4.5,"name":"Alice"},{"start":4.8,"end":9.3,"name":"Bob"}]' \
+  -F "transcribe=true"
+```
+
+**Real response** (the 2-speaker test clip; visual named both, with transcript):
+```json
+{
+  "duration": 9.259,
+  "speakers": [
+    { "cluster": "Speaker_0", "name": "Alice", "source": "visual", "overlap_confidence": 0.9866,
+      "fingerprint": [0.0028, 0.0173, "...192 floats"] },
+    { "cluster": "Speaker_1", "name": "Bob",   "source": "visual", "overlap_confidence": 1.0,
+      "fingerprint": [0.0020, 0.0056, "...192 floats"] }
+  ],
+  "segments": [
+    { "start_ms": 0,    "end_ms": 4480, "speaker": "Alice", "text": "Good morning everyone. I think the energy thesis is strong this quarter." },
+    { "start_ms": 4800, "end_ms": 9040, "speaker": "Bob",   "text": "I agree, but I am worried about the lockup terms and the fee load this time." }
+  ],
+  "fingerprints": { "Alice": [192 floats], "Bob": [192 floats] },
+  "models": { "diarization": "nvidia/diar_sortformer_4spk-v1", "embedding": "nvidia/speakerverification_en_titanet_large" }
+}
+```
+
+**Name resolution per cluster, in order:** (1) the visual-timeline name with the
+most temporal overlap (`source: "visual"`); (2) if none, the closest
+`known_voiceprints` match above `voiceprint_threshold` (`source: "voiceprint"`,
+with `match_similarity`); (3) otherwise `Unknown_N` (`source: "unmatched"`) — never
+mislabeled. The `fingerprints` map (keyed by the resolved name) is what you persist
+to grow your voiceprint library for the next call. When `transcribe=false`, segments
+are `{start_s, end_s, speaker, confidence}` instead of text blocks.
+
+**Verified live** — visual match (both speakers named), voiceprint recovery (a
+camera-off speaker matched by voice), and unmatched (→ `Unknown_0`) all confirmed.
+
+### Dual-channel mode (recommended for Ten31 Transcripts)
+
+If you capture two sample-aligned tracks — **`mic_file`** (the local user) + **`system_file`**
+(everyone else, from screen capture) — send them *instead of* `file`. This is strictly
+better than mixing to mono: the diarizer over-segments a mono mix (a stereo clip of two
+clean voices comes back as **3** speakers), whereas the two channels let each model get
+the easiest possible mono input.
+
+Extra form fields for dual mode:
+| field | required | notes |
+|---|---|---|
+| `mic_file` + `system_file` | **yes (dual)** | the two aligned mono-16k tracks |
+| `self_name` | no | the local user's name (mic channel). Default `"Me"`. |
+| `self_vad` | no | JSON `[{"start","end"}]` — windows where the mic is active *and louder than* system. If omitted, computed server-side per-window. |
+
+How it works: the **mic track** → your words, gated to windows where the mic is genuinely
+you speaking (the mic also picks up the remote audio as quiet bleed, so this loudness gate
+is essential — without it the bleed gets transcribed as you). The **system track** →
+diarized (it only has to separate the *remote* people) and named via the timeline +
+voiceprints. Your clean voiceprint is **enrolled from the mic track** and injected into the
+library, so a system cluster that's you dialed in from a second device (dual-login) resolves
+to you, not a stranger. You also free a Sortformer speaker slot (you no longer consume one).
+
+```bash
+curl -k -X POST https://<host>/api/audio/label-merge \
+  -F "mic_file=@mic.wav" -F "system_file=@system.wav" \
+  -F "self_name=Alice" -F 'timeline=[...]' -F "transcribe=true" \
+  -F 'known_voiceprints={"Alice":[...],"Bob":[...]}'   # include your own
+```
+
+Response is the same shape with `"mode":"dual_channel"`; `speakers` includes a
+`{"name":self_name,"source":"mic_channel"}` entry, and `fingerprints[self_name]` is your
+clean mic-enrolled voiceprint to store.
+
+**Validated on a real misattributing call:** dual-channel fixed both mono-mix
+misattributions (a remote "Go Bitcoin" no longer credited to the user; a local "There"
+recovered from `Unknown`), and **correctly split overlapping speech** — two people saying
+"Hello" at once that the coarse ground truth itself had conflated.
+
+> **One known limit:** if *loud* remote bleed masks a *quiet* local word, the mic-track ASR
+> can miss it entirely (we can't attribute a word that was never transcribed). A cleaner mic
+> (headphones, so there's no speaker bleed) avoids it; channel-subtraction echo-cancellation
+> is a possible future enhancement since the tracks are sample-aligned.
+
+---
+
+## 5. Anything else
+
+- **No OpenAPI/Swagger yet.** This doc + the curl examples are the contract.
+- **Health / discovery:**
+  - `GET /api/status` — per-service health (`parakeet`, etc.).
+  - `GET /api/endpoints` — service-discovery JSON (base URLs + ready flags).
+  - `GET /v1/models` — lists the STT model + diarizer.
+- **Errors:** JSON body, conventional status codes — `400` malformed, `413` too
+  large, `503` + `Retry-After` if the backend briefly wedges (retry after the
+  interval; transcription auto-recovers). Most error bodies are `{"detail": "..."}`.
+- **Long calls:** chunk into ~2–3 min pieces, send **sequentially**, diarize each
+  with `/api/audio/diarize-chunk`, and stitch speakers across chunks via the
+  fingerprints. (The operator's other apps use exactly this pattern.)
+
+---
+
+*Backend: NVIDIA Parakeet TDT 0.6B (STT) + Sortformer 4spk-v1 (diarization) +
+TitaNet (voice fingerprints) on DGX Spark, fronted by Spark Control. All on the
+operator's LAN — nothing leaves the box.*
@@ -0,0 +1,202 @@
+# Embeddings + Retrieval through Spark Control (v0.15.0)
+
+Spark Control now fronts a local RAG stack so your agent/CRM system can do
+dense embeddings, reranking, and hybrid vector search against one trusted host
+— same TLS cert and allowlist as the LLM and audio endpoints.
+
+## What runs where
+
+| Component | Host | Port | Role |
+|---|---|---|---|
+| **spark-embed** | Spark 2 (GPU) | 8088 | `BAAI/bge-m3` dense embeddings (1024-d) + `BAAI/bge-reranker-v2-m3` cross-encoder rerank |
+| **Qdrant** | Spark 2 (CPU) | 6333/6334 | Vector storage, hybrid dense+sparse retrieval, RRF fusion, payload filtering |
+| **Spark Control** | Start9 | (your LAN URL) | Proxies all of the above behind one host |
+
+`spark-embed` is a small FastAPI server built **from the NGC PyTorch image**
+(the torch we've proven runs on the GB10's sm_121 GPU). We did *not* use HF Text
+Embeddings Inference because as of 2026 it ships no arm64 CUDA image. No
+torchaudio, no flash-attn — bge-m3 + the reranker are plain XLM-RoBERTa
+encoders.
+
+## Endpoints (all on the Spark Control host)
+
+### `POST /v1/embeddings` — dense embeddings (OpenAI-compatible)
+
+```json
+// request
+{ "input": "text to embed", "model": "BAAI/bge-m3" }
+// or  { "input": ["batch", "of", "texts"] }
+
+// response (OpenAI shape)
+{
+  "object": "list",
+  "data": [ { "object": "embedding", "index": 0, "embedding": [0.01, ...1024 floats] } ],
+  "model": "BAAI/bge-m3",
+  "usage": { "prompt_tokens": 0, "total_tokens": 0 }
+}
+```
+
+Vectors are L2-normalized by default (cosine == dot product). Works with the
+stock OpenAI Python/JS client by pointing `base_url` at Spark Control.
+
+### `POST /v1/rerank` — cross-encoder rerank
+
+```json
+// request
+{ "query": "did Brightwater commit?", "documents": ["chunk a", "chunk b", ...],
+  "top_n": 5, "return_documents": false }
+
+// response (Cohere-ish)
+{ "object": "rerank.result", "model": "BAAI/bge-reranker-v2-m3",
+  "results": [ { "index": 3, "relevance_score": 5.21 }, { "index": 0, "relevance_score": 1.04 }, ... ] }
+```
+
+`relevance_score` is the reranker's raw logit (higher = more relevant; unbounded,
+roughly −10..+10). Sort desc. Send a candidate set (≤200), not your whole corpus.
+
+### `POST /api/search` — orchestrated hybrid retrieval
+
+One call: embeds the query (dense), retrieves from Qdrant (hybrid dense+sparse
+with RRF **when you supply a sparse vector**, else dense-only), optionally
+cross-encoder reranks, returns top_k.
+
+```json
+// request
+{
+  "query": "Did Brightwater commit to the Fund III close in Q1?",
+  "collection": "crm_chunks",
+  "top_k": 8,
+  "retrieve_n": 80,                 // first-stage candidates (default max(50, top_k*10))
+  "sparse": { "indices": [12, 904], "values": [0.7, 1.2] },  // optional BM25 vector for hybrid
+  "fusion": "rrf",                  // or "dbsf"
+  "filter": {                       // raw Qdrant filter (pre-filter, see below)
+    "must": [ { "key": "lp_id", "match": { "value": "lp_0427" } } ]
+  },
+  "rerank": true,
+  "text_field": "text",             // payload field holding the chunk text
+  "with_payload": true
+}
+
+// response
+{
+  "object": "search.result_list",
+  "model": "BAAI/bge-m3+bge-reranker-v2-m3",
+  "query": "...",
+  "collection": "crm_chunks",
+  "reranked": true,
+  "data": [
+    { "object": "search.result", "index": 0, "id": "chunk_99c1_3",
+      "score": 5.21, "fused_score": 0.41, "rerank_score": 5.21,
+      "text": "...Brightwater confirmed verbal commitment...",
+      "payload": { "lp_id": "lp_0427", "doc_type": "email", "date_ts": 1771027200, ... } }
+  ],
+  "usage": { "embed_ms": 21, "qdrant_ms": 9, "rerank_ms": 140, "candidates": 80 }
+}
+```
+
+`score` is the rerank score when reranked, else the fused/dense score.
+
+## The sparse (BM25) story — important for entity-heavy data
+
+bge-m3's dense vectors carry semantic meaning, but exact entity matches
+(fund names, tickers, people) are a **lexical** signal. For that you want hybrid
+dense + sparse. Two design facts:
+
+1. **spark-embed serves dense only.** TEI/Infinity can't emit bge-m3's learned
+   sparse either, and bge-m3's pretrained sparse weights underweight novel
+   entity tokens (brand-new fund names). So we don't use bge-m3 sparse.
+2. **Use Qdrant BM25 with its built-in IDF**, generated client-side. This learns
+   IDF over *your* corpus, so novel entity strings get correct weight.
+
+### Your ingest pipeline (the part you own)
+
+For each chunk, produce and upsert **both** vectors:
+
+- **dense**: call Spark Control `POST /v1/embeddings` → 1024-d vector.
+- **sparse**: run [FastEmbed](https://github.com/qdrant/fastembed) BM25 client-side:
+  ```python
+  from fastembed import SparseTextEmbedding
+  bm25 = SparseTextEmbedding(model_name="Qdrant/bm25")
+  sp = next(bm25.embed([chunk_text]))   # -> {indices, values}
+  ```
+
+Create the collection with a named dense vector and a named sparse vector that
+uses `modifier: idf` (so Qdrant applies IDF server-side):
+
+```json
+PUT /collections/crm_chunks
+{ "vectors": { "dense": { "size": 1024, "distance": "Cosine" } },
+  "sparse_vectors": { "sparse": { "modifier": "idf" } } }
+```
+
+Add payload indexes for your filter fields so filtered queries stay fast:
+
+```json
+PUT /collections/crm_chunks/index   { "field_name": "lp_id",     "field_schema": "keyword" }
+PUT /collections/crm_chunks/index   { "field_name": "doc_type",  "field_schema": "keyword" }
+PUT /collections/crm_chunks/index   { "field_name": "date_ts",   "field_schema": "integer" }
+```
+
+Upsert points with both vectors + payload:
+
+```json
+PUT /collections/crm_chunks/points
+{ "points": [ { "id": 1,
+    "vector": { "dense": [...1024...], "sparse": { "indices": [...], "values": [...] } },
+    "payload": { "lp_id": "lp_0427", "lp_name": "Brightwater Capital",
+                 "doc_type": "email", "date_ts": 1771027200, "text": "..." } } ] }
+```
+
+### At query time
+
+Generate the query's BM25 sparse vector with the **same** FastEmbed model, and
+pass it to `/api/search` as `sparse`. Spark Control fuses dense+sparse with RRF
+inside Qdrant, then reranks. If you omit `sparse`, you get dense + rerank
+(still good, just no exact-lexical leg).
+
+> You can talk to Qdrant directly (`http://<spark2>:6333`) for collection
+> management and upserts — that's the natural home for ingest. `/api/search` is
+> the convenience path for the read side so your agents hit one host. If you'd
+> rather Spark Control proxy Qdrant admin too, say the word.
+
+## Chunking guidance (entity-heavy CRM)
+
+- One chunk per email / note / transcript-turn; one chunk per memo *section* —
+  don't split mid-thought.
+- Keep entity + date as **payload fields** (filterable), not buried in embedded text.
+- Time-aware: don't merge a 2022 note with a 2026 update in one chunk; store
+  `date_ts` so you can pre-filter and recency-boost.
+- Resolve entity-name variants ("J. Smith" / "Jonathan Smith" / "JS") to one
+  canonical `lp_id` at ingest, or retrieval fragments across variants.
+
+## Pre-filtering
+
+Most agent queries aren't pure semantic — they're "recent emails with Bob about
+energy". Pass a Qdrant `filter` to restrict the search space *before* vector
+scoring (faster + more precise). Qdrant also supports server-side recency
+boosting via Formula/Score-Boosting if you want decay-by-`date_ts` without an
+app round-trip — ask and we can expose it through `/api/search`.
+
+## Service discovery + health
+
+- `GET /api/endpoints` includes `embeddings` and `qdrant` entries (base_url +
+  ready flag) alongside `vllm`, `parakeet`, `kokoro`.
+- The dashboard shows Embeddings + Qdrant health dots and Start/Restart/Stop
+  controls in the Always-On Services panel.
+- Spark Control will auto-restart a wedged **embedding** container (GPU CUDA
+  wedge recovery, like the audio services) but **never** auto-restarts Qdrant —
+  it holds your only copy of the index, so a restart is surfaced for manual
+  action instead.
+
+## Capacity
+
+At your scale (tens of thousands of chunks now → low hundreds of thousands),
+this is trivial: ~0.6–1.2 GB of dense vectors at 300k chunks, spark-embed +
+Qdrant together use a few GB of GPU/RAM on Spark 2's 122 GB. Full re-embed of
+300k chunks is ~8–15 min, so re-indexing is cheap if you change models.
+
+## Model upgrade path
+
+If dense recall becomes the bottleneck, `Qwen3-Embedding-4B` (Matryoshka-trained,
+tops 2026 MTEB) is the A/B candidate — same `/v1/embeddings` contract, swap the
+model in spark-embed. bge-m3 is the lower-risk starting point and ships now.
@@ -0,0 +1,105 @@
+# Redaction Gateway — `/scrub` + `/rehydrate` (Spark Control v0.16.0)
+
+The privacy boundary between sovereign LP data and the Claude API, living at the
+same trusted Spark Control host as `/v1/chat/completions`, `/v1/embeddings`,
+`/v1/rerank`, and `/api/search`. Built to **behavioral parity** with the CRM's
+reference `backend/redaction/scrub.py` — that engine is vendored verbatim into
+Spark Control and its leak test passes here, so `SCRUB_BACKEND=gateway` is a
+drop-in for the in-repo path.
+
+## What it is
+
+- `POST /scrub` — de-identify an agent's assembled context. Returns placeholder-only
+  text (the agent forwards that to Claude) plus an opaque `map_handle`.
+- `POST /rehydrate` — swap the real values back into Claude's placeholder-bearing
+  response, locally, for human review.
+
+Spark Control does **not** call Claude. It's the scrub/rehydrate transform pair
+plus a server-held pseudonym map.
+
+## Contract (matches the handover doc)
+
+`POST /scrub`
+```json
+{ "task_id": "...", "actor": "analyst",
+  "items": [{"id": "ctx_1", "text": "..."}],
+  "known_entities": {"persons": [], "orgs": [], "funds": [], "emails": [], "locations": []},
+  "tier1_action": "drop",            // or "reject" (fail-closed 422 on any Tier-1)
+  "bucket": {"amounts": false, "dates": false},
+  "ner": "auto",                     // "auto" | "rules_only" | "qwen"
+  "map_handle": null }               // pass to reuse/extend a task's map (stable tokens)
+```
+→ `200 { task_id, map_handle, items:[{id, scrubbed_text, tokens_used}], stats:{tier1_dropped, tier2_tokenized, distinct_entities, descriptive_flags:[{item, span, action}]}, expires_at }`
+- `422 {"error":"tier1_detected","spans":[{item, kinds}]}` when `tier1_action="reject"` and Tier-1 found (kinds only — never the raw value).
+- `422 {"error":"ner_unavailable", ...}` when `ner=auto|qwen` and the local Qwen is unreachable / no model loaded — **fail-closed, emits nothing**.
+- `400` on malformed input.
+
+`POST /rehydrate`
+```json
+{ "task_id": "...", "map_handle": "...", "items": [{"id": "out_1", "text": "...[PERSON_1]..."}],
+  "actor": "analyst", "strict": true }
+```
+→ `200 { items:[{id, rehydrated_text}], stats:{tokens_substituted, unknown_tokens} }`
+- `409 {"error":"unknown_tokens","tokens":[...]}` when `strict` and a token has no map entry (your tripwire for a Claude-hallucinated/smuggled token).
+- `410 {"error":"map_expired"}` if the map TTL lapsed or the handle is unknown.
+
+## The dictionary is caller-supplied — and treated as sensitive
+
+You supply `known_entities` (built by your `build_known_entities`, scoped to the LP
+in play) in each `/scrub` call. Spark Control never reads your CRM — keeps the
+package portable and needs no CRM credentials. The gateway treats your dictionary
+as a slice of the LP list: used transiently for the scrub, **never persisted beyond
+the resulting tokens, never logged, never echoed**. Only the resulting
+`{token → real_value}` map is held server-side.
+
+## NER backstop is load-bearing, not optional
+
+The dictionary is the deterministic floor; the local-Qwen NER pass catches the
+unknown names it can't know (new prospects, an advisor named in passing) and flags
+**descriptive re-identifiers** ("the family that sold the mining company in Texas" →
+redacted). Under `ner=auto` (default) or `ner=qwen`, if the local Qwen is unreachable
+or no model is loaded, `/scrub` **fails closed (422)** rather than passing name-blind
+text to Claude. `ner=rules_only` is the explicit, knowing opt-out — never the silent
+fallback. The NER uses the same local Qwen at `/v1/chat/completions`; the sensitive
+text never reaches a remote model.
+
+> Verified live against Qwen3.6: an unknown "Sarah Kim from Atlas Ventures" → `[PERSON_1] from [ORG_1]`; a descriptive re-identifier → `[redacted]` + flagged.
+
+## Map-stays-local
+
+The pseudonym map (the de-anonymization key) is held only on this box, keyed by
+`map_handle`, in a TTL-swept local store on the StartOS `/data` volume (default 2h;
+survives a Spark Control restart mid-review). Never returned in full, never logged,
+never in a Claude-bound payload. `REDACTION_MAP_TTL` and `REDACTION_MAP_DB` are
+configurable via env if you want a different TTL/path.
+
+## Logging stays on your side
+
+`/scrub` and `/rehydrate` return counts-only `stats`; **your app writes the
+`interaction_log` row** (you already have `log_scrub`/`log_rehydrate`). Spark Control
+does not write to your DB and keeps no audit log of its own that contains real values.
+The `descriptive_flags` span text is in the `/scrub` *response* (to you, the local
+caller) — strip it before you persist, per your own logging rule (payload = counts only).
+
+## Acceptance — what passed
+
+1. **Parity** — the reference leak fixtures run through the live `/scrub` endpoint: every Tier-1 + Tier-2 identifier absent from the response; substance survives verbatim.
+2. **Map-leak** — no real value (incl. Tier-1) in any response body; Tier-1 values absent from the server map entirely.
+3. **Round-trip** — `/rehydrate` via the server-held map reproduces the original (Tier-1 → `[redacted]`, the only lossy part).
+4. **Handle reuse** — same entity → same token across items and across calls reusing `map_handle` (cache-stable for Claude prompt caching).
+5. **Tripwires** — 409 on a strict unmapped token; 410 on expired/unknown handle; 422 fail-closed on `tier1_action=reject`.
+6. **Live NER** — unknown names tokenized + descriptive re-identifier redacted against the real local Qwen.
+
+## Cutover
+
+Point your `SCRUB_BACKEND=gateway` client at `https://<spark-control-host>/scrub` and
+`/rehydrate` (same TLS-skip / Root-CA story as the other endpoints). The request/
+response shapes match your in-repo module, so agents cut over with no app changes.
+
+## Honest caveat (unchanged from your design)
+
+The NER pass is the probabilistic layer — it will not catch every free-text or
+descriptive re-identifier. The strong defenses remain: **minimize-first** (does Claude
+need the record content at all?), the deterministic dictionary + rules, and the
+re-identification spot-check. Treat the gateway as the enforcement *point*, not a
+guarantee that any text is safe to send.