# Redaction Gateway — `/scrub` + `/rehydrate` (Spark Control v0.16.0) The privacy boundary between sovereign LP data and the Claude API, living at the same trusted Spark Control host as `/v1/chat/completions`, `/v1/embeddings`, `/v1/rerank`, and `/api/search`. Built to **behavioral parity** with the CRM's reference `backend/redaction/scrub.py` — that engine is vendored verbatim into Spark Control and its leak test passes here, so `SCRUB_BACKEND=gateway` is a drop-in for the in-repo path. ## What it is - `POST /scrub` — de-identify an agent's assembled context. Returns placeholder-only text (the agent forwards that to Claude) plus an opaque `map_handle`. - `POST /rehydrate` — swap the real values back into Claude's placeholder-bearing response, locally, for human review. Spark Control does **not** call Claude. It's the scrub/rehydrate transform pair plus a server-held pseudonym map. ## Contract (matches the handover doc) `POST /scrub` ```json { "task_id": "...", "actor": "analyst", "items": [{"id": "ctx_1", "text": "..."}], "known_entities": {"persons": [], "orgs": [], "funds": [], "emails": [], "locations": []}, "tier1_action": "drop", // or "reject" (fail-closed 422 on any Tier-1) "bucket": {"amounts": false, "dates": false}, "ner": "auto", // "auto" | "rules_only" | "qwen" "map_handle": null } // pass to reuse/extend a task's map (stable tokens) ``` → `200 { task_id, map_handle, items:[{id, scrubbed_text, tokens_used}], stats:{tier1_dropped, tier2_tokenized, distinct_entities, descriptive_flags:[{item, span, action}]}, expires_at }` - `422 {"error":"tier1_detected","spans":[{item, kinds}]}` when `tier1_action="reject"` and Tier-1 found (kinds only — never the raw value). - `422 {"error":"ner_unavailable", ...}` when `ner=auto|qwen` and the local Qwen is unreachable / no model loaded — **fail-closed, emits nothing**. - `400` on malformed input. `POST /rehydrate` ```json { "task_id": "...", "map_handle": "...", "items": [{"id": "out_1", "text": "...[PERSON_1]..."}], "actor": "analyst", "strict": true } ``` → `200 { items:[{id, rehydrated_text}], stats:{tokens_substituted, unknown_tokens} }` - `409 {"error":"unknown_tokens","tokens":[...]}` when `strict` and a token has no map entry (your tripwire for a Claude-hallucinated/smuggled token). - `410 {"error":"map_expired"}` if the map TTL lapsed or the handle is unknown. ## The dictionary is caller-supplied — and treated as sensitive You supply `known_entities` (built by your `build_known_entities`, scoped to the LP in play) in each `/scrub` call. Spark Control never reads your CRM — keeps the package portable and needs no CRM credentials. The gateway treats your dictionary as a slice of the LP list: used transiently for the scrub, **never persisted beyond the resulting tokens, never logged, never echoed**. Only the resulting `{token → real_value}` map is held server-side. ## NER backstop is load-bearing, not optional The dictionary is the deterministic floor; the local-Qwen NER pass catches the unknown names it can't know (new prospects, an advisor named in passing) and flags **descriptive re-identifiers** ("the family that sold the mining company in Texas" → redacted). Under `ner=auto` (default) or `ner=qwen`, if the local Qwen is unreachable or no model is loaded, `/scrub` **fails closed (422)** rather than passing name-blind text to Claude. `ner=rules_only` is the explicit, knowing opt-out — never the silent fallback. The NER uses the same local Qwen at `/v1/chat/completions`; the sensitive text never reaches a remote model. > Verified live against Qwen3.6: an unknown "Sarah Kim from Atlas Ventures" → `[PERSON_1] from [ORG_1]`; a descriptive re-identifier → `[redacted]` + flagged. ## Map-stays-local The pseudonym map (the de-anonymization key) is held only on this box, keyed by `map_handle`, in a TTL-swept local store on the StartOS `/data` volume (default 2h; survives a Spark Control restart mid-review). Never returned in full, never logged, never in a Claude-bound payload. `REDACTION_MAP_TTL` and `REDACTION_MAP_DB` are configurable via env if you want a different TTL/path. ## Logging stays on your side `/scrub` and `/rehydrate` return counts-only `stats`; **your app writes the `interaction_log` row** (you already have `log_scrub`/`log_rehydrate`). Spark Control does not write to your DB and keeps no audit log of its own that contains real values. The `descriptive_flags` span text is in the `/scrub` *response* (to you, the local caller) — strip it before you persist, per your own logging rule (payload = counts only). ## Acceptance — what passed 1. **Parity** — the reference leak fixtures run through the live `/scrub` endpoint: every Tier-1 + Tier-2 identifier absent from the response; substance survives verbatim. 2. **Map-leak** — no real value (incl. Tier-1) in any response body; Tier-1 values absent from the server map entirely. 3. **Round-trip** — `/rehydrate` via the server-held map reproduces the original (Tier-1 → `[redacted]`, the only lossy part). 4. **Handle reuse** — same entity → same token across items and across calls reusing `map_handle` (cache-stable for Claude prompt caching). 5. **Tripwires** — 409 on a strict unmapped token; 410 on expired/unknown handle; 422 fail-closed on `tier1_action=reject`. 6. **Live NER** — unknown names tokenized + descriptive re-identifier redacted against the real local Qwen. ## Cutover Point your `SCRUB_BACKEND=gateway` client at `https:///scrub` and `/rehydrate` (same TLS-skip / Root-CA story as the other endpoints). The request/ response shapes match your in-repo module, so agents cut over with no app changes. ## Honest caveat (unchanged from your design) The NER pass is the probabilistic layer — it will not catch every free-text or descriptive re-identifier. The strong defenses remain: **minimize-first** (does Claude need the record content at all?), the deterministic dictionary + rules, and the re-identification spot-check. Treat the gateway as the enforcement *point*, not a guarantee that any text is safe to send.