# Redaction / Re-hydration Boundary *Design note for the privacy boundary between Ten31's sovereign data and the Claude API. Written in Phase 0 so it is a designed step, not an afterthought. **It is BUILT in Phase 2/3** (Analyst, Closer) — Phase 0 does not need it (see §1).* Related: `CLAUDE.md` guardrails #1 (sovereignty), #4 (human-in-the-loop outbound), #9 (dev-time data handling); `docs/Ten31_Agentic_Build_Plan.md` §4.5. --- ## 1. When this applies (and when it doesn't) The system has two very different data paths, and only one of them needs this: - **Local-only paths — no redaction needed, because Claude is never in the loop.** All of Phase 0 is here: ingest, chunking, embeddings (bge-m3), the vector index (Qdrant), and entity resolution (local Qwen). Sensitive data flows `CRM → Sparks → Qdrant`, all on Ten31 infrastructure. Nothing reaches Anthropic, so there is nothing to scrub. **Do not add redaction overhead to the ingest/retrieval path.** - **Claude-facing reasoning steps — this boundary applies.** When an agent asks Claude to *reason over* LP-specific content: **Analyst** (building a dossier from retrieved chunks), **Closer** (drafting outreach/nurture/meeting prep), and any Orchestrator step that forwards record content. These send sensitive context to a third-party API and are the reason this boundary exists. The guiding rule (guardrail #1) is *"only the minimum necessary, non-sensitive context per call."* Redaction/re-hydration is how we honor that rule **when the task genuinely needs record content** — as opposed to simply sending less. ## 2. The three-tier data classification Before any agent calls Claude, classify each piece of context: | Tier | Examples | Treatment | |---|---|---| | **Never send** | Full LP list/export, bulk relationship graph, raw account numbers, wire details, SSNs/passport, anything covered by a confidentiality obligation | Stays on Ten31 infra. Not even tokenized — just excluded. | | **Tokenize (pseudonymize)** | Person names, org/fund names, emails, phone, physical addresses, exact $ amounts, dates that pin identity | Replaced with stable placeholders before the call; real values swapped back locally after. | | **Send as-is** | The *substance* an agent needs Claude to reason about: thesis discussion, sentiment, objections, generic deal mechanics, the drafted message body (minus identifiers) | Sent in the de-identified prompt. | The art is the middle tier: keep enough semantic content for Claude to be useful, while every *identifier* is a placeholder. ## 3. The round-trip ``` ┌─────────────────────────── Ten31 infrastructure (sovereign) ───────────────────────────┐ │ │ │ 1. SCRUB (local model on the Sparks, via Spark Control) │ │ - Pull the minimal context the task needs (retrieved chunks + record fields). │ │ - NER + rule pass replaces Tier-2 identifiers with stable tokens: │ │ "Jonathan Reyes" -> [PERSON_1] "Cedar Point Capital" -> [ORG_1] │ │ "jon@cedarpoint..." -> [EMAIL_1] "$5,000,000" -> [AMOUNT_1] "Fund III" -> [FUND_1]│ │ - Tokens are STABLE within a task (same entity -> same token) and CONSISTENT across │ │ all chunks in the call, so Claude can reason about relationships. │ │ - The pseudonym map { [PERSON_1] -> "Jonathan Reyes", ... } is held LOCALLY, keyed to │ │ the task/session. It never leaves the box. │ │ - Drop Tier-1 content entirely. Log the scrub to the interaction_log. │ │ │ └──────────────────────────────────────────┬───────────────────────────────────────────────┘ │ de-identified prompt (placeholders only) ▼ 2. REASON — Claude API (Agent SDK) Drafts / synthesizes using [PERSON_1], [ORG_1], [FUND_1] ... │ response referencing the same placeholders ┌──────────────────────────────────────────┴───────────────────────────────────────────────┐ │ │ │ 3. RE-HYDRATE (local) │ │ - Substitute real values back in using the local pseudonym map. │ │ - A human reviews the re-hydrated draft (guardrail #4) before anything is sent. │ │ - Log the rehydrate + the human decision to the interaction_log. │ │ │ └──────────────────────────────────────────────────────────────────────────────────────────┘ ``` ## 4. Where it runs The natural home is **Spark Control** (the gateway that already fronts the local models): add a scrub endpoint and a rehydrate endpoint, or run the pair as middleware inside the Agent SDK tool loop so every outbound Claude call passes through it. Keeping it at the gateway means: - One enforcement point — agents can't accidentally bypass it. - The pseudonym map lives next to the local models, never in agent memory that might get logged. - The scrub uses the same local Qwen already used for entity resolution, so the NER is consistent with how entities were canonicalized at ingest (reuse the `canonical_entities` → token mapping). ## 5. Caveats (why this is a tool, not a magic switch) - **Free-text leakage is the hard part.** A note that says *"the family that sold the mining company in Texas last year"* re-identifies even with the name tokenized. The scrub model must catch *descriptive* identifiers, not just named entities — and it will not be perfect. For high-sensitivity tasks, prefer sending *less* (summary/thesis only) over trusting the scrub to catch everything. - **Re-identification by inference.** Enough tokenized-but-specific detail (amounts + dates + sector) can still single out a person. Keep Tier-2 amounts/dates *bucketed* ("~$5M", "Q1") when the exact value isn't needed. - **Map integrity.** The pseudonym map is sensitive (it's the de-anonymization key) — keep it local, in memory or short-lived, never logged to a third party, never sent in a prompt. - **It does not replace minimization.** First ask "does Claude need this record content at all?" Often a retrieval summary suffices. Redaction is for when the answer is genuinely yes. - **Consistency with retrieval.** Retrieval itself is already local, so chunks come back with real values; the scrub is applied at the *prompt-assembly* step, not at ingest. ## 6. Verification (when we build it) - A test harness that asserts no Tier-1 string and no real Tier-2 identifier appears in any outbound payload (golden-file diff over recorded prompts). - A re-identification spot-check: have the local model attempt to re-identify entities from the de-identified prompt alone; flag anything it gets right. - Every scrub/rehydrate logged to `interaction_log` (actor, task, token-count, what tier was dropped) for audit (guardrail #5). ## 7. Open questions (resolve at build time, Phase 2/3) 1. Token granularity — per-task ephemeral maps, or a stable per-entity token space reused across tasks (better for Claude's cross-call memory, worse for re-identification risk)? 2. Do we tokenize the *drafted outbound message itself* (Closer) and re-hydrate, or draft against placeholders and let the human fill specifics? (Affects how much the human edits.) 3. Bucketing policy for amounts/dates — what precision is "non-sensitive enough"? 4. Where exactly in the Agent SDK loop the middleware sits, and how it composes with prompt caching (placeholders must be cache-stable).