Files

T

Keysat c7ce44d963 Phase 0 foundation: canonical schema, ingest pipeline, CRM MCP server

Workstream A–C substrate for the Ten31 agentic system:
- A1: docs/crm-overview.md; CLAUDE.md conventions + guardrail #9
- A2: additive/reversible core migration (canonical_entities, entity_links,
  interaction_log, relationship_edges, soft-delete) + ledgered runner
- B1/B3: chunking + deterministic entity resolution (backend/ingest)
- B2: dense (bge-m3) + BM25 sparse ingest to Qdrant crm_chunks
- C: CRM MCP server (reads, retrieval modes, logged writes) — no outbound tools
- docs: redaction/re-hydration, Gmail enablement runbook
- synthetic test data; .env.example; housekeeping (.gitignore, untrack crm.db,
  drop legacy files + start9/0.3.5)

Verified end-to-end on synthetic data + live Sparks (hybrid > dense on entity
queries). Real backfill runs on Ten31 infra; index holds synthetic data only.
Branch snapshot also captures pre-existing working-tree changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-05 08:13:35 -05:00

8.6 KiB

Raw Blame History

Redaction / Re-hydration Boundary

Design note for the privacy boundary between Ten31's sovereign data and the Claude API. Written in Phase 0 so it is a designed step, not an afterthought. It is BUILT in Phase 2/3 (Analyst, Closer) — Phase 0 does not need it (see §1).

Related: CLAUDE.md guardrails #1 (sovereignty), #4 (human-in-the-loop outbound), #9 (dev-time data handling); docs/Ten31_Agentic_Build_Plan.md §4.5.

1. When this applies (and when it doesn't)

The system has two very different data paths, and only one of them needs this:

Local-only paths — no redaction needed, because Claude is never in the loop. All of Phase 0 is here: ingest, chunking, embeddings (bge-m3), the vector index (Qdrant), and entity resolution (local Qwen). Sensitive data flows CRM → Sparks → Qdrant, all on Ten31 infrastructure. Nothing reaches Anthropic, so there is nothing to scrub. Do not add redaction overhead to the ingest/retrieval path.
Claude-facing reasoning steps — this boundary applies. When an agent asks Claude to reason over LP-specific content: Analyst (building a dossier from retrieved chunks), Closer (drafting outreach/nurture/meeting prep), and any Orchestrator step that forwards record content. These send sensitive context to a third-party API and are the reason this boundary exists.

The guiding rule (guardrail #1) is "only the minimum necessary, non-sensitive context per call." Redaction/re-hydration is how we honor that rule when the task genuinely needs record content — as opposed to simply sending less.

2. The three-tier data classification

Before any agent calls Claude, classify each piece of context:

Tier	Examples	Treatment
Never send	Full LP list/export, bulk relationship graph, raw account numbers, wire details, SSNs/passport, anything covered by a confidentiality obligation	Stays on Ten31 infra. Not even tokenized — just excluded.
Tokenize (pseudonymize)	Person names, org/fund names, emails, phone, physical addresses, exact $ amounts, dates that pin identity	Replaced with stable placeholders before the call; real values swapped back locally after.
Send as-is	The substance an agent needs Claude to reason about: thesis discussion, sentiment, objections, generic deal mechanics, the drafted message body (minus identifiers)	Sent in the de-identified prompt.

The art is the middle tier: keep enough semantic content for Claude to be useful, while every identifier is a placeholder.

3. The round-trip

  ┌─────────────────────────── Ten31 infrastructure (sovereign) ───────────────────────────┐
  │                                                                                          │
  │  1. SCRUB (local model on the Sparks, via Spark Control)                                  │
  │     - Pull the minimal context the task needs (retrieved chunks + record fields).         │
  │     - NER + rule pass replaces Tier-2 identifiers with stable tokens:                      │
  │         "Jonathan Reyes" -> [PERSON_1]   "Cedar Point Capital" -> [ORG_1]                  │
  │         "jon@cedarpoint..." -> [EMAIL_1]  "$5,000,000" -> [AMOUNT_1]   "Fund III" -> [FUND_1]│
  │     - Tokens are STABLE within a task (same entity -> same token) and CONSISTENT across    │
  │       all chunks in the call, so Claude can reason about relationships.                    │
  │     - The pseudonym map { [PERSON_1] -> "Jonathan Reyes", ... } is held LOCALLY, keyed to  │
  │       the task/session. It never leaves the box.                                           │
  │     - Drop Tier-1 content entirely. Log the scrub to the interaction_log.                  │
  │                                                                                          │
  └──────────────────────────────────────────┬───────────────────────────────────────────────┘
                                              │  de-identified prompt (placeholders only)
                                              ▼
                         2. REASON  —  Claude API (Agent SDK)
                            Drafts / synthesizes using [PERSON_1], [ORG_1], [FUND_1] ...
                                              │  response referencing the same placeholders
  ┌──────────────────────────────────────────┴───────────────────────────────────────────────┐
  │                                                                                          │
  │  3. RE-HYDRATE (local)                                                                     │
  │     - Substitute real values back in using the local pseudonym map.                        │
  │     - A human reviews the re-hydrated draft (guardrail #4) before anything is sent.        │
  │     - Log the rehydrate + the human decision to the interaction_log.                       │
  │                                                                                          │
  └──────────────────────────────────────────────────────────────────────────────────────────┘

4. Where it runs

The natural home is Spark Control (the gateway that already fronts the local models): add a scrub endpoint and a rehydrate endpoint, or run the pair as middleware inside the Agent SDK tool loop so every outbound Claude call passes through it. Keeping it at the gateway means:

One enforcement point — agents can't accidentally bypass it.
The pseudonym map lives next to the local models, never in agent memory that might get logged.
The scrub uses the same local Qwen already used for entity resolution, so the NER is consistent with how entities were canonicalized at ingest (reuse the canonical_entities → token mapping).

5. Caveats (why this is a tool, not a magic switch)

Free-text leakage is the hard part. A note that says "the family that sold the mining company in Texas last year" re-identifies even with the name tokenized. The scrub model must catch descriptive identifiers, not just named entities — and it will not be perfect. For high-sensitivity tasks, prefer sending less (summary/thesis only) over trusting the scrub to catch everything.
Re-identification by inference. Enough tokenized-but-specific detail (amounts + dates + sector) can still single out a person. Keep Tier-2 amounts/dates bucketed ("~$5M", "Q1") when the exact value isn't needed.
Map integrity. The pseudonym map is sensitive (it's the de-anonymization key) — keep it local, in memory or short-lived, never logged to a third party, never sent in a prompt.
It does not replace minimization. First ask "does Claude need this record content at all?" Often a retrieval summary suffices. Redaction is for when the answer is genuinely yes.
Consistency with retrieval. Retrieval itself is already local, so chunks come back with real values; the scrub is applied at the prompt-assembly step, not at ingest.

6. Verification (when we build it)

A test harness that asserts no Tier-1 string and no real Tier-2 identifier appears in any outbound payload (golden-file diff over recorded prompts).
A re-identification spot-check: have the local model attempt to re-identify entities from the de-identified prompt alone; flag anything it gets right.
Every scrub/rehydrate logged to interaction_log (actor, task, token-count, what tier was dropped) for audit (guardrail #5).

7. Open questions (resolve at build time, Phase 2/3)

Token granularity — per-task ephemeral maps, or a stable per-entity token space reused across tasks (better for Claude's cross-call memory, worse for re-identification risk)?
Do we tokenize the drafted outbound message itself (Closer) and re-hydrate, or draft against placeholders and let the human fill specifics? (Affects how much the human edits.)
Bucketing policy for amounts/dates — what precision is "non-sensitive enough"?
Where exactly in the Agent SDK loop the middleware sits, and how it composes with prompt caching (placeholders must be cache-stable).

8.6 KiB Raw Blame History