Phase 0 foundation: canonical schema, ingest pipeline, CRM MCP server
Workstream A–C substrate for the Ten31 agentic system: - A1: docs/crm-overview.md; CLAUDE.md conventions + guardrail #9 - A2: additive/reversible core migration (canonical_entities, entity_links, interaction_log, relationship_edges, soft-delete) + ledgered runner - B1/B3: chunking + deterministic entity resolution (backend/ingest) - B2: dense (bge-m3) + BM25 sparse ingest to Qdrant crm_chunks - C: CRM MCP server (reads, retrieval modes, logged writes) — no outbound tools - docs: redaction/re-hydration, Gmail enablement runbook - synthetic test data; .env.example; housekeeping (.gitignore, untrack crm.db, drop legacy files + start9/0.3.5) Verified end-to-end on synthetic data + live Sparks (hybrid > dense on entity queries). Real backfill runs on Ten31 infra; index holds synthetic data only. Branch snapshot also captures pre-existing working-tree changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,202 @@
|
||||
# Embeddings + Retrieval through Spark Control (v0.15.0)
|
||||
|
||||
Spark Control now fronts a local RAG stack so your agent/CRM system can do
|
||||
dense embeddings, reranking, and hybrid vector search against one trusted host
|
||||
— same TLS cert and allowlist as the LLM and audio endpoints.
|
||||
|
||||
## What runs where
|
||||
|
||||
| Component | Host | Port | Role |
|
||||
|---|---|---|---|
|
||||
| **spark-embed** | Spark 2 (GPU) | 8088 | `BAAI/bge-m3` dense embeddings (1024-d) + `BAAI/bge-reranker-v2-m3` cross-encoder rerank |
|
||||
| **Qdrant** | Spark 2 (CPU) | 6333/6334 | Vector storage, hybrid dense+sparse retrieval, RRF fusion, payload filtering |
|
||||
| **Spark Control** | Start9 | (your LAN URL) | Proxies all of the above behind one host |
|
||||
|
||||
`spark-embed` is a small FastAPI server built **from the NGC PyTorch image**
|
||||
(the torch we've proven runs on the GB10's sm_121 GPU). We did *not* use HF Text
|
||||
Embeddings Inference because as of 2026 it ships no arm64 CUDA image. No
|
||||
torchaudio, no flash-attn — bge-m3 + the reranker are plain XLM-RoBERTa
|
||||
encoders.
|
||||
|
||||
## Endpoints (all on the Spark Control host)
|
||||
|
||||
### `POST /v1/embeddings` — dense embeddings (OpenAI-compatible)
|
||||
|
||||
```json
|
||||
// request
|
||||
{ "input": "text to embed", "model": "BAAI/bge-m3" }
|
||||
// or { "input": ["batch", "of", "texts"] }
|
||||
|
||||
// response (OpenAI shape)
|
||||
{
|
||||
"object": "list",
|
||||
"data": [ { "object": "embedding", "index": 0, "embedding": [0.01, ...1024 floats] } ],
|
||||
"model": "BAAI/bge-m3",
|
||||
"usage": { "prompt_tokens": 0, "total_tokens": 0 }
|
||||
}
|
||||
```
|
||||
|
||||
Vectors are L2-normalized by default (cosine == dot product). Works with the
|
||||
stock OpenAI Python/JS client by pointing `base_url` at Spark Control.
|
||||
|
||||
### `POST /v1/rerank` — cross-encoder rerank
|
||||
|
||||
```json
|
||||
// request
|
||||
{ "query": "did Brightwater commit?", "documents": ["chunk a", "chunk b", ...],
|
||||
"top_n": 5, "return_documents": false }
|
||||
|
||||
// response (Cohere-ish)
|
||||
{ "object": "rerank.result", "model": "BAAI/bge-reranker-v2-m3",
|
||||
"results": [ { "index": 3, "relevance_score": 5.21 }, { "index": 0, "relevance_score": 1.04 }, ... ] }
|
||||
```
|
||||
|
||||
`relevance_score` is the reranker's raw logit (higher = more relevant; unbounded,
|
||||
roughly −10..+10). Sort desc. Send a candidate set (≤200), not your whole corpus.
|
||||
|
||||
### `POST /api/search` — orchestrated hybrid retrieval
|
||||
|
||||
One call: embeds the query (dense), retrieves from Qdrant (hybrid dense+sparse
|
||||
with RRF **when you supply a sparse vector**, else dense-only), optionally
|
||||
cross-encoder reranks, returns top_k.
|
||||
|
||||
```json
|
||||
// request
|
||||
{
|
||||
"query": "Did Brightwater commit to the Fund III close in Q1?",
|
||||
"collection": "crm_chunks",
|
||||
"top_k": 8,
|
||||
"retrieve_n": 80, // first-stage candidates (default max(50, top_k*10))
|
||||
"sparse": { "indices": [12, 904], "values": [0.7, 1.2] }, // optional BM25 vector for hybrid
|
||||
"fusion": "rrf", // or "dbsf"
|
||||
"filter": { // raw Qdrant filter (pre-filter, see below)
|
||||
"must": [ { "key": "lp_id", "match": { "value": "lp_0427" } } ]
|
||||
},
|
||||
"rerank": true,
|
||||
"text_field": "text", // payload field holding the chunk text
|
||||
"with_payload": true
|
||||
}
|
||||
|
||||
// response
|
||||
{
|
||||
"object": "search.result_list",
|
||||
"model": "BAAI/bge-m3+bge-reranker-v2-m3",
|
||||
"query": "...",
|
||||
"collection": "crm_chunks",
|
||||
"reranked": true,
|
||||
"data": [
|
||||
{ "object": "search.result", "index": 0, "id": "chunk_99c1_3",
|
||||
"score": 5.21, "fused_score": 0.41, "rerank_score": 5.21,
|
||||
"text": "...Brightwater confirmed verbal commitment...",
|
||||
"payload": { "lp_id": "lp_0427", "doc_type": "email", "date_ts": 1771027200, ... } }
|
||||
],
|
||||
"usage": { "embed_ms": 21, "qdrant_ms": 9, "rerank_ms": 140, "candidates": 80 }
|
||||
}
|
||||
```
|
||||
|
||||
`score` is the rerank score when reranked, else the fused/dense score.
|
||||
|
||||
## The sparse (BM25) story — important for entity-heavy data
|
||||
|
||||
bge-m3's dense vectors carry semantic meaning, but exact entity matches
|
||||
(fund names, tickers, people) are a **lexical** signal. For that you want hybrid
|
||||
dense + sparse. Two design facts:
|
||||
|
||||
1. **spark-embed serves dense only.** TEI/Infinity can't emit bge-m3's learned
|
||||
sparse either, and bge-m3's pretrained sparse weights underweight novel
|
||||
entity tokens (brand-new fund names). So we don't use bge-m3 sparse.
|
||||
2. **Use Qdrant BM25 with its built-in IDF**, generated client-side. This learns
|
||||
IDF over *your* corpus, so novel entity strings get correct weight.
|
||||
|
||||
### Your ingest pipeline (the part you own)
|
||||
|
||||
For each chunk, produce and upsert **both** vectors:
|
||||
|
||||
- **dense**: call Spark Control `POST /v1/embeddings` → 1024-d vector.
|
||||
- **sparse**: run [FastEmbed](https://github.com/qdrant/fastembed) BM25 client-side:
|
||||
```python
|
||||
from fastembed import SparseTextEmbedding
|
||||
bm25 = SparseTextEmbedding(model_name="Qdrant/bm25")
|
||||
sp = next(bm25.embed([chunk_text])) # -> {indices, values}
|
||||
```
|
||||
|
||||
Create the collection with a named dense vector and a named sparse vector that
|
||||
uses `modifier: idf` (so Qdrant applies IDF server-side):
|
||||
|
||||
```json
|
||||
PUT /collections/crm_chunks
|
||||
{ "vectors": { "dense": { "size": 1024, "distance": "Cosine" } },
|
||||
"sparse_vectors": { "sparse": { "modifier": "idf" } } }
|
||||
```
|
||||
|
||||
Add payload indexes for your filter fields so filtered queries stay fast:
|
||||
|
||||
```json
|
||||
PUT /collections/crm_chunks/index { "field_name": "lp_id", "field_schema": "keyword" }
|
||||
PUT /collections/crm_chunks/index { "field_name": "doc_type", "field_schema": "keyword" }
|
||||
PUT /collections/crm_chunks/index { "field_name": "date_ts", "field_schema": "integer" }
|
||||
```
|
||||
|
||||
Upsert points with both vectors + payload:
|
||||
|
||||
```json
|
||||
PUT /collections/crm_chunks/points
|
||||
{ "points": [ { "id": 1,
|
||||
"vector": { "dense": [...1024...], "sparse": { "indices": [...], "values": [...] } },
|
||||
"payload": { "lp_id": "lp_0427", "lp_name": "Brightwater Capital",
|
||||
"doc_type": "email", "date_ts": 1771027200, "text": "..." } } ] }
|
||||
```
|
||||
|
||||
### At query time
|
||||
|
||||
Generate the query's BM25 sparse vector with the **same** FastEmbed model, and
|
||||
pass it to `/api/search` as `sparse`. Spark Control fuses dense+sparse with RRF
|
||||
inside Qdrant, then reranks. If you omit `sparse`, you get dense + rerank
|
||||
(still good, just no exact-lexical leg).
|
||||
|
||||
> You can talk to Qdrant directly (`http://<spark2>:6333`) for collection
|
||||
> management and upserts — that's the natural home for ingest. `/api/search` is
|
||||
> the convenience path for the read side so your agents hit one host. If you'd
|
||||
> rather Spark Control proxy Qdrant admin too, say the word.
|
||||
|
||||
## Chunking guidance (entity-heavy CRM)
|
||||
|
||||
- One chunk per email / note / transcript-turn; one chunk per memo *section* —
|
||||
don't split mid-thought.
|
||||
- Keep entity + date as **payload fields** (filterable), not buried in embedded text.
|
||||
- Time-aware: don't merge a 2022 note with a 2026 update in one chunk; store
|
||||
`date_ts` so you can pre-filter and recency-boost.
|
||||
- Resolve entity-name variants ("J. Smith" / "Jonathan Smith" / "JS") to one
|
||||
canonical `lp_id` at ingest, or retrieval fragments across variants.
|
||||
|
||||
## Pre-filtering
|
||||
|
||||
Most agent queries aren't pure semantic — they're "recent emails with Bob about
|
||||
energy". Pass a Qdrant `filter` to restrict the search space *before* vector
|
||||
scoring (faster + more precise). Qdrant also supports server-side recency
|
||||
boosting via Formula/Score-Boosting if you want decay-by-`date_ts` without an
|
||||
app round-trip — ask and we can expose it through `/api/search`.
|
||||
|
||||
## Service discovery + health
|
||||
|
||||
- `GET /api/endpoints` includes `embeddings` and `qdrant` entries (base_url +
|
||||
ready flag) alongside `vllm`, `parakeet`, `kokoro`.
|
||||
- The dashboard shows Embeddings + Qdrant health dots and Start/Restart/Stop
|
||||
controls in the Always-On Services panel.
|
||||
- Spark Control will auto-restart a wedged **embedding** container (GPU CUDA
|
||||
wedge recovery, like the audio services) but **never** auto-restarts Qdrant —
|
||||
it holds your only copy of the index, so a restart is surfaced for manual
|
||||
action instead.
|
||||
|
||||
## Capacity
|
||||
|
||||
At your scale (tens of thousands of chunks now → low hundreds of thousands),
|
||||
this is trivial: ~0.6–1.2 GB of dense vectors at 300k chunks, spark-embed +
|
||||
Qdrant together use a few GB of GPU/RAM on Spark 2's 122 GB. Full re-embed of
|
||||
300k chunks is ~8–15 min, so re-indexing is cheap if you change models.
|
||||
|
||||
## Model upgrade path
|
||||
|
||||
If dense recall becomes the bottleneck, `Qwen3-Embedding-4B` (Matryoshka-trained,
|
||||
tops 2026 MTEB) is the A/B candidate — same `/v1/embeddings` contract, swap the
|
||||
model in spark-embed. bge-m3 is the lower-risk starting point and ships now.
|
||||
@@ -0,0 +1,71 @@
|
||||
# Phase 0 — Foundation: Kickoff Brief
|
||||
|
||||
**Goal:** stand up the data + retrieval substrate that every later agent depends on. By the end of Phase 0 we can ingest the CRM into a searchable index and query it through a CRM MCP server using multiple retrieval modes. **No outward-facing agent behavior in this phase.**
|
||||
|
||||
See `../CLAUDE.md` for settled architecture and non-negotiable guardrails. This brief assumes them.
|
||||
|
||||
## Prerequisites / inputs needed before building
|
||||
|
||||
- [ ] CRM source code present in this repo.
|
||||
- [ ] `docs/EMBEDDINGS.md` read — it is the authoritative embedding/retrieval contract and ingest recipe. This brief defers to it for all vector mechanics.
|
||||
- [ ] `.env` populated from `.env.example`: `ANTHROPIC_API_KEY`, `SPARK_CONTROL_URL` (gateway), `QDRANT_URL` (direct Qdrant on Spark 2 for ingest), `X_API_KEY`, CRM connection vars.
|
||||
- [ ] Spark Control endpoints live (shipped in v0.15.0): `/v1/embeddings`, `/v1/rerank`, `/api/search`. Verify with `GET /api/endpoints`.
|
||||
- [ ] `fastembed` installed in the ingest environment (client-side BM25 sparse vectors, model `Qdrant/bm25`).
|
||||
- [ ] A small set of real example records + documents (a few LPs, a memo, a transcript) to test ingest and retrieval against.
|
||||
|
||||
## Workstream A — Understand the CRM, then extend the schema
|
||||
|
||||
1. Read the CRM code. Write `docs/crm-overview.md`: storage engine and where it runs, the data model (records, custom fields, and especially how notes/interactions/correspondence are stored), the API/route surface, and auth. Fill in the Conventions section and CRM env vars in `CLAUDE.md`.
|
||||
2. Design and apply additive, reversible schema extensions:
|
||||
- LP/prospect fields: thesis fit, segment, accreditation/QP status, warmth score, source, owner, last-touch.
|
||||
- **Interaction log** table: append-only record of every agent action and every human touch (actor, action, target, timestamp, payload).
|
||||
- **Relationship graph** table: edges between people/entities with source + strength (seeded later from email/calendar + X follower overlap).
|
||||
- **Canonical entity IDs**: a canonical record each name variant resolves to (see Workstream B3).
|
||||
|
||||
*Guardrail: additive and reversible only; no destructive migrations; review before running.*
|
||||
|
||||
## Workstream B — Ingest / sync pipeline
|
||||
|
||||
**Follow the ingest recipe in `docs/EMBEDDINGS.md` exactly** (collection schema, dual-vector upsert, FastEmbed BM25, payload fields). This brief covers only the CRM-specific decisions layered on top of it.
|
||||
|
||||
1. **Chunking** (a bigger quality lever than model choice). Per EMBEDDINGS.md: one chunk per email / note / transcript-turn; one chunk per memo *section*; never split mid-thought; keep entities + `date_ts` as filterable payload, not embedded text. The CRM-specific work is mapping *your* record types to chunks (how a thread, a note, a memo, a call transcript each decompose).
|
||||
2. **Two vectors per chunk.** Dense via Spark Control `POST /v1/embeddings` (`BAAI/bge-m3`, 1024-d); sparse BM25 via FastEmbed (`Qdrant/bm25`) generated **client-side**. Create the `crm_chunks` collection with a named dense vector (1024, Cosine) + named sparse vector (`modifier: idf`), and payload indexes on `lp_id` (keyword), `doc_type` (keyword), `date_ts` (integer). Upsert both vectors + payload (`lp_id`, `lp_name`, `doc_type`, `date_ts`, `text`). **Ingest talks to Qdrant directly** (`QDRANT_URL`), not through the gateway.
|
||||
3. **Entity resolution at ingest.** Resolve name variants ("John Smith" / "J. Smith" / "JS") to one canonical `lp_id` before storing — a lightweight local-Qwen call via Spark Control `/v1/chat/completions`. Without it, retrieval fragments across variants. Same canonical-ID work as Workstream A4.
|
||||
4. **Backfill + incremental sync, idempotent.** One-time full backfill, then incremental updates as records change. A full re-embed is ~8–15 min, so the index is cheaply rebuildable — but only if the pipeline is **idempotent and re-runnable** (Qdrant is never auto-restarted and holds the only live index). Prefer simple, correct, repeatable sync over premature optimization.
|
||||
|
||||
## Workstream C — CRM MCP server
|
||||
|
||||
Wrap the CRM behind an MCP server that agents (Claude Agent SDK) call. The retrieval modes are thin wrappers over Spark Control `POST /api/search` (which orchestrates embed → Qdrant RRF → rerank); for hybrid/keyword modes, generate the query's BM25 sparse vector with the **same** FastEmbed model used at ingest and pass it as `sparse`. Tools to expose:
|
||||
- **Reads:** fetch record(s), list/search records by structured filters, fetch interaction history for an entity.
|
||||
- **Retrieval modes** (all via `/api/search`, varying parameters):
|
||||
- `semantic_search` — dense only (omit `sparse`), large `top_k`/`retrieve_n`. High recall (Scout).
|
||||
- `hybrid_search` — pass `sparse` + `rerank: true` + `filter`. Dense+BM25 RRF then rerank. **Default**, best for entity-heavy queries.
|
||||
- `keyword_search` — high-precision lexical ("did we ever discuss X with this LP?"): lean on the sparse leg + rerank. For a pure-BM25 result, query Qdrant's sparse vector directly via `QDRANT_URL`.
|
||||
- Always support a Qdrant `filter` (e.g. `lp_id`, `date_ts` range) to pre-filter before scoring.
|
||||
- **Writes:** create/update records, append to the interaction log, write enrichment results back. (No outbound/contact actions — Phase 3, gated.)
|
||||
|
||||
Keep retrieval modes as small, composable functions so the orchestrator can select among them later.
|
||||
|
||||
## Workstream D — Compliance groundwork (parallel, non-blocking)
|
||||
|
||||
Engage counsel to define solicitation posture (506(b) vs 506(c)), accreditation/QP verification, and recordkeeping rules. This does not block A–C, but it must be resolved before any Phase 3 outbound work begins.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] CRM schema extended (additive, reversible) and documented.
|
||||
- [ ] `crm_chunks` collection created per EMBEDDINGS.md (named dense + sparse vectors, payload indexes).
|
||||
- [ ] Full CRM corpus backfilled into Qdrant with **both** dense + BM25 sparse vectors and payload; incremental sync working and idempotent.
|
||||
- [ ] Entity resolution collapses obvious name variants to one canonical `lp_id`.
|
||||
- [ ] CRM MCP server live; retrieval modes wrap `/api/search`; hybrid (sparse + rerank) visibly beats dense-only on entity-heavy test queries (e.g. exact fund/LP name lookups).
|
||||
- [ ] Every write goes through the interaction log.
|
||||
- [ ] No outward-facing behavior exists anywhere in the codebase.
|
||||
|
||||
## Explicitly out of scope for Phase 0
|
||||
|
||||
- Any agent that sends email, posts, or contacts a prospect.
|
||||
- Scout monitoring, Closer outreach, Scribe publishing, the Orchestrator.
|
||||
- The workflow/scheduling engine (revisited at Phase 3).
|
||||
|
||||
## Suggested order
|
||||
|
||||
A1 → A2 → B1 → B2 → B3 → B4 → C. Start D in parallel.
|
||||
@@ -0,0 +1,133 @@
|
||||
# Ten31 — Agentic Capability Build Plan
|
||||
|
||||
*Working document. Purpose: a concrete, sequenced plan for building an in-house system of AI agents to widen the top of the fundraising funnel, refine and propagate Ten31's thesis, and automate marketing/branding workflows — built with internal resources using Claude and Claude Code as the engineering partner.*
|
||||
|
||||
---
|
||||
|
||||
## 1. Approach in one paragraph
|
||||
|
||||
Build **six agents** — five workers plus a lightweight orchestrator — on the **Claude Agent SDK**, connected to your systems through **MCP**. Run the *reasoning* on **Claude** (frontier-quality judgment for research, messaging, drafting). **Self-host the data layer and the privacy-sensitive model work** on your existing Start9 server and your **dual DGX Sparks**. **Buy nothing for the core**: your self-built CRM becomes the system of record, and your existing Gmail/Superhuman + calendar connectors supply the relationship data. The real unit of reuse is not the agent count — it is one shared **LP graph** (your CRM) plus a library of **skills** every agent draws from.
|
||||
|
||||
---
|
||||
|
||||
## 2. Guiding principles
|
||||
|
||||
1. **Sovereignty first.** Sensitive LP and relationship data stays on infrastructure you control (Start9 + DGX Sparks). Only the minimum necessary context per call ever reaches a third-party model API.
|
||||
2. **Frontier reasoning where it is best-in-class; local where privacy or cost dominate.** Claude for hard agentic reasoning and LP-facing output; local open models for embeddings, redaction, triage, transcription, and reasoning over data that must not leave your walls.
|
||||
3. **Human-in-the-loop on anything outbound or thesis-defining.** Agents draft and prepare; partners approve and send.
|
||||
4. **Compliant by design.** Log every agent action; gate all outbound; bring counsel in before any cold outreach goes live.
|
||||
5. **One source of truth.** Every agent reads from and writes to the same LP graph, so research → outreach → nurture → meeting prep compound instead of fragmenting.
|
||||
|
||||
---
|
||||
|
||||
## 3. The agent roster (6)
|
||||
|
||||
| Agent | Job | Cadence | Brain | Human gate |
|
||||
|---|---|---|---|---|
|
||||
| **Scout** | Watches sources (X/nostr, filings, treasury announcements, conference rosters, podcast networks); flags trigger events; populates the pipeline. | Continuous / scheduled | Local (triage) + Claude (judgment calls) | None (internal only) |
|
||||
| **Analyst** | Builds LP dossiers, enriches records, maps shortest warm-intro path through the team's network. | On-demand + triggered | Claude (synthesis); local for RAG/embeddings | None (internal only) |
|
||||
| **Architect** | **Thesis articulation.** Owns and refines the canonical messaging — the scarcity / critical-infrastructure throughline tying bitcoin to AI infrastructure. The copilot partners sit with to sharpen the narrative. Output = a living "messaging source of truth." | On-demand, collaborative | Claude | Partner sign-off on canonical thesis |
|
||||
| **Scribe** | **Distribution / amplification.** Takes the Architect's canonical thesis + your content (Bitcoin Alpha, partner shows, memos) and propagates segment-specific cuts across X, nostr, LinkedIn, email. | Scheduled + on-demand | Claude | Review before publish |
|
||||
| **Closer** | Drafts personalized outreach and nurture sequences, preps partners before LP calls, writes follow-ups, keeps the CRM clean. | Triggered + on-demand | Claude | **Hard gate** — human sends all outbound |
|
||||
| **Orchestrator** ("Chief of Staff") | Schedules runs, routes work between agents, escalates to a human. | Always on | Claude (light) | n/a |
|
||||
|
||||
**Why Architect and Scribe are separate.** Distribution is high-frequency and semi-mechanical; thesis articulation is low-frequency, high-judgment, and collaborative. Keeping them apart lets the Architect own a stable, partner-approved narrative that the Scribe then propagates consistently everywhere.
|
||||
|
||||
---
|
||||
|
||||
## 4. Architecture and hosting map
|
||||
|
||||
### 4.1 Model layer
|
||||
- **Claude (API)** — the brains for Analyst synthesis, Architect thesis work, Scribe drafting, Closer judgment, and Orchestrator routing. Use a stronger model for Architect/Analyst, a faster one for high-volume Scout/Closer tasks.
|
||||
- **Local model on the DGX Sparks** — current local model is **Qwen3.6 35B-A3B running on a single Spark**. Used for PII redaction before any data leaves your walls, inbound triage/classification, transcription orchestration, structuring/extraction, and local reasoning over data you choose never to send out.
|
||||
- The **A3B (~3B active params)** design means only a small slice of the model runs per token, so it largely sidesteps the Spark's memory-bandwidth limit and keeps decode fast despite being a 35B-total model. No need to link both Sparks for a larger model — that earlier ceiling is moot for this workload.
|
||||
- **Embeddings + reranking (shipped, Spark Control v0.15.0).** Retrieval runs on `BAAI/bge-m3` (dense, 1024-dim, L2-normalized) plus `BAAI/bge-reranker-v2-m3` (cross-encoder), served by **spark-embed** — a small FastAPI server on **Spark 2** built from the NGC PyTorch image (HF TEI was ruled out: no arm64 CUDA image). Exposed through Spark Control as `/v1/embeddings`, `/v1/rerank`, and `/api/search` (orchestrated hybrid retrieval). Combined GPU footprint on Spark 2 is trivial (~3 GB).
|
||||
- **Spark allocation.** Spark 1 = LLM serving (hot KV cache). Spark 2 = embeddings + reranker + audio + the Qdrant vector index. Both Sparks are treated as always-on production infrastructure.
|
||||
- **All local model services are fronted by Spark Control** (the self-hosted gateway on Start9): agents hit one trusted URL for chat, embeddings, rerank, transcription, and TTS, with shared TLS, access control, and observability.
|
||||
- **Auth note:** Agent SDK agents must authenticate with an **API key**, not a claude.ai login.
|
||||
|
||||
### 4.2 Data layer — the LP graph (self-hosted)
|
||||
- **The CRM (self-hosted on Start9) is the canonical system of record.** Extend it to be the LP graph. Add: prospect/LP schema fields (thesis fit, segment, accreditation/QP status, warmth score, source, owner, last-touch), an interaction log (every agent action + every human touch), a derived **relationship graph** table, and **canonical entity IDs** for entity resolution (see ingest pipeline).
|
||||
- **Vector store: Qdrant on Spark 2 (settled).** Holds the embedded chunks. It is a **rebuildable, derived index**, not a second source of truth — if lost, it re-embeds from the CRM in minutes. Qdrant provides dense search + native BM25 + payload filtering + Reciprocal Rank Fusion in one service.
|
||||
- **Retrieval pipeline.** One orchestrated call to Spark Control `/api/search`: embed query (BGE-M3) → Qdrant dense + BM25 RRF with payload pre-filter → cross-encoder rerank → top_k. BM25 is generated **client-side** via FastEmbed (`Qdrant/bm25`) at both ingest and query time, with Qdrant applying IDF over *your* corpus — so domain entities (LP names, tickers, portfolio companies) are weighted by your own term statistics rather than BGE-M3's general-web sparse weights.
|
||||
- **Ingest pipeline (the real Phase 0 work).** CRM record/change → chunk (one chunk per email/note/transcript-turn; one per memo *section*; time-aware; entities + `date_ts` kept as filterable payload, not embedded text) → resolve entities to a canonical `lp_id` (lightweight local-Qwen step) → produce **both** a dense vector (`/v1/embeddings`) and a sparse BM25 vector (FastEmbed) → upsert both + payload to Qdrant **directly** (not via the gateway). One-time backfill + idempotent incremental sync. Full recipe: `docs/EMBEDDINGS.md`.
|
||||
- **Per-agent retrieval modes.** Don't force one pipeline on all agents. Build a small library the orchestrator picks from: high-recall dense at large K (Scout), high-precision keyword/BM25 (Closer — "did we ever discuss X with this LP?"), long-context + rerank (Architect). The CRM MCP server exposes these as tools.
|
||||
- **Wrap the CRM in an MCP server** so all agents read/write through one uniform interface, including the retrieval modes above. Because the CRM is self-built, any endpoint the agents need can be added.
|
||||
|
||||
### 4.3 Integration layer (MCP fabric)
|
||||
- MCP servers to stand up / connect:
|
||||
- **CRM / LP graph** (custom, self-hosted) — primary.
|
||||
- **Email + calendar** — Gmail/Superhuman connectors are already live; these feed Closer (drafting, follow-ups) and the Analyst's warm-path derivation.
|
||||
- **Drive / notes** — internal documents and memos.
|
||||
- **Publishing channels** — X, nostr, LinkedIn, email/newsletter (for Scribe).
|
||||
- **Public data sources** — filings, web search, and the **X API (official key in hand)** for Scout/Analyst enrichment. X is a primary source here: per-prospect public profile/bio/activity and follower-following overlap for thesis-fit scoring and mutual-connection discovery (Analyst), plus account/list/keyword monitoring and follower-graph signals (Scout). Confirm what your X access *tier* permits (full-archive search, follower-graph pulls, streaming) — that sets the ceiling on heavier monitoring. nostr APIs as a complementary source.
|
||||
|
||||
### 4.4 Orchestration / runtime
|
||||
- Inner loop: **Claude Agent SDK** handles each agent's tool-use loop and context management.
|
||||
- Outer loop: a thin workflow engine decides *when* and *which* agent runs (Temporal for durable retries, or simpler cron/queue + n8n glue to start).
|
||||
- **Observability:** structured logging of every agent action, with a simple dashboard. Required for both debugging and compliance.
|
||||
|
||||
### 4.5 Enrichment (privacy-preserving)
|
||||
- Default: **one-way, per-prospect public lookups** that write results *into* the CRM. Never upload the LP list to a third party. The **X API** is the workhorse here — public, per-prospect, ToS-compliant via the official key — and its follower-graph data complements the email/calendar relationship graph for warm-path mapping.
|
||||
- Optional: a **self-hosted scraper/enrichment pipeline on the Sparks** if you want zero third-party API exposure.
|
||||
|
||||
### 4.6 Redaction / re-hydration boundary (Claude-facing reasoning)
|
||||
- For the steps where an agent must have **Claude reason over LP-specific content** (Analyst dossiers, Closer drafting), a local **scrub → reason → re-hydrate** round-trip keeps identifiers off the third-party API: the Sparks pseudonymize names/orgs/amounts to stable placeholders, Claude reasons over the de-identified prompt, and real values are swapped back locally before a human reviews. The ingest/retrieval path is already fully local and needs none of this.
|
||||
- This is **designed now, built in Phase 2/3** (it is not needed in Phase 0). Full design: `docs/redaction-rehydration.md`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Build sequence
|
||||
|
||||
### Phase 0 — Foundation
|
||||
The substrate: data layer + retrieval, no live-in-the-wild agents yet. Division of labor:
|
||||
- **Spark developer (their side):** TEI serving BGE-M3 + BGE-Reranker-v2-m3 and Qdrant on Spark 2, exposed via Spark Control `/v1/embeddings` + `/v1/rerank`.
|
||||
- **Claude Code + you (this project):**
|
||||
1. Read the CRM code; document the storage engine, schema, and API surface.
|
||||
2. Extend the CRM schema (LP/prospect fields, interaction log, relationship graph, canonical entity IDs).
|
||||
3. Build the ingest/sync pipeline (chunking + entity resolution + metadata payloads; backfill + incremental).
|
||||
4. Build the CRM MCP server wrapping CRM reads/writes and the per-agent retrieval modes.
|
||||
5. Bring counsel in to define outbound and recordkeeping rules so the system is compliant from day one.
|
||||
|
||||
### Phase 1 — Architect + Scribe
|
||||
- Stand up the **Architect** first: encode the current thesis, voice, and segment definitions as skills; use it collaboratively to produce the canonical messaging source of truth.
|
||||
- Then **Scribe**: propagate that thesis into segment-specific content with human review before publish.
|
||||
- Lowest risk, highest immediate awareness ROI, never touches cold outreach — and it proves the full pattern (SDK + skills + MCP + human review).
|
||||
|
||||
### Phase 2 — Scout + Analyst
|
||||
- **Scout** populates the pipeline from public signals (X monitoring via the API key); **Analyst** builds dossiers and derives warm paths from your own email/calendar graph plus X follower-graph overlap.
|
||||
- Internal-facing, still no outbound. This is where the Sparks earn their keep (bulk classification, embeddings, RAG).
|
||||
|
||||
### Phase 3 — Closer + Orchestrator
|
||||
- **Closer** drafts outbound, nurture, and meeting prep — with hard human-in-the-loop gates and full logging. Highest-risk and regulated, so it comes last.
|
||||
- **Orchestrator** added once there are multiple agents to coordinate and schedule.
|
||||
|
||||
---
|
||||
|
||||
## 6. Team and ownership model
|
||||
|
||||
- **Engineering partner:** Claude + Claude Code, supplying Agent SDK and MCP fluency, scaffolding the agents, writing the MCP servers and orchestration, and customizing the Start9 CRM package.
|
||||
- **Operator:** you (and your partner). You own deployment, secrets/key management, uptime, and the human-review gates. Your prior Start9 CRM build demonstrates this is well within reach.
|
||||
- **The one real risk is time, not capability.** Removing the part-time data/ops hire means operational ownership lands on the partners. If partner time is scarce, that — not tooling or skill — is the constraint to manage. Mitigations: keep the early phases internal-only (no on-call urgency), automate logging/monitoring, and stage the highest-maintenance agent (Closer) last.
|
||||
|
||||
---
|
||||
|
||||
## 7. Compliance by design
|
||||
|
||||
- Log every agent action and every outbound draft.
|
||||
- Gate all outbound through human send.
|
||||
- Resolve solicitation posture (e.g. 506(b) vs 506(c)), accreditation/QP verification, and recordkeeping with counsel **before** the Closer touches cold outreach.
|
||||
- Start with distribution and inbound nurture, where constraints are lightest.
|
||||
|
||||
---
|
||||
|
||||
## 8. Open decisions
|
||||
|
||||
**Resolved:** local chat/triage model = Qwen3.6 35B-A3B (Spark 1); embedding = `BAAI/bge-m3` dense 1024-dim; reranker = `BAAI/bge-reranker-v2-m3`; vector DB = Qdrant v1.16.0 on Spark 2; serving = **spark-embed** (custom FastAPI on NGC PyTorch image, *not* TEI); BM25 sparse generated client-side via FastEmbed (`Qdrant/bm25`); all fronted by Spark Control (`/v1/embeddings`, `/v1/rerank`, `/api/search`), shipped v0.15.0. Embedding-model A/B upgrade candidate if dense recall lags: `Qwen3-Embedding-4B` (same `/v1/embeddings` contract).
|
||||
|
||||
**Still open:**
|
||||
1. Workflow engine for the outer loop (Phase 3): Temporal vs. cron/queue + n8n to start.
|
||||
2. Whether any third-party enrichment API is acceptable, or X + fully self-hosted enrichment only.
|
||||
3. Confirm **X API usage limits** (full-archive search, follower-graph pulls, streaming) to size Scout's monitoring scope. (Current access is pay-as-you-go credits.)
|
||||
4. Segment definitions for the Architect/Scribe (who are the distinct LP audiences, and what does each one need to hear?).
|
||||
5. Embedding dimension/quantization left at BGE-M3 native 1024-dim fp16 — no Matryoshka truncation or int8 needed at this corpus scale.
|
||||
@@ -0,0 +1,207 @@
|
||||
# CRM Overview — Storage, Data Model, API, Auth
|
||||
|
||||
*Workstream A1 deliverable (see `PHASE_0.md`). Read-only documentation of the existing CRM as of 2026-06. Every concrete claim is anchored to `file:line`. This is a description of what exists today, not a proposal — the schema-extension proposal for A2 lives separately.*
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR for Phase 0
|
||||
|
||||
- **One Python file, no framework.** The whole backend is `backend/server.py` (~4,530 lines): a stdlib `http.server.ThreadingHTTPServer` with a hand-written `CRMHandler(BaseHTTPRequestHandler)` and manual path dispatch. `requirements.txt` lists FastAPI/SQLAlchemy/Alembic/Pydantic but **none are imported** — they are vestigial.
|
||||
- **Storage is one SQLite file** (`data/crm.db`), WAL mode, opened fresh per request. Schema is created idempotently in-code at boot. There is no Alembic; "migrations" are `CREATE TABLE IF NOT EXISTS` + best-effort `ALTER TABLE ADD COLUMN`.
|
||||
- **Two parallel investor data models** coexist with no shared key: (1) the *classic* `contacts / organizations / opportunities / communications / lp_profiles` CRM, and (2) the *newer, actively-used* `fundraising_*` collaborative grid. They are bridged only by fuzzy name/email matching. **This duality is the central entity-resolution problem for Phase 0.**
|
||||
- **A real Gmail subsystem** (`backend/email_integration/`) stores threaded correspondence in `crm.db` and matches emails to investors — but is **self-disabling** (off unless a service-account key is present).
|
||||
- **Auth is a single scheme:** username/password → HS256 JWT (Bearer header), re-validated against the `users` table each request; two roles (`admin`/`member`). The `X_API_KEY` named in `CLAUDE.md`/`PHASE_0.md` **does not exist in the code** — it is aspirational.
|
||||
- **Guardrail flags:** all deletes are **hard deletes** (violates guardrail #3 as written); a destructive `POST /api/admin/reset-all-data` exists; `audit_log` is mutation-only and is *not* the append-only interaction log Phase 0 wants.
|
||||
|
||||
---
|
||||
|
||||
## 1. Storage engine & where it runs
|
||||
|
||||
### 1.1 Runtime
|
||||
- **Server:** `ThreadingHTTPServer((HOST, PORT), CRMHandler)`, `daemon_threads = True`, `serve_forever()` — `backend/server.py:4509`. Handler class at `backend/server.py:1418`. Pure Python stdlib (`http.server`); **not** FastAPI/uvicorn despite `backend/requirements.txt:1-2`.
|
||||
- **Concurrency model:** one OS thread per request. Safe because each request opens its own short-lived SQLite connection under WAL (rationale documented inline at `backend/server.py:4506-4508`).
|
||||
- **Request lifecycle:** banned-IP check → per-IP rate limit → (email module hook) → manual `if path == …` dispatch. Body is read once and cached on the handler instance (`get_body()`, `backend/server.py:1433`). Malformed JSON silently becomes `{}`.
|
||||
- **Background threads:** a backup scheduler loops every 60 s (`start_backup_scheduler`, `backend/server.py:1367`); an optional Gmail sync scheduler starts only if `CRM_GMAIL_INTEGRATION_ENABLED` is truthy (`backend/server.py:4498`).
|
||||
- **Default bind:** `0.0.0.0:8080`, plain HTTP — TLS is expected to be terminated upstream (Start9 / Tailscale).
|
||||
|
||||
### 1.2 SQLite configuration
|
||||
`get_db()` (`backend/server.py:77-84`) sets, on every connection:
|
||||
- `PRAGMA journal_mode=WAL` — concurrent readers + single writer (this is what makes the ingest reader safe against the live writer).
|
||||
- `PRAGMA foreign_keys=ON` — FKs are enforced at runtime (per-connection in SQLite, so re-set each time).
|
||||
- `PRAGMA busy_timeout=5000` — 5 s wait on a lock.
|
||||
- `row_factory = sqlite3.Row`.
|
||||
|
||||
The Gmail module re-implements the identical pragmas (`email_integration/scheduler.py:49`, `email_integration/routes.py:89`) rather than import `server.py`, to avoid a circular import.
|
||||
|
||||
### 1.3 Schema bootstrap & "migrations"
|
||||
- `init_db()` (`backend/server.py:86`) runs once at startup, before binding. One big `executescript` of `CREATE TABLE/INDEX IF NOT EXISTS` (`backend/server.py:91-405`) creates **both** data models plus `app_settings`.
|
||||
- **Core "migrations":** a hardcoded list of `ALTER TABLE … ADD COLUMN` wrapped in a try/except that swallows `OperationalError` (`backend/server.py:407-418`) — additive-only, idempotent-by-failure. No version table, no down-migrations. (Currently adds `city/state/country/location_query` to `contacts` and `lead_source` to `fundraising_investors`.)
|
||||
- **The only real migration runner** is in the Gmail module: `email_integration.db.apply_migrations()` (`email_integration/db.py:23`) runs numbered `NNNN_*.sql` files lexicographically. There is one today: `migrations/0001_email_tables.sql`. ⚠️ This is called from `init_db()` guarded only by `ImportError` (`backend/server.py:421-427`), so on any image where the package is importable, **the email tables are created even when Gmail sync is disabled**.
|
||||
|
||||
### 1.4 `data/` layout
|
||||
- `crm.db`, `crm.db-wal`, `crm.db-shm` — the DB + WAL + shared-memory.
|
||||
- `backups/` — JSON snapshots of the **fundraising grid state only** (not the whole DB), written by the backup scheduler.
|
||||
- `secrets/` — holds `gmail-service-account.json` (mode 600).
|
||||
- `email_attachments/` — Gmail attachment bytes on disk (created on the 0.4 image).
|
||||
- `.crm-secret` — persisted JWT secret, written by the container entrypoint (not the app).
|
||||
|
||||
### 1.5 Production deployment (StartOS)
|
||||
Package id `ten-database` ("Ten31 Database"). Both generations run the **same** app (`python3 /app/backend/server.py`) in a `python:3.11-slim` container; all state on a single persistent volume `main` mounted at `/data`.
|
||||
|
||||
- **`start9/0.4/` — current/live target.** Manifest & lifecycle are TypeScript under `start9/0.4/startos/` (`manifest/index.ts`, `main.ts`, `backups.ts`, `interfaces.ts`). Built for `x86_64` + `aarch64`. Whole-volume backups (`sdk.Backups.ofVolumes('main')`). The richer `docker_entrypoint.sh` creates `/data/{backups,secrets,email_attachments}`, persists `CRM_SECRET_KEY` to `/data/.crm-secret`, and **conditionally enables Gmail** iff `/data/secrets/gmail-service-account.json` exists (then exports DWD env: `CRM_GMAIL_AUTH_METHOD=dwd`, `CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz`, sync interval 180 min). Version notes record the 0.3.5→0.4 migration is complete and the live `/data` volume is the sole source of truth (no more baked-in seed snapshot).
|
||||
- **`start9/0.3.5/` — legacy.** Hand-written YAML manifest (`manifest.yaml`), arm64-only, Tor `80→8080` + LAN `443(ssl)→8080`. Superseded by 0.4.
|
||||
- **Local/dev:** `start.sh` runs `python3 backend/server.py` with dev defaults. `start_beta.sh` is a Tailscale launcher that sources `.env.beta`, forces `CRM_ENV=production`, and requires a ≥24-char `CRM_SECRET_KEY`. In production mode the app refuses to start without `CRM_SECRET_KEY` (`backend/server.py:4487`).
|
||||
|
||||
### 1.6 Environment variables (for CLAUDE.md "CRM connection vars")
|
||||
**Core server (`backend/server.py:42-71`):** `CRM_DATA_DIR`, `CRM_FRONTEND_DIR`, `CRM_DB_PATH` (default `<DATA_DIR>/crm.db`), `CRM_SECRET_KEY` (JWT signing — required in production), `CRM_HOST` (default `0.0.0.0`), `CRM_PORT` (default `8080`), `CRM_CORS_ORIGIN` (default `*`), `CRM_ENV` (default `development`), `CRM_LOGIN_RATE_LIMIT_PER_MIN`, `CRM_WRITE_RATE_LIMIT_PER_MIN`, `CRM_GET_RATE_LIMIT_PER_MIN`, `CRM_ABUSE_404_THRESHOLD`, `CRM_ABUSE_404_WINDOW_SEC`, `CRM_ABUSE_BAN_SEC`, `CRM_SEED_DEMO_DATA`, `CRM_GMAIL_INTEGRATION_ENABLED`.
|
||||
|
||||
**Gmail module (`email_integration/config.py:80-101`):** `CRM_GMAIL_AUTH_METHOD` (`dwd`/`oauth`), `CRM_GMAIL_SA_KEY_PATH`, `CRM_GMAIL_WORKSPACE_DOMAIN`, `CRM_GMAIL_OAUTH_CLIENT_ID/SECRET/REDIRECT_URI`, `CRM_GMAIL_SECRET_KEY` (AES key for OAuth-token-at-rest — **separate** from `CRM_SECRET_KEY`), `CRM_GMAIL_SYNC_INTERVAL_MIN`, `CRM_GMAIL_BACKFILL_PAGE_SIZE`, `CRM_GMAIL_MAX_ATTACHMENT_MB`, plus rate/retry knobs.
|
||||
|
||||
> **There is no network DB protocol.** "Connecting to the CRM" means either (a) opening the same SQLite file (`CRM_DB_PATH`) — only possible co-located with the `/data` volume — or (b) HTTP at `http://<host>:8080` with a Bearer JWT.
|
||||
|
||||
---
|
||||
|
||||
## 2. Data model
|
||||
|
||||
PKs are **8-char truncated UUIDs** (`generate_id()` = `str(uuid.uuid4())[:8]`, `backend/server.py:522`). Timestamps are ISO-8601 UTC strings (`now()`, `backend/server.py:525`). JSON-bearing TEXT columns (`tags`, `attendees`, `options`) are `json.dumps`'d on write and auto-parsed on read by `row_to_dict()` (`backend/server.py:506-517`).
|
||||
|
||||
### 2.1 Classic CRM model
|
||||
|
||||
| Table | Role | Key columns / notes |
|
||||
|---|---|---|
|
||||
| `users` | auth + ownership principal | `username`/`email` UNIQUE, `password_hash`, `role` ∈ {`admin`,`member`}, `is_active`. First user forced `admin`. (`backend/server.py:92`) |
|
||||
| `organizations` | weak parent of contacts/opps | `name` (not unique), `type` (free-text, default `other`), `tags` JSON, `description`. (`backend/server.py:104`) |
|
||||
| `contacts` | **the hub** | `first_name`/`last_name` (req), `organization_id` (FK SET NULL), `contact_type` (free-text; load-bearing values `prospect`/`investor`), `status` (default `active`), `source`, `tags` JSON, `notes`, `linkedin_url`. (`backend/server.py:123`) |
|
||||
| `opportunities` | deal pipeline | `contact_id` (req, FK **CASCADE**), `stage` (allowlist `PIPELINE_STAGES` at `backend/server.py:1380`, enforced **only** on the stage endpoint), `commitment_amount`, `expected_amount`, `fund_name`, `owner_id`, `lost_reason`. (`backend/server.py:148`) |
|
||||
| `lp_profiles` | closed-LP extension | 1:1 with a contact (`contact_id` UNIQUE, FK CASCADE). Holds `commitment_amount`, `funded_amount`, `accredited` (bare 0/1), `legal_docs_signed`, `wire_received`, `k1_sent`, `investor_type` (free-text). (`backend/server.py:186`) |
|
||||
| `custom_fields` / `custom_field_values` | EAV custom fields | **Dead**: schema exists but has **no routes/handlers**; only ever wiped by reset. Do not build on this. (`backend/server.py:206`) |
|
||||
| `tags` | global tag palette | `name` UNIQUE + `color`. Not FK-linked to the per-row `tags` JSON arrays; just an autocomplete source. (`backend/server.py:237`) |
|
||||
| `audit_log` | mutation diff trail | `user_id`, `entity_type`, `entity_id`, `action`, `changes` JSON. **Mutation-only**, no reads, no actor/agent dimension. (`backend/server.py:227`) |
|
||||
|
||||
**How an LP is represented:** a single `contacts` row is the canonical record; `contact_type` carries the funnel stage (`prospect`→`investor`). Promotion to `investor` is a side effect of creating an `lp_profiles` row (`backend/server.py:2834`) or of fundraising-grid sync (`backend/server.py:788`). The contact dossier is assembled by `GET /api/contacts/{id}` (`backend/server.py:2008`): contact + last-20 communications + all opportunities + the one lp_profile. **Note the unreconciled double-modeling of money:** in-flight commitment lives on `opportunities`, closed commitment lives on `lp_profiles`, and the grid has a *third* copy in `fundraising_commitments`.
|
||||
|
||||
### 2.2 Fundraising grid model (newer, actively used)
|
||||
|
||||
A **real-time collaborative spreadsheet** the partners actually edit: funds are columns, investors are rows, dollar commitments are cells, plus saved views, live presence/cell-locks, and a small automation engine.
|
||||
|
||||
- **Authoritative store = one JSON blob:** `fundraising_state.grid_json` + `views_json` (row `id='main'`, `backend/server.py:258`). Reads/exports come straight from this.
|
||||
- **Normalized tables are a derived mirror**, fully rebuilt from the JSON on **every** write by `sync_fundraising_relational()` (`backend/server.py:945`): `fundraising_investors` (keyed by `source_row_id`), `fundraising_funds` (by `column_id`), `fundraising_commitments` (cells), `fundraising_contacts`, `fundraising_views`. ⚠️ **`fundraising_contacts`/`fundraising_commitments`/`views` get fresh UUIDs on every save** (DELETE+reinsert); only `fundraising_investors.id` and `fundraising_funds.id` are stable. Don't persist external references to the volatile ones.
|
||||
- **Automation engine** (`run_fundraising_automations`, `backend/server.py:668`): currently **ignores** the rules' `condition_json`/`action_json` and uses hard-coded flag logic (`graveyard`→graveyard list, `follow_up`→follow_up list, everyone→`all`), rebuilding `fundraising_list_memberships` and logging changes to `fundraising_automation_runs`. So the rules table is display/config surface, not a live interpreter.
|
||||
- **Backups:** JSON-only filesystem snapshots of grid state to `data/backups/` (manual/auto/pre_restore), governed by a policy in `app_settings`. Restore overwrites state then re-syncs the mirror.
|
||||
|
||||
### 2.3 The two-model bridge (the central problem)
|
||||
|
||||
There is **no foreign key** between `fundraising_investors` and the classic `contacts/organizations/opportunities/lp_profiles`. They are joined only by best-effort name/email matching, essentially one-directional **grid → classic**:
|
||||
|
||||
1. **Grid → classic (write-through):** every grid save pushes each grid contact into classic `contacts`/`organizations` via `_upsert_contact_from_fundraising` (`backend/server.py:730`), matching by lowercased email else (name + org). No stored key links the resulting `contacts.id` back to `fundraising_investors.id` — it re-matches by name/email each time.
|
||||
2. **Grid-logged comms → classic `communications`:** `POST /api/fundraising/log-communication` (`backend/server.py:2561`) writes into the classic `communications` table *and* appends a dated line to the grid row's `notes`.
|
||||
3. **Classic → grid (partial reverse mirror):** `_sync_contact_to_fundraising_state` (`backend/server.py:815`) patches an existing matching grid row but will **not** create a new investor row.
|
||||
|
||||
**Net:** the same investor can simultaneously exist as a `fundraising_investors` row, one-or-more `contacts` rows, an `organizations` row, and an `lp_profiles` row, with **no authoritative join key**. The email matcher treats the **fundraising side as higher-signal** (`matcher.py:103`). `lp_profiles` is entirely outside the fundraising subsystem.
|
||||
|
||||
### 2.4 Notes, interactions & correspondence (Phase-0 critical)
|
||||
|
||||
Three subsystems hold embeddable text:
|
||||
|
||||
**(a) `communications`** (`backend/server.py:168`) — the primary human-logged activity store. One row per `note`/`call`/`email`/`meeting`/`text` (type is free-text; UI offers those 5, `frontend/index.html:4220`). Columns: `contact_id` (req, CASCADE), `opportunity_id` (SET NULL), `subject`, `body`, `communication_date` (the **event** timestamp, distinct from `created_at` — this is the `date_ts` source), `outcome`, `next_action`, `attendees` JSON. Written by `POST /api/communications` and by the fundraising log endpoint. **Hard delete** at `backend/server.py:2758`.
|
||||
|
||||
**(b) Scattered free-text fields** worth embedding: `contacts.notes`, `lp_profiles.notes`, `fundraising_investors.notes` (a running, newline-appended outreach log mirroring the grid Notes column), `opportunities.description`/`next_step`, `organizations.description`.
|
||||
|
||||
**(c) Gmail correspondence** (`backend/email_integration/`, schema in `migrations/0001_email_tables.sql`):
|
||||
- `emails` — canonical record deduped by RFC `message_id`; `subject`, `from_*`, `to/cc/bcc` JSON, `sent_at`, `body_text`, `body_html`, `snippet`, `is_matched`, `match_status`. ⚠️ **Bodies are stored only for matched emails**; unmatched emails are metadata-only with the body nulled (`sync.py:319`).
|
||||
- `email_threads` — thread roll-up; threading via RFC `In-Reply-To`/`References` chain then Gmail thread id (`threads.py:38`).
|
||||
- `email_account_messages` — per-mailbox sighting (dedup across team inboxes).
|
||||
- `email_attachments` — metadata; bytes on disk, deduped by SHA-256.
|
||||
- **`email_investor_links`** — the entity linkage. Populates any subset of `fundraising_investor_id` / `fundraising_contact_id` / `contact_id` / `organization_id` (all **soft references, no FK**) with `match_kind` (`exact_email` conf 1.0 / `domain_match` conf 0.6) and confidence. A single email can link to several entities at once. `email_sync_runs` records per-run observability.
|
||||
|
||||
**`audit_log` is *not* the interaction log.** It is mutation-only (≈26 write sites), logs no reads, and has no agent/actor dimension. Phase 0's "append-only interaction log of every agent action and every human touch" (Workstream A2) needs a **new** table, not a repurpose.
|
||||
|
||||
---
|
||||
|
||||
## 3. API / route surface
|
||||
|
||||
Full REST verbs exist (mutations are **not** tunneled through POST): `do_GET` (1589), `do_POST` (1727), `do_PUT` (1791), `do_PATCH` (1817), `do_DELETE` (1845), `do_OPTIONS` (1580), all in `backend/server.py`. Routing is a flat `if/elif` ladder: exact string for collections, `re.match(r'^/api/x/[^/]+$')` for items, path params parsed positionally with `path.split('/')`. The Gmail module hooks in at the top of `do_GET`/`do_POST` via `try_handle(self)` (`email_integration/routes.py:49`), claiming any `/api/email/*` path.
|
||||
|
||||
**Response envelope:** reads/writes → `{"data": …}` (+`total`/`limit`/`offset` for lists); errors → `{"error": msg}`; create → 201; auth → bare `{"token","user"}`; email handlers use ad-hoc keys. CORS allows all verbs; `Access-Control-Allow-Origin` echoes `CRM_CORS_ORIGIN` (default `*`).
|
||||
|
||||
**Auth column:** None = public · Bearer = any active user · Admin = `require_admin`.
|
||||
|
||||
| Method | Path | Purpose | Auth |
|
||||
|---|---|---|---|
|
||||
| GET | `/`, `/index.html`, `/assets/*` | Serve SPA + static | None |
|
||||
| GET | `/api/health` | Liveness | None |
|
||||
| GET | `/api/bootstrap/status` | First-run check | None |
|
||||
| POST | `/api/auth/login` | Login → JWT | None |
|
||||
| POST | `/api/auth/register` | First-user registration (self-disables) | None |
|
||||
| GET/POST | `/api/contacts` | List/search · Create | Bearer |
|
||||
| GET/PUT/DELETE | `/api/contacts/{id}` | Detail (dossier) · Update · **Hard delete** | Bearer |
|
||||
| GET | `/api/contacts/{id}/communications` | Per-contact interaction history | Bearer |
|
||||
| GET/POST · GET/PUT/DELETE | `/api/organizations[/{id}]` | Org CRUD | Bearer |
|
||||
| GET/POST · GET/PUT/DELETE | `/api/opportunities[/{id}]` | Opp CRUD | Bearer |
|
||||
| PATCH | `/api/opportunities/{id}/stage` | Move pipeline stage (validated) | Bearer |
|
||||
| GET/POST · GET/PUT/DELETE | `/api/communications[/{id}]` | Comms CRUD | Bearer |
|
||||
| GET/POST · GET/PUT | `/api/lp-profiles[/{id}]` | LP-profile CRUD (no delete route) | Bearer |
|
||||
| GET | `/api/reports/{dashboard,pipeline,lp-breakdown,activity}` | Aggregates | Bearer |
|
||||
| GET | `/api/export/contacts` | Export **all** contacts (returns JSON, not CSV) | Bearer |
|
||||
| POST | `/api/import/csv` | Bulk import from JSON rows | Bearer |
|
||||
| GET/POST · PATCH | `/api/feature-requests[/{id}]` | Feature-request tracker | Bearer |
|
||||
| GET | `/api/users` | List users (no hashes) | Bearer |
|
||||
| POST · PATCH | `/api/admin/users[/{id}]` | Create / update user | **Admin** |
|
||||
| POST | `/api/admin/reset-all-data` | ⚠️ Wipe CRM (confirm phrase `RESET ALL DATA`) | **Admin** |
|
||||
| GET | `/api/audit-log` | Mutation audit trail | **Admin** |
|
||||
| GET | `/api/security/status` | Config/security status | **Admin** |
|
||||
| GET/PUT | `/api/fundraising/state` | Get / save grid (optimistic `version`, 409 on conflict) | Bearer |
|
||||
| GET/POST | `/api/fundraising/collab/{state,heartbeat}` | Presence + cell locks | Bearer |
|
||||
| POST | `/api/fundraising/log-communication` | Log comm + append grid note | Bearer |
|
||||
| GET | `/api/fundraising/{export,relational-summary}` | Export / counts | Bearer |
|
||||
| GET | `/api/fundraising/activity` | Merged audit+automation+backup feed | **Admin** |
|
||||
| GET/PATCH | `/api/fundraising/automations[/{id}]` | Automation rules | **Admin** |
|
||||
| GET | `/api/fundraising/automation-runs` | Run history | **Admin** |
|
||||
| GET/POST | `/api/fundraising/{backups,backup,backup-verify}` | Backup mgmt | **Admin** |
|
||||
| POST | `/api/fundraising/{restore-preview,restore}` | Restore grid | **Admin** |
|
||||
| GET/PATCH | `/api/fundraising/backup-policy` | Backup policy | **Admin** |
|
||||
| GET | `/api/email/{status,accounts,threads}` | Sync status / accounts / matched threads | Bearer + flag |
|
||||
| GET | `/api/email/oauth/{start,callback}` | Per-user OAuth (callback is state-token gated, no Bearer) | mixed + flag |
|
||||
| POST | `/api/email/accounts/{enroll-all,enroll}` · `/sync/run-now` · `/rematch` | Enrollment & sync ops | **Admin** + flag |
|
||||
|
||||
**Defined but NOT routed:** `handle_list_tags`/`create`/`delete` exist (`backend/server.py:3366-3400`) but no `/api/tags` route is wired; the `custom_fields` tables have no routes at all. Treat both as dead for Phase 0.
|
||||
|
||||
---
|
||||
|
||||
## 4. Authentication & authorization
|
||||
|
||||
- **Login:** `POST /api/auth/login` → `handle_login` (`backend/server.py:1880`). Looks up active user, `verify_password` (bcrypt, PBKDF2-SHA256 fallback, `backend/server.py:444`), issues `create_token` (HS256 JWT via PyJWT, HMAC fallback; claims `user_id/username/role/exp/iat`; 24 h expiry; `backend/server.py:464`).
|
||||
- **Per-request verification:** `get_user()` (`backend/server.py:1458`) reads `Authorization: Bearer`, `decode_token` (pins `algorithms=["HS256"]` — no `alg:none` downgrade), then **re-loads the user row** and rejects if missing/inactive. Identity (incl. role) comes from the **DB row, not token claims**, so deactivation and role changes take effect immediately.
|
||||
- **No cookies, no logout, no refresh, no revocation.** The only early kill-switch is `is_active=0`.
|
||||
- **Bootstrap:** `GET /api/bootstrap/status` (public) reports `setup_required`. `POST /api/auth/register` (public, **self-disables** once any user exists) creates and force-promotes the first user to `admin`. A separate flag-gated `seed_demo_data()` hardcodes `admin/admin123` + `grant/password` and prints them — **dev-only, off by default** (`backend/server.py:4351-4374`).
|
||||
- **Roles:** only `admin`/`member`. Enforcement is an inline `require_admin(user)` (`backend/server.py:541`) at the top of each admin handler — no middleware. **No row-level authorization:** any active member can read/edit *all* LP and fundraising data; `created_by`/`owner_id` are informational only.
|
||||
- **`X_API_KEY` does not exist in code.** Repo-wide it appears only in `CLAUDE.md` and `docs/PHASE_0.md`. There is no API-key header path and no service-auth distinct from the user JWT. (The `Bearer` tokens in `email_integration/` are *outbound* Google OAuth tokens, unrelated.)
|
||||
- **Secrets:** JWT key `CRM_SECRET_KEY` (random per-process default; hard-fails in production if unset, `backend/server.py:4487`). Gmail OAuth tokens are encrypted at rest with **AES-256-GCM** keyed off the *separate* `CRM_GMAIL_SECRET_KEY` (`email_integration/crypto.py:37`) — a genuinely sound scheme. The Google service-account key lives at `data/secrets/gmail-service-account.json` (0600).
|
||||
- **Hardening present:** per-IP sliding-window rate limits (login 20/min, writes 300/min, GETs 600/min); auto-ban of scanner IPs after a 404 burst (`record_404`, `backend/server.py:1520`). **Absent:** security headers (HSTS/CSP/X-Frame-Options), CORS defaults to wildcard `*`, `X-Forwarded-For` is trusted (only safe behind a controlled proxy).
|
||||
|
||||
---
|
||||
|
||||
## 5. Phase-0 implications (carry into A2/B/C)
|
||||
|
||||
1. **Pick a canonical LP identity.** The classic vs fundraising duality (§2.3) means entity resolution (A4/B3) must collapse across **both** models — grid investor + grid contacts + classic contact/org/lp_profile — into one canonical `lp_id`, not just dedupe name variants within one table. The email matcher's preference (`fundraising_contacts` > `contacts`) is a hint that the **grid is the operationally-live LP graph**, but closed financials/accreditation live only in `lp_profiles`.
|
||||
2. **Canonical ID host.** `contacts.id` is the natural join target (lp_profiles + opportunities FK to it; the dossier is keyed on it), but the **8-char truncated UUID** is a uniqueness risk if it becomes the Qdrant payload key — resolve before indexing.
|
||||
3. **Interaction log is greenfield.** `audit_log` won't do (mutation-only, no actor/agent dimension). A2 should add a dedicated append-only `interaction_log` table and route all MCP/agent writes through it (guardrail #5).
|
||||
4. **Soft-delete gap.** Everything is hard-deleted today (guardrail #3 risk) and there's no tombstone for an idempotent re-embed to detect removals. A2 should add a `deleted_at`/archive flag and have ingest treat it as a delete-from-index signal.
|
||||
5. **Chunk sources** (per `EMBEDDINGS.md`): one chunk per `communications` row; one per **matched** `emails` row (only these have bodies); split the `fundraising_investors.notes` outreach log per dated line; one chunk each for the scattered note/description fields. Keep ids/names/dates/types/confidence as filterable payload; embed only subject/body/note prose. `date_ts` = epoch of the **event** timestamp (`communication_date`/`sent_at`), not `created_at`. Key Qdrant point ids on stable ids (`emails.id`), namespaced to avoid the 8-char collision risk.
|
||||
6. **Migration mechanism.** No Alembic. Either extend the in-code idempotent pattern (`CREATE … IF NOT EXISTS` + try/except `ALTER ADD COLUMN`) or adopt the email module's numbered-`.sql` runner for the core schema (recommended; add a `schema_migrations` ledger). SQLite ALTER is add-column/rename only — which conveniently forces the additive/reversible guardrail.
|
||||
7. **Ingest connection.** No network DB protocol — open `data/crm.db` **read-only** (`sqlite3` URI `mode=ro`) with the same WAL/`busy_timeout` pragmas, co-located with the `/data` volume; WAL guarantees the reader never blocks the live writer. Reserve HTTP+JWT for any write-back. Decide the MCP↔CRM auth boundary explicitly (recommend: read-only direct SQLite for ingest; a constrained service principal for write-back) since `X_API_KEY` is unimplemented.
|
||||
|
||||
---
|
||||
|
||||
## 6. Open questions for the owner
|
||||
|
||||
1. **Which model is canonical for an LP** — the `contacts` row or the `fundraising_investors` grid row? (Determines the ingest spine and the canonical-ID target.)
|
||||
2. **Is the Gmail integration enabled on the live Start9 box, and has a backfill run?** If not, the Phase-0 corpus is just `communications` + note fields until it is. (The email schema exists regardless.)
|
||||
3. **Should `custom_fields` (EAV) and the unrouted `tags` CRUD be revived or removed?** Leaving the EAV risks a second divergent custom-data path next to the live fundraising custom columns.
|
||||
4. **Accreditation today is a single boolean** (`lp_profiles.accredited`) with no QP flag / method / date — where should the 506(b)/506(c) + accreditation/QP fields counsel will require (guardrail #6) live?
|
||||
5. **MCP↔CRM auth:** build a real `X_API_KEY` service-key path, authenticate the MCP server as a dedicated CRM user (24 h JWT, must re-login), or read SQLite directly? And does the agent principal need a new least-privilege role below `member`?
|
||||
6. **`fundraising_state.grid_json` vs the normalized mirror** — confirmed authoritative = the JSON blob; is reading the normalized mirror (consistent after each save) acceptable for ingest, treating the JSON as the re-derivable source of truth?
|
||||
|
||||
---
|
||||
|
||||
*Sources: `backend/server.py`, `backend/email_integration/*`, `backend/email_integration/migrations/0001_email_tables.sql`, `start9/0.3.5/*`, `start9/0.4/*`. Generated from a structured multi-agent read of the codebase, cross-checked against the live `data/crm.db` schema (currently a near-empty seeded instance: 1 user, 9 funds, 4 views, 2 automation rules — the real corpus lives on the Start9 deployment).*
|
||||
@@ -0,0 +1,115 @@
|
||||
# Gmail Integration — Enablement Runbook
|
||||
|
||||
*How to turn on the (already-built) Gmail correspondence integration on the live Start9 box, validate it with a small observed backfill, then roll out to the domain. Read-only capture; all mail stays on Ten31 infrastructure.*
|
||||
|
||||
Code: `backend/email_integration/`. Schema: `migrations/0001_email_tables.sql`. See `docs/crm-overview.md` §2.4 for the data model.
|
||||
|
||||
---
|
||||
|
||||
## What this does & the sovereignty posture
|
||||
|
||||
- Pulls Gmail messages for enrolled `@ten31.xyz` mailboxes into the CRM's own SQLite DB (`emails`, `email_threads`, `email_attachments`, …), **deduped across inboxes**, **threaded**, and **matched** to investors/contacts (`email_investor_links`).
|
||||
- **Scope is `https://www.googleapis.com/auth/gmail.readonly`** (`credentials.py:34`) — the integration can *read* mail, never send or modify. Lower risk, and it's all the ingest needs.
|
||||
- **Data path is Google → your Start9 box only.** No new third party, and per guardrail #9 Claude never reads the mail — the correspondence becomes ingest input for *local* embeddings (bge-m3 on the Sparks), not API context. (Contrast with Superhuman's MCP — see §7.)
|
||||
|
||||
## 0. Pick the auth method
|
||||
|
||||
| Method | When | What you provide |
|
||||
|---|---|---|
|
||||
| **DWD (domain-wide delegation)** — recommended | You administer the `ten31.xyz` Google Workspace and want to capture team mailboxes without per-user consent | One service-account JSON key + a Workspace admin authorization |
|
||||
| **Per-user OAuth** | Capturing a mailbox you don't admin, or avoiding DWD | OAuth client id/secret + each user clicks through `/api/email/oauth/start` |
|
||||
|
||||
The Start9 0.4 entrypoint is built around **DWD** (auto-detects the key, sets `CRM_GMAIL_AUTH_METHOD=dwd`, `CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz`). The rest of this runbook assumes DWD.
|
||||
|
||||
## 1. Google-side setup (one time)
|
||||
|
||||
You need Workspace **super-admin** + a GCP project.
|
||||
|
||||
1. **GCP project** → enable the **Gmail API** (`APIs & Services → Library → Gmail API → Enable`).
|
||||
2. **Create a service account** (`IAM & Admin → Service Accounts`). Note its **client ID** (a long number) and its email.
|
||||
3. **Create a JSON key** for it (`Keys → Add key → JSON`). This file is the secret — handle per guardrail #7.
|
||||
4. **Authorize domain-wide delegation** in the Workspace **Admin console** (`Security → Access and data control → API controls → Domain-wide delegation → Add new`):
|
||||
- **Client ID** = the service account's client ID from step 2.
|
||||
- **OAuth scopes** = `https://www.googleapis.com/auth/gmail.readonly`
|
||||
- Save. (Without this exact scope authorized, sync returns a non-retryable auth error — see `errors.py:21`.)
|
||||
|
||||
## 2. Install the key on Start9
|
||||
|
||||
1. Copy the JSON key to the service's data volume at **`/data/secrets/gmail-service-account.json`**.
|
||||
2. Lock it down: `chmod 600 /data/secrets/gmail-service-account.json` (the entrypoint also `chmod 700`s `/data/secrets`).
|
||||
3. **Restart the service.** On boot the 0.4 entrypoint detects the key and exports:
|
||||
`CRM_GMAIL_INTEGRATION_ENABLED=true`, `CRM_GMAIL_AUTH_METHOD=dwd`, `CRM_GMAIL_SA_KEY_PATH=/data/secrets/gmail-service-account.json`, `CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz`, `CRM_GMAIL_SYNC_INTERVAL_MIN=180`.
|
||||
It logs `Gmail integration: ENABLED (key at …)`.
|
||||
|
||||
## 3. Smoke test — ONE mailbox first (the "don't rush it" gate)
|
||||
|
||||
Do a single-mailbox run before enrolling the whole team, to shake out auth/matching bugs on a small surface. All calls need an **admin Bearer token**:
|
||||
|
||||
```bash
|
||||
CRM=https://<your-start9-crm-host> # the CRM's address
|
||||
TOKEN=$(curl -sk $CRM/api/auth/login -H 'Content-Type: application/json' \
|
||||
-d '{"username":"<admin>","password":"<pw>"}' | python3 -c 'import sys,json;print(json.load(sys.stdin)["token"])')
|
||||
|
||||
# integration alive?
|
||||
curl -sk $CRM/api/email/status -H "Authorization: Bearer $TOKEN"
|
||||
|
||||
# enroll just yourself
|
||||
curl -sk $CRM/api/email/accounts/enroll -H "Authorization: Bearer $TOKEN" \
|
||||
-H 'Content-Type: application/json' -d '{"email":"you@ten31.xyz"}'
|
||||
|
||||
# trigger a sync now (otherwise it runs every 180 min)
|
||||
curl -sk $CRM/api/email/sync/run-now -X POST -H "Authorization: Bearer $TOKEN"
|
||||
```
|
||||
|
||||
**Tip:** to keep the first backfill small, set `CRM_GMAIL_BACKFILL_PAGE_SIZE` low (e.g. `50`) before the restart, watch one page land, then raise it.
|
||||
|
||||
## 4. Verify (on the box, read-only SQL)
|
||||
|
||||
```sql
|
||||
-- sync ran cleanly?
|
||||
SELECT kind, status, messages_seen, messages_stored, attachments_saved, error
|
||||
FROM email_sync_runs ORDER BY started_at DESC LIMIT 3;
|
||||
|
||||
-- mail captured + how much got matched to investors/contacts
|
||||
SELECT COUNT(*) total, SUM(is_matched) matched FROM emails;
|
||||
|
||||
-- who did it match, and how confidently?
|
||||
SELECT match_kind, COUNT(*) FROM email_investor_links GROUP BY match_kind;
|
||||
```
|
||||
|
||||
Or via the API: `GET /api/email/status` (counts) and `GET /api/email/threads?investor_id=<id>` (matched threads for one investor). If matching looks thin, run `POST /api/email/rematch` with `{"since":"<ISO8601>"}` after the investor list is populated.
|
||||
|
||||
## 5. Roll out to the domain
|
||||
|
||||
Once the single mailbox looks right:
|
||||
|
||||
```bash
|
||||
curl -sk $CRM/api/email/accounts/enroll-all -X POST -H "Authorization: Bearer $TOKEN"
|
||||
curl -sk $CRM/api/email/sync/run-now -X POST -H "Authorization: Bearer $TOKEN"
|
||||
```
|
||||
|
||||
Incremental sync then runs every `CRM_GMAIL_SYNC_INTERVAL_MIN` (default 180) via the scheduler thread.
|
||||
|
||||
## 6. Tuning knobs (env, `config.py`)
|
||||
|
||||
`CRM_GMAIL_SYNC_INTERVAL_MIN` (180) · `CRM_GMAIL_BACKFILL_PAGE_SIZE` (500) · `CRM_GMAIL_MAX_ATTACHMENT_MB` (50) · `CRM_GMAIL_ATTACH_CONCURRENCY` (4) · `CRM_GMAIL_RATE_UNITS_SEC` (150) · `CRM_GMAIL_HISTORY_STALE_DAYS` (5, forces a backfill if Gmail pruned history).
|
||||
|
||||
## 7. Where Superhuman fits (and where it doesn't)
|
||||
|
||||
You have Superhuman connected to Gmail, and it exposes an MCP server. The two are **complementary, not competing**, and it matters which job each does:
|
||||
|
||||
- **Canonical correspondence ingest → use this DWD integration, not Superhuman.** It pulls mail straight into your own `crm.db` on Start9 and feeds the *local* embedding pipeline. Routing bulk ingest through Superhuman's MCP would put your email content through Superhuman's servers and — because an agent/Claude would be driving those calls — through Anthropic, which is exactly what guardrail #1 keeps the corpus away from. DWD keeps the data path Google → your box.
|
||||
- **Human mail workflow & drafting → Superhuman MCP is great.** Reading/triaging your own inbox, and Closer-style *draft* generation that a human reviews and sends, naturally happen in your real mail client. The `batch-draft-writer` skill already drives the Superhuman MCP for that, and it's usable today — independent of the CRM pipeline.
|
||||
|
||||
Net: **DWD = system-of-record correspondence (sovereign, for retrieval). Superhuman MCP = the human's working surface (drafting, triage).** Don't make Superhuman the ingest source of truth.
|
||||
|
||||
## 8. Disable / rollback
|
||||
|
||||
Remove (or rename) `/data/secrets/gmail-service-account.json` and restart → the entrypoint logs `DISABLED` and routes return 503; captured data remains. To pause one mailbox without disabling the whole integration, set its `email_accounts.sync_enabled = 0`.
|
||||
|
||||
## 9. Troubleshooting
|
||||
|
||||
- **401/403 from Google on sync** → DWD scope not authorized, wrong client ID, or Gmail API not enabled (steps 1 & 4). This error is non-retryable by design (`errors.py`).
|
||||
- **`status` says disabled / routes 503** → key not found at `CRM_GMAIL_SA_KEY_PATH`, or `CRM_GMAIL_INTEGRATION_ENABLED` not truthy (the entrypoint only sets it when the key file exists).
|
||||
- **Mail captured but `matched = 0`** → the investor/contact list was empty or addresses don't match; populate the CRM/grid first, then `POST /api/email/rematch`.
|
||||
- **Bodies missing on some emails** → by design, unmatched emails are stored metadata-only (no body) until matched (`sync.py`); re-match to backfill.
|
||||
@@ -0,0 +1,88 @@
|
||||
# Redaction / Re-hydration Boundary
|
||||
|
||||
*Design note for the privacy boundary between Ten31's sovereign data and the Claude API. Written in Phase 0 so it is a designed step, not an afterthought. **It is BUILT in Phase 2/3** (Analyst, Closer) — Phase 0 does not need it (see §1).*
|
||||
|
||||
Related: `CLAUDE.md` guardrails #1 (sovereignty), #4 (human-in-the-loop outbound), #9 (dev-time data handling); `docs/Ten31_Agentic_Build_Plan.md` §4.5.
|
||||
|
||||
---
|
||||
|
||||
## 1. When this applies (and when it doesn't)
|
||||
|
||||
The system has two very different data paths, and only one of them needs this:
|
||||
|
||||
- **Local-only paths — no redaction needed, because Claude is never in the loop.** All of Phase 0 is here: ingest, chunking, embeddings (bge-m3), the vector index (Qdrant), and entity resolution (local Qwen). Sensitive data flows `CRM → Sparks → Qdrant`, all on Ten31 infrastructure. Nothing reaches Anthropic, so there is nothing to scrub. **Do not add redaction overhead to the ingest/retrieval path.**
|
||||
- **Claude-facing reasoning steps — this boundary applies.** When an agent asks Claude to *reason over* LP-specific content: **Analyst** (building a dossier from retrieved chunks), **Closer** (drafting outreach/nurture/meeting prep), and any Orchestrator step that forwards record content. These send sensitive context to a third-party API and are the reason this boundary exists.
|
||||
|
||||
The guiding rule (guardrail #1) is *"only the minimum necessary, non-sensitive context per call."* Redaction/re-hydration is how we honor that rule **when the task genuinely needs record content** — as opposed to simply sending less.
|
||||
|
||||
## 2. The three-tier data classification
|
||||
|
||||
Before any agent calls Claude, classify each piece of context:
|
||||
|
||||
| Tier | Examples | Treatment |
|
||||
|---|---|---|
|
||||
| **Never send** | Full LP list/export, bulk relationship graph, raw account numbers, wire details, SSNs/passport, anything covered by a confidentiality obligation | Stays on Ten31 infra. Not even tokenized — just excluded. |
|
||||
| **Tokenize (pseudonymize)** | Person names, org/fund names, emails, phone, physical addresses, exact $ amounts, dates that pin identity | Replaced with stable placeholders before the call; real values swapped back locally after. |
|
||||
| **Send as-is** | The *substance* an agent needs Claude to reason about: thesis discussion, sentiment, objections, generic deal mechanics, the drafted message body (minus identifiers) | Sent in the de-identified prompt. |
|
||||
|
||||
The art is the middle tier: keep enough semantic content for Claude to be useful, while every *identifier* is a placeholder.
|
||||
|
||||
## 3. The round-trip
|
||||
|
||||
```
|
||||
┌─────────────────────────── Ten31 infrastructure (sovereign) ───────────────────────────┐
|
||||
│ │
|
||||
│ 1. SCRUB (local model on the Sparks, via Spark Control) │
|
||||
│ - Pull the minimal context the task needs (retrieved chunks + record fields). │
|
||||
│ - NER + rule pass replaces Tier-2 identifiers with stable tokens: │
|
||||
│ "Jonathan Reyes" -> [PERSON_1] "Cedar Point Capital" -> [ORG_1] │
|
||||
│ "jon@cedarpoint..." -> [EMAIL_1] "$5,000,000" -> [AMOUNT_1] "Fund III" -> [FUND_1]│
|
||||
│ - Tokens are STABLE within a task (same entity -> same token) and CONSISTENT across │
|
||||
│ all chunks in the call, so Claude can reason about relationships. │
|
||||
│ - The pseudonym map { [PERSON_1] -> "Jonathan Reyes", ... } is held LOCALLY, keyed to │
|
||||
│ the task/session. It never leaves the box. │
|
||||
│ - Drop Tier-1 content entirely. Log the scrub to the interaction_log. │
|
||||
│ │
|
||||
└──────────────────────────────────────────┬───────────────────────────────────────────────┘
|
||||
│ de-identified prompt (placeholders only)
|
||||
▼
|
||||
2. REASON — Claude API (Agent SDK)
|
||||
Drafts / synthesizes using [PERSON_1], [ORG_1], [FUND_1] ...
|
||||
│ response referencing the same placeholders
|
||||
┌──────────────────────────────────────────┴───────────────────────────────────────────────┐
|
||||
│ │
|
||||
│ 3. RE-HYDRATE (local) │
|
||||
│ - Substitute real values back in using the local pseudonym map. │
|
||||
│ - A human reviews the re-hydrated draft (guardrail #4) before anything is sent. │
|
||||
│ - Log the rehydrate + the human decision to the interaction_log. │
|
||||
│ │
|
||||
└──────────────────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## 4. Where it runs
|
||||
|
||||
The natural home is **Spark Control** (the gateway that already fronts the local models): add a scrub endpoint and a rehydrate endpoint, or run the pair as middleware inside the Agent SDK tool loop so every outbound Claude call passes through it. Keeping it at the gateway means:
|
||||
- One enforcement point — agents can't accidentally bypass it.
|
||||
- The pseudonym map lives next to the local models, never in agent memory that might get logged.
|
||||
- The scrub uses the same local Qwen already used for entity resolution, so the NER is consistent with how entities were canonicalized at ingest (reuse the `canonical_entities` → token mapping).
|
||||
|
||||
## 5. Caveats (why this is a tool, not a magic switch)
|
||||
|
||||
- **Free-text leakage is the hard part.** A note that says *"the family that sold the mining company in Texas last year"* re-identifies even with the name tokenized. The scrub model must catch *descriptive* identifiers, not just named entities — and it will not be perfect. For high-sensitivity tasks, prefer sending *less* (summary/thesis only) over trusting the scrub to catch everything.
|
||||
- **Re-identification by inference.** Enough tokenized-but-specific detail (amounts + dates + sector) can still single out a person. Keep Tier-2 amounts/dates *bucketed* ("~$5M", "Q1") when the exact value isn't needed.
|
||||
- **Map integrity.** The pseudonym map is sensitive (it's the de-anonymization key) — keep it local, in memory or short-lived, never logged to a third party, never sent in a prompt.
|
||||
- **It does not replace minimization.** First ask "does Claude need this record content at all?" Often a retrieval summary suffices. Redaction is for when the answer is genuinely yes.
|
||||
- **Consistency with retrieval.** Retrieval itself is already local, so chunks come back with real values; the scrub is applied at the *prompt-assembly* step, not at ingest.
|
||||
|
||||
## 6. Verification (when we build it)
|
||||
|
||||
- A test harness that asserts no Tier-1 string and no real Tier-2 identifier appears in any outbound payload (golden-file diff over recorded prompts).
|
||||
- A re-identification spot-check: have the local model attempt to re-identify entities from the de-identified prompt alone; flag anything it gets right.
|
||||
- Every scrub/rehydrate logged to `interaction_log` (actor, task, token-count, what tier was dropped) for audit (guardrail #5).
|
||||
|
||||
## 7. Open questions (resolve at build time, Phase 2/3)
|
||||
|
||||
1. Token granularity — per-task ephemeral maps, or a stable per-entity token space reused across tasks (better for Claude's cross-call memory, worse for re-identification risk)?
|
||||
2. Do we tokenize the *drafted outbound message itself* (Closer) and re-hydrate, or draft against placeholders and let the human fill specifics? (Affects how much the human edits.)
|
||||
3. Bucketing policy for amounts/dates — what precision is "non-sensitive enough"?
|
||||
4. Where exactly in the Agent SDK loop the middleware sits, and how it composes with prompt caching (placeholders must be cache-stable).
|
||||
Reference in New Issue
Block a user