Files
Keysat c7ce44d963 Phase 0 foundation: canonical schema, ingest pipeline, CRM MCP server
Workstream A–C substrate for the Ten31 agentic system:
- A1: docs/crm-overview.md; CLAUDE.md conventions + guardrail #9
- A2: additive/reversible core migration (canonical_entities, entity_links,
  interaction_log, relationship_edges, soft-delete) + ledgered runner
- B1/B3: chunking + deterministic entity resolution (backend/ingest)
- B2: dense (bge-m3) + BM25 sparse ingest to Qdrant crm_chunks
- C: CRM MCP server (reads, retrieval modes, logged writes) — no outbound tools
- docs: redaction/re-hydration, Gmail enablement runbook
- synthetic test data; .env.example; housekeeping (.gitignore, untrack crm.db,
  drop legacy files + start9/0.3.5)

Verified end-to-end on synthetic data + live Sparks (hybrid > dense on entity
queries). Real backfill runs on Ten31 infra; index holds synthetic data only.
Branch snapshot also captures pre-existing working-tree changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 08:13:35 -05:00

72 lines
6.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 0 — Foundation: Kickoff Brief
**Goal:** stand up the data + retrieval substrate that every later agent depends on. By the end of Phase 0 we can ingest the CRM into a searchable index and query it through a CRM MCP server using multiple retrieval modes. **No outward-facing agent behavior in this phase.**
See `../CLAUDE.md` for settled architecture and non-negotiable guardrails. This brief assumes them.
## Prerequisites / inputs needed before building
- [ ] CRM source code present in this repo.
- [ ] `docs/EMBEDDINGS.md` read — it is the authoritative embedding/retrieval contract and ingest recipe. This brief defers to it for all vector mechanics.
- [ ] `.env` populated from `.env.example`: `ANTHROPIC_API_KEY`, `SPARK_CONTROL_URL` (gateway), `QDRANT_URL` (direct Qdrant on Spark 2 for ingest), `X_API_KEY`, CRM connection vars.
- [ ] Spark Control endpoints live (shipped in v0.15.0): `/v1/embeddings`, `/v1/rerank`, `/api/search`. Verify with `GET /api/endpoints`.
- [ ] `fastembed` installed in the ingest environment (client-side BM25 sparse vectors, model `Qdrant/bm25`).
- [ ] A small set of real example records + documents (a few LPs, a memo, a transcript) to test ingest and retrieval against.
## Workstream A — Understand the CRM, then extend the schema
1. Read the CRM code. Write `docs/crm-overview.md`: storage engine and where it runs, the data model (records, custom fields, and especially how notes/interactions/correspondence are stored), the API/route surface, and auth. Fill in the Conventions section and CRM env vars in `CLAUDE.md`.
2. Design and apply additive, reversible schema extensions:
- LP/prospect fields: thesis fit, segment, accreditation/QP status, warmth score, source, owner, last-touch.
- **Interaction log** table: append-only record of every agent action and every human touch (actor, action, target, timestamp, payload).
- **Relationship graph** table: edges between people/entities with source + strength (seeded later from email/calendar + X follower overlap).
- **Canonical entity IDs**: a canonical record each name variant resolves to (see Workstream B3).
*Guardrail: additive and reversible only; no destructive migrations; review before running.*
## Workstream B — Ingest / sync pipeline
**Follow the ingest recipe in `docs/EMBEDDINGS.md` exactly** (collection schema, dual-vector upsert, FastEmbed BM25, payload fields). This brief covers only the CRM-specific decisions layered on top of it.
1. **Chunking** (a bigger quality lever than model choice). Per EMBEDDINGS.md: one chunk per email / note / transcript-turn; one chunk per memo *section*; never split mid-thought; keep entities + `date_ts` as filterable payload, not embedded text. The CRM-specific work is mapping *your* record types to chunks (how a thread, a note, a memo, a call transcript each decompose).
2. **Two vectors per chunk.** Dense via Spark Control `POST /v1/embeddings` (`BAAI/bge-m3`, 1024-d); sparse BM25 via FastEmbed (`Qdrant/bm25`) generated **client-side**. Create the `crm_chunks` collection with a named dense vector (1024, Cosine) + named sparse vector (`modifier: idf`), and payload indexes on `lp_id` (keyword), `doc_type` (keyword), `date_ts` (integer). Upsert both vectors + payload (`lp_id`, `lp_name`, `doc_type`, `date_ts`, `text`). **Ingest talks to Qdrant directly** (`QDRANT_URL`), not through the gateway.
3. **Entity resolution at ingest.** Resolve name variants ("John Smith" / "J. Smith" / "JS") to one canonical `lp_id` before storing — a lightweight local-Qwen call via Spark Control `/v1/chat/completions`. Without it, retrieval fragments across variants. Same canonical-ID work as Workstream A4.
4. **Backfill + incremental sync, idempotent.** One-time full backfill, then incremental updates as records change. A full re-embed is ~815 min, so the index is cheaply rebuildable — but only if the pipeline is **idempotent and re-runnable** (Qdrant is never auto-restarted and holds the only live index). Prefer simple, correct, repeatable sync over premature optimization.
## Workstream C — CRM MCP server
Wrap the CRM behind an MCP server that agents (Claude Agent SDK) call. The retrieval modes are thin wrappers over Spark Control `POST /api/search` (which orchestrates embed → Qdrant RRF → rerank); for hybrid/keyword modes, generate the query's BM25 sparse vector with the **same** FastEmbed model used at ingest and pass it as `sparse`. Tools to expose:
- **Reads:** fetch record(s), list/search records by structured filters, fetch interaction history for an entity.
- **Retrieval modes** (all via `/api/search`, varying parameters):
- `semantic_search` — dense only (omit `sparse`), large `top_k`/`retrieve_n`. High recall (Scout).
- `hybrid_search` — pass `sparse` + `rerank: true` + `filter`. Dense+BM25 RRF then rerank. **Default**, best for entity-heavy queries.
- `keyword_search` — high-precision lexical ("did we ever discuss X with this LP?"): lean on the sparse leg + rerank. For a pure-BM25 result, query Qdrant's sparse vector directly via `QDRANT_URL`.
- Always support a Qdrant `filter` (e.g. `lp_id`, `date_ts` range) to pre-filter before scoring.
- **Writes:** create/update records, append to the interaction log, write enrichment results back. (No outbound/contact actions — Phase 3, gated.)
Keep retrieval modes as small, composable functions so the orchestrator can select among them later.
## Workstream D — Compliance groundwork (parallel, non-blocking)
Engage counsel to define solicitation posture (506(b) vs 506(c)), accreditation/QP verification, and recordkeeping rules. This does not block AC, but it must be resolved before any Phase 3 outbound work begins.
## Acceptance criteria
- [ ] CRM schema extended (additive, reversible) and documented.
- [ ] `crm_chunks` collection created per EMBEDDINGS.md (named dense + sparse vectors, payload indexes).
- [ ] Full CRM corpus backfilled into Qdrant with **both** dense + BM25 sparse vectors and payload; incremental sync working and idempotent.
- [ ] Entity resolution collapses obvious name variants to one canonical `lp_id`.
- [ ] CRM MCP server live; retrieval modes wrap `/api/search`; hybrid (sparse + rerank) visibly beats dense-only on entity-heavy test queries (e.g. exact fund/LP name lookups).
- [ ] Every write goes through the interaction log.
- [ ] No outward-facing behavior exists anywhere in the codebase.
## Explicitly out of scope for Phase 0
- Any agent that sends email, posts, or contacts a prospect.
- Scout monitoring, Closer outreach, Scribe publishing, the Orchestrator.
- The workflow/scheduling engine (revisited at Phase 3).
## Suggested order
A1 → A2 → B1 → B2 → B3 → B4 → C. Start D in parallel.