Phase 0 foundation: canonical schema, ingest pipeline, CRM MCP server
Workstream A–C substrate for the Ten31 agentic system: - A1: docs/crm-overview.md; CLAUDE.md conventions + guardrail #9 - A2: additive/reversible core migration (canonical_entities, entity_links, interaction_log, relationship_edges, soft-delete) + ledgered runner - B1/B3: chunking + deterministic entity resolution (backend/ingest) - B2: dense (bge-m3) + BM25 sparse ingest to Qdrant crm_chunks - C: CRM MCP server (reads, retrieval modes, logged writes) — no outbound tools - docs: redaction/re-hydration, Gmail enablement runbook - synthetic test data; .env.example; housekeeping (.gitignore, untrack crm.db, drop legacy files + start9/0.3.5) Verified end-to-end on synthetic data + live Sparks (hybrid > dense on entity queries). Real backfill runs on Ten31 infra; index holds synthetic data only. Branch snapshot also captures pre-existing working-tree changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,71 @@
|
||||
# Phase 0 — Foundation: Kickoff Brief
|
||||
|
||||
**Goal:** stand up the data + retrieval substrate that every later agent depends on. By the end of Phase 0 we can ingest the CRM into a searchable index and query it through a CRM MCP server using multiple retrieval modes. **No outward-facing agent behavior in this phase.**
|
||||
|
||||
See `../CLAUDE.md` for settled architecture and non-negotiable guardrails. This brief assumes them.
|
||||
|
||||
## Prerequisites / inputs needed before building
|
||||
|
||||
- [ ] CRM source code present in this repo.
|
||||
- [ ] `docs/EMBEDDINGS.md` read — it is the authoritative embedding/retrieval contract and ingest recipe. This brief defers to it for all vector mechanics.
|
||||
- [ ] `.env` populated from `.env.example`: `ANTHROPIC_API_KEY`, `SPARK_CONTROL_URL` (gateway), `QDRANT_URL` (direct Qdrant on Spark 2 for ingest), `X_API_KEY`, CRM connection vars.
|
||||
- [ ] Spark Control endpoints live (shipped in v0.15.0): `/v1/embeddings`, `/v1/rerank`, `/api/search`. Verify with `GET /api/endpoints`.
|
||||
- [ ] `fastembed` installed in the ingest environment (client-side BM25 sparse vectors, model `Qdrant/bm25`).
|
||||
- [ ] A small set of real example records + documents (a few LPs, a memo, a transcript) to test ingest and retrieval against.
|
||||
|
||||
## Workstream A — Understand the CRM, then extend the schema
|
||||
|
||||
1. Read the CRM code. Write `docs/crm-overview.md`: storage engine and where it runs, the data model (records, custom fields, and especially how notes/interactions/correspondence are stored), the API/route surface, and auth. Fill in the Conventions section and CRM env vars in `CLAUDE.md`.
|
||||
2. Design and apply additive, reversible schema extensions:
|
||||
- LP/prospect fields: thesis fit, segment, accreditation/QP status, warmth score, source, owner, last-touch.
|
||||
- **Interaction log** table: append-only record of every agent action and every human touch (actor, action, target, timestamp, payload).
|
||||
- **Relationship graph** table: edges between people/entities with source + strength (seeded later from email/calendar + X follower overlap).
|
||||
- **Canonical entity IDs**: a canonical record each name variant resolves to (see Workstream B3).
|
||||
|
||||
*Guardrail: additive and reversible only; no destructive migrations; review before running.*
|
||||
|
||||
## Workstream B — Ingest / sync pipeline
|
||||
|
||||
**Follow the ingest recipe in `docs/EMBEDDINGS.md` exactly** (collection schema, dual-vector upsert, FastEmbed BM25, payload fields). This brief covers only the CRM-specific decisions layered on top of it.
|
||||
|
||||
1. **Chunking** (a bigger quality lever than model choice). Per EMBEDDINGS.md: one chunk per email / note / transcript-turn; one chunk per memo *section*; never split mid-thought; keep entities + `date_ts` as filterable payload, not embedded text. The CRM-specific work is mapping *your* record types to chunks (how a thread, a note, a memo, a call transcript each decompose).
|
||||
2. **Two vectors per chunk.** Dense via Spark Control `POST /v1/embeddings` (`BAAI/bge-m3`, 1024-d); sparse BM25 via FastEmbed (`Qdrant/bm25`) generated **client-side**. Create the `crm_chunks` collection with a named dense vector (1024, Cosine) + named sparse vector (`modifier: idf`), and payload indexes on `lp_id` (keyword), `doc_type` (keyword), `date_ts` (integer). Upsert both vectors + payload (`lp_id`, `lp_name`, `doc_type`, `date_ts`, `text`). **Ingest talks to Qdrant directly** (`QDRANT_URL`), not through the gateway.
|
||||
3. **Entity resolution at ingest.** Resolve name variants ("John Smith" / "J. Smith" / "JS") to one canonical `lp_id` before storing — a lightweight local-Qwen call via Spark Control `/v1/chat/completions`. Without it, retrieval fragments across variants. Same canonical-ID work as Workstream A4.
|
||||
4. **Backfill + incremental sync, idempotent.** One-time full backfill, then incremental updates as records change. A full re-embed is ~8–15 min, so the index is cheaply rebuildable — but only if the pipeline is **idempotent and re-runnable** (Qdrant is never auto-restarted and holds the only live index). Prefer simple, correct, repeatable sync over premature optimization.
|
||||
|
||||
## Workstream C — CRM MCP server
|
||||
|
||||
Wrap the CRM behind an MCP server that agents (Claude Agent SDK) call. The retrieval modes are thin wrappers over Spark Control `POST /api/search` (which orchestrates embed → Qdrant RRF → rerank); for hybrid/keyword modes, generate the query's BM25 sparse vector with the **same** FastEmbed model used at ingest and pass it as `sparse`. Tools to expose:
|
||||
- **Reads:** fetch record(s), list/search records by structured filters, fetch interaction history for an entity.
|
||||
- **Retrieval modes** (all via `/api/search`, varying parameters):
|
||||
- `semantic_search` — dense only (omit `sparse`), large `top_k`/`retrieve_n`. High recall (Scout).
|
||||
- `hybrid_search` — pass `sparse` + `rerank: true` + `filter`. Dense+BM25 RRF then rerank. **Default**, best for entity-heavy queries.
|
||||
- `keyword_search` — high-precision lexical ("did we ever discuss X with this LP?"): lean on the sparse leg + rerank. For a pure-BM25 result, query Qdrant's sparse vector directly via `QDRANT_URL`.
|
||||
- Always support a Qdrant `filter` (e.g. `lp_id`, `date_ts` range) to pre-filter before scoring.
|
||||
- **Writes:** create/update records, append to the interaction log, write enrichment results back. (No outbound/contact actions — Phase 3, gated.)
|
||||
|
||||
Keep retrieval modes as small, composable functions so the orchestrator can select among them later.
|
||||
|
||||
## Workstream D — Compliance groundwork (parallel, non-blocking)
|
||||
|
||||
Engage counsel to define solicitation posture (506(b) vs 506(c)), accreditation/QP verification, and recordkeeping rules. This does not block A–C, but it must be resolved before any Phase 3 outbound work begins.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] CRM schema extended (additive, reversible) and documented.
|
||||
- [ ] `crm_chunks` collection created per EMBEDDINGS.md (named dense + sparse vectors, payload indexes).
|
||||
- [ ] Full CRM corpus backfilled into Qdrant with **both** dense + BM25 sparse vectors and payload; incremental sync working and idempotent.
|
||||
- [ ] Entity resolution collapses obvious name variants to one canonical `lp_id`.
|
||||
- [ ] CRM MCP server live; retrieval modes wrap `/api/search`; hybrid (sparse + rerank) visibly beats dense-only on entity-heavy test queries (e.g. exact fund/LP name lookups).
|
||||
- [ ] Every write goes through the interaction log.
|
||||
- [ ] No outward-facing behavior exists anywhere in the codebase.
|
||||
|
||||
## Explicitly out of scope for Phase 0
|
||||
|
||||
- Any agent that sends email, posts, or contacts a prospect.
|
||||
- Scout monitoring, Closer outreach, Scribe publishing, the Orchestrator.
|
||||
- The workflow/scheduling engine (revisited at Phase 3).
|
||||
|
||||
## Suggested order
|
||||
|
||||
A1 → A2 → B1 → B2 → B3 → B4 → C. Start D in parallel.
|
||||
Reference in New Issue
Block a user