Files
ten31-database/docs/PHASE_0.md
T
Keysat c7ce44d963 Phase 0 foundation: canonical schema, ingest pipeline, CRM MCP server
Workstream A–C substrate for the Ten31 agentic system:
- A1: docs/crm-overview.md; CLAUDE.md conventions + guardrail #9
- A2: additive/reversible core migration (canonical_entities, entity_links,
  interaction_log, relationship_edges, soft-delete) + ledgered runner
- B1/B3: chunking + deterministic entity resolution (backend/ingest)
- B2: dense (bge-m3) + BM25 sparse ingest to Qdrant crm_chunks
- C: CRM MCP server (reads, retrieval modes, logged writes) — no outbound tools
- docs: redaction/re-hydration, Gmail enablement runbook
- synthetic test data; .env.example; housekeeping (.gitignore, untrack crm.db,
  drop legacy files + start9/0.3.5)

Verified end-to-end on synthetic data + live Sparks (hybrid > dense on entity
queries). Real backfill runs on Ten31 infra; index holds synthetic data only.
Branch snapshot also captures pre-existing working-tree changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 08:13:35 -05:00

6.6 KiB
Raw Blame History

Phase 0 — Foundation: Kickoff Brief

Goal: stand up the data + retrieval substrate that every later agent depends on. By the end of Phase 0 we can ingest the CRM into a searchable index and query it through a CRM MCP server using multiple retrieval modes. No outward-facing agent behavior in this phase.

See ../CLAUDE.md for settled architecture and non-negotiable guardrails. This brief assumes them.

Prerequisites / inputs needed before building

  • CRM source code present in this repo.
  • docs/EMBEDDINGS.md read — it is the authoritative embedding/retrieval contract and ingest recipe. This brief defers to it for all vector mechanics.
  • .env populated from .env.example: ANTHROPIC_API_KEY, SPARK_CONTROL_URL (gateway), QDRANT_URL (direct Qdrant on Spark 2 for ingest), X_API_KEY, CRM connection vars.
  • Spark Control endpoints live (shipped in v0.15.0): /v1/embeddings, /v1/rerank, /api/search. Verify with GET /api/endpoints.
  • fastembed installed in the ingest environment (client-side BM25 sparse vectors, model Qdrant/bm25).
  • A small set of real example records + documents (a few LPs, a memo, a transcript) to test ingest and retrieval against.

Workstream A — Understand the CRM, then extend the schema

  1. Read the CRM code. Write docs/crm-overview.md: storage engine and where it runs, the data model (records, custom fields, and especially how notes/interactions/correspondence are stored), the API/route surface, and auth. Fill in the Conventions section and CRM env vars in CLAUDE.md.

  2. Design and apply additive, reversible schema extensions:

    • LP/prospect fields: thesis fit, segment, accreditation/QP status, warmth score, source, owner, last-touch.
    • Interaction log table: append-only record of every agent action and every human touch (actor, action, target, timestamp, payload).
    • Relationship graph table: edges between people/entities with source + strength (seeded later from email/calendar + X follower overlap).
    • Canonical entity IDs: a canonical record each name variant resolves to (see Workstream B3).

    Guardrail: additive and reversible only; no destructive migrations; review before running.

Workstream B — Ingest / sync pipeline

Follow the ingest recipe in docs/EMBEDDINGS.md exactly (collection schema, dual-vector upsert, FastEmbed BM25, payload fields). This brief covers only the CRM-specific decisions layered on top of it.

  1. Chunking (a bigger quality lever than model choice). Per EMBEDDINGS.md: one chunk per email / note / transcript-turn; one chunk per memo section; never split mid-thought; keep entities + date_ts as filterable payload, not embedded text. The CRM-specific work is mapping your record types to chunks (how a thread, a note, a memo, a call transcript each decompose).
  2. Two vectors per chunk. Dense via Spark Control POST /v1/embeddings (BAAI/bge-m3, 1024-d); sparse BM25 via FastEmbed (Qdrant/bm25) generated client-side. Create the crm_chunks collection with a named dense vector (1024, Cosine) + named sparse vector (modifier: idf), and payload indexes on lp_id (keyword), doc_type (keyword), date_ts (integer). Upsert both vectors + payload (lp_id, lp_name, doc_type, date_ts, text). Ingest talks to Qdrant directly (QDRANT_URL), not through the gateway.
  3. Entity resolution at ingest. Resolve name variants ("John Smith" / "J. Smith" / "JS") to one canonical lp_id before storing — a lightweight local-Qwen call via Spark Control /v1/chat/completions. Without it, retrieval fragments across variants. Same canonical-ID work as Workstream A4.
  4. Backfill + incremental sync, idempotent. One-time full backfill, then incremental updates as records change. A full re-embed is ~815 min, so the index is cheaply rebuildable — but only if the pipeline is idempotent and re-runnable (Qdrant is never auto-restarted and holds the only live index). Prefer simple, correct, repeatable sync over premature optimization.

Workstream C — CRM MCP server

Wrap the CRM behind an MCP server that agents (Claude Agent SDK) call. The retrieval modes are thin wrappers over Spark Control POST /api/search (which orchestrates embed → Qdrant RRF → rerank); for hybrid/keyword modes, generate the query's BM25 sparse vector with the same FastEmbed model used at ingest and pass it as sparse. Tools to expose:

  • Reads: fetch record(s), list/search records by structured filters, fetch interaction history for an entity.
  • Retrieval modes (all via /api/search, varying parameters):
    • semantic_search — dense only (omit sparse), large top_k/retrieve_n. High recall (Scout).
    • hybrid_search — pass sparse + rerank: true + filter. Dense+BM25 RRF then rerank. Default, best for entity-heavy queries.
    • keyword_search — high-precision lexical ("did we ever discuss X with this LP?"): lean on the sparse leg + rerank. For a pure-BM25 result, query Qdrant's sparse vector directly via QDRANT_URL.
    • Always support a Qdrant filter (e.g. lp_id, date_ts range) to pre-filter before scoring.
  • Writes: create/update records, append to the interaction log, write enrichment results back. (No outbound/contact actions — Phase 3, gated.)

Keep retrieval modes as small, composable functions so the orchestrator can select among them later.

Workstream D — Compliance groundwork (parallel, non-blocking)

Engage counsel to define solicitation posture (506(b) vs 506(c)), accreditation/QP verification, and recordkeeping rules. This does not block AC, but it must be resolved before any Phase 3 outbound work begins.

Acceptance criteria

  • CRM schema extended (additive, reversible) and documented.
  • crm_chunks collection created per EMBEDDINGS.md (named dense + sparse vectors, payload indexes).
  • Full CRM corpus backfilled into Qdrant with both dense + BM25 sparse vectors and payload; incremental sync working and idempotent.
  • Entity resolution collapses obvious name variants to one canonical lp_id.
  • CRM MCP server live; retrieval modes wrap /api/search; hybrid (sparse + rerank) visibly beats dense-only on entity-heavy test queries (e.g. exact fund/LP name lookups).
  • Every write goes through the interaction log.
  • No outward-facing behavior exists anywhere in the codebase.

Explicitly out of scope for Phase 0

  • Any agent that sends email, posts, or contacts a prospect.
  • Scout monitoring, Closer outreach, Scribe publishing, the Orchestrator.
  • The workflow/scheduling engine (revisited at Phase 3).

Suggested order

A1 → A2 → B1 → B2 → B3 → B4 → C. Start D in parallel.