Phase 0 foundation: canonical schema, ingest pipeline, CRM MCP server

Workstream A–C substrate for the Ten31 agentic system: - A1: docs/crm-overview.md; CLAUDE.md conventions + guardrail #9 - A2: additive/reversible core migration (canonical_entities, entity_links, interaction_log, relationship_edges, soft-delete) + ledgered runner - B1/B3: chunking + deterministic entity resolution (backend/ingest) - B2: dense (bge-m3) + BM25 sparse ingest to Qdrant crm_chunks - C: CRM MCP server (reads, retrieval modes, logged writes) — no outbound tools - docs: redaction/re-hydration, Gmail enablement runbook - synthetic test data; .env.example; housekeeping (.gitignore, untrack crm.db, drop legacy files + start9/0.3.5) Verified end-to-end on synthetic data + live Sparks (hybrid > dense on entity queries). Real backfill runs on Ten31 infra; index holds synthetic data only. Branch snapshot also captures pre-existing working-tree changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 08:11:28 -05:00
parent 7027efd777
commit c7ce44d963
99 changed files with 10676 additions and 7817 deletions
@@ -0,0 +1,40 @@
+"""Client-side BM25 sparse vectors.
+
+EMBEDDINGS.md specifies FastEmbed `Qdrant/bm25` so Qdrant applies IDF (via the
+sparse vector's `modifier: idf`) over OUR corpus. FastEmbed pulls onnxruntime,
+which has no wheel for this Python (3.14) yet, so this module provides a
+dependency-free BM25 term-frequency encoder with the same contract:
+`encode(text) -> {"indices": [...], "values": [...]}`.
+
+Qdrant computes IDF server-side from the stored sparse vectors regardless of how
+indices are assigned, so this is a legitimate corpus-IDF BM25 leg. The ONLY hard
+requirement is that ingest and query use the SAME encoder — they both import this
+one. For production, swap `encode()` for FastEmbed `Qdrant/bm25` (and re-index, so
+ingest and query stay on the same tokenizer).
+"""
+import hashlib
+import math
+import re
+
+_TOKEN_RE = re.compile(r"[a-z0-9]+")
+
+
+def tokenize(text: str):
+    return _TOKEN_RE.findall((text or "").lower())
+
+
+def _index(token: str) -> int:
+    # Stable unsigned 32-bit index for a token (Qdrant sparse indices are u32).
+    return int.from_bytes(hashlib.md5(token.encode("utf-8")).digest()[:4], "big")
+
+
+def encode(text: str):
+    """Return a sparse vector {indices, values}. Value is 1 + ln(tf) (sublinear
+    term frequency); IDF is applied by Qdrant via modifier:idf."""
+    tf = {}
+    for tok in tokenize(text):
+        tf[tok] = tf.get(tok, 0) + 1
+    idx_val = {}
+    for tok, count in tf.items():
+        idx_val[_index(tok)] = 1.0 + math.log(count)
+    return {"indices": list(idx_val.keys()), "values": list(idx_val.values())}