Phase 0 foundation: canonical schema, ingest pipeline, CRM MCP server
Workstream A–C substrate for the Ten31 agentic system: - A1: docs/crm-overview.md; CLAUDE.md conventions + guardrail #9 - A2: additive/reversible core migration (canonical_entities, entity_links, interaction_log, relationship_edges, soft-delete) + ledgered runner - B1/B3: chunking + deterministic entity resolution (backend/ingest) - B2: dense (bge-m3) + BM25 sparse ingest to Qdrant crm_chunks - C: CRM MCP server (reads, retrieval modes, logged writes) — no outbound tools - docs: redaction/re-hydration, Gmail enablement runbook - synthetic test data; .env.example; housekeeping (.gitignore, untrack crm.db, drop legacy files + start9/0.3.5) Verified end-to-end on synthetic data + live Sparks (hybrid > dense on entity queries). Real backfill runs on Ten31 infra; index holds synthetic data only. Branch snapshot also captures pre-existing working-tree changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,40 @@
|
||||
"""Client-side BM25 sparse vectors.
|
||||
|
||||
EMBEDDINGS.md specifies FastEmbed `Qdrant/bm25` so Qdrant applies IDF (via the
|
||||
sparse vector's `modifier: idf`) over OUR corpus. FastEmbed pulls onnxruntime,
|
||||
which has no wheel for this Python (3.14) yet, so this module provides a
|
||||
dependency-free BM25 term-frequency encoder with the same contract:
|
||||
`encode(text) -> {"indices": [...], "values": [...]}`.
|
||||
|
||||
Qdrant computes IDF server-side from the stored sparse vectors regardless of how
|
||||
indices are assigned, so this is a legitimate corpus-IDF BM25 leg. The ONLY hard
|
||||
requirement is that ingest and query use the SAME encoder — they both import this
|
||||
one. For production, swap `encode()` for FastEmbed `Qdrant/bm25` (and re-index, so
|
||||
ingest and query stay on the same tokenizer).
|
||||
"""
|
||||
import hashlib
|
||||
import math
|
||||
import re
|
||||
|
||||
_TOKEN_RE = re.compile(r"[a-z0-9]+")
|
||||
|
||||
|
||||
def tokenize(text: str):
|
||||
return _TOKEN_RE.findall((text or "").lower())
|
||||
|
||||
|
||||
def _index(token: str) -> int:
|
||||
# Stable unsigned 32-bit index for a token (Qdrant sparse indices are u32).
|
||||
return int.from_bytes(hashlib.md5(token.encode("utf-8")).digest()[:4], "big")
|
||||
|
||||
|
||||
def encode(text: str):
|
||||
"""Return a sparse vector {indices, values}. Value is 1 + ln(tf) (sublinear
|
||||
term frequency); IDF is applied by Qdrant via modifier:idf."""
|
||||
tf = {}
|
||||
for tok in tokenize(text):
|
||||
tf[tok] = tf.get(tok, 0) + 1
|
||||
idx_val = {}
|
||||
for tok, count in tf.items():
|
||||
idx_val[_index(tok)] = 1.0 + math.log(count)
|
||||
return {"indices": list(idx_val.keys()), "values": list(idx_val.values())}
|
||||
Reference in New Issue
Block a user