Phase 0 complete: fuzzy entity tier, incremental sync, Start9 packaging

- Fuzzy tier (backend/ingest/fuzzy_resolve.py + llm.py): local Qwen adjudicates the deterministic resolver's flagged name-variant candidates; merges are durable via entity_merges (deterministic re-runs respect them), losers soft-deleted, logged. Idempotent. - Incremental sync (backend/ingest/sync.py): re-embeds only rows changed since a watermark (ingest_sync_state); first run / --recreate = full. Tested full→0→1. - Start9 packaging (start9/0.4): Dockerfile bundles ingest+mcp + fastembed/mcp; "Build search index" action runs the init in a subcontainer; MCP shipped as a manual stdio server (not a daemon); version 0.1.0:44. INGEST_PACKAGING.md. - backfill.py: factored embed_and_upsert() shared with sync. Verified end-to-end on synthetic data + live Sparks/Qwen/Qdrant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 08:55:12 -05:00
parent c7ce44d963
commit f357c23c75
16 changed files with 808 additions and 48 deletions
@@ -16,25 +16,41 @@ import hashlib
 import math
 import re

-_TOKEN_RE = re.compile(r"[a-z0-9]+")
+# Prefer FastEmbed Qdrant/bm25 (the EMBEDDINGS.md-specified encoder) when it is
+# installable — true on the Start9 box (Python 3.11). Fall back to the
+# dependency-free encoder below where it is not (e.g. this dev Mac on 3.14).
+# Whichever is active, ingest and query in the SAME environment use it, so they
+# stay consistent; production rebuilds the index on the box, so it uses FastEmbed
+# end-to-end. BACKEND reports which is live.
+try:
+    from fastembed import SparseTextEmbedding  # type: ignore
+    _MODEL = None

+    def _model():
+        global _MODEL
+        if _MODEL is None:
+            _MODEL = SparseTextEmbedding(model_name="Qdrant/bm25")
+        return _MODEL

-def tokenize(text: str):
-    return _TOKEN_RE.findall((text or "").lower())
+    def encode(text: str):
+        emb = next(_model().embed([text or ""]))
+        return {"indices": [int(i) for i in emb.indices], "values": [float(v) for v in emb.values]}

+    BACKEND = "fastembed:Qdrant/bm25"
+except Exception:
+    BACKEND = "pure-python-bm25"
+    _TOKEN_RE = re.compile(r"[a-z0-9]+")

-def _index(token: str) -> int:
-    # Stable unsigned 32-bit index for a token (Qdrant sparse indices are u32).
-    return int.from_bytes(hashlib.md5(token.encode("utf-8")).digest()[:4], "big")
+    def tokenize(text: str):
+        return _TOKEN_RE.findall((text or "").lower())

+    def _index(token: str) -> int:
+        # Stable unsigned 32-bit index for a token (Qdrant sparse indices are u32).
+        return int.from_bytes(hashlib.md5(token.encode("utf-8")).digest()[:4], "big")

-def encode(text: str):
-    """Return a sparse vector {indices, values}. Value is 1 + ln(tf) (sublinear
-    term frequency); IDF is applied by Qdrant via modifier:idf."""
-    tf = {}
-    for tok in tokenize(text):
-        tf[tok] = tf.get(tok, 0) + 1
-    idx_val = {}
-    for tok, count in tf.items():
-        idx_val[_index(tok)] = 1.0 + math.log(count)
-    return {"indices": list(idx_val.keys()), "values": list(idx_val.values())}
+    def encode(text: str):
+        """Sparse vector {indices, values}; value = 1 + ln(tf). Qdrant applies IDF."""
+        tf = {}
+        for tok in tokenize(text):
+            tf[tok] = tf.get(tok, 0) + 1
+        return {"indices": [_index(t) for t in tf], "values": [1.0 + math.log(c) for c in tf.values()]}