Phase 0 complete: fuzzy entity tier, incremental sync, Start9 packaging

- Fuzzy tier (backend/ingest/fuzzy_resolve.py + llm.py): local Qwen adjudicates
  the deterministic resolver's flagged name-variant candidates; merges are
  durable via entity_merges (deterministic re-runs respect them), losers
  soft-deleted, logged. Idempotent.
- Incremental sync (backend/ingest/sync.py): re-embeds only rows changed since a
  watermark (ingest_sync_state); first run / --recreate = full. Tested full→0→1.
- Start9 packaging (start9/0.4): Dockerfile bundles ingest+mcp + fastembed/mcp;
  "Build search index" action runs the init in a subcontainer; MCP shipped as a
  manual stdio server (not a daemon); version 0.1.0:44. INGEST_PACKAGING.md.
- backfill.py: factored embed_and_upsert() shared with sync.

Verified end-to-end on synthetic data + live Sparks/Qwen/Qdrant.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Keysat
2026-06-05 08:55:12 -05:00
parent c7ce44d963
commit f357c23c75
16 changed files with 808 additions and 48 deletions
+17 -4
View File
@@ -31,14 +31,27 @@ RUN apt-get update \
&& rm -rf /var/lib/apt/lists/*
# ── Python dependencies ─────────────────────────────────────────
# Only one hard dep for now: `cryptography` is required by the Gmail
# integration's RS256 JWT signing (DWD bearer tokens). Everything else
# server.py needs is stdlib.
RUN pip install --no-cache-dir cryptography==42.0.5
# `cryptography` is required by the Gmail integration's RS256 JWT signing
# (DWD bearer tokens). The two Phase-0 deps are runtime-only for the ingest
# pipeline + MCP server (the CRM web server itself still needs no new deps):
# * fastembed — client-side BM25 (Qdrant/bm25) for the sparse retrieval leg
# (backend/ingest/sparse.py auto-detects it).
# * mcp — MCP Python SDK, only needed to run backend/mcp/server.py.
# Everything else server.py needs is stdlib.
RUN pip install --no-cache-dir \
cryptography==42.0.5 \
fastembed==0.4.2 \
mcp==1.2.0
# ── Application source ──────────────────────────────────────────
COPY backend/server.py /app/backend/server.py
COPY backend/email_integration /app/backend/email_integration
# Phase-0 substrate: ingest pipeline (entity resolution + backfill) and the
# CRM MCP server. Shipped alongside the web server so the one-time index build
# and the (manually-run) MCP server can execute on the box where /data/crm.db
# lives. See start9/0.4/INGEST_PACKAGING.md.
COPY backend/ingest /app/backend/ingest
COPY backend/mcp /app/backend/mcp
COPY frontend /app/frontend
# ── StartOS wrapper scripts ─────────────────────────────────────