Phase 0 complete: fuzzy entity tier, incremental sync, Start9 packaging

- Fuzzy tier (backend/ingest/fuzzy_resolve.py + llm.py): local Qwen adjudicates the deterministic resolver's flagged name-variant candidates; merges are durable via entity_merges (deterministic re-runs respect them), losers soft-deleted, logged. Idempotent. - Incremental sync (backend/ingest/sync.py): re-embeds only rows changed since a watermark (ingest_sync_state); first run / --recreate = full. Tested full→0→1. - Start9 packaging (start9/0.4): Dockerfile bundles ingest+mcp + fastembed/mcp; "Build search index" action runs the init in a subcontainer; MCP shipped as a manual stdio server (not a daemon); version 0.1.0:44. INGEST_PACKAGING.md. - backfill.py: factored embed_and_upsert() shared with sync. Verified end-to-end on synthetic data + live Sparks/Qwen/Qdrant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 08:55:12 -05:00
parent c7ce44d963
commit f357c23c75
16 changed files with 808 additions and 48 deletions
@@ -0,0 +1,87 @@
+# Go-Live Runbook — Phase 0 substrate on the live Start9 box
+
+*How to take the Phase-0 data substrate from "tested on synthetic data" to "running against the real CRM" on the Start9 server. You run this on your infrastructure; no real LP data goes to Claude/Anthropic (guardrails #1, #9). The live `/data/crm.db` on the box is the canonical source — not the possibly-stale `start9/0.4/seed/` snapshot.*
+
+Recap of the three moves (see also `docs/crm-overview.md`): (1) ship code → empty new tables appear; (2) run the one-time init → fills the canonical IDs + search index from your real data; (3) run the MCP server.
+
+---
+
+## Prerequisites
+
+- Spark Control + Qdrant reachable from the box: `SPARK_CONTROL_URL`, `QDRANT_URL` (see `.env.example`). Verify with `curl -sk $SPARK_CONTROL_URL/api/endpoints`.
+- The `backend/ingest/` + `backend/mcp/` code present on the box (ships with the package — see "Packaging decision" below).
+- Python deps in the ingest environment: `fastembed` (BM25; installs cleanly on the box's Python 3.11) and `mcp` (only to run the MCP server). The CRM server itself needs no new deps.
+
+## Step 1 — Deploy the new CRM version (auto-creates the empty tables)
+
+1. Bump the package version, rebuild the `.s9pk`, sideload it. StartOS preserves `/data`, so your real data is undisturbed.
+2. On first boot, `init_db()` runs `backend/core_migrations.py`, which applies `migrations/0001_phase0_foundation.sql` **once** (tracked in `schema_migrations`) — additively creating `canonical_entities`, `entity_links`, `interaction_log`, `relationship_edges`, and the `deleted_at` columns. Nothing existing changes.
+3. Verify: `sqlite3 /data/crm.db "SELECT filename FROM schema_migrations;"` → should list `0001_phase0_foundation.sql`.
+
+## Step 2 — Prepare the ingest environment (on the box)
+
+```bash
+pip install fastembed                 # BM25 Qdrant/bm25 (sparse.py auto-detects it)
+export CRM_DB_PATH=/data/crm.db
+export SPARK_CONTROL_URL=https://192.168.1.72:62419
+export SPARK_CONTROL_VERIFY_TLS=false
+export QDRANT_URL=http://192.168.1.87:6333
+```
+
+`sparse.py` will report `BACKEND = fastembed:Qdrant/bm25` here (vs the pure-Python fallback used on the dev Mac). Because the index is built **and** queried on the box, the encoder is consistent end-to-end.
+
+## Step 3 — Build the canonical IDs from your real data
+
+```bash
+python3 backend/ingest/entity_resolution.py --db /data/crm.db --show-candidates
+```
+
+This reads your real contacts / fundraising investors / organizations and fills `canonical_entities` + `entity_links` (the "create entity IDs from existing data" step). It is **read-only on your CRM source tables**, idempotent, and logs a run to `interaction_log`. Review the printed fuzzy candidates — those are the name-variant pairs the deterministic tier wouldn't merge on a guess (the local-Qwen fuzzy tier, still to be built, resolves these).
+
+## Step 4 — Build the search index
+
+```bash
+python3 backend/ingest/backfill.py --db /data/crm.db --recreate
+```
+
+Chunks your real records → dense (bge-m3 via Spark Control) + BM25 sparse → upserts to Qdrant `crm_chunks`. ~8–15 min for a full corpus. Idempotent (deterministic point ids), so re-running is safe. `--recreate` drops and rebuilds the collection; omit it to update in place.
+
+Note: your live CRM's text is concentrated in the **fundraising grid notes** + grid contacts (the seed snapshot had 0 communications / 0 lp_profiles), plus Gmail once enabled (see `docs/gmail-enablement-runbook.md`). The chunker already handles all of these.
+
+## Step 5 — Start the MCP server
+
+```bash
+pip install mcp
+CRM_DB_PATH=/data/crm.db python3 backend/mcp/server.py
+```
+
+Register it with the Agent SDK / Claude Code as an stdio MCP server. It exposes reads, the three retrieval modes, and logged writes — **no outbound/contact tools** (Phase 3 gate). For Phase 0 there are no live agents; this is for testing and the internal-only Analyst work later.
+
+## Step 6 — Incremental sync (NOT YET BUILT — Workstream B4)
+
+The full backfill is one-shot. Keeping the index fresh as the CRM changes (new grid edits, new emails) needs an incremental, idempotent sync on a schedule. This is the remaining Phase-0 ingest piece; until it's built, re-run Steps 3–4 to refresh.
+
+## Verification
+
+```sql
+SELECT entity_kind, COUNT(*) FROM canonical_entities GROUP BY entity_kind;   -- IDs built
+SELECT COUNT(*) FROM entity_links;                                            -- source rows linked
+```
+```bash
+curl -s "$QDRANT_URL/collections/crm_chunks" | python3 -c "import sys,json;print('points:', json.load(sys.stdin)['result']['points_count'])"
+python3 backend/ingest/search.py "Fund III wire timeline" --mode hybrid       # sanity query
+```
+
+## Open decision — packaging (how the init + MCP run on the box)
+
+The ingest scripts read `/data/crm.db` by file path, so they must run **where that file lives** — inside or beside the CRM container (the dev Mac cannot open the container's SQLite file directly). Options, to decide before go-live:
+
+- **A (recommended): same image.** Bundle `backend/ingest` + `backend/mcp` (+ `fastembed`, `mcp`) into the CRM container image; expose the init as a one-shot Start9 action and run the MCP server as a second daemon in the 0.4 `startos` manifest. The image is already Python 3.11 with the volume mounted.
+- **B: sidecar container** on the box mounting the same `/data` volume.
+- **C: co-located host** with a copy of `/data` and LAN access to the Sparks (involves copying the DB — least clean).
+
+This packaging wiring (and Step 6) is the remaining build work for a fully turn-key go-live.
+
+## Sovereignty checkpoint
+
+Every step above runs on Ten31 infrastructure. Real records flow `crm.db → local Spark (bge-m3) → local Qdrant` and never reach Anthropic. The scripts print counts, not records. Keep it that way: don't paste query *results* over real data back into a Claude session (guardrail #9).