Phase 0 complete: fuzzy entity tier, incremental sync, Start9 packaging

- Fuzzy tier (backend/ingest/fuzzy_resolve.py + llm.py): local Qwen adjudicates
  the deterministic resolver's flagged name-variant candidates; merges are
  durable via entity_merges (deterministic re-runs respect them), losers
  soft-deleted, logged. Idempotent.
- Incremental sync (backend/ingest/sync.py): re-embeds only rows changed since a
  watermark (ingest_sync_state); first run / --recreate = full. Tested full→0→1.
- Start9 packaging (start9/0.4): Dockerfile bundles ingest+mcp + fastembed/mcp;
  "Build search index" action runs the init in a subcontainer; MCP shipped as a
  manual stdio server (not a daemon); version 0.1.0:44. INGEST_PACKAGING.md.
- backfill.py: factored embed_and_upsert() shared with sync.

Verified end-to-end on synthetic data + live Sparks/Qwen/Qdrant.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Keysat
2026-06-05 08:55:12 -05:00
parent c7ce44d963
commit f357c23c75
16 changed files with 808 additions and 48 deletions
+87
View File
@@ -0,0 +1,87 @@
# Go-Live Runbook — Phase 0 substrate on the live Start9 box
*How to take the Phase-0 data substrate from "tested on synthetic data" to "running against the real CRM" on the Start9 server. You run this on your infrastructure; no real LP data goes to Claude/Anthropic (guardrails #1, #9). The live `/data/crm.db` on the box is the canonical source — not the possibly-stale `start9/0.4/seed/` snapshot.*
Recap of the three moves (see also `docs/crm-overview.md`): (1) ship code → empty new tables appear; (2) run the one-time init → fills the canonical IDs + search index from your real data; (3) run the MCP server.
---
## Prerequisites
- Spark Control + Qdrant reachable from the box: `SPARK_CONTROL_URL`, `QDRANT_URL` (see `.env.example`). Verify with `curl -sk $SPARK_CONTROL_URL/api/endpoints`.
- The `backend/ingest/` + `backend/mcp/` code present on the box (ships with the package — see "Packaging decision" below).
- Python deps in the ingest environment: `fastembed` (BM25; installs cleanly on the box's Python 3.11) and `mcp` (only to run the MCP server). The CRM server itself needs no new deps.
## Step 1 — Deploy the new CRM version (auto-creates the empty tables)
1. Bump the package version, rebuild the `.s9pk`, sideload it. StartOS preserves `/data`, so your real data is undisturbed.
2. On first boot, `init_db()` runs `backend/core_migrations.py`, which applies `migrations/0001_phase0_foundation.sql` **once** (tracked in `schema_migrations`) — additively creating `canonical_entities`, `entity_links`, `interaction_log`, `relationship_edges`, and the `deleted_at` columns. Nothing existing changes.
3. Verify: `sqlite3 /data/crm.db "SELECT filename FROM schema_migrations;"` → should list `0001_phase0_foundation.sql`.
## Step 2 — Prepare the ingest environment (on the box)
```bash
pip install fastembed # BM25 Qdrant/bm25 (sparse.py auto-detects it)
export CRM_DB_PATH=/data/crm.db
export SPARK_CONTROL_URL=https://192.168.1.72:62419
export SPARK_CONTROL_VERIFY_TLS=false
export QDRANT_URL=http://192.168.1.87:6333
```
`sparse.py` will report `BACKEND = fastembed:Qdrant/bm25` here (vs the pure-Python fallback used on the dev Mac). Because the index is built **and** queried on the box, the encoder is consistent end-to-end.
## Step 3 — Build the canonical IDs from your real data
```bash
python3 backend/ingest/entity_resolution.py --db /data/crm.db --show-candidates
```
This reads your real contacts / fundraising investors / organizations and fills `canonical_entities` + `entity_links` (the "create entity IDs from existing data" step). It is **read-only on your CRM source tables**, idempotent, and logs a run to `interaction_log`. Review the printed fuzzy candidates — those are the name-variant pairs the deterministic tier wouldn't merge on a guess (the local-Qwen fuzzy tier, still to be built, resolves these).
## Step 4 — Build the search index
```bash
python3 backend/ingest/backfill.py --db /data/crm.db --recreate
```
Chunks your real records → dense (bge-m3 via Spark Control) + BM25 sparse → upserts to Qdrant `crm_chunks`. ~815 min for a full corpus. Idempotent (deterministic point ids), so re-running is safe. `--recreate` drops and rebuilds the collection; omit it to update in place.
Note: your live CRM's text is concentrated in the **fundraising grid notes** + grid contacts (the seed snapshot had 0 communications / 0 lp_profiles), plus Gmail once enabled (see `docs/gmail-enablement-runbook.md`). The chunker already handles all of these.
## Step 5 — Start the MCP server
```bash
pip install mcp
CRM_DB_PATH=/data/crm.db python3 backend/mcp/server.py
```
Register it with the Agent SDK / Claude Code as an stdio MCP server. It exposes reads, the three retrieval modes, and logged writes — **no outbound/contact tools** (Phase 3 gate). For Phase 0 there are no live agents; this is for testing and the internal-only Analyst work later.
## Step 6 — Incremental sync (NOT YET BUILT — Workstream B4)
The full backfill is one-shot. Keeping the index fresh as the CRM changes (new grid edits, new emails) needs an incremental, idempotent sync on a schedule. This is the remaining Phase-0 ingest piece; until it's built, re-run Steps 34 to refresh.
## Verification
```sql
SELECT entity_kind, COUNT(*) FROM canonical_entities GROUP BY entity_kind; -- IDs built
SELECT COUNT(*) FROM entity_links; -- source rows linked
```
```bash
curl -s "$QDRANT_URL/collections/crm_chunks" | python3 -c "import sys,json;print('points:', json.load(sys.stdin)['result']['points_count'])"
python3 backend/ingest/search.py "Fund III wire timeline" --mode hybrid # sanity query
```
## Open decision — packaging (how the init + MCP run on the box)
The ingest scripts read `/data/crm.db` by file path, so they must run **where that file lives** — inside or beside the CRM container (the dev Mac cannot open the container's SQLite file directly). Options, to decide before go-live:
- **A (recommended): same image.** Bundle `backend/ingest` + `backend/mcp` (+ `fastembed`, `mcp`) into the CRM container image; expose the init as a one-shot Start9 action and run the MCP server as a second daemon in the 0.4 `startos` manifest. The image is already Python 3.11 with the volume mounted.
- **B: sidecar container** on the box mounting the same `/data` volume.
- **C: co-located host** with a copy of `/data` and LAN access to the Sparks (involves copying the DB — least clean).
This packaging wiring (and Step 6) is the remaining build work for a fully turn-key go-live.
## Sovereignty checkpoint
Every step above runs on Ten31 infrastructure. Real records flow `crm.db → local Spark (bge-m3) → local Qdrant` and never reach Anthropic. The scripts print counts, not records. Keep it that way: don't paste query *results* over real data back into a Claude session (guardrail #9).