Files

T

Keysat f357c23c75 Phase 0 complete: fuzzy entity tier, incremental sync, Start9 packaging

- Fuzzy tier (backend/ingest/fuzzy_resolve.py + llm.py): local Qwen adjudicates
  the deterministic resolver's flagged name-variant candidates; merges are
  durable via entity_merges (deterministic re-runs respect them), losers
  soft-deleted, logged. Idempotent.
- Incremental sync (backend/ingest/sync.py): re-embeds only rows changed since a
  watermark (ingest_sync_state); first run / --recreate = full. Tested full→0→1.
- Start9 packaging (start9/0.4): Dockerfile bundles ingest+mcp + fastembed/mcp;
  "Build search index" action runs the init in a subcontainer; MCP shipped as a
  manual stdio server (not a daemon); version 0.1.0:44. INGEST_PACKAGING.md.
- backfill.py: factored embed_and_upsert() shared with sync.

Verified end-to-end on synthetic data + live Sparks/Qwen/Qdrant.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-05 08:55:12 -05:00

5.8 KiB

Raw Blame History

Go-Live Runbook — Phase 0 substrate on the live Start9 box

How to take the Phase-0 data substrate from "tested on synthetic data" to "running against the real CRM" on the Start9 server. You run this on your infrastructure; no real LP data goes to Claude/Anthropic (guardrails #1, #9). The live /data/crm.db on the box is the canonical source — not the possibly-stale start9/0.4/seed/ snapshot.

Recap of the three moves (see also docs/crm-overview.md): (1) ship code → empty new tables appear; (2) run the one-time init → fills the canonical IDs + search index from your real data; (3) run the MCP server.

Prerequisites

Spark Control + Qdrant reachable from the box: SPARK_CONTROL_URL, QDRANT_URL (see .env.example). Verify with curl -sk $SPARK_CONTROL_URL/api/endpoints.
The backend/ingest/ + backend/mcp/ code present on the box (ships with the package — see "Packaging decision" below).
Python deps in the ingest environment: fastembed (BM25; installs cleanly on the box's Python 3.11) and mcp (only to run the MCP server). The CRM server itself needs no new deps.

Step 1 — Deploy the new CRM version (auto-creates the empty tables)

Bump the package version, rebuild the .s9pk, sideload it. StartOS preserves /data, so your real data is undisturbed.
On first boot, init_db() runs backend/core_migrations.py, which applies migrations/0001_phase0_foundation.sql once (tracked in schema_migrations) — additively creating canonical_entities, entity_links, interaction_log, relationship_edges, and the deleted_at columns. Nothing existing changes.
Verify: sqlite3 /data/crm.db "SELECT filename FROM schema_migrations;" → should list 0001_phase0_foundation.sql.

Step 2 — Prepare the ingest environment (on the box)

pip install fastembed                 # BM25 Qdrant/bm25 (sparse.py auto-detects it)
export CRM_DB_PATH=/data/crm.db
export SPARK_CONTROL_URL=https://192.168.1.72:62419
export SPARK_CONTROL_VERIFY_TLS=false
export QDRANT_URL=http://192.168.1.87:6333

sparse.py will report BACKEND = fastembed:Qdrant/bm25 here (vs the pure-Python fallback used on the dev Mac). Because the index is built and queried on the box, the encoder is consistent end-to-end.

Step 3 — Build the canonical IDs from your real data

python3 backend/ingest/entity_resolution.py --db /data/crm.db --show-candidates

This reads your real contacts / fundraising investors / organizations and fills canonical_entities + entity_links (the "create entity IDs from existing data" step). It is read-only on your CRM source tables, idempotent, and logs a run to interaction_log. Review the printed fuzzy candidates — those are the name-variant pairs the deterministic tier wouldn't merge on a guess (the local-Qwen fuzzy tier, still to be built, resolves these).

Step 4 — Build the search index

python3 backend/ingest/backfill.py --db /data/crm.db --recreate

Chunks your real records → dense (bge-m3 via Spark Control) + BM25 sparse → upserts to Qdrant crm_chunks. ~8–15 min for a full corpus. Idempotent (deterministic point ids), so re-running is safe. --recreate drops and rebuilds the collection; omit it to update in place.

Note: your live CRM's text is concentrated in the fundraising grid notes + grid contacts (the seed snapshot had 0 communications / 0 lp_profiles), plus Gmail once enabled (see docs/gmail-enablement-runbook.md). The chunker already handles all of these.

Step 5 — Start the MCP server

pip install mcp
CRM_DB_PATH=/data/crm.db python3 backend/mcp/server.py

Register it with the Agent SDK / Claude Code as an stdio MCP server. It exposes reads, the three retrieval modes, and logged writes — no outbound/contact tools (Phase 3 gate). For Phase 0 there are no live agents; this is for testing and the internal-only Analyst work later.

Step 6 — Incremental sync (NOT YET BUILT — Workstream B4)

The full backfill is one-shot. Keeping the index fresh as the CRM changes (new grid edits, new emails) needs an incremental, idempotent sync on a schedule. This is the remaining Phase-0 ingest piece; until it's built, re-run Steps 3–4 to refresh.

Verification

SELECT entity_kind, COUNT(*) FROM canonical_entities GROUP BY entity_kind;   -- IDs built
SELECT COUNT(*) FROM entity_links;                                            -- source rows linked

curl -s "$QDRANT_URL/collections/crm_chunks" | python3 -c "import sys,json;print('points:', json.load(sys.stdin)['result']['points_count'])"
python3 backend/ingest/search.py "Fund III wire timeline" --mode hybrid       # sanity query

Open decision — packaging (how the init + MCP run on the box)

The ingest scripts read /data/crm.db by file path, so they must run where that file lives — inside or beside the CRM container (the dev Mac cannot open the container's SQLite file directly). Options, to decide before go-live:

A (recommended): same image. Bundle backend/ingest + backend/mcp (+ fastembed, mcp) into the CRM container image; expose the init as a one-shot Start9 action and run the MCP server as a second daemon in the 0.4 startos manifest. The image is already Python 3.11 with the volume mounted.
B: sidecar container on the box mounting the same /data volume.
C: co-located host with a copy of /data and LAN access to the Sparks (involves copying the DB — least clean).

This packaging wiring (and Step 6) is the remaining build work for a fully turn-key go-live.

Sovereignty checkpoint

Every step above runs on Ten31 infrastructure. Real records flow crm.db → local Spark (bge-m3) → local Qdrant and never reach Anthropic. The scripts print counts, not records. Keep it that way: don't paste query results over real data back into a Claude session (guardrail #9).

5.8 KiB Raw Blame History Unescape Escape