- Fuzzy tier (backend/ingest/fuzzy_resolve.py + llm.py): local Qwen adjudicates the deterministic resolver's flagged name-variant candidates; merges are durable via entity_merges (deterministic re-runs respect them), losers soft-deleted, logged. Idempotent. - Incremental sync (backend/ingest/sync.py): re-embeds only rows changed since a watermark (ingest_sync_state); first run / --recreate = full. Tested full→0→1. - Start9 packaging (start9/0.4): Dockerfile bundles ingest+mcp + fastembed/mcp; "Build search index" action runs the init in a subcontainer; MCP shipped as a manual stdio server (not a daemon); version 0.1.0:44. INGEST_PACKAGING.md. - backfill.py: factored embed_and_upsert() shared with sync. Verified end-to-end on synthetic data + live Sparks/Qwen/Qdrant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5.8 KiB
Go-Live Runbook — Phase 0 substrate on the live Start9 box
How to take the Phase-0 data substrate from "tested on synthetic data" to "running against the real CRM" on the Start9 server. You run this on your infrastructure; no real LP data goes to Claude/Anthropic (guardrails #1, #9). The live /data/crm.db on the box is the canonical source — not the possibly-stale start9/0.4/seed/ snapshot.
Recap of the three moves (see also docs/crm-overview.md): (1) ship code → empty new tables appear; (2) run the one-time init → fills the canonical IDs + search index from your real data; (3) run the MCP server.
Prerequisites
- Spark Control + Qdrant reachable from the box:
SPARK_CONTROL_URL,QDRANT_URL(see.env.example). Verify withcurl -sk $SPARK_CONTROL_URL/api/endpoints. - The
backend/ingest/+backend/mcp/code present on the box (ships with the package — see "Packaging decision" below). - Python deps in the ingest environment:
fastembed(BM25; installs cleanly on the box's Python 3.11) andmcp(only to run the MCP server). The CRM server itself needs no new deps.
Step 1 — Deploy the new CRM version (auto-creates the empty tables)
- Bump the package version, rebuild the
.s9pk, sideload it. StartOS preserves/data, so your real data is undisturbed. - On first boot,
init_db()runsbackend/core_migrations.py, which appliesmigrations/0001_phase0_foundation.sqlonce (tracked inschema_migrations) — additively creatingcanonical_entities,entity_links,interaction_log,relationship_edges, and thedeleted_atcolumns. Nothing existing changes. - Verify:
sqlite3 /data/crm.db "SELECT filename FROM schema_migrations;"→ should list0001_phase0_foundation.sql.
Step 2 — Prepare the ingest environment (on the box)
pip install fastembed # BM25 Qdrant/bm25 (sparse.py auto-detects it)
export CRM_DB_PATH=/data/crm.db
export SPARK_CONTROL_URL=https://192.168.1.72:62419
export SPARK_CONTROL_VERIFY_TLS=false
export QDRANT_URL=http://192.168.1.87:6333
sparse.py will report BACKEND = fastembed:Qdrant/bm25 here (vs the pure-Python fallback used on the dev Mac). Because the index is built and queried on the box, the encoder is consistent end-to-end.
Step 3 — Build the canonical IDs from your real data
python3 backend/ingest/entity_resolution.py --db /data/crm.db --show-candidates
This reads your real contacts / fundraising investors / organizations and fills canonical_entities + entity_links (the "create entity IDs from existing data" step). It is read-only on your CRM source tables, idempotent, and logs a run to interaction_log. Review the printed fuzzy candidates — those are the name-variant pairs the deterministic tier wouldn't merge on a guess (the local-Qwen fuzzy tier, still to be built, resolves these).
Step 4 — Build the search index
python3 backend/ingest/backfill.py --db /data/crm.db --recreate
Chunks your real records → dense (bge-m3 via Spark Control) + BM25 sparse → upserts to Qdrant crm_chunks. ~8–15 min for a full corpus. Idempotent (deterministic point ids), so re-running is safe. --recreate drops and rebuilds the collection; omit it to update in place.
Note: your live CRM's text is concentrated in the fundraising grid notes + grid contacts (the seed snapshot had 0 communications / 0 lp_profiles), plus Gmail once enabled (see docs/gmail-enablement-runbook.md). The chunker already handles all of these.
Step 5 — Start the MCP server
pip install mcp
CRM_DB_PATH=/data/crm.db python3 backend/mcp/server.py
Register it with the Agent SDK / Claude Code as an stdio MCP server. It exposes reads, the three retrieval modes, and logged writes — no outbound/contact tools (Phase 3 gate). For Phase 0 there are no live agents; this is for testing and the internal-only Analyst work later.
Step 6 — Incremental sync (NOT YET BUILT — Workstream B4)
The full backfill is one-shot. Keeping the index fresh as the CRM changes (new grid edits, new emails) needs an incremental, idempotent sync on a schedule. This is the remaining Phase-0 ingest piece; until it's built, re-run Steps 3–4 to refresh.
Verification
SELECT entity_kind, COUNT(*) FROM canonical_entities GROUP BY entity_kind; -- IDs built
SELECT COUNT(*) FROM entity_links; -- source rows linked
curl -s "$QDRANT_URL/collections/crm_chunks" | python3 -c "import sys,json;print('points:', json.load(sys.stdin)['result']['points_count'])"
python3 backend/ingest/search.py "Fund III wire timeline" --mode hybrid # sanity query
Open decision — packaging (how the init + MCP run on the box)
The ingest scripts read /data/crm.db by file path, so they must run where that file lives — inside or beside the CRM container (the dev Mac cannot open the container's SQLite file directly). Options, to decide before go-live:
- A (recommended): same image. Bundle
backend/ingest+backend/mcp(+fastembed,mcp) into the CRM container image; expose the init as a one-shot Start9 action and run the MCP server as a second daemon in the 0.4startosmanifest. The image is already Python 3.11 with the volume mounted. - B: sidecar container on the box mounting the same
/datavolume. - C: co-located host with a copy of
/dataand LAN access to the Sparks (involves copying the DB — least clean).
This packaging wiring (and Step 6) is the remaining build work for a fully turn-key go-live.
Sovereignty checkpoint
Every step above runs on Ten31 infrastructure. Real records flow crm.db → local Spark (bge-m3) → local Qdrant and never reach Anthropic. The scripts print counts, not records. Keep it that way: don't paste query results over real data back into a Claude session (guardrail #9).