Files
ten31-database/docs/go-live-runbook.md
T
Keysat f357c23c75 Phase 0 complete: fuzzy entity tier, incremental sync, Start9 packaging
- Fuzzy tier (backend/ingest/fuzzy_resolve.py + llm.py): local Qwen adjudicates
  the deterministic resolver's flagged name-variant candidates; merges are
  durable via entity_merges (deterministic re-runs respect them), losers
  soft-deleted, logged. Idempotent.
- Incremental sync (backend/ingest/sync.py): re-embeds only rows changed since a
  watermark (ingest_sync_state); first run / --recreate = full. Tested full→0→1.
- Start9 packaging (start9/0.4): Dockerfile bundles ingest+mcp + fastembed/mcp;
  "Build search index" action runs the init in a subcontainer; MCP shipped as a
  manual stdio server (not a daemon); version 0.1.0:44. INGEST_PACKAGING.md.
- backfill.py: factored embed_and_upsert() shared with sync.

Verified end-to-end on synthetic data + live Sparks/Qwen/Qdrant.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 08:55:12 -05:00

88 lines
5.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Go-Live Runbook — Phase 0 substrate on the live Start9 box
*How to take the Phase-0 data substrate from "tested on synthetic data" to "running against the real CRM" on the Start9 server. You run this on your infrastructure; no real LP data goes to Claude/Anthropic (guardrails #1, #9). The live `/data/crm.db` on the box is the canonical source — not the possibly-stale `start9/0.4/seed/` snapshot.*
Recap of the three moves (see also `docs/crm-overview.md`): (1) ship code → empty new tables appear; (2) run the one-time init → fills the canonical IDs + search index from your real data; (3) run the MCP server.
---
## Prerequisites
- Spark Control + Qdrant reachable from the box: `SPARK_CONTROL_URL`, `QDRANT_URL` (see `.env.example`). Verify with `curl -sk $SPARK_CONTROL_URL/api/endpoints`.
- The `backend/ingest/` + `backend/mcp/` code present on the box (ships with the package — see "Packaging decision" below).
- Python deps in the ingest environment: `fastembed` (BM25; installs cleanly on the box's Python 3.11) and `mcp` (only to run the MCP server). The CRM server itself needs no new deps.
## Step 1 — Deploy the new CRM version (auto-creates the empty tables)
1. Bump the package version, rebuild the `.s9pk`, sideload it. StartOS preserves `/data`, so your real data is undisturbed.
2. On first boot, `init_db()` runs `backend/core_migrations.py`, which applies `migrations/0001_phase0_foundation.sql` **once** (tracked in `schema_migrations`) — additively creating `canonical_entities`, `entity_links`, `interaction_log`, `relationship_edges`, and the `deleted_at` columns. Nothing existing changes.
3. Verify: `sqlite3 /data/crm.db "SELECT filename FROM schema_migrations;"` → should list `0001_phase0_foundation.sql`.
## Step 2 — Prepare the ingest environment (on the box)
```bash
pip install fastembed # BM25 Qdrant/bm25 (sparse.py auto-detects it)
export CRM_DB_PATH=/data/crm.db
export SPARK_CONTROL_URL=https://192.168.1.72:62419
export SPARK_CONTROL_VERIFY_TLS=false
export QDRANT_URL=http://192.168.1.87:6333
```
`sparse.py` will report `BACKEND = fastembed:Qdrant/bm25` here (vs the pure-Python fallback used on the dev Mac). Because the index is built **and** queried on the box, the encoder is consistent end-to-end.
## Step 3 — Build the canonical IDs from your real data
```bash
python3 backend/ingest/entity_resolution.py --db /data/crm.db --show-candidates
```
This reads your real contacts / fundraising investors / organizations and fills `canonical_entities` + `entity_links` (the "create entity IDs from existing data" step). It is **read-only on your CRM source tables**, idempotent, and logs a run to `interaction_log`. Review the printed fuzzy candidates — those are the name-variant pairs the deterministic tier wouldn't merge on a guess (the local-Qwen fuzzy tier, still to be built, resolves these).
## Step 4 — Build the search index
```bash
python3 backend/ingest/backfill.py --db /data/crm.db --recreate
```
Chunks your real records → dense (bge-m3 via Spark Control) + BM25 sparse → upserts to Qdrant `crm_chunks`. ~815 min for a full corpus. Idempotent (deterministic point ids), so re-running is safe. `--recreate` drops and rebuilds the collection; omit it to update in place.
Note: your live CRM's text is concentrated in the **fundraising grid notes** + grid contacts (the seed snapshot had 0 communications / 0 lp_profiles), plus Gmail once enabled (see `docs/gmail-enablement-runbook.md`). The chunker already handles all of these.
## Step 5 — Start the MCP server
```bash
pip install mcp
CRM_DB_PATH=/data/crm.db python3 backend/mcp/server.py
```
Register it with the Agent SDK / Claude Code as an stdio MCP server. It exposes reads, the three retrieval modes, and logged writes — **no outbound/contact tools** (Phase 3 gate). For Phase 0 there are no live agents; this is for testing and the internal-only Analyst work later.
## Step 6 — Incremental sync (NOT YET BUILT — Workstream B4)
The full backfill is one-shot. Keeping the index fresh as the CRM changes (new grid edits, new emails) needs an incremental, idempotent sync on a schedule. This is the remaining Phase-0 ingest piece; until it's built, re-run Steps 34 to refresh.
## Verification
```sql
SELECT entity_kind, COUNT(*) FROM canonical_entities GROUP BY entity_kind; -- IDs built
SELECT COUNT(*) FROM entity_links; -- source rows linked
```
```bash
curl -s "$QDRANT_URL/collections/crm_chunks" | python3 -c "import sys,json;print('points:', json.load(sys.stdin)['result']['points_count'])"
python3 backend/ingest/search.py "Fund III wire timeline" --mode hybrid # sanity query
```
## Open decision — packaging (how the init + MCP run on the box)
The ingest scripts read `/data/crm.db` by file path, so they must run **where that file lives** — inside or beside the CRM container (the dev Mac cannot open the container's SQLite file directly). Options, to decide before go-live:
- **A (recommended): same image.** Bundle `backend/ingest` + `backend/mcp` (+ `fastembed`, `mcp`) into the CRM container image; expose the init as a one-shot Start9 action and run the MCP server as a second daemon in the 0.4 `startos` manifest. The image is already Python 3.11 with the volume mounted.
- **B: sidecar container** on the box mounting the same `/data` volume.
- **C: co-located host** with a copy of `/data` and LAN access to the Sparks (involves copying the DB — least clean).
This packaging wiring (and Step 6) is the remaining build work for a fully turn-key go-live.
## Sovereignty checkpoint
Every step above runs on Ten31 infrastructure. Real records flow `crm.db → local Spark (bge-m3) → local Qdrant` and never reach Anthropic. The scripts print counts, not records. Keep it that way: don't paste query *results* over real data back into a Claude session (guardrail #9).