Files
ten31-database/start9/0.4/INGEST_PACKAGING.md
T
Keysat f357c23c75 Phase 0 complete: fuzzy entity tier, incremental sync, Start9 packaging
- Fuzzy tier (backend/ingest/fuzzy_resolve.py + llm.py): local Qwen adjudicates
  the deterministic resolver's flagged name-variant candidates; merges are
  durable via entity_merges (deterministic re-runs respect them), losers
  soft-deleted, logged. Idempotent.
- Incremental sync (backend/ingest/sync.py): re-embeds only rows changed since a
  watermark (ingest_sync_state); first run / --recreate = full. Tested full→0→1.
- Start9 packaging (start9/0.4): Dockerfile bundles ingest+mcp + fastembed/mcp;
  "Build search index" action runs the init in a subcontainer; MCP shipped as a
  manual stdio server (not a daemon); version 0.1.0:44. INGEST_PACKAGING.md.
- backfill.py: factored embed_and_upsert() shared with sync.

Verified end-to-end on synthetic data + live Sparks/Qwen/Qdrant.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 08:55:12 -05:00

7.9 KiB
Raw Blame History

Phase-0 ingest packaging (StartOS 0.4)

How the Phase-0 data substrate — the ingest pipeline (backend/ingest/) and the CRM MCP server (backend/mcp/) — ships and runs on the live StartOS 0.4 package, without changing the CRM web server. This implements Option A ("same image") from docs/go-live-runbook.md §"Open decision — packaging".

The CRM web server (backend/server.py) is untouched and gains no new dependencies. The primary daemon and its checkPortListening health check are unchanged.

What changed

File Change
Dockerfile COPY backend/ingest and COPY backend/mcp into the image alongside backend/server.py. Added two runtime deps to the existing pip install: fastembed==0.4.2 (client-side BM25 / Qdrant/bm25 for the sparse retrieval leg) and mcp==1.2.0 (MCP Python SDK, only for backend/mcp/server.py).
docker_entrypoint.sh Added an export block for the ingest/retrieval env: CRM_DB_PATH, SPARK_CONTROL_URL, SPARK_CONTROL_VERIFY_TLS, QDRANT_URL, with LAN-default placeholder values and an operator comment. The CRM web server ignores these; they exist so manual python3 /app/backend/ingest/... and backend/mcp/server.py runs on the box inherit them.
startos/actions/buildSearchIndex.ts New. A one-shot "Build search index" StartOS action (Steps 34 of the runbook).
startos/actions/index.ts Registered the new action: sdk.Actions.of().addAction(buildSearchIndex).
startos/versions/v0.1.0.44.ts + versions/index.ts New version 0.1.0:44 (image-only change, no data migration) set as current; 0.1.0:43 moved to other.
startos/utils.ts Bumped the informational PACKAGE_VERSION constant to 0.1.0:44.

Action registration mechanism (verified)

Actions are collected in startos/actions/index.ts as export const actions = sdk.Actions.of().addAction(...), and that actions object is passed into sdk.setupInit(...) in startos/init/index.ts (and re-exported from startos/index.ts). Adding .addAction(buildSearchIndex) is the entire registration — no manifest entry is required for actions in the 0.4 SDK.

How the operator triggers the index build

  1. Build/sideload the new .s9pk (version 0.1.0:44). StartOS preserves /data, so live data is undisturbed. On first boot the CRM's own migration runner creates the Phase-0 tables (see runbook Step 1) — that is independent of this packaging change.
  2. In the StartOS UI, open the Ten31 Database service → ActionsBuild search index, and run it. It:
    • runs entity_resolution.py --db /data/crm.db (canonical ids + links), then
    • runs backfill.py --db /data/crm.db --recreate (chunk → dense via Spark Control + BM25 → upsert to Qdrant crm_chunks). Both steps are idempotent and read-only on the CRM source tables, so the action is safe to re-run any time to refresh the index. A full re-embed is ~815 min (the action allows up to 30 min before timing out).

The action runs in its own subcontainer with the same main volume mounted at /data, with cwd=/app/backend/ingest (the ingest scripts import their siblings by bare name, e.g. import config, so they must run from that directory). It uses allowedStatuses: 'any' — SQLite WAL mode makes a concurrently-running CRM safe for these reads/derived writes.

Env / config the operator must set (Spark URLs)

The ingest run reaches out to Spark Control (dense embeddings) and Qdrant (upserts). Those endpoints are LAN-specific, so they are defined in two places that the operator must point at their network. The current values are the Ten31 LAN defaults:

Variable Default Used by
SPARK_CONTROL_URL https://192.168.1.72:62419 dense embeds (/v1/embeddings)
SPARK_CONTROL_VERIFY_TLS false (Spark Control uses a self-signed cert) TLS verification toggle
QDRANT_URL http://192.168.1.87:6333 Qdrant collection admin + upserts
CRM_DB_PATH /data/crm.db both scripts + MCP server (already correct)

Where to set them:

  • docker_entrypoint.sh — for manual python3 / MCP runs via the running container. Edit the ${VAR:-default} block, or override via the StartOS service environment.
  • startos/actions/buildSearchIndex.ts (ingestEnv) — for the "Build search index" action, which runs in its own subcontainer and does not execute the entrypoint, so it carries its own copy of the values. Edit these to match.

Keep the two copies in sync. They are duplicated because the action's subcontainer never runs docker_entrypoint.sh; there is no shared config store wired into this package today (see "Still needed" below).

Verify reachability from the box before running the action: curl -sk $SPARK_CONTROL_URL/api/endpoints and curl -s $QDRANT_URL/collections.

MCP server: decision and how to run it

Decision: the MCP server is NOT a daemon in this release — it is shipped in the image and run manually. Rationale:

  • backend/mcp/server.py is an stdio MCP server (mcp.run() with FastMCP): it has no network port to bind, so the StartOS daemon model (a long-running process with a checkPortListening health check, like primary) does not fit it. There is nothing to port-probe and no meaningful liveness signal.
  • Phase 0 has no live agents (per CLAUDE.md and the runbook): nothing on the box would connect to it. An always-on daemon would idle with no client on its stdin and no health semantics.
  • It exposes reads, the three retrieval modes, and logged writes — no outbound/contact tools (Phase 3 compliance gate). It is for testing and later internal-only Analyst work.

To run it manually on the box (it is present at /app/backend/mcp/server.py with mcp already installed):

# from inside the running container
CRM_DB_PATH=/data/crm.db python3 /app/backend/mcp/server.py

Then register it with the Agent SDK / Claude Code as an stdio MCP server pointing at that script (it inherits the Spark/Qdrant env exported by the entrypoint).

If/when a live agent needs it as a persistent service, the cleanest upgrade is to add it as a second daemon in startos/main.ts mirroring the primary daemon — but only after giving it a network transport (e.g. an HTTP/SSE MCP endpoint on its own port) so it has a real checkPortListening health check. That is deliberately deferred to a later phase.

Still needed for a fully turn-key deploy

  • MCP-as-a-service — see above. Deferred until there is a live agent and a network transport; today it is manual/stdio only.
  • Incremental sync (runbook Step 6 / Workstream B4) — the action does a full one-shot rebuild. Keeping the index fresh as the CRM changes needs an incremental, idempotent sync on a schedule. Until that exists, re-running the "Build search index" action is the refresh path. When built, it could be wired as a recurring StartOS action/task rather than a manual re-run.
  • Single source of truth for Spark/Qdrant config — currently duplicated in docker_entrypoint.sh and buildSearchIndex.ts. A small StartOS config store + input form (the SDK supports Action.withInput and a service config) would let the operator set the endpoints once in the UI; deferred to keep this change minimal and reviewable.
  • .env on the boxbackend/ingest/config.py also reads /app/.env if present (via os.environ.setdefault, so it does not override the exported env). Not required given the exported env above, but available as an alternative if the operator prefers a file.

Constraints honored

  • No files under backend/ingest/, backend/mcp/, backend/server.py, backend/core_migrations.py, backend/migrations/, or data/ were modified; only start9/0.4/** and this new doc.
  • No build/deploy commands were run. npx tsc --noEmit was used only to verify the new TypeScript compiles against the SDK types.