# Phase-0 ingest packaging (StartOS 0.4) How the Phase-0 data substrate — the ingest pipeline (`backend/ingest/`) and the CRM MCP server (`backend/mcp/`) — ships and runs on the live StartOS 0.4 package, **without changing the CRM web server**. This implements **Option A** ("same image") from `docs/go-live-runbook.md` §"Open decision — packaging". The CRM web server (`backend/server.py`) is untouched and gains no new dependencies. The `primary` daemon and its `checkPortListening` health check are unchanged. ## What changed | File | Change | | --- | --- | | `Dockerfile` | `COPY backend/ingest` and `COPY backend/mcp` into the image alongside `backend/server.py`. Added two runtime deps to the existing `pip install`: `fastembed==0.4.2` (client-side BM25 / `Qdrant/bm25` for the sparse retrieval leg) and `mcp==1.2.0` (MCP Python SDK, only for `backend/mcp/server.py`). **These two pins carry a pre-release multi-arch verification requirement — see "Pre-release checks" below.** | | `docker_entrypoint.sh` | Added an export block for the ingest/retrieval env: `CRM_DB_PATH`, `SPARK_CONTROL_URL`, `SPARK_CONTROL_VERIFY_TLS`, `QDRANT_URL`, `CRM_INGEST_SYNC_INTERVAL_MIN`, with LAN-default placeholder values and an operator comment. The CRM web server ignores these; they exist so manual `python3 /app/backend/ingest/...` and `backend/mcp/server.py` runs on the box inherit them. Also launches the **background ingest sync scheduler** (`sync_scheduler.py`) before `exec`-ing the web server, guarded so it only starts when Spark Control + Qdrant are configured — see "Automatic scheduled refresh" below. | | `startos/actions/buildSearchIndex.ts` | **New.** A one-shot "Build search index" StartOS action (Steps 3–4 of the runbook) — full rebuild with `--recreate`. | | `startos/actions/refreshSearchIndex.ts` | **New.** A manual "Refresh search index" action — incremental, idempotent `sync.py` (no `--recreate`); the manual counterpart to the background scheduler. | | `startos/actions/index.ts` | Registered both actions: `sdk.Actions.of().addAction(buildSearchIndex).addAction(refreshSearchIndex)`. | | `startos/versions/v0.1.0.44.ts` + `versions/index.ts` | New version `0.1.0:44` (image-only change, no data migration) set as `current`; `0.1.0:43` moved to `other`. | | `startos/utils.ts` | Bumped the informational `PACKAGE_VERSION` constant to `0.1.0:44`. | ### Action registration mechanism (verified) Actions are collected in `startos/actions/index.ts` as `export const actions = sdk.Actions.of().addAction(...)`, and that `actions` object is passed into `sdk.setupInit(...)` in `startos/init/index.ts` (and re-exported from `startos/index.ts`). Adding `.addAction(buildSearchIndex)` is the entire registration — no manifest entry is required for actions in the 0.4 SDK. ## How the operator triggers the index build 1. Build/sideload the new `.s9pk` (version `0.1.0:44`). StartOS preserves `/data`, so live data is undisturbed. On first boot the CRM's own migration runner creates the Phase-0 tables (see runbook Step 1) — that is independent of this packaging change. 2. In the StartOS UI, open the **Ten31 Database** service → **Actions** → **Build search index**, and run it. It: - runs `entity_resolution.py --db /data/crm.db` (canonical ids + links), then - runs `backfill.py --db /data/crm.db --recreate` (chunk → dense via Spark Control + BM25 → upsert to Qdrant `crm_chunks`). Both steps are idempotent and read-only on the CRM source tables, so the action is safe to re-run any time to refresh the index. A full re-embed is ~8–15 min (the action allows up to 30 min before timing out). The action runs in its **own subcontainer** with the same `main` volume mounted at `/data`, with `cwd=/app/backend/ingest` (the ingest scripts import their siblings by bare name, e.g. `import config`, so they must run from that directory). It uses `allowedStatuses: 'any'` — SQLite WAL mode makes a concurrently-running CRM safe for these reads/derived writes. ## Keeping the index fresh (hands-off refresh) The "Build search index" action above is a full one-shot rebuild. To keep the index current as the CRM changes, there are now two incremental paths — both run `sync.py` (chunk → dense+BM25 → Qdrant upsert) for **changed records only**, with NO `--recreate`, so they never drop the collection and are safe to run any time. ### Manual: "Refresh search index" action `startos/actions/refreshSearchIndex.ts` adds a second StartOS action, **Refresh search index** (id `refresh-search-index`). It mirrors `buildSearchIndex.ts` exactly — same subcontainer, same `/data` mount, same explicit `ingestEnv` — but runs `python3 sync.py --db /data/crm.db` (no `--recreate`) with `cwd = /app/backend/ingest`. An incremental delta is usually seconds to a few minutes (the action allows up to 30 min of headroom). Use it for an on-demand refresh; use "Build search index" only for a full rebuild. ### Automatic: background sync scheduler For hands-off freshness, `docker_entrypoint.sh` launches `backend/ingest/sync_scheduler.py` as a **background process** just before it `exec`s the web server. `sync_scheduler.py` loops the incremental sync every `CRM_INGEST_SYNC_INTERVAL_MIN` minutes (default **60**, exported in the entrypoint's env block with an operator comment). It logs to `/data/ingest-sync.log`. The launch is **guarded**: it only starts when both `SPARK_CONTROL_URL` and `QDRANT_URL` are set (both are exported just above it, so the default LAN values satisfy the guard; an operator who clears them to disable ingest also disables the scheduler). The entrypoint prints `STARTED` or `SKIPPED (Spark/Qdrant not configured)` so the choice is visible in the service logs. #### Why a background process and not a StartOS daemon The prior agent deliberately avoided adding the stdio MCP server as a daemon because StartOS daemons are built around a network port + `checkPortListening` health check, and a portless process has no liveness signal to probe (see "MCP server" below). `sync_scheduler.py` is the same shape — a long-running loop with no port — so adding it as a second daemon in `main.ts` would hit the same mismatch. Launching it as a child of the entrypoint sidesteps that entirely: - **Pro:** no portless-daemon contortion; it shares the `primary` container's `/data` and inherited env; the existing `primary` daemon and its `checkPortListening` health check are untouched. - **Con:** StartOS does not supervise it independently. If the scheduler dies it is not auto-restarted on its own (the container as a whole is still health-checked via the web server), and it has no separate status tile in the UI. Crashes surface only in `/data/ingest-sync.log`. The manual "Refresh search index" action is the always-available fallback. If a future phase wants first-class supervision/visibility, promote it to a real StartOS daemon — but, as with the MCP server, only after giving the work a network transport (e.g. a tiny HTTP health endpoint) so it has a meaningful `checkPortListening` probe. ## Env / config the operator must set (Spark URLs) The ingest run reaches out to **Spark Control** (dense embeddings) and **Qdrant** (upserts). Those endpoints are LAN-specific, so they are defined in **two places** that the operator must point at their network. The current values are the Ten31 LAN defaults: | Variable | Default | Used by | | --- | --- | --- | | `SPARK_CONTROL_URL` | `https://192.168.1.72:62419` | dense embeds (`/v1/embeddings`) | | `SPARK_CONTROL_VERIFY_TLS` | `false` (Spark Control uses a self-signed cert) | TLS verification toggle | | `QDRANT_URL` | `http://192.168.1.87:6333` | Qdrant collection admin + upserts | | `CRM_DB_PATH` | `/data/crm.db` | both scripts + MCP server (already correct) | | `CRM_INGEST_SYNC_INTERVAL_MIN` | `60` | background sync scheduler loop interval (entrypoint only) | Where to set them: - **`docker_entrypoint.sh`** — for manual `python3` / MCP runs via the running container and for the background sync scheduler. Edit the `${VAR:-default}` block, or override via the StartOS service environment. - **`startos/actions/buildSearchIndex.ts`** and **`startos/actions/refreshSearchIndex.ts`** (`ingestEnv`) — for the "Build search index" and "Refresh search index" actions, which run in their own subcontainers and do **not** execute the entrypoint, so each carries its own copy of the values. Edit these to match. (`CRM_INGEST_SYNC_INTERVAL_MIN` only matters to the entrypoint's scheduler loop, not to the actions.) > Keep the copies in sync. They are duplicated because the actions' > subcontainers never run `docker_entrypoint.sh`; there is no shared config > store wired into this package today (see "Still needed" below). Verify reachability from the box before running the action: `curl -sk $SPARK_CONTROL_URL/api/endpoints` and `curl -s $QDRANT_URL/collections`. ## MCP server: decision and how to run it **Decision: the MCP server is NOT a daemon in this release — it is shipped in the image and run manually.** Rationale: - `backend/mcp/server.py` is an **stdio** MCP server (`mcp.run()` with FastMCP): it has no network port to bind, so the StartOS daemon model (a long-running process with a `checkPortListening` health check, like `primary`) does not fit it. There is nothing to port-probe and no meaningful liveness signal. - **Phase 0 has no live agents** (per `CLAUDE.md` and the runbook): nothing on the box would connect to it. An always-on daemon would idle with no client on its stdin and no health semantics. - It exposes reads, the three retrieval modes, and logged writes — **no outbound/contact tools** (Phase 3 compliance gate). It is for testing and later internal-only Analyst work. To run it manually on the box (it is present at `/app/backend/mcp/server.py` with `mcp` already installed): ```sh # from inside the running container CRM_DB_PATH=/data/crm.db python3 /app/backend/mcp/server.py ``` Then register it with the Agent SDK / Claude Code as an stdio MCP server pointing at that script (it inherits the Spark/Qdrant env exported by the entrypoint). If/when a live agent needs it as a persistent service, the cleanest upgrade is to add it as a **second daemon** in `startos/main.ts` mirroring the `primary` daemon — but only after giving it a network transport (e.g. an HTTP/SSE MCP endpoint on its own port) so it has a real `checkPortListening` health check. That is deliberately deferred to a later phase. ## Still needed for a fully turn-key deploy - **MCP-as-a-service** — see above. Deferred until there is a live agent and a network transport; today it is manual/stdio only. - ~~**Incremental sync (runbook Step 6 / Workstream B4)**~~ — **done.** The background sync scheduler (`sync_scheduler.py`, started by the entrypoint) keeps the index fresh automatically, and the manual "Refresh search index" action provides an on-demand incremental sync. See "Keeping the index fresh" above. A future enhancement could still promote the scheduler to a first-class StartOS daemon (with a network transport for a real health check) for independent supervision/visibility. - **Single source of truth for Spark/Qdrant config** — currently duplicated in `docker_entrypoint.sh`, `buildSearchIndex.ts`, and `refreshSearchIndex.ts`. A small StartOS config store + input form (the SDK supports `Action.withInput` and a service config) would let the operator set the endpoints once in the UI; deferred to keep this change minimal and reviewable. - **`.env` on the box** — `backend/ingest/config.py` also reads `/app/.env` if present (via `os.environ.setdefault`, so it does not override the exported env). Not required given the exported env above, but available as an alternative if the operator prefers a file. ## Pre-release checks Verify before cutting a release: - **Multi-arch dependency build (BLOCKER).** The `fastembed==0.4.2` and `mcp==1.2.0` pins in `Dockerfile` were chosen best-effort and have **not** been confirmed to build on **both** `x86_64` and `aarch64`. StartOS targets arm64, and `fastembed` pulls `onnxruntime` (which may have no prebuilt arm64 wheel and fall back to a slow source build) plus downloads a model on first use. Build the image on aarch64 and run the ingest once end-to-end before release. Do not bump either pin without re-verifying on both arches. (Flagged inline above the pip line in `Dockerfile`.) - **Scheduler smoke test.** With Spark Control + Qdrant reachable, start the container and confirm the entrypoint logs `[entrypoint] ingest sync scheduler: STARTED`, that `/data/ingest-sync.log` accumulates sync output, and that clearing one of the endpoints flips the log to `SKIPPED`. - **Actions present.** Confirm both **Build search index** and **Refresh search index** appear under the service's Actions in the StartOS UI and run to success. ## Constraints honored - No files under `backend/ingest/`, `backend/mcp/`, `backend/server.py`, `backend/core_migrations.py`, `backend/migrations/`, or `data/` were modified; only `start9/0.4/**` and this doc. The entrypoint and the refresh action reference `backend/ingest/sync_scheduler.py` and `backend/ingest/sync.py` by path only — those scripts are owned/created by a separate process. - No build/deploy commands were run. `npx tsc --noEmit` was used only to verify the new TypeScript compiles against the SDK types.