Files
Keysat 6be2e40f54 Phase 0 go-live polish: hands-off incremental sync + refresh action
- backend/ingest/sync_scheduler.py: periodic incremental-sync loop (every
  CRM_INGEST_SYNC_INTERVAL_MIN min); resilient, --once for testing.
- start9/0.4: "Refresh search index" action (incremental sync.py); entrypoint
  launches the scheduler as a background process when Spark/Qdrant are set;
  CRM_INGEST_SYNC_INTERVAL_MIN env; pre-release note on fastembed/mcp pins.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:36:06 -05:00

230 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase-0 ingest packaging (StartOS 0.4)
How the Phase-0 data substrate — the ingest pipeline (`backend/ingest/`) and the
CRM MCP server (`backend/mcp/`) — ships and runs on the live StartOS 0.4 package,
**without changing the CRM web server**. This implements **Option A** ("same
image") from `docs/go-live-runbook.md` §"Open decision — packaging".
The CRM web server (`backend/server.py`) is untouched and gains no new
dependencies. The `primary` daemon and its `checkPortListening` health check are
unchanged.
## What changed
| File | Change |
| --- | --- |
| `Dockerfile` | `COPY backend/ingest` and `COPY backend/mcp` into the image alongside `backend/server.py`. Added two runtime deps to the existing `pip install`: `fastembed==0.4.2` (client-side BM25 / `Qdrant/bm25` for the sparse retrieval leg) and `mcp==1.2.0` (MCP Python SDK, only for `backend/mcp/server.py`). **These two pins carry a pre-release multi-arch verification requirement — see "Pre-release checks" below.** |
| `docker_entrypoint.sh` | Added an export block for the ingest/retrieval env: `CRM_DB_PATH`, `SPARK_CONTROL_URL`, `SPARK_CONTROL_VERIFY_TLS`, `QDRANT_URL`, `CRM_INGEST_SYNC_INTERVAL_MIN`, with LAN-default placeholder values and an operator comment. The CRM web server ignores these; they exist so manual `python3 /app/backend/ingest/...` and `backend/mcp/server.py` runs on the box inherit them. Also launches the **background ingest sync scheduler** (`sync_scheduler.py`) before `exec`-ing the web server, guarded so it only starts when Spark Control + Qdrant are configured — see "Automatic scheduled refresh" below. |
| `startos/actions/buildSearchIndex.ts` | **New.** A one-shot "Build search index" StartOS action (Steps 34 of the runbook) — full rebuild with `--recreate`. |
| `startos/actions/refreshSearchIndex.ts` | **New.** A manual "Refresh search index" action — incremental, idempotent `sync.py` (no `--recreate`); the manual counterpart to the background scheduler. |
| `startos/actions/index.ts` | Registered both actions: `sdk.Actions.of().addAction(buildSearchIndex).addAction(refreshSearchIndex)`. |
| `startos/versions/v0.1.0.44.ts` + `versions/index.ts` | New version `0.1.0:44` (image-only change, no data migration) set as `current`; `0.1.0:43` moved to `other`. |
| `startos/utils.ts` | Bumped the informational `PACKAGE_VERSION` constant to `0.1.0:44`. |
### Action registration mechanism (verified)
Actions are collected in `startos/actions/index.ts` as
`export const actions = sdk.Actions.of().addAction(...)`, and that `actions`
object is passed into `sdk.setupInit(...)` in `startos/init/index.ts` (and
re-exported from `startos/index.ts`). Adding `.addAction(buildSearchIndex)` is
the entire registration — no manifest entry is required for actions in the 0.4
SDK.
## How the operator triggers the index build
1. Build/sideload the new `.s9pk` (version `0.1.0:44`). StartOS preserves
`/data`, so live data is undisturbed. On first boot the CRM's own migration
runner creates the Phase-0 tables (see runbook Step 1) — that is independent
of this packaging change.
2. In the StartOS UI, open the **Ten31 Database** service → **Actions**
**Build search index**, and run it. It:
- runs `entity_resolution.py --db /data/crm.db` (canonical ids + links), then
- runs `backfill.py --db /data/crm.db --recreate` (chunk → dense via Spark
Control + BM25 → upsert to Qdrant `crm_chunks`).
Both steps are idempotent and read-only on the CRM source tables, so the
action is safe to re-run any time to refresh the index. A full re-embed is
~815 min (the action allows up to 30 min before timing out).
The action runs in its **own subcontainer** with the same `main` volume mounted
at `/data`, with `cwd=/app/backend/ingest` (the ingest scripts import their
siblings by bare name, e.g. `import config`, so they must run from that
directory). It uses `allowedStatuses: 'any'` — SQLite WAL mode makes a
concurrently-running CRM safe for these reads/derived writes.
## Keeping the index fresh (hands-off refresh)
The "Build search index" action above is a full one-shot rebuild. To keep the
index current as the CRM changes, there are now two incremental paths — both run
`sync.py` (chunk → dense+BM25 → Qdrant upsert) for **changed records only**, with
NO `--recreate`, so they never drop the collection and are safe to run any time.
### Manual: "Refresh search index" action
`startos/actions/refreshSearchIndex.ts` adds a second StartOS action,
**Refresh search index** (id `refresh-search-index`). It mirrors
`buildSearchIndex.ts` exactly — same subcontainer, same `/data` mount, same
explicit `ingestEnv` — but runs `python3 sync.py --db /data/crm.db` (no
`--recreate`) with `cwd = /app/backend/ingest`. An incremental delta is usually
seconds to a few minutes (the action allows up to 30 min of headroom). Use it
for an on-demand refresh; use "Build search index" only for a full rebuild.
### Automatic: background sync scheduler
For hands-off freshness, `docker_entrypoint.sh` launches
`backend/ingest/sync_scheduler.py` as a **background process** just before it
`exec`s the web server. `sync_scheduler.py` loops the incremental sync every
`CRM_INGEST_SYNC_INTERVAL_MIN` minutes (default **60**, exported in the
entrypoint's env block with an operator comment). It logs to
`/data/ingest-sync.log`.
The launch is **guarded**: it only starts when both `SPARK_CONTROL_URL` and
`QDRANT_URL` are set (both are exported just above it, so the default LAN values
satisfy the guard; an operator who clears them to disable ingest also disables
the scheduler). The entrypoint prints `STARTED` or `SKIPPED (Spark/Qdrant not
configured)` so the choice is visible in the service logs.
#### Why a background process and not a StartOS daemon
The prior agent deliberately avoided adding the stdio MCP server as a daemon
because StartOS daemons are built around a network port + `checkPortListening`
health check, and a portless process has no liveness signal to probe (see "MCP
server" below). `sync_scheduler.py` is the same shape — a long-running loop with
no port — so adding it as a second daemon in `main.ts` would hit the same
mismatch.
Launching it as a child of the entrypoint sidesteps that entirely:
- **Pro:** no portless-daemon contortion; it shares the `primary` container's
`/data` and inherited env; the existing `primary` daemon and its
`checkPortListening` health check are untouched.
- **Con:** StartOS does not supervise it independently. If the scheduler dies it
is not auto-restarted on its own (the container as a whole is still
health-checked via the web server), and it has no separate status tile in the
UI. Crashes surface only in `/data/ingest-sync.log`. The manual "Refresh
search index" action is the always-available fallback.
If a future phase wants first-class supervision/visibility, promote it to a real
StartOS daemon — but, as with the MCP server, only after giving the work a
network transport (e.g. a tiny HTTP health endpoint) so it has a meaningful
`checkPortListening` probe.
## Env / config the operator must set (Spark URLs)
The ingest run reaches out to **Spark Control** (dense embeddings) and **Qdrant**
(upserts). Those endpoints are LAN-specific, so they are defined in **two
places** that the operator must point at their network. The current values are
the Ten31 LAN defaults:
| Variable | Default | Used by |
| --- | --- | --- |
| `SPARK_CONTROL_URL` | `https://192.168.1.72:62419` | dense embeds (`/v1/embeddings`) |
| `SPARK_CONTROL_VERIFY_TLS` | `false` (Spark Control uses a self-signed cert) | TLS verification toggle |
| `QDRANT_URL` | `http://192.168.1.87:6333` | Qdrant collection admin + upserts |
| `CRM_DB_PATH` | `/data/crm.db` | both scripts + MCP server (already correct) |
| `CRM_INGEST_SYNC_INTERVAL_MIN` | `60` | background sync scheduler loop interval (entrypoint only) |
Where to set them:
- **`docker_entrypoint.sh`** — for manual `python3` / MCP runs via the running
container and for the background sync scheduler. Edit the `${VAR:-default}`
block, or override via the StartOS service environment.
- **`startos/actions/buildSearchIndex.ts`** and
**`startos/actions/refreshSearchIndex.ts`** (`ingestEnv`) — for the "Build
search index" and "Refresh search index" actions, which run in their own
subcontainers and do **not** execute the entrypoint, so each carries its own
copy of the values. Edit these to match. (`CRM_INGEST_SYNC_INTERVAL_MIN` only
matters to the entrypoint's scheduler loop, not to the actions.)
> Keep the copies in sync. They are duplicated because the actions'
> subcontainers never run `docker_entrypoint.sh`; there is no shared config
> store wired into this package today (see "Still needed" below).
Verify reachability from the box before running the action:
`curl -sk $SPARK_CONTROL_URL/api/endpoints` and
`curl -s $QDRANT_URL/collections`.
## MCP server: decision and how to run it
**Decision: the MCP server is NOT a daemon in this release — it is shipped in the
image and run manually.** Rationale:
- `backend/mcp/server.py` is an **stdio** MCP server (`mcp.run()` with FastMCP):
it has no network port to bind, so the StartOS daemon model (a long-running
process with a `checkPortListening` health check, like `primary`) does not fit
it. There is nothing to port-probe and no meaningful liveness signal.
- **Phase 0 has no live agents** (per `CLAUDE.md` and the runbook): nothing on
the box would connect to it. An always-on daemon would idle with no client on
its stdin and no health semantics.
- It exposes reads, the three retrieval modes, and logged writes — **no
outbound/contact tools** (Phase 3 compliance gate). It is for testing and
later internal-only Analyst work.
To run it manually on the box (it is present at `/app/backend/mcp/server.py` with
`mcp` already installed):
```sh
# from inside the running container
CRM_DB_PATH=/data/crm.db python3 /app/backend/mcp/server.py
```
Then register it with the Agent SDK / Claude Code as an stdio MCP server pointing
at that script (it inherits the Spark/Qdrant env exported by the entrypoint).
If/when a live agent needs it as a persistent service, the cleanest upgrade is to
add it as a **second daemon** in `startos/main.ts` mirroring the `primary`
daemon — but only after giving it a network transport (e.g. an HTTP/SSE MCP
endpoint on its own port) so it has a real `checkPortListening` health check.
That is deliberately deferred to a later phase.
## Still needed for a fully turn-key deploy
- **MCP-as-a-service** — see above. Deferred until there is a live agent and a
network transport; today it is manual/stdio only.
- ~~**Incremental sync (runbook Step 6 / Workstream B4)**~~ — **done.** The
background sync scheduler (`sync_scheduler.py`, started by the entrypoint) keeps
the index fresh automatically, and the manual "Refresh search index" action
provides an on-demand incremental sync. See "Keeping the index fresh" above. A
future enhancement could still promote the scheduler to a first-class StartOS
daemon (with a network transport for a real health check) for independent
supervision/visibility.
- **Single source of truth for Spark/Qdrant config** — currently duplicated in
`docker_entrypoint.sh`, `buildSearchIndex.ts`, and `refreshSearchIndex.ts`. A
small StartOS config
store + input form (the SDK supports `Action.withInput` and a service config)
would let the operator set the endpoints once in the UI; deferred to keep this
change minimal and reviewable.
- **`.env` on the box** — `backend/ingest/config.py` also reads `/app/.env` if
present (via `os.environ.setdefault`, so it does not override the exported
env). Not required given the exported env above, but available as an
alternative if the operator prefers a file.
## Pre-release checks
Verify before cutting a release:
- **Multi-arch dependency build (BLOCKER).** The `fastembed==0.4.2` and
`mcp==1.2.0` pins in `Dockerfile` were chosen best-effort and have **not** been
confirmed to build on **both** `x86_64` and `aarch64`. StartOS targets arm64,
and `fastembed` pulls `onnxruntime` (which may have no prebuilt arm64 wheel and
fall back to a slow source build) plus downloads a model on first use. Build the
image on aarch64 and run the ingest once end-to-end before release. Do not bump
either pin without re-verifying on both arches. (Flagged inline above the pip
line in `Dockerfile`.)
- **Scheduler smoke test.** With Spark Control + Qdrant reachable, start the
container and confirm the entrypoint logs
`[entrypoint] ingest sync scheduler: STARTED`, that `/data/ingest-sync.log`
accumulates sync output, and that clearing one of the endpoints flips the log to
`SKIPPED`.
- **Actions present.** Confirm both **Build search index** and **Refresh search
index** appear under the service's Actions in the StartOS UI and run to success.
## Constraints honored
- No files under `backend/ingest/`, `backend/mcp/`, `backend/server.py`,
`backend/core_migrations.py`, `backend/migrations/`, or `data/` were modified;
only `start9/0.4/**` and this doc. The entrypoint and the refresh action
reference `backend/ingest/sync_scheduler.py` and `backend/ingest/sync.py` by
path only — those scripts are owned/created by a separate process.
- No build/deploy commands were run. `npx tsc --noEmit` was used only to verify
the new TypeScript compiles against the SDK types.