Phase 0 complete: fuzzy entity tier, incremental sync, Start9 packaging
- Fuzzy tier (backend/ingest/fuzzy_resolve.py + llm.py): local Qwen adjudicates the deterministic resolver's flagged name-variant candidates; merges are durable via entity_merges (deterministic re-runs respect them), losers soft-deleted, logged. Idempotent. - Incremental sync (backend/ingest/sync.py): re-embeds only rows changed since a watermark (ingest_sync_state); first run / --recreate = full. Tested full→0→1. - Start9 packaging (start9/0.4): Dockerfile bundles ingest+mcp + fastembed/mcp; "Build search index" action runs the init in a subcontainer; MCP shipped as a manual stdio server (not a daemon); version 0.1.0:44. INGEST_PACKAGING.md. - backfill.py: factored embed_and_upsert() shared with sync. Verified end-to-end on synthetic data + live Sparks/Qwen/Qdrant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,47 @@
|
||||
import { VersionInfo } from '@start9labs/start-sdk'
|
||||
|
||||
// Phase-0 substrate packaging release.
|
||||
//
|
||||
// Context:
|
||||
// * Ships the Phase-0 ingest pipeline (backend/ingest/) and the CRM MCP
|
||||
// server (backend/mcp/) inside the existing CRM container image, alongside
|
||||
// the web server. Two runtime deps are added to the image: `fastembed`
|
||||
// (client-side BM25 for the sparse retrieval leg) and `mcp` (the MCP
|
||||
// Python SDK, used only to run backend/mcp/server.py). The CRM web server
|
||||
// itself gains no new dependencies and is unchanged.
|
||||
// * Adds a one-shot "Build search index" StartOS action that runs the
|
||||
// one-time init on the box where /data/crm.db lives:
|
||||
// entity_resolution.py --db /data/crm.db (canonical ids)
|
||||
// backfill.py --db /data/crm.db --recreate (Qdrant search index)
|
||||
// Both steps are idempotent and read-only on the CRM source tables.
|
||||
// * docker_entrypoint.sh now exports the Spark Control / Qdrant env
|
||||
// (SPARK_CONTROL_URL, SPARK_CONTROL_VERIFY_TLS, QDRANT_URL) with LAN
|
||||
// defaults so manual ingest / MCP runs on the box inherit them.
|
||||
//
|
||||
// The MCP server is intentionally NOT a daemon in this release: it is an
|
||||
// stdio server with no port to bind and (in Phase 0) no live agent on the box
|
||||
// to talk to it, so it is run manually for testing. See
|
||||
// start9/0.4/INGEST_PACKAGING.md.
|
||||
//
|
||||
// No schema changes and no data migration: the SQLite schema is unchanged and
|
||||
// the live /data volume is left exactly as-is. The new tables the ingest
|
||||
// pipeline reads/writes are created by the CRM's own migration runner
|
||||
// (migrations/0001_phase0_foundation.sql), independent of this package change.
|
||||
export const v_0_1_0_44 = VersionInfo.of({
|
||||
version: '0.1.0:44',
|
||||
releaseNotes: {
|
||||
en_US: [
|
||||
'Ships the Phase-0 data substrate inside the CRM image: the ingest',
|
||||
'pipeline (entity resolution + Qdrant backfill) and the CRM MCP server,',
|
||||
'plus the fastembed and mcp runtime dependencies. Adds a one-time',
|
||||
'"Build search index" action that resolves canonical entity ids from',
|
||||
'your live CRM and rebuilds the Qdrant search index — both steps are',
|
||||
'idempotent and read-only on your CRM source data. The CRM web server',
|
||||
'is unchanged and gains no new dependencies. No data migration.',
|
||||
].join(' '),
|
||||
},
|
||||
migrations: {
|
||||
up: async () => {},
|
||||
down: async () => {},
|
||||
},
|
||||
})
|
||||
Reference in New Issue
Block a user