Phase 0 foundation: canonical schema, ingest pipeline, CRM MCP server

Workstream A–C substrate for the Ten31 agentic system:
- A1: docs/crm-overview.md; CLAUDE.md conventions + guardrail #9
- A2: additive/reversible core migration (canonical_entities, entity_links,
  interaction_log, relationship_edges, soft-delete) + ledgered runner
- B1/B3: chunking + deterministic entity resolution (backend/ingest)
- B2: dense (bge-m3) + BM25 sparse ingest to Qdrant crm_chunks
- C: CRM MCP server (reads, retrieval modes, logged writes) — no outbound tools
- docs: redaction/re-hydration, Gmail enablement runbook
- synthetic test data; .env.example; housekeeping (.gitignore, untrack crm.db,
  drop legacy files + start9/0.3.5)

Verified end-to-end on synthetic data + live Sparks (hybrid > dense on entity
queries). Real backfill runs on Ten31 infra; index holds synthetic data only.
Branch snapshot also captures pre-existing working-tree changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Keysat
2026-06-05 08:11:28 -05:00
parent 7027efd777
commit c7ce44d963
99 changed files with 10676 additions and 7817 deletions
+11
View File
@@ -0,0 +1,11 @@
import { VersionGraph } from '@start9labs/start-sdk'
import { v_0_1_0_39 } from './v0.1.0.39'
import { v_0_1_0_40 } from './v0.1.0.40'
import { v_0_1_0_41 } from './v0.1.0.41'
import { v_0_1_0_42 } from './v0.1.0.42'
import { v_0_1_0_43 } from './v0.1.0.43'
export const versionGraph = VersionGraph.of({
current: v_0_1_0_43,
other: [v_0_1_0_39, v_0_1_0_40, v_0_1_0_41, v_0_1_0_42],
})
+38
View File
@@ -0,0 +1,38 @@
import { VersionInfo } from '@start9labs/start-sdk'
// First StartOS 0.4 release of Ten31 Database.
//
// Upgrade context:
// * The 0.3.5 wrapper shipped at 0.1.0.38 (legacy, aarch64 only).
// * This 0.4 wrapper is built for x86_64 and is intended for a
// parallel install on a new StartOS 0.4 host.
// * Data continuity is NOT handled by a StartOS-level in-place
// upgrade (that path does not exist across StartOS majors).
// Instead the container image is pre-seeded with a snapshot of
// /data (crm.db, backups/, optional .crm-secret). On first boot
// docker_entrypoint.sh copies that snapshot into the mounted
// `main` volume if it is empty.
//
// Because both "up" and "down" paths are inside the same wrapper
// lineage (and the first 0.4 release has no earlier 0.4 version
// to migrate from), the migration functions are intentionally
// no-ops. Future 0.4.x releases can chain off this node in the
// version graph.
export const v_0_1_0_39 = VersionInfo.of({
version: '0.1.0:39',
releaseNotes: {
en_US: [
'First StartOS 0.4 package for Ten31 Database.',
'Built for x86_64; sideload-only during beta.',
'Container image ships with a baked-in /data snapshot so the',
'service boots with the existing investor and fundraising data,',
'saved views, backups, users, and app settings already in place.',
'No StartOS-level migration is performed from the 0.3.5 package;',
'this package is installed fresh on a 0.4 host.',
].join(' '),
},
migrations: {
up: async () => {},
down: async () => {},
},
})
+57
View File
@@ -0,0 +1,57 @@
import { VersionInfo } from '@start9labs/start-sdk'
// Post-migration cleanup + hardening release.
//
// Context:
// * 0.1.0:39 was the first 0.4 package and shipped a baked-in
// /data snapshot that docker_entrypoint.sh copied into the
// mounted `main` volume on first boot (only if the volume was
// empty). That snapshot did its job and the live host now has
// a populated /data with all real investor + fundraising data.
// * 0.1.0:40 removes the seed snapshot from the image and the
// seeding logic from the entrypoint. The live /data volume is
// the sole source of truth from here on. StartOS preserves the
// volume across sideloads, so this upgrade does not disturb
// any data — it just slims the image and removes a code path
// that should never run again.
// * 0.1.0:40 also hardens the backend HTTP server against the
// vulnerability scanners that find the StartTunnel-exposed
// interface within hours of going live:
// - HTTPServer → ThreadingHTTPServer so one slow request or
// a wave of scanner probes can't block legit users.
// - Per-IP GET rate limit (default 600/min) in addition to
// the existing login/write limits.
// - 404-burst auto-ban: any IP that produces ABUSE_404_THRESHOLD
// 404s within ABUSE_404_WINDOW_SEC (default 15 in 60s) is
// parked on a class-level blacklist for ABUSE_BAN_SEC
// (default 15 minutes). Banned IPs get an instant 429 with
// no DB or filesystem work.
// - All limits stay tunable via env vars
// (CRM_GET_RATE_LIMIT_PER_MIN, CRM_ABUSE_404_THRESHOLD,
// CRM_ABUSE_404_WINDOW_SEC, CRM_ABUSE_BAN_SEC).
//
// No data migration is required: the SQLite schema is unchanged
// and the live DB on /data is left exactly as-is.
export const v_0_1_0_40 = VersionInfo.of({
version: '0.1.0:40',
releaseNotes: {
en_US: [
'Removes the baked-in /data seed snapshot now that the',
'0.3.5 → 0.4 migration is complete. The live /data volume',
'on the StartOS host is the sole source of truth and is',
'preserved across sideloads, so no live data is touched by',
'this upgrade. Image is smaller and the first-boot seeding',
'code path has been removed. Also hardens the backend',
'against vulnerability scanners hitting the public',
'StartTunnel interface: the HTTP server is now multi-threaded',
'so one slow request can no longer block legit users, GET',
'requests are rate-limited per IP, and any IP that bursts',
'too many 404s in a short window is auto-banned for 15',
'minutes with no DB work performed.',
].join(' '),
},
migrations: {
up: async () => {},
down: async () => {},
},
})
+42
View File
@@ -0,0 +1,42 @@
import { VersionInfo } from '@start9labs/start-sdk'
// Frontend convenience release: persist auth across page reloads.
//
// Background: through 0.1.0:40 the auth token + user object were held only
// in React state in memory. Any refresh, tab close, or browser restart
// dropped the token and forced the user back to the login screen. Since
// the JWT is signed with /data/.crm-secret (which already survives sideloads
// and container restarts), the underlying token is still valid for its full
// 24-hour lifetime — we just weren't keeping it anywhere persistent.
//
// 0.1.0:41 stores the JWT and user object in localStorage on login (and
// rehydrates from there on app mount), so refreshes and reopened tabs stay
// signed in until the token expires. The api() helper now also dispatches
// a 'crm:unauthorized' event whenever an authenticated request comes back
// with a 401, and the AuthProvider listens for that event to clear the
// stored auth — so an expired or rejected token immediately bounces the
// user back to the login screen instead of leaving the app in a broken
// "loaded but every request fails" state.
//
// Backend is unchanged: the JWT still carries the user's true role and is
// re-verified on every request, so a tampered localStorage user object
// cannot escalate privileges (the next admin call would just 401/403).
//
// No data migration is required.
export const v_0_1_0_41 = VersionInfo.of({
version: '0.1.0:41',
releaseNotes: {
en_US: [
'Logins now persist across page refreshes and tab closures for',
'the full 24-hour token lifetime. Previously every reload bounced',
'you to the login screen even though the token was still valid.',
'If the server later rejects a stored token (expired, secret key',
'changed, etc.) the app automatically clears it and shows the',
'login screen instead of leaving requests silently failing.',
].join(' '),
},
migrations: {
up: async () => {},
down: async () => {},
},
})
+58
View File
@@ -0,0 +1,58 @@
import { VersionInfo } from '@start9labs/start-sdk'
// Gmail integration — Phase 1.
//
// Background: the CRM previously had no ingestion path for email
// activity. Contacts were logged manually; correspondence history lived
// only in our mailboxes. This release adds a one-way capture pipeline
// that ingests sent and received mail for every Workspace user at
// ten31.xyz, matches messages against existing investor records, and
// records metadata (+ bodies and attachments for matched threads) into
// the CRM database.
//
// Auth model: domain-wide delegation via a Google service account. The
// service-account JSON key is stored on the /data volume at
// /data/secrets/gmail-service-account.json (chmod 600, operator-dropped).
// The integration is self-disabling: if the key file is absent, the
// scheduler doesn't start and /api/email/* routes return 503. No key →
// no behavior change from 0.1.0:41.
//
// When the key IS present, docker_entrypoint.sh auto-enables the
// integration and sets sensible defaults (3-hour sync interval, domain
// ten31.xyz, DWD auth). All defaults can still be overridden via env.
//
// Database: migration 0001 adds eight new tables under the email_
// namespace (emails, email_accounts, email_recipients,
// email_account_messages, email_attachments, email_threads,
// email_investor_links, email_sync_runs). All CREATE TABLE IF NOT EXISTS,
// so the migration is safely idempotent — re-applying is a no-op.
//
// Backend: wholly isolated under backend/email_integration/. Three tiny,
// feature-flag-guarded hooks in server.py (migration call, scheduler
// startup, /api/email/* route dispatch). Removing or disabling the
// integration leaves server behavior identical to 0.1.0:41.
//
// New Python dep: cryptography==42.0.5 (required for RS256 JWT signing
// in DWD bearer token exchange). Now installed in the image.
//
// No data migration code needed — new tables, additive only.
export const v_0_1_0_42 = VersionInfo.of({
version: '0.1.0:42',
releaseNotes: {
en_US: [
'Adds a Gmail capture pipeline. When a Google Workspace',
"service-account key is dropped into the server's /data/secrets",
'folder, the CRM begins pulling sent and received mail for every',
'ten31.xyz user on a 3-hour cycle, matching messages against',
'existing investor records and storing metadata (plus bodies and',
'attachments for matched threads) in the database. With no key',
'present the feature is dormant and this release behaves',
'identically to 0.1.0:41. Eight new email_* tables are added',
'additively; no existing data is touched.',
].join(' '),
},
migrations: {
up: async () => {},
down: async () => {},
},
})
+44
View File
@@ -0,0 +1,44 @@
import { VersionInfo } from '@start9labs/start-sdk'
// Hotfix for 0.1.0:42.
//
// Issue 1 (critical): POST requests to /api/email/* hung indefinitely.
// server.py's do_POST called get_body() early in the dispatch to support
// /api/auth/login, which reads bytes off the request stream. My Gmail
// integration hook then ran route handlers that called get_body() a
// second time — but the stream was already drained, so the second read
// blocked waiting for bytes that never came. GET requests (which don't
// read a body) were unaffected.
//
// Fix: get_body() now caches the parsed JSON on the handler instance
// on first call. Repeat calls return the cached value. Handler
// instances are per-request in ThreadingHTTPServer, so the cache is
// naturally request-scoped and thread-safe.
//
// Issue 2 (minor): the /api/email/accounts/enroll endpoint required
// both `email_address` and `user_id` in the body, making it painful to
// call for the common single-admin-enrolling-themselves case.
//
// Fix: the endpoint now also accepts `email` as an alias, and if
// user_id isn't supplied it auto-resolves by looking up the email in
// the users table (falling back to the authenticated admin's own id
// if no match).
//
// No schema changes, no data migration.
export const v_0_1_0_43 = VersionInfo.of({
version: '0.1.0:43',
releaseNotes: {
en_US: [
'Hotfix for the Gmail integration in 0.1.0:42. POST requests to',
'/api/email/* endpoints were hanging because the request body was',
'being read twice from a single-shot stream. This release caches',
'the parsed body on the request so subsequent reads are safe, and',
'also relaxes the enroll endpoint to accept just an email and',
'auto-resolve the CRM user.',
].join(' '),
},
migrations: {
up: async () => {},
down: async () => {},
},
})