Files
ten31-database/docs/crm-overview.md
T
Keysat c7ce44d963 Phase 0 foundation: canonical schema, ingest pipeline, CRM MCP server
Workstream A–C substrate for the Ten31 agentic system:
- A1: docs/crm-overview.md; CLAUDE.md conventions + guardrail #9
- A2: additive/reversible core migration (canonical_entities, entity_links,
  interaction_log, relationship_edges, soft-delete) + ledgered runner
- B1/B3: chunking + deterministic entity resolution (backend/ingest)
- B2: dense (bge-m3) + BM25 sparse ingest to Qdrant crm_chunks
- C: CRM MCP server (reads, retrieval modes, logged writes) — no outbound tools
- docs: redaction/re-hydration, Gmail enablement runbook
- synthetic test data; .env.example; housekeeping (.gitignore, untrack crm.db,
  drop legacy files + start9/0.3.5)

Verified end-to-end on synthetic data + live Sparks (hybrid > dense on entity
queries). Real backfill runs on Ten31 infra; index holds synthetic data only.
Branch snapshot also captures pre-existing working-tree changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 08:13:35 -05:00

208 lines
27 KiB
Markdown

# CRM Overview — Storage, Data Model, API, Auth
*Workstream A1 deliverable (see `PHASE_0.md`). Read-only documentation of the existing CRM as of 2026-06. Every concrete claim is anchored to `file:line`. This is a description of what exists today, not a proposal — the schema-extension proposal for A2 lives separately.*
---
## 0. TL;DR for Phase 0
- **One Python file, no framework.** The whole backend is `backend/server.py` (~4,530 lines): a stdlib `http.server.ThreadingHTTPServer` with a hand-written `CRMHandler(BaseHTTPRequestHandler)` and manual path dispatch. `requirements.txt` lists FastAPI/SQLAlchemy/Alembic/Pydantic but **none are imported** — they are vestigial.
- **Storage is one SQLite file** (`data/crm.db`), WAL mode, opened fresh per request. Schema is created idempotently in-code at boot. There is no Alembic; "migrations" are `CREATE TABLE IF NOT EXISTS` + best-effort `ALTER TABLE ADD COLUMN`.
- **Two parallel investor data models** coexist with no shared key: (1) the *classic* `contacts / organizations / opportunities / communications / lp_profiles` CRM, and (2) the *newer, actively-used* `fundraising_*` collaborative grid. They are bridged only by fuzzy name/email matching. **This duality is the central entity-resolution problem for Phase 0.**
- **A real Gmail subsystem** (`backend/email_integration/`) stores threaded correspondence in `crm.db` and matches emails to investors — but is **self-disabling** (off unless a service-account key is present).
- **Auth is a single scheme:** username/password → HS256 JWT (Bearer header), re-validated against the `users` table each request; two roles (`admin`/`member`). The `X_API_KEY` named in `CLAUDE.md`/`PHASE_0.md` **does not exist in the code** — it is aspirational.
- **Guardrail flags:** all deletes are **hard deletes** (violates guardrail #3 as written); a destructive `POST /api/admin/reset-all-data` exists; `audit_log` is mutation-only and is *not* the append-only interaction log Phase 0 wants.
---
## 1. Storage engine & where it runs
### 1.1 Runtime
- **Server:** `ThreadingHTTPServer((HOST, PORT), CRMHandler)`, `daemon_threads = True`, `serve_forever()``backend/server.py:4509`. Handler class at `backend/server.py:1418`. Pure Python stdlib (`http.server`); **not** FastAPI/uvicorn despite `backend/requirements.txt:1-2`.
- **Concurrency model:** one OS thread per request. Safe because each request opens its own short-lived SQLite connection under WAL (rationale documented inline at `backend/server.py:4506-4508`).
- **Request lifecycle:** banned-IP check → per-IP rate limit → (email module hook) → manual `if path == …` dispatch. Body is read once and cached on the handler instance (`get_body()`, `backend/server.py:1433`). Malformed JSON silently becomes `{}`.
- **Background threads:** a backup scheduler loops every 60 s (`start_backup_scheduler`, `backend/server.py:1367`); an optional Gmail sync scheduler starts only if `CRM_GMAIL_INTEGRATION_ENABLED` is truthy (`backend/server.py:4498`).
- **Default bind:** `0.0.0.0:8080`, plain HTTP — TLS is expected to be terminated upstream (Start9 / Tailscale).
### 1.2 SQLite configuration
`get_db()` (`backend/server.py:77-84`) sets, on every connection:
- `PRAGMA journal_mode=WAL` — concurrent readers + single writer (this is what makes the ingest reader safe against the live writer).
- `PRAGMA foreign_keys=ON` — FKs are enforced at runtime (per-connection in SQLite, so re-set each time).
- `PRAGMA busy_timeout=5000` — 5 s wait on a lock.
- `row_factory = sqlite3.Row`.
The Gmail module re-implements the identical pragmas (`email_integration/scheduler.py:49`, `email_integration/routes.py:89`) rather than import `server.py`, to avoid a circular import.
### 1.3 Schema bootstrap & "migrations"
- `init_db()` (`backend/server.py:86`) runs once at startup, before binding. One big `executescript` of `CREATE TABLE/INDEX IF NOT EXISTS` (`backend/server.py:91-405`) creates **both** data models plus `app_settings`.
- **Core "migrations":** a hardcoded list of `ALTER TABLE … ADD COLUMN` wrapped in a try/except that swallows `OperationalError` (`backend/server.py:407-418`) — additive-only, idempotent-by-failure. No version table, no down-migrations. (Currently adds `city/state/country/location_query` to `contacts` and `lead_source` to `fundraising_investors`.)
- **The only real migration runner** is in the Gmail module: `email_integration.db.apply_migrations()` (`email_integration/db.py:23`) runs numbered `NNNN_*.sql` files lexicographically. There is one today: `migrations/0001_email_tables.sql`. ⚠️ This is called from `init_db()` guarded only by `ImportError` (`backend/server.py:421-427`), so on any image where the package is importable, **the email tables are created even when Gmail sync is disabled**.
### 1.4 `data/` layout
- `crm.db`, `crm.db-wal`, `crm.db-shm` — the DB + WAL + shared-memory.
- `backups/` — JSON snapshots of the **fundraising grid state only** (not the whole DB), written by the backup scheduler.
- `secrets/` — holds `gmail-service-account.json` (mode 600).
- `email_attachments/` — Gmail attachment bytes on disk (created on the 0.4 image).
- `.crm-secret` — persisted JWT secret, written by the container entrypoint (not the app).
### 1.5 Production deployment (StartOS)
Package id `ten-database` ("Ten31 Database"). Both generations run the **same** app (`python3 /app/backend/server.py`) in a `python:3.11-slim` container; all state on a single persistent volume `main` mounted at `/data`.
- **`start9/0.4/` — current/live target.** Manifest & lifecycle are TypeScript under `start9/0.4/startos/` (`manifest/index.ts`, `main.ts`, `backups.ts`, `interfaces.ts`). Built for `x86_64` + `aarch64`. Whole-volume backups (`sdk.Backups.ofVolumes('main')`). The richer `docker_entrypoint.sh` creates `/data/{backups,secrets,email_attachments}`, persists `CRM_SECRET_KEY` to `/data/.crm-secret`, and **conditionally enables Gmail** iff `/data/secrets/gmail-service-account.json` exists (then exports DWD env: `CRM_GMAIL_AUTH_METHOD=dwd`, `CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz`, sync interval 180 min). Version notes record the 0.3.5→0.4 migration is complete and the live `/data` volume is the sole source of truth (no more baked-in seed snapshot).
- **`start9/0.3.5/` — legacy.** Hand-written YAML manifest (`manifest.yaml`), arm64-only, Tor `80→8080` + LAN `443(ssl)→8080`. Superseded by 0.4.
- **Local/dev:** `start.sh` runs `python3 backend/server.py` with dev defaults. `start_beta.sh` is a Tailscale launcher that sources `.env.beta`, forces `CRM_ENV=production`, and requires a ≥24-char `CRM_SECRET_KEY`. In production mode the app refuses to start without `CRM_SECRET_KEY` (`backend/server.py:4487`).
### 1.6 Environment variables (for CLAUDE.md "CRM connection vars")
**Core server (`backend/server.py:42-71`):** `CRM_DATA_DIR`, `CRM_FRONTEND_DIR`, `CRM_DB_PATH` (default `<DATA_DIR>/crm.db`), `CRM_SECRET_KEY` (JWT signing — required in production), `CRM_HOST` (default `0.0.0.0`), `CRM_PORT` (default `8080`), `CRM_CORS_ORIGIN` (default `*`), `CRM_ENV` (default `development`), `CRM_LOGIN_RATE_LIMIT_PER_MIN`, `CRM_WRITE_RATE_LIMIT_PER_MIN`, `CRM_GET_RATE_LIMIT_PER_MIN`, `CRM_ABUSE_404_THRESHOLD`, `CRM_ABUSE_404_WINDOW_SEC`, `CRM_ABUSE_BAN_SEC`, `CRM_SEED_DEMO_DATA`, `CRM_GMAIL_INTEGRATION_ENABLED`.
**Gmail module (`email_integration/config.py:80-101`):** `CRM_GMAIL_AUTH_METHOD` (`dwd`/`oauth`), `CRM_GMAIL_SA_KEY_PATH`, `CRM_GMAIL_WORKSPACE_DOMAIN`, `CRM_GMAIL_OAUTH_CLIENT_ID/SECRET/REDIRECT_URI`, `CRM_GMAIL_SECRET_KEY` (AES key for OAuth-token-at-rest — **separate** from `CRM_SECRET_KEY`), `CRM_GMAIL_SYNC_INTERVAL_MIN`, `CRM_GMAIL_BACKFILL_PAGE_SIZE`, `CRM_GMAIL_MAX_ATTACHMENT_MB`, plus rate/retry knobs.
> **There is no network DB protocol.** "Connecting to the CRM" means either (a) opening the same SQLite file (`CRM_DB_PATH`) — only possible co-located with the `/data` volume — or (b) HTTP at `http://<host>:8080` with a Bearer JWT.
---
## 2. Data model
PKs are **8-char truncated UUIDs** (`generate_id()` = `str(uuid.uuid4())[:8]`, `backend/server.py:522`). Timestamps are ISO-8601 UTC strings (`now()`, `backend/server.py:525`). JSON-bearing TEXT columns (`tags`, `attendees`, `options`) are `json.dumps`'d on write and auto-parsed on read by `row_to_dict()` (`backend/server.py:506-517`).
### 2.1 Classic CRM model
| Table | Role | Key columns / notes |
|---|---|---|
| `users` | auth + ownership principal | `username`/`email` UNIQUE, `password_hash`, `role` ∈ {`admin`,`member`}, `is_active`. First user forced `admin`. (`backend/server.py:92`) |
| `organizations` | weak parent of contacts/opps | `name` (not unique), `type` (free-text, default `other`), `tags` JSON, `description`. (`backend/server.py:104`) |
| `contacts` | **the hub** | `first_name`/`last_name` (req), `organization_id` (FK SET NULL), `contact_type` (free-text; load-bearing values `prospect`/`investor`), `status` (default `active`), `source`, `tags` JSON, `notes`, `linkedin_url`. (`backend/server.py:123`) |
| `opportunities` | deal pipeline | `contact_id` (req, FK **CASCADE**), `stage` (allowlist `PIPELINE_STAGES` at `backend/server.py:1380`, enforced **only** on the stage endpoint), `commitment_amount`, `expected_amount`, `fund_name`, `owner_id`, `lost_reason`. (`backend/server.py:148`) |
| `lp_profiles` | closed-LP extension | 1:1 with a contact (`contact_id` UNIQUE, FK CASCADE). Holds `commitment_amount`, `funded_amount`, `accredited` (bare 0/1), `legal_docs_signed`, `wire_received`, `k1_sent`, `investor_type` (free-text). (`backend/server.py:186`) |
| `custom_fields` / `custom_field_values` | EAV custom fields | **Dead**: schema exists but has **no routes/handlers**; only ever wiped by reset. Do not build on this. (`backend/server.py:206`) |
| `tags` | global tag palette | `name` UNIQUE + `color`. Not FK-linked to the per-row `tags` JSON arrays; just an autocomplete source. (`backend/server.py:237`) |
| `audit_log` | mutation diff trail | `user_id`, `entity_type`, `entity_id`, `action`, `changes` JSON. **Mutation-only**, no reads, no actor/agent dimension. (`backend/server.py:227`) |
**How an LP is represented:** a single `contacts` row is the canonical record; `contact_type` carries the funnel stage (`prospect``investor`). Promotion to `investor` is a side effect of creating an `lp_profiles` row (`backend/server.py:2834`) or of fundraising-grid sync (`backend/server.py:788`). The contact dossier is assembled by `GET /api/contacts/{id}` (`backend/server.py:2008`): contact + last-20 communications + all opportunities + the one lp_profile. **Note the unreconciled double-modeling of money:** in-flight commitment lives on `opportunities`, closed commitment lives on `lp_profiles`, and the grid has a *third* copy in `fundraising_commitments`.
### 2.2 Fundraising grid model (newer, actively used)
A **real-time collaborative spreadsheet** the partners actually edit: funds are columns, investors are rows, dollar commitments are cells, plus saved views, live presence/cell-locks, and a small automation engine.
- **Authoritative store = one JSON blob:** `fundraising_state.grid_json` + `views_json` (row `id='main'`, `backend/server.py:258`). Reads/exports come straight from this.
- **Normalized tables are a derived mirror**, fully rebuilt from the JSON on **every** write by `sync_fundraising_relational()` (`backend/server.py:945`): `fundraising_investors` (keyed by `source_row_id`), `fundraising_funds` (by `column_id`), `fundraising_commitments` (cells), `fundraising_contacts`, `fundraising_views`. ⚠️ **`fundraising_contacts`/`fundraising_commitments`/`views` get fresh UUIDs on every save** (DELETE+reinsert); only `fundraising_investors.id` and `fundraising_funds.id` are stable. Don't persist external references to the volatile ones.
- **Automation engine** (`run_fundraising_automations`, `backend/server.py:668`): currently **ignores** the rules' `condition_json`/`action_json` and uses hard-coded flag logic (`graveyard`→graveyard list, `follow_up`→follow_up list, everyone→`all`), rebuilding `fundraising_list_memberships` and logging changes to `fundraising_automation_runs`. So the rules table is display/config surface, not a live interpreter.
- **Backups:** JSON-only filesystem snapshots of grid state to `data/backups/` (manual/auto/pre_restore), governed by a policy in `app_settings`. Restore overwrites state then re-syncs the mirror.
### 2.3 The two-model bridge (the central problem)
There is **no foreign key** between `fundraising_investors` and the classic `contacts/organizations/opportunities/lp_profiles`. They are joined only by best-effort name/email matching, essentially one-directional **grid → classic**:
1. **Grid → classic (write-through):** every grid save pushes each grid contact into classic `contacts`/`organizations` via `_upsert_contact_from_fundraising` (`backend/server.py:730`), matching by lowercased email else (name + org). No stored key links the resulting `contacts.id` back to `fundraising_investors.id` — it re-matches by name/email each time.
2. **Grid-logged comms → classic `communications`:** `POST /api/fundraising/log-communication` (`backend/server.py:2561`) writes into the classic `communications` table *and* appends a dated line to the grid row's `notes`.
3. **Classic → grid (partial reverse mirror):** `_sync_contact_to_fundraising_state` (`backend/server.py:815`) patches an existing matching grid row but will **not** create a new investor row.
**Net:** the same investor can simultaneously exist as a `fundraising_investors` row, one-or-more `contacts` rows, an `organizations` row, and an `lp_profiles` row, with **no authoritative join key**. The email matcher treats the **fundraising side as higher-signal** (`matcher.py:103`). `lp_profiles` is entirely outside the fundraising subsystem.
### 2.4 Notes, interactions & correspondence (Phase-0 critical)
Three subsystems hold embeddable text:
**(a) `communications`** (`backend/server.py:168`) — the primary human-logged activity store. One row per `note`/`call`/`email`/`meeting`/`text` (type is free-text; UI offers those 5, `frontend/index.html:4220`). Columns: `contact_id` (req, CASCADE), `opportunity_id` (SET NULL), `subject`, `body`, `communication_date` (the **event** timestamp, distinct from `created_at` — this is the `date_ts` source), `outcome`, `next_action`, `attendees` JSON. Written by `POST /api/communications` and by the fundraising log endpoint. **Hard delete** at `backend/server.py:2758`.
**(b) Scattered free-text fields** worth embedding: `contacts.notes`, `lp_profiles.notes`, `fundraising_investors.notes` (a running, newline-appended outreach log mirroring the grid Notes column), `opportunities.description`/`next_step`, `organizations.description`.
**(c) Gmail correspondence** (`backend/email_integration/`, schema in `migrations/0001_email_tables.sql`):
- `emails` — canonical record deduped by RFC `message_id`; `subject`, `from_*`, `to/cc/bcc` JSON, `sent_at`, `body_text`, `body_html`, `snippet`, `is_matched`, `match_status`. ⚠️ **Bodies are stored only for matched emails**; unmatched emails are metadata-only with the body nulled (`sync.py:319`).
- `email_threads` — thread roll-up; threading via RFC `In-Reply-To`/`References` chain then Gmail thread id (`threads.py:38`).
- `email_account_messages` — per-mailbox sighting (dedup across team inboxes).
- `email_attachments` — metadata; bytes on disk, deduped by SHA-256.
- **`email_investor_links`** — the entity linkage. Populates any subset of `fundraising_investor_id` / `fundraising_contact_id` / `contact_id` / `organization_id` (all **soft references, no FK**) with `match_kind` (`exact_email` conf 1.0 / `domain_match` conf 0.6) and confidence. A single email can link to several entities at once. `email_sync_runs` records per-run observability.
**`audit_log` is *not* the interaction log.** It is mutation-only (≈26 write sites), logs no reads, and has no agent/actor dimension. Phase 0's "append-only interaction log of every agent action and every human touch" (Workstream A2) needs a **new** table, not a repurpose.
---
## 3. API / route surface
Full REST verbs exist (mutations are **not** tunneled through POST): `do_GET` (1589), `do_POST` (1727), `do_PUT` (1791), `do_PATCH` (1817), `do_DELETE` (1845), `do_OPTIONS` (1580), all in `backend/server.py`. Routing is a flat `if/elif` ladder: exact string for collections, `re.match(r'^/api/x/[^/]+$')` for items, path params parsed positionally with `path.split('/')`. The Gmail module hooks in at the top of `do_GET`/`do_POST` via `try_handle(self)` (`email_integration/routes.py:49`), claiming any `/api/email/*` path.
**Response envelope:** reads/writes → `{"data": …}` (+`total`/`limit`/`offset` for lists); errors → `{"error": msg}`; create → 201; auth → bare `{"token","user"}`; email handlers use ad-hoc keys. CORS allows all verbs; `Access-Control-Allow-Origin` echoes `CRM_CORS_ORIGIN` (default `*`).
**Auth column:** None = public · Bearer = any active user · Admin = `require_admin`.
| Method | Path | Purpose | Auth |
|---|---|---|---|
| GET | `/`, `/index.html`, `/assets/*` | Serve SPA + static | None |
| GET | `/api/health` | Liveness | None |
| GET | `/api/bootstrap/status` | First-run check | None |
| POST | `/api/auth/login` | Login → JWT | None |
| POST | `/api/auth/register` | First-user registration (self-disables) | None |
| GET/POST | `/api/contacts` | List/search · Create | Bearer |
| GET/PUT/DELETE | `/api/contacts/{id}` | Detail (dossier) · Update · **Hard delete** | Bearer |
| GET | `/api/contacts/{id}/communications` | Per-contact interaction history | Bearer |
| GET/POST · GET/PUT/DELETE | `/api/organizations[/{id}]` | Org CRUD | Bearer |
| GET/POST · GET/PUT/DELETE | `/api/opportunities[/{id}]` | Opp CRUD | Bearer |
| PATCH | `/api/opportunities/{id}/stage` | Move pipeline stage (validated) | Bearer |
| GET/POST · GET/PUT/DELETE | `/api/communications[/{id}]` | Comms CRUD | Bearer |
| GET/POST · GET/PUT | `/api/lp-profiles[/{id}]` | LP-profile CRUD (no delete route) | Bearer |
| GET | `/api/reports/{dashboard,pipeline,lp-breakdown,activity}` | Aggregates | Bearer |
| GET | `/api/export/contacts` | Export **all** contacts (returns JSON, not CSV) | Bearer |
| POST | `/api/import/csv` | Bulk import from JSON rows | Bearer |
| GET/POST · PATCH | `/api/feature-requests[/{id}]` | Feature-request tracker | Bearer |
| GET | `/api/users` | List users (no hashes) | Bearer |
| POST · PATCH | `/api/admin/users[/{id}]` | Create / update user | **Admin** |
| POST | `/api/admin/reset-all-data` | ⚠️ Wipe CRM (confirm phrase `RESET ALL DATA`) | **Admin** |
| GET | `/api/audit-log` | Mutation audit trail | **Admin** |
| GET | `/api/security/status` | Config/security status | **Admin** |
| GET/PUT | `/api/fundraising/state` | Get / save grid (optimistic `version`, 409 on conflict) | Bearer |
| GET/POST | `/api/fundraising/collab/{state,heartbeat}` | Presence + cell locks | Bearer |
| POST | `/api/fundraising/log-communication` | Log comm + append grid note | Bearer |
| GET | `/api/fundraising/{export,relational-summary}` | Export / counts | Bearer |
| GET | `/api/fundraising/activity` | Merged audit+automation+backup feed | **Admin** |
| GET/PATCH | `/api/fundraising/automations[/{id}]` | Automation rules | **Admin** |
| GET | `/api/fundraising/automation-runs` | Run history | **Admin** |
| GET/POST | `/api/fundraising/{backups,backup,backup-verify}` | Backup mgmt | **Admin** |
| POST | `/api/fundraising/{restore-preview,restore}` | Restore grid | **Admin** |
| GET/PATCH | `/api/fundraising/backup-policy` | Backup policy | **Admin** |
| GET | `/api/email/{status,accounts,threads}` | Sync status / accounts / matched threads | Bearer + flag |
| GET | `/api/email/oauth/{start,callback}` | Per-user OAuth (callback is state-token gated, no Bearer) | mixed + flag |
| POST | `/api/email/accounts/{enroll-all,enroll}` · `/sync/run-now` · `/rematch` | Enrollment & sync ops | **Admin** + flag |
**Defined but NOT routed:** `handle_list_tags`/`create`/`delete` exist (`backend/server.py:3366-3400`) but no `/api/tags` route is wired; the `custom_fields` tables have no routes at all. Treat both as dead for Phase 0.
---
## 4. Authentication & authorization
- **Login:** `POST /api/auth/login``handle_login` (`backend/server.py:1880`). Looks up active user, `verify_password` (bcrypt, PBKDF2-SHA256 fallback, `backend/server.py:444`), issues `create_token` (HS256 JWT via PyJWT, HMAC fallback; claims `user_id/username/role/exp/iat`; 24 h expiry; `backend/server.py:464`).
- **Per-request verification:** `get_user()` (`backend/server.py:1458`) reads `Authorization: Bearer`, `decode_token` (pins `algorithms=["HS256"]` — no `alg:none` downgrade), then **re-loads the user row** and rejects if missing/inactive. Identity (incl. role) comes from the **DB row, not token claims**, so deactivation and role changes take effect immediately.
- **No cookies, no logout, no refresh, no revocation.** The only early kill-switch is `is_active=0`.
- **Bootstrap:** `GET /api/bootstrap/status` (public) reports `setup_required`. `POST /api/auth/register` (public, **self-disables** once any user exists) creates and force-promotes the first user to `admin`. A separate flag-gated `seed_demo_data()` hardcodes `admin/admin123` + `grant/password` and prints them — **dev-only, off by default** (`backend/server.py:4351-4374`).
- **Roles:** only `admin`/`member`. Enforcement is an inline `require_admin(user)` (`backend/server.py:541`) at the top of each admin handler — no middleware. **No row-level authorization:** any active member can read/edit *all* LP and fundraising data; `created_by`/`owner_id` are informational only.
- **`X_API_KEY` does not exist in code.** Repo-wide it appears only in `CLAUDE.md` and `docs/PHASE_0.md`. There is no API-key header path and no service-auth distinct from the user JWT. (The `Bearer` tokens in `email_integration/` are *outbound* Google OAuth tokens, unrelated.)
- **Secrets:** JWT key `CRM_SECRET_KEY` (random per-process default; hard-fails in production if unset, `backend/server.py:4487`). Gmail OAuth tokens are encrypted at rest with **AES-256-GCM** keyed off the *separate* `CRM_GMAIL_SECRET_KEY` (`email_integration/crypto.py:37`) — a genuinely sound scheme. The Google service-account key lives at `data/secrets/gmail-service-account.json` (0600).
- **Hardening present:** per-IP sliding-window rate limits (login 20/min, writes 300/min, GETs 600/min); auto-ban of scanner IPs after a 404 burst (`record_404`, `backend/server.py:1520`). **Absent:** security headers (HSTS/CSP/X-Frame-Options), CORS defaults to wildcard `*`, `X-Forwarded-For` is trusted (only safe behind a controlled proxy).
---
## 5. Phase-0 implications (carry into A2/B/C)
1. **Pick a canonical LP identity.** The classic vs fundraising duality (§2.3) means entity resolution (A4/B3) must collapse across **both** models — grid investor + grid contacts + classic contact/org/lp_profile — into one canonical `lp_id`, not just dedupe name variants within one table. The email matcher's preference (`fundraising_contacts` > `contacts`) is a hint that the **grid is the operationally-live LP graph**, but closed financials/accreditation live only in `lp_profiles`.
2. **Canonical ID host.** `contacts.id` is the natural join target (lp_profiles + opportunities FK to it; the dossier is keyed on it), but the **8-char truncated UUID** is a uniqueness risk if it becomes the Qdrant payload key — resolve before indexing.
3. **Interaction log is greenfield.** `audit_log` won't do (mutation-only, no actor/agent dimension). A2 should add a dedicated append-only `interaction_log` table and route all MCP/agent writes through it (guardrail #5).
4. **Soft-delete gap.** Everything is hard-deleted today (guardrail #3 risk) and there's no tombstone for an idempotent re-embed to detect removals. A2 should add a `deleted_at`/archive flag and have ingest treat it as a delete-from-index signal.
5. **Chunk sources** (per `EMBEDDINGS.md`): one chunk per `communications` row; one per **matched** `emails` row (only these have bodies); split the `fundraising_investors.notes` outreach log per dated line; one chunk each for the scattered note/description fields. Keep ids/names/dates/types/confidence as filterable payload; embed only subject/body/note prose. `date_ts` = epoch of the **event** timestamp (`communication_date`/`sent_at`), not `created_at`. Key Qdrant point ids on stable ids (`emails.id`), namespaced to avoid the 8-char collision risk.
6. **Migration mechanism.** No Alembic. Either extend the in-code idempotent pattern (`CREATE … IF NOT EXISTS` + try/except `ALTER ADD COLUMN`) or adopt the email module's numbered-`.sql` runner for the core schema (recommended; add a `schema_migrations` ledger). SQLite ALTER is add-column/rename only — which conveniently forces the additive/reversible guardrail.
7. **Ingest connection.** No network DB protocol — open `data/crm.db` **read-only** (`sqlite3` URI `mode=ro`) with the same WAL/`busy_timeout` pragmas, co-located with the `/data` volume; WAL guarantees the reader never blocks the live writer. Reserve HTTP+JWT for any write-back. Decide the MCP↔CRM auth boundary explicitly (recommend: read-only direct SQLite for ingest; a constrained service principal for write-back) since `X_API_KEY` is unimplemented.
---
## 6. Open questions for the owner
1. **Which model is canonical for an LP** — the `contacts` row or the `fundraising_investors` grid row? (Determines the ingest spine and the canonical-ID target.)
2. **Is the Gmail integration enabled on the live Start9 box, and has a backfill run?** If not, the Phase-0 corpus is just `communications` + note fields until it is. (The email schema exists regardless.)
3. **Should `custom_fields` (EAV) and the unrouted `tags` CRUD be revived or removed?** Leaving the EAV risks a second divergent custom-data path next to the live fundraising custom columns.
4. **Accreditation today is a single boolean** (`lp_profiles.accredited`) with no QP flag / method / date — where should the 506(b)/506(c) + accreditation/QP fields counsel will require (guardrail #6) live?
5. **MCP↔CRM auth:** build a real `X_API_KEY` service-key path, authenticate the MCP server as a dedicated CRM user (24 h JWT, must re-login), or read SQLite directly? And does the agent principal need a new least-privilege role below `member`?
6. **`fundraising_state.grid_json` vs the normalized mirror** — confirmed authoritative = the JSON blob; is reading the normalized mirror (consistent after each save) acceptable for ingest, treating the JSON as the re-derivable source of truth?
---
*Sources: `backend/server.py`, `backend/email_integration/*`, `backend/email_integration/migrations/0001_email_tables.sql`, `start9/0.3.5/*`, `start9/0.4/*`. Generated from a structured multi-agent read of the codebase, cross-checked against the live `data/crm.db` schema (currently a near-empty seeded instance: 1 user, 9 funds, 4 views, 2 automation rules — the real corpus lives on the Start9 deployment).*