Phase 0 foundation: canonical schema, ingest pipeline, CRM MCP server

Workstream A–C substrate for the Ten31 agentic system: - A1: docs/crm-overview.md; CLAUDE.md conventions + guardrail #9 - A2: additive/reversible core migration (canonical_entities, entity_links, interaction_log, relationship_edges, soft-delete) + ledgered runner - B1/B3: chunking + deterministic entity resolution (backend/ingest) - B2: dense (bge-m3) + BM25 sparse ingest to Qdrant crm_chunks - C: CRM MCP server (reads, retrieval modes, logged writes) — no outbound tools - docs: redaction/re-hydration, Gmail enablement runbook - synthetic test data; .env.example; housekeeping (.gitignore, untrack crm.db, drop legacy files + start9/0.3.5) Verified end-to-end on synthetic data + live Sparks (hybrid > dense on entity queries). Real backfill runs on Ten31 infra; index holds synthetic data only. Branch snapshot also captures pre-existing working-tree changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 08:11:28 -05:00
parent 7027efd777
commit c7ce44d963
99 changed files with 10676 additions and 7817 deletions
@@ -0,0 +1,30 @@
+# Keep Docker build context lean and reproducible.
+# Used by start9/0.4/Dockerfile; workdir is the repo root.
+
+# Python artifacts — never copy stale bytecode into the image.
+**/__pycache__/
+**/*.pyc
+**/*.pyo
+
+# Local dev / editor noise
+.venv/
+venv/
+.env
+.env.beta
+.env.beta.example
+.idea/
+.vscode/
+*.swp
+.DS_Store
+
+# Start9 package build artifacts (huge; re-generated by `make`).
+start9/**/javascript/
+start9/**/javascript.old/
+start9/**/javascript.tmp.*/
+start9/**/node_modules/
+start9/**/*.s9pk
+start9/**/image.tar
+start9/**/seed/
+
+# Runtime CRM data — never leak dev state into a prod image.
+data/
@@ -0,0 +1,21 @@
+# Ten31 agentic system — environment template.
+# Copy to .env (gitignored) and fill in. Secret values NEVER go in .env.example.
+
+# ── Claude (frontier reasoning; Agent SDK uses an API key, not claude.ai login) ──
+ANTHROPIC_API_KEY=
+
+# ── Spark Control gateway (local model services; reads + dense embeds) ──
+# HTTPS with the Start9 self-signed cert -> clients must skip TLS verification.
+SPARK_CONTROL_URL=https://<spark-control-host>:<port>
+SPARK_CONTROL_VERIFY_TLS=false
+
+# ── Qdrant (direct, for ingest: create collection + upsert points) ──
+# Plain HTTP on the trusted LAN, no auth currently.
+QDRANT_URL=http://<spark2-host>:6333
+
+# ── X (Twitter) API for Scout/Analyst enrichment (NOT a CRM key) ──
+X_API_KEY=
+
+# ── CRM (ingest opens the SQLite file directly, read-only) ──
+CRM_DB_PATH=./data/crm.db
+CRM_DEV_DB_PATH=./data/crm_dev.db
@@ -0,0 +1,34 @@
+# ── macOS ──
+.DS_Store
+**/.DS_Store
+
+# ── Python ──
+__pycache__/
+*.py[cod]
+
+# ── Secrets & env (CLAUDE.md guardrail #7 — never commit secrets) ──
+.env
+.env.*
+!.env.example
+!.env.beta.example
+data/secrets/
+
+# ── Local databases & SQLite sidecars ──
+# CRM is canonical and lives on the Start9 /data volume; never commit live LP
+# data into git (guardrail #1). Dev/synthetic DBs are local-only.
+data/*.db
+data/*.db-wal
+data/*.db-shm
+data/crm_dev.db*
+
+# ── FUSE / build temp ──
+*.fuse_hidden*
+start9/**/javascript.tmp.*
+node_modules/
+
+# Real CRM data bundled in packaging seed dirs — never commit (guardrail #1)
+**/seed/data/
+**/crm.db
+**/crm.db-wal
+**/crm.db-shm
+**/fundraising_state_auto_*.json
@@ -0,0 +1,89 @@
+# Ten31 Agentic System — Project Memory
+
+This file is the project constitution. Read it first; it states settled decisions and non-negotiable guardrails. Where anything here conflicts with a one-off prompt, this file wins.
+
+## What we're building
+
+Ten31 is an investment platform (bitcoin ecosystem, energy, AI infrastructure, freedom tech) that raises from LPs and deploys into private companies. We are building an in-house system of AI agents to widen the fundraising funnel, sharpen and propagate our investment thesis, and automate marketing/branding. Build agents on the **Claude Agent SDK**, connected to our systems via **MCP**. Frontier reasoning runs on **Claude**; privacy-sensitive and high-volume work runs on **local models on our DGX Sparks**, fronted by **Spark Control**.
+
+Full architecture and rationale: see `@./docs/Ten31_Agentic_Build_Plan.md`.
+Current phase tasks and acceptance criteria: see `@./docs/PHASE_0.md`.
+Embedding/retrieval API contract + ingest recipe (authoritative): see `@./docs/EMBEDDINGS.md`.
+
+**We are in Phase 0.** Phase 0 builds the data + retrieval substrate. There are NO live, outward-facing agents in Phase 0.
+
+## Settled architecture
+
+- **Reasoning model:** Claude via the Agent SDK / API (API-key auth, not claude.ai login).
+- **Local models (Sparks, via Spark Control gateway):**
+  - Chat/triage: Qwen3.6 35B-A3B on Spark 1.
+  - Embeddings: `BAAI/bge-m3` (dense, 1024-dim, L2-normalized) → `/v1/embeddings` (OpenAI shape).
+  - Reranker: `BAAI/bge-reranker-v2-m3` (cross-encoder) → `/v1/rerank` (Cohere shape).
+  - Served by **spark-embed**, a small FastAPI server on Spark 2 (NGC PyTorch image — *not* HF TEI, which ships no arm64 CUDA image). Shipped in Spark Control v0.15.0.
+  - Audio: transcription + diarization + TTS on Spark 2.
+- **Canonical data store:** the self-built CRM on the Start9 server. This is the single source of truth for LP/prospect data.
+- **Vector index:** Qdrant v1.16.0 on Spark 2 (ports 6333/6334). Derived and rebuildable from the CRM (~8–15 min full re-embed) — NOT a second source of truth. But it holds the only *live* copy of the index, so it is never auto-restarted; the ingest pipeline must be idempotent so a rebuild is always safe.
+- **Retrieval:** one orchestrated call, `POST /api/search` (embed query → Qdrant dense+sparse RRF with payload pre-filter → cross-encoder rerank → top_k). The sparse/BM25 leg is generated **client-side** with FastEmbed (`Qdrant/bm25`) at both ingest and query time, with Qdrant applying IDF over our own corpus — so exact entity/name matching is weighted by our term statistics, not bge-m3's pretrained sparse. Authoritative contract + ingest recipe: `@./docs/EMBEDDINGS.md`.
+- **Gateway:** Spark Control (on Start9) fronts all local model services behind one trusted URL with shared TLS, access control, and observability.
+
+## Environment & services
+
+- All local model calls go through **Spark Control**, never directly to a Spark.
+- Endpoints: `/v1/chat/completions`, `/v1/embeddings`, `/v1/rerank`, `/api/search` (orchestrated hybrid retrieval), `/v1/audio/transcriptions`, `/v1/audio/speech`.
+- **Secrets live in `.env` (gitignored). Never commit secret values.** Required variables (names only):
+  - `ANTHROPIC_API_KEY`
+  - `SPARK_CONTROL_URL` — gateway for `/v1/embeddings`, `/v1/rerank`, `/api/search` (reads + dense embeds)
+  - `QDRANT_URL` — direct Qdrant on Spark 2 (`http://<spark2>:6333`) for collection admin + ingest upserts
+  - `X_API_KEY` — the X (Twitter) API key for Scout/Analyst enrichment. **Note:** this is *not* a CRM auth key; the CRM has no service-key/API-key path today (see below).
+  - CRM connection vars:
+    - `CRM_DB_PATH` — absolute path to the SQLite file (default `<CRM_DATA_DIR>/crm.db`). The CRM has **no network DB protocol** — ingest "connects" by opening this file directly (read-only, `mode=ro`), co-located with the Start9 `/data` volume.
+    - `CRM_DATA_DIR` — the `/data` volume root (holds `crm.db`, `backups/`, `secrets/`, `email_attachments/`).
+    - `CRM_BASE_URL` — `http://<host>:8080` (env `CRM_HOST`/`CRM_PORT`), for any HTTP access to the running CRM.
+    - `CRM_SECRET_KEY` — the CRM's own JWT signing secret (set on the Start9 deployment, persisted at `/data/.crm-secret`); only needed if the MCP server authenticates over HTTP rather than reading SQLite directly.
+- A `.env.example` lists the variable names with empty values.
+
+## The agents (target roster — built in later phases)
+
+- **Scout** — monitors public sources (X via API, filings, etc.); flags trigger events. (Phase 2)
+- **Analyst** — builds LP dossiers, enriches records, maps warm-intro paths. (Phase 2)
+- **Architect** — owns/refines the canonical thesis; collaborative copilot. (Phase 1)
+- **Scribe** — distributes the thesis as content across channels. (Phase 1)
+- **Closer** — drafts outreach, nurture, meeting prep. Humans approve/send everything. (Phase 3)
+- **Orchestrator** — schedules and routes work; picks per-agent retrieval modes. (Phase 3)
+
+## Division of labor
+
+- **Spark developer (separate):** TEI serving (BGE-M3 + reranker) and Qdrant on Spark 2, exposed via Spark Control `/v1/embeddings` + `/v1/rerank`.
+- **This repo (Claude Code + the partners):** CRM schema extensions, ingest/sync pipeline, CRM MCP server, retrieval-mode library, and (later phases) the agents.
+
+## Guardrails — NON-NEGOTIABLE
+
+1. **Sovereignty.** Sensitive LP and relationship data stays on our infrastructure (Start9 + Sparks). Send only the minimum necessary, non-sensitive context to the Claude API. Never bulk-export the LP list to any third party.
+2. **CRM is canonical.** Qdrant and any other store are derived. Never treat a derived index as the source of truth; never let them silently diverge.
+3. **No destructive data ops.** Never hard-delete CRM records or history. Soft-delete/archive only. Migrations must be reversible and reviewed before running.
+4. **Human-in-the-loop on anything outbound.** No agent sends email, posts publicly, or contacts an LP/prospect autonomously. Agents draft; a partner approves and sends. (Especially Closer and Scribe.)
+5. **Log every agent action** to the interaction log, for compliance and debugging.
+6. **Compliance gate before Phase 3.** No cold/outbound capability ships until counsel has defined solicitation posture (e.g. 506(b) vs 506(c)), accreditation/QP verification, and recordkeeping rules.
+7. **Secrets never committed.** Use `.env` / a secrets store. No keys, tokens, or credentials in code, configs, or docs.
+8. **Enrichment is one-way and public.** Per-prospect public lookups that write INTO the CRM; never push our data outward.
+9. **Development data handling — keep real LP data out of Claude during the build.** Claude Code (the engineering partner) runs on the Anthropic API, so anything it reads is sent to a third party. Therefore Claude Code works only on **code, the schema, and synthetic or properly-redacted data** — never the real LP list, live records, or raw note/email prose. The real backfill and ingest **run on Ten31 infrastructure** (Start9 + Sparks) via **local models**; sensitive rows are never pasted into a Claude Code session or sent to the Claude API during development. To produce a realistic test corpus, redact/pseudonymize a copy **on the Sparks** (local) — do not hand-feed real records to Claude to "clean up." This is the same sovereignty boundary as guardrail #1, applied to the engineering workflow itself.
+
+## Conventions
+
+*Filled in from the CRM code (2026-06). Full detail: `@./docs/crm-overview.md`.*
+
+- **Language / runtime:** Python 3.11, standard library only at runtime. The CRM is one file, `backend/server.py` (~4.5k lines): a stdlib `http.server.ThreadingHTTPServer` + hand-written `CRMHandler` with manual path dispatch. **Not** FastAPI — `backend/requirements.txt` lists FastAPI/SQLAlchemy/Alembic/Pydantic but **none are imported** (vestigial). The only non-stdlib runtime deps are optional `bcrypt`/`jwt` and (for the Gmail module) `cryptography`.
+- **Storage:** a single SQLite DB (`data/crm.db`), WAL mode, `foreign_keys=ON`, opened per-request via `get_db()`. Two parallel investor models coexist (classic `contacts`/`lp_profiles` + the `fundraising_*` grid) — see `docs/crm-overview.md` §2.3; reconciling them to canonical IDs is the core Phase-0 entity-resolution task.
+- **Migrations:** **additive and reversible only.** Core schema uses ordered `backend/migrations/NNNN_*.sql` files applied once at startup by `backend/core_migrations.py`, tracked in a `schema_migrations` ledger; ship a paired `NNNN_*.down.sql` for rollback. (The Gmail module has its own runner under `backend/email_integration/migrations/`.) SQLite ALTER is add-column/rename only — which enforces the additive guardrail.
+- **Run locally:** `./start.sh` (dev defaults, port 8080). `./start_beta.sh` for a Tailscale/production-mode launch (requires `CRM_SECRET_KEY`). No build step.
+- **Tests / lint:** none in-repo. Sanity-check edits with `python3 -m py_compile backend/server.py`. Verify migrations against a *copy* of `crm.db`, never production.
+- **Production:** Start9 package `ten-database`. **`start9/0.4/` is the live target** (TypeScript SDK manifest under `start9/0.4/startos/`); `start9/0.3.5/` (YAML manifest) is the superseded prior generation. All state on the persistent `/data` volume.
+- **Auth:** username/password → HS256 JWT (Bearer header), two roles (`admin`/`member`), no row-level authorization. `X_API_KEY` (in this file's env list) is the *X/Twitter* key — there is **no CRM service-key path in code**; an MCP/ingest client must read SQLite directly or authenticate as a real CRM user.
+- Prefer clear, reviewable changes over cleverness. Keep the ingest pipeline and MCP server modular so retrieval modes and sources can be added without rewrites.
+
+## First actions for a new session
+
+1. Read `@./docs/PHASE_0.md` and `@./docs/EMBEDDINGS.md` (the latter is the authoritative embedding/retrieval contract and ingest recipe).
+2. Read the CRM source in the repo; produce a short written summary of the storage engine, schema, and API surface, and fill in the Conventions section above and the CRM env vars.
+3. Confirm Spark Control is reachable and `/v1/embeddings`, `/v1/rerank`, and `/api/search` respond (these shipped in v0.15.0; check `GET /api/endpoints`).
+4. Proceed through the Phase 0 workstreams in order. Do not build any outward-facing agent behavior in Phase 0.
@@ -0,0 +1,826 @@
+# Gmail Integration — Technical Architecture
+
+**Status:** Design / Phase 1 scaffolding
+**Target:** Venture CRM (self-hosted, StartOS 0.4 beta)
+**Author:** Grant + Claude
+**Last updated:** 2026-04-21
+
+---
+
+## 1. Goals and non-goals
+
+### In scope for Phase 1
+1. Authorize Gmail access for all 5 team members.
+2. Continuous, incremental sync of sent and received mail.
+3. Parse sender, recipients, subject, body, timestamp, attachments.
+4. Store locally in the CRM SQLite DB.
+5. Match each email to an investor record by sender/recipient email.
+6. Thread related messages using Gmail's native `threadId` plus RFC 2822 `Message-ID` / `In-Reply-To` / `References`.
+7. Store attachments on local disk alongside email records.
+
+### Non-goals for Phase 1 (future phases)
+- Sending email from the CRM.
+- Full-text search (we'll store bodies; search UI comes later).
+- Real-time push via Pub/Sub watch (we'll poll every few hours).
+- Contact auto-creation from unknown senders.
+- Shared team inbox features.
+
+### Guiding constraints
+- **Do not break the running CRM.** All new code lives in an isolated module. Exactly two small additions are made to the existing `server.py`: a startup call and a route-dispatch hook (both behind a feature flag, both safe to roll back).
+- **Match the existing codebase's style.** Pure Python stdlib + sqlite3, no new hard dependencies beyond `cryptography` (for encrypting stored OAuth refresh tokens) and the `google-auth` / `google-api-python-client` libs for the Gmail SDK. Everything else is stdlib.
+- **Single-process, single-sqlite-writer discipline.** The sync worker writes through the same connection pattern as the rest of the CRM (WAL mode, short transactions) so we don't introduce lock contention.
+
+---
+
+## 2. Decision summary
+
+| # | Decision | Choice | Rationale |
+|---|----------|--------|-----------|
+| D1 | Auth method | Domain-wide delegation (DWD) primary, per-user OAuth2 as pluggable fallback | You're Workspace super-admin; one-time setup, no per-user consent flow, centralized revocation. OAuth path preserved behind a common `CredentialProvider` interface so the CRM isn't locked in. |
+| D2 | Schema | New dedicated `emails_*` tables (not reuse `communications`) | Gmail-specific fields (`gmail_message_id`, `thread_id`, `history_id`, labels, RFC headers) don't belong in the generic `communications` table. `communications` can later project email rows via a read-only view if desired. |
+| D3 | Sync scope | Entire mailbox, match-only full storage | Fetch cheap metadata for every message. Persist the full body + attachments only for messages that match a known investor email. Minimizes disk use and privacy surface. |
+| D4 | Incremental strategy | Gmail `history.list` driven by per-account `historyId` checkpoint | Canonical Gmail-supported mechanism for "what changed since last sync" — cheap, reliable, dedupe-free. `messages.list` only on initial backfill. |
+| D5 | Delivery | Architecture doc + scaffolded modules in `backend/email_integration/`, no edits to `server.py` yet | Lets you review the shape before any wiring happens. Activation is a 3-line patch to `server.py` when you're ready. |
+
+---
+
+## 3. Google Cloud / Workspace setup (one-time, ~20 minutes)
+
+This is the runbook you'll follow once the code is ready. It's in the doc so the setup steps and the code that relies on them are never out of sync.
+
+### 3a. Create a GCP project
+1. `console.cloud.google.com` → new project, name it `ten31-crm-gmail` (or similar). Note the **Project ID**.
+2. Enable the Gmail API: APIs & Services → Library → "Gmail API" → Enable.
+
+### 3b. Create a service account (for DWD)
+3. IAM & Admin → Service Accounts → Create service account.
+   - Name: `crm-gmail-sync`
+   - Role: leave blank (it doesn't need any GCP IAM roles; Gmail scope comes from DWD).
+4. On the service account's detail page → Keys → Add Key → Create new key → JSON. Download the file.
+5. Advanced settings → enable **domain-wide delegation**. Copy the **numeric client ID** (a ~21-digit number).
+
+### 3c. Authorize the service account in Google Workspace
+6. `admin.google.com` → Security → Access and data control → API controls → Manage domain-wide delegation → Add new.
+7. Client ID: paste the numeric client ID from step 5.
+8. OAuth scopes (comma-separated, read-only — no send):
+   ```
+   https://www.googleapis.com/auth/gmail.readonly,
+   https://www.googleapis.com/auth/gmail.metadata
+   ```
+   (Phase 1 uses `gmail.readonly` only. `gmail.metadata` is listed for a possible future "metadata-only" sync mode.)
+9. Authorize.
+
+### 3d. Install the service account key on the CRM server
+10. Copy the JSON key file to `<CRM_ROOT>/data/secrets/gmail-service-account.json`.
+11. `chmod 600` the file. `chown` to the CRM process owner.
+12. Set env var in `.env.beta`:
+    ```
+    CRM_GMAIL_SA_KEY_PATH=/path/to/CRM/data/secrets/gmail-service-account.json
+    CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz
+    CRM_GMAIL_INTEGRATION_ENABLED=true
+    CRM_GMAIL_SYNC_INTERVAL_MIN=180   # 3 hours
+    CRM_GMAIL_SECRET_KEY=<32+ random bytes base64> # for encrypting stored OAuth refresh tokens (unused in DWD mode but required if fallback ever activates)
+    ```
+
+### 3e. Configure per-user mailboxes
+13. In the CRM, Admin → Email Accounts → "Enroll all Workspace users via DWD". This calls `POST /api/email/accounts/enroll-all` which for each existing CRM user with an `@ten31.xyz` email creates an `email_accounts` row bound to their address.
+14. Each teammate sees a "Gmail sync: active" chip on their profile.
+
+---
+
+## 4. Database schema
+
+All new tables live under the `emails_` prefix so they're grouped in tooling. Migration script: `backend/email_integration/migrations/0001_email_tables.sql`. It is **idempotent** (uses `CREATE TABLE IF NOT EXISTS` + targeted `ALTER TABLE` guards) so re-running is safe.
+
+### 4a. `email_accounts` — one row per enrolled mailbox
+
+```sql
+CREATE TABLE IF NOT EXISTS email_accounts (
+    id                TEXT PRIMARY KEY,              -- uuid
+    user_id           TEXT NOT NULL,                 -- FK users.id
+    email_address     TEXT NOT NULL UNIQUE,          -- e.g. grant@ten31.xyz
+    auth_method       TEXT NOT NULL,                 -- 'dwd' | 'oauth'
+    oauth_refresh_enc BLOB,                          -- NULL for DWD; AES-GCM ciphertext for oauth
+    oauth_token_exp   TEXT,                          -- ISO8601, cached access token expiry
+    oauth_token_enc   BLOB,                          -- short-lived access token cache (encrypted)
+    sync_enabled      INTEGER NOT NULL DEFAULT 1,
+    sync_status       TEXT NOT NULL DEFAULT 'pending', -- pending|active|paused|error
+    sync_error        TEXT,                          -- last error message, if any
+    last_history_id   TEXT,                          -- Gmail historyId checkpoint
+    last_synced_at    TEXT,
+    backfill_complete INTEGER NOT NULL DEFAULT 0,    -- 0 = initial backfill still running
+    backfill_cursor   TEXT,                          -- nextPageToken during backfill
+    created_at        TEXT DEFAULT (datetime('now')),
+    updated_at        TEXT DEFAULT (datetime('now')),
+    FOREIGN KEY(user_id) REFERENCES users(id)
+);
+CREATE INDEX IF NOT EXISTS idx_email_accounts_user ON email_accounts(user_id);
+CREATE INDEX IF NOT EXISTS idx_email_accounts_sync ON email_accounts(sync_enabled, sync_status);
+```
+
+### 4b. `emails` — canonical email record (dedup'd across accounts)
+
+One row per distinct RFC `Message-ID`. If two teammates are both on a thread, we store one `emails` row and link both via `email_account_messages`.
+
+```sql
+CREATE TABLE IF NOT EXISTS emails (
+    id                 TEXT PRIMARY KEY,              -- uuid
+    rfc_message_id     TEXT NOT NULL UNIQUE,          -- RFC 2822 Message-ID, dedup key
+    gmail_thread_id    TEXT,                          -- primary Gmail thread id (first account that saw it)
+    rfc_thread_root_id TEXT,                          -- root of References chain (for cross-account threading)
+    subject            TEXT,
+    from_email         TEXT NOT NULL,
+    from_name          TEXT,
+    to_emails_json     TEXT NOT NULL DEFAULT '[]',    -- JSON array
+    cc_emails_json     TEXT NOT NULL DEFAULT '[]',
+    bcc_emails_json    TEXT NOT NULL DEFAULT '[]',
+    reply_to           TEXT,
+    sent_at            TEXT NOT NULL,                 -- from Date: header, ISO8601
+    body_text          TEXT,                          -- plain/text part
+    body_html          TEXT,                          -- text/html part
+    snippet            TEXT,                          -- Gmail-provided, useful for unmatched rows
+    in_reply_to        TEXT,                          -- RFC Message-ID of parent
+    references_json    TEXT DEFAULT '[]',             -- parsed References: header
+    has_attachments    INTEGER NOT NULL DEFAULT 0,
+    size_estimate      INTEGER,                       -- bytes reported by Gmail
+    is_matched         INTEGER NOT NULL DEFAULT 0,    -- 1 if any investor link exists
+    match_status       TEXT NOT NULL DEFAULT 'unmatched', -- unmatched|matched|skipped
+    raw_headers_json   TEXT,                          -- full header dump for debugging/forensics
+    created_at         TEXT DEFAULT (datetime('now')),
+    updated_at         TEXT DEFAULT (datetime('now'))
+);
+CREATE INDEX IF NOT EXISTS idx_emails_thread        ON emails(gmail_thread_id);
+CREATE INDEX IF NOT EXISTS idx_emails_rfc_thread    ON emails(rfc_thread_root_id);
+CREATE INDEX IF NOT EXISTS idx_emails_from          ON emails(from_email);
+CREATE INDEX IF NOT EXISTS idx_emails_sent_at       ON emails(sent_at);
+CREATE INDEX IF NOT EXISTS idx_emails_matched       ON emails(is_matched, sent_at);
+```
+
+> **Match-only storage note:** For unmatched emails we still write the `emails` row, but `body_text` / `body_html` / `raw_headers_json` are left NULL (only headers and snippet are persisted). A nightly job can prune unmatched rows older than 90 days if storage becomes a concern. See §12.
+
+### 4c. `email_recipients` — exploded recipients for fast investor matching
+
+Denormalized so we can index and JOIN directly.
+
+```sql
+CREATE TABLE IF NOT EXISTS email_recipients (
+    id           TEXT PRIMARY KEY,
+    email_id     TEXT NOT NULL,
+    address      TEXT NOT NULL,                -- lowercased, trimmed
+    display_name TEXT,
+    kind         TEXT NOT NULL,                -- 'from'|'to'|'cc'|'bcc'|'reply_to'
+    FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
+);
+CREATE INDEX IF NOT EXISTS idx_email_recipients_addr ON email_recipients(address);
+CREATE INDEX IF NOT EXISTS idx_email_recipients_email ON email_recipients(email_id);
+```
+
+### 4d. `email_account_messages` — per-account sighting of an email
+
+Same `emails` row may be visible to multiple teammates with different Gmail message IDs (each user's mailbox has its own copy). This join table records each sighting.
+
+```sql
+CREATE TABLE IF NOT EXISTS email_account_messages (
+    id                TEXT PRIMARY KEY,
+    email_id          TEXT NOT NULL,
+    account_id        TEXT NOT NULL,
+    gmail_message_id  TEXT NOT NULL,                  -- per-mailbox unique
+    gmail_thread_id   TEXT NOT NULL,
+    labels_json       TEXT DEFAULT '[]',              -- Gmail labels from this account's view
+    is_sent           INTEGER NOT NULL DEFAULT 0,     -- 1 if SENT label present
+    first_seen_at     TEXT DEFAULT (datetime('now')),
+    FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE,
+    FOREIGN KEY(account_id) REFERENCES email_accounts(id) ON DELETE CASCADE,
+    UNIQUE(account_id, gmail_message_id)
+);
+CREATE INDEX IF NOT EXISTS idx_eam_email      ON email_account_messages(email_id);
+CREATE INDEX IF NOT EXISTS idx_eam_account    ON email_account_messages(account_id);
+CREATE INDEX IF NOT EXISTS idx_eam_gmail_msg  ON email_account_messages(gmail_message_id);
+```
+
+### 4e. `email_attachments` — attachments stored on disk
+
+Attachment bytes live at `data/email_attachments/<email_id>/<attachment_id>-<sanitized_filename>`. DB stores metadata + content hash.
+
+```sql
+CREATE TABLE IF NOT EXISTS email_attachments (
+    id                    TEXT PRIMARY KEY,
+    email_id              TEXT NOT NULL,
+    gmail_attachment_id   TEXT NOT NULL,         -- opaque Gmail handle
+    filename              TEXT NOT NULL,
+    sanitized_filename    TEXT NOT NULL,         -- what's actually on disk
+    mime_type             TEXT,
+    size_bytes            INTEGER,
+    sha256_hex            TEXT,                  -- integrity + dedup across emails
+    storage_path          TEXT NOT NULL,         -- relative to CRM_DATA_DIR
+    download_status       TEXT NOT NULL DEFAULT 'pending', -- pending|downloaded|failed|skipped
+    download_attempts     INTEGER NOT NULL DEFAULT 0,
+    download_error        TEXT,
+    downloaded_at         TEXT,
+    created_at            TEXT DEFAULT (datetime('now')),
+    FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
+);
+CREATE INDEX IF NOT EXISTS idx_attach_email ON email_attachments(email_id);
+CREATE INDEX IF NOT EXISTS idx_attach_sha   ON email_attachments(sha256_hex);
+```
+
+### 4f. `email_threads` — thread roll-up (materialized for UI)
+
+```sql
+CREATE TABLE IF NOT EXISTS email_threads (
+    id                 TEXT PRIMARY KEY,          -- uuid
+    gmail_thread_id    TEXT,                      -- NULL if we only have RFC threading
+    rfc_thread_root_id TEXT,                      -- fallback / cross-account key
+    subject_normalized TEXT,                      -- lowercased + stripped of Re:/Fwd: prefixes
+    first_message_at   TEXT,
+    last_message_at    TEXT,
+    message_count      INTEGER NOT NULL DEFAULT 0,
+    participant_count  INTEGER NOT NULL DEFAULT 0,
+    participants_json  TEXT DEFAULT '[]',
+    is_matched         INTEGER NOT NULL DEFAULT 0,
+    created_at         TEXT DEFAULT (datetime('now')),
+    updated_at         TEXT DEFAULT (datetime('now'))
+);
+CREATE UNIQUE INDEX IF NOT EXISTS idx_threads_gmail ON email_threads(gmail_thread_id) WHERE gmail_thread_id IS NOT NULL;
+CREATE INDEX IF NOT EXISTS idx_threads_rfc_root ON email_threads(rfc_thread_root_id);
+CREATE INDEX IF NOT EXISTS idx_threads_last_msg ON email_threads(last_message_at);
+
+-- Back-link emails → threads
+ALTER TABLE emails ADD COLUMN thread_id TEXT;  -- wrapped in PRAGMA-based guard at migration time
+CREATE INDEX IF NOT EXISTS idx_emails_thread_fk ON emails(thread_id);
+```
+
+### 4g. `email_investor_links` — matched investors
+
+```sql
+CREATE TABLE IF NOT EXISTS email_investor_links (
+    id                      TEXT PRIMARY KEY,
+    email_id                TEXT NOT NULL,
+    -- Match target: exactly one of the following is non-null
+    fundraising_investor_id TEXT,                  -- fundraising_investors.id
+    fundraising_contact_id  TEXT,                  -- fundraising_contacts.id
+    contact_id              TEXT,                  -- contacts.id
+    organization_id         TEXT,                  -- organizations.id (if matched by domain only)
+    matched_address         TEXT NOT NULL,         -- which email address produced the match
+    match_kind              TEXT NOT NULL,         -- 'exact_email'|'domain_match'|'manual'
+    match_confidence        REAL NOT NULL DEFAULT 1.0, -- 0..1
+    created_at              TEXT DEFAULT (datetime('now')),
+    FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
+);
+CREATE INDEX IF NOT EXISTS idx_eil_email        ON email_investor_links(email_id);
+CREATE INDEX IF NOT EXISTS idx_eil_investor     ON email_investor_links(fundraising_investor_id);
+CREATE INDEX IF NOT EXISTS idx_eil_fr_contact   ON email_investor_links(fundraising_contact_id);
+CREATE INDEX IF NOT EXISTS idx_eil_contact      ON email_investor_links(contact_id);
+```
+
+### 4h. `email_sync_runs` — observability
+
+```sql
+CREATE TABLE IF NOT EXISTS email_sync_runs (
+    id                TEXT PRIMARY KEY,
+    account_id        TEXT NOT NULL,
+    kind              TEXT NOT NULL,              -- 'backfill'|'incremental'
+    started_at        TEXT NOT NULL,
+    finished_at       TEXT,
+    status            TEXT NOT NULL,              -- 'running'|'ok'|'error'|'partial'
+    messages_seen     INTEGER NOT NULL DEFAULT 0,
+    messages_stored   INTEGER NOT NULL DEFAULT 0,
+    attachments_saved INTEGER NOT NULL DEFAULT 0,
+    api_calls         INTEGER NOT NULL DEFAULT 0,
+    retries           INTEGER NOT NULL DEFAULT 0,
+    error             TEXT,
+    FOREIGN KEY(account_id) REFERENCES email_accounts(id) ON DELETE CASCADE
+);
+CREATE INDEX IF NOT EXISTS idx_sync_runs_account ON email_sync_runs(account_id, started_at);
+```
+
+---
+
+## 5. Module layout
+
+```
+backend/
+├── server.py                           # unchanged in Phase 1 (hooks added at end of Phase 1)
+└── email_integration/
+    ├── __init__.py                     # re-exports public API
+    ├── README.md                       # setup instructions (copy of §3 in code-local form)
+    ├── config.py                       # env-var loader + singleton Config()
+    ├── credentials.py                  # CredentialProvider interface + DWD/OAuth impls
+    ├── db.py                           # thin data-access layer; all SQL lives here
+    ├── gmail_client.py                 # Gmail API wrapper w/ retry + rate-limit
+    ├── parser.py                       # MIME/payload parser
+    ├── matcher.py                      # investor matching logic
+    ├── threads.py                      # thread roll-up logic
+    ├── attachments.py                  # attachment download + disk storage
+    ├── sync.py                         # orchestrator (backfill + incremental)
+    ├── scheduler.py                    # background thread running periodic sync
+    ├── routes.py                       # HTTP handlers (plug into CRMHandler)
+    ├── errors.py                       # exception types + classification
+    ├── crypto.py                       # AES-GCM wrapper for encrypting OAuth tokens
+    └── migrations/
+        └── 0001_email_tables.sql
+```
+
+---
+
+## 6. Auth flow
+
+### 6a. Credential abstraction
+
+```python
+class CredentialProvider(Protocol):
+    def access_token_for(self, email_address: str) -> AccessToken: ...
+    # AccessToken = namedtuple("AccessToken", ["token", "expires_at"])
+
+class DWDCredentialProvider:
+    """Service-account impersonation. No per-user state."""
+
+class OAuthCredentialProvider:
+    """Per-user refresh tokens stored encrypted in email_accounts.oauth_refresh_enc."""
+```
+
+`sync.py` and `gmail_client.py` only depend on `CredentialProvider`. Switching modes is a config flag.
+
+### 6b. DWD flow (primary)
+
+1. On startup: load service-account JSON from `CRM_GMAIL_SA_KEY_PATH`.
+2. For each sync request, build a signed JWT claim set:
+   ```
+   iss: <service account email>
+   sub: <target user email>          ← impersonation
+   scope: gmail.readonly
+   aud: https://oauth2.googleapis.com/token
+   exp: now + 3600
+   iat: now
+   ```
+3. Sign RS256 with the service account's private key.
+4. POST to `https://oauth2.googleapis.com/token` with `grant_type=urn:ietf:params:oauth:grant-type:jwt-bearer&assertion=<jwt>`.
+5. Cache the returned access token (typ. 1h lifetime) in `email_accounts.oauth_token_enc` + `oauth_token_exp`. If expired/missing, re-mint.
+
+We implement this ourselves with `cryptography` rather than pulling `google-auth` as a hard dependency — keeps the vendor footprint small. If you want to use `google-auth` instead, swap `_mint_dwd_access_token()` for `google.oauth2.service_account.Credentials.with_subject(...)` — one-line change.
+
+### 6c. OAuth fallback flow
+
+Endpoints scaffolded but inactive by default:
+
+- `GET  /api/email/oauth/start?account_email=<email>` → redirects to Google consent screen with `state` CSRF token, `access_type=offline`, `prompt=consent`.
+- `GET  /api/email/oauth/callback?code=...&state=...` → exchanges code for tokens, encrypts refresh token with AES-GCM using `CRM_GMAIL_SECRET_KEY`, stores in `email_accounts.oauth_refresh_enc`.
+- Access tokens are refreshed on-demand just before any Gmail call if `oauth_token_exp` is within 60s of expiring.
+
+Refresh tokens **never** leave the DB in plaintext; decryption happens in memory for the duration of a single API call and is then discarded.
+
+### 6d. Token storage security
+
+- At rest: `oauth_refresh_enc` is AES-256-GCM with a random per-record nonce. Key material is `CRM_GMAIL_SECRET_KEY` (separate from the CRM's JWT `CRM_SECRET_KEY` so you can rotate them independently).
+- At rest (DWD): no tokens to store. Only the service account JSON is sensitive, and it lives at `0600` on the host.
+- In transit: all Gmail API calls use HTTPS. Our HTTP client pins TLS 1.2+.
+- Revocation:
+  - DWD: remove client ID from Workspace Admin console → instantly revokes all 5 accounts.
+  - OAuth: per-user `DELETE /api/email/accounts/:id/revoke` calls `https://oauth2.googleapis.com/revoke` then zeroes the DB row.
+
+---
+
+## 7. Sync algorithm
+
+### 7a. Initial backfill (runs once per account)
+
+When a new `email_accounts` row is created:
+
+```
+while backfill_complete == 0:
+    resp = GET /gmail/v1/users/me/messages
+           ?pageToken=<backfill_cursor>&maxResults=500&q=
+    for msg_id in resp.messages:
+        fetch metadata (format=metadata, headers=[From,To,Cc,Subject,Date,
+                                                  Message-ID,In-Reply-To,References])
+        decide match via matcher.is_match(addresses)
+        if matched:
+            fetch full (format=full)
+            parser.extract(...)
+            db.upsert_email(...), db.upsert_links(...)
+            enqueue attachment downloads
+        else:
+            db.upsert_email(headers_only=True)
+        db.upsert_account_sighting(account_id, msg_id, thread_id, labels)
+    backfill_cursor = resp.nextPageToken
+    if not backfill_cursor:
+        backfill_complete = 1
+        last_history_id = GET /users/me/profile → historyId
+```
+
+Batch the metadata fetches using Gmail's batch endpoint (up to 100 per batch) to stay well under quota.
+
+### 7b. Incremental sync (runs every `CRM_GMAIL_SYNC_INTERVAL_MIN` minutes)
+
+```
+for account where sync_enabled AND backfill_complete:
+    try:
+        resp = GET /users/me/history?startHistoryId=<last_history_id>
+        for record in resp.history:
+            for msg_added in record.messagesAdded:
+                process_message(msg_added.message.id)   # same as backfill
+            for label_change in record.labelsAdded/labelsRemoved:
+                update labels on email_account_messages (cheap)
+            for msg_deleted in record.messagesDeleted:
+                # Gmail delete ≠ CRM delete. We tombstone the sighting, keep email.
+                mark email_account_messages.deleted_at
+        last_history_id = resp.historyId
+    except HistoryExpiredError (HTTP 404):
+        # Gmail only keeps history for ~7 days. If we fall behind, trigger
+        # a partial re-backfill using q=after:<last_synced_at>.
+        reset_to_date_backfill(account)
+```
+
+**Idempotency guarantee:** `email_account_messages (account_id, gmail_message_id)` is unique. `emails.rfc_message_id` is unique. A retry can't double-insert. The `INSERT OR IGNORE` patterns in `db.py` make this explicit.
+
+### 7c. Scheduler
+
+A single daemon thread started from `server.py` startup:
+
+```python
+# scheduler.py
+def run_forever():
+    while not _stop.is_set():
+        try:
+            run_all_accounts()
+        except Exception as e:
+            log.exception("sync loop failed")
+        _stop.wait(CONFIG.sync_interval_sec)
+```
+
+Within `run_all_accounts()`, accounts are processed **sequentially** (not in parallel) because:
+- Our sqlite setup has one writer.
+- It keeps the per-user-quota-per-second math simple (250 units/user/sec per account, no cross-account contention).
+- Serial processing at ~30s/account × 5 = 2.5 min per cycle, which is well within a 3-hour interval.
+
+Manual trigger: `POST /api/email/sync/run-now` (admin-only) skips the sleep and runs one cycle immediately.
+
+---
+
+## 8. Parsing
+
+Gmail returns messages as a recursive tree of MIME parts. `parser.py` walks the tree and produces a `ParsedEmail` dataclass.
+
+Headers we extract:
+- `From` (single address, parsed with `email.utils.getaddresses`)
+- `To`, `Cc`, `Bcc`, `Reply-To` (address lists)
+- `Subject`
+- `Date` (parsed with `email.utils.parsedate_to_datetime`, stored UTC ISO8601)
+- `Message-ID`
+- `In-Reply-To`
+- `References` (space-separated → JSON array)
+
+Body extraction rules:
+1. Prefer `text/plain` part; fall back to HTML-stripped text from `text/html` using stdlib `html.parser.HTMLParser`.
+2. Keep `text/html` alongside `text/plain` when both exist (both columns stored).
+3. Ignore inline images for body purposes; they get stored as attachments with `content_disposition='inline'`.
+4. Decode base64url bodies correctly (Gmail uses `-_` alphabet with no padding).
+5. Normalize line endings to `\n`.
+
+Attachments are identified by any MIME part with a non-empty `filename` or `Content-Disposition: attachment`. Attachment bodies are **not** inlined in the message `get` response when it's large — Gmail returns an opaque `attachmentId`, and we fetch them separately via `messages.attachments.get`.
+
+Edge cases explicitly handled:
+- Malformed `Date:` headers (fall back to `internalDate` from Gmail response, always present).
+- Missing `Message-ID` header (generate synthetic: `synthetic-<gmail_message_id>@ten31.local`).
+- Duplicate `Message-ID` across mailboxes (expected — primary dedup path).
+- RFC 2047 encoded subjects (`=?UTF-8?B?...?=`) — `email.header.decode_header`.
+
+---
+
+## 9. Matching
+
+### 9a. Source of truth for investor addresses
+
+At matcher startup and periodically (every 15 min), we build an in-memory index:
+
+```python
+INVESTOR_EMAIL_INDEX = {
+    "alice@acme.vc": {
+        "kind": "fundraising_contact",
+        "id": "fc_abc",
+        "investor_id": "fi_xyz",
+        "investor_name": "Acme Capital"
+    },
+    ...
+}
+
+INVESTOR_DOMAIN_INDEX = {
+    "acme.vc": [("fi_xyz", "Acme Capital"), ...]
+}
+```
+
+Pulled from:
+1. `fundraising_contacts.email` (primary LP list)
+2. `contacts.email` (general CRM contacts)
+3. `organizations.email` and domain of `organizations.website` (weak domain match only)
+
+### 9b. Match algorithm
+
+For each parsed email, collect the set of participant addresses (from + to + cc + bcc), **excluding** the owning account's own address (we don't match our own team).
+
+```python
+def match(participants: set[str]) -> list[InvestorLink]:
+    links = []
+    for addr in participants:
+        if addr in INVESTOR_EMAIL_INDEX:
+            entry = INVESTOR_EMAIL_INDEX[addr]
+            links.append(InvestorLink(
+                match_kind="exact_email",
+                confidence=1.0,
+                matched_address=addr,
+                **entry.targets
+            ))
+    if not links:  # only fall back to domain if no exact match
+        for addr in participants:
+            domain = addr.split("@", 1)[-1].lower()
+            if domain in INVESTOR_DOMAIN_INDEX and domain not in COMMON_PERSONAL_DOMAINS:
+                for inv_id, inv_name in INVESTOR_DOMAIN_INDEX[domain]:
+                    links.append(InvestorLink(
+                        match_kind="domain_match",
+                        confidence=0.6,
+                        matched_address=addr,
+                        fundraising_investor_id=inv_id
+                    ))
+    return dedupe(links)
+```
+
+`COMMON_PERSONAL_DOMAINS = {"gmail.com", "outlook.com", "yahoo.com", "icloud.com", ...}` — we don't domain-match on these.
+
+Match status flows:
+- 0 links → `emails.match_status = 'unmatched'`, body not stored.
+- ≥1 exact_email link → `match_status = 'matched'`, full body + attachments stored.
+- Only domain_match links → `match_status = 'matched'` with confidence 0.6, full body stored but flagged for review in UI.
+
+### 9c. Re-matching
+
+If a new investor contact is added to the CRM, existing unmatched emails can be retroactively matched. `POST /api/email/rematch?since=<iso>` rebuilds the index and scans emails where `match_status='unmatched'` in the window. Full bodies are fetched from Gmail on demand for the newly matched ones (we still have the `gmail_message_id` via the sightings table).
+
+---
+
+## 10. Threading
+
+Two-level strategy:
+
+1. **Primary: Gmail `threadId`.** Within one mailbox, Gmail's grouping is authoritative. We trust it.
+2. **Cross-account reconciliation: RFC `References` / `In-Reply-To`.** If Alice sees thread `thr_A` and Bob sees thread `thr_B` and they share RFC `Message-ID`s, they're really one conversation.
+
+Algorithm:
+
+```python
+def resolve_thread(email_row):
+    # Step 1: normalize subject (strip "Re:", "Fwd:", whitespace)
+    subj_norm = normalize_subject(email_row.subject)
+
+    # Step 2: collect candidate thread keys
+    rfc_refs = email_row.references + [email_row.in_reply_to]
+    rfc_refs = [r for r in rfc_refs if r]
+
+    # Step 3: find any existing email that shares a References link
+    parent = db.find_email_by_any_rfc_id(rfc_refs + [email_row.rfc_message_id])
+
+    if parent and parent.thread_id:
+        return parent.thread_id
+
+    # Step 4: fall back to gmail_thread_id (first sighting wins)
+    if email_row.gmail_thread_id:
+        existing = db.find_thread_by_gmail_id(email_row.gmail_thread_id)
+        if existing:
+            return existing.id
+
+    # Step 5: create new thread
+    return db.create_thread(
+        gmail_thread_id=email_row.gmail_thread_id,
+        rfc_thread_root_id=rfc_refs[0] if rfc_refs else email_row.rfc_message_id,
+        subject_normalized=subj_norm,
+        first_message_at=email_row.sent_at
+    )
+```
+
+Thread roll-up (`email_threads.message_count`, `participants_json`, `last_message_at`) is recomputed opportunistically on each insert using a single `UPDATE ... FROM (SELECT ...)` — cheap at the volumes a 5-person team produces.
+
+---
+
+## 11. Attachments
+
+### 11a. Storage layout
+
+```
+data/
+├── crm.db
+├── email_attachments/
+│   └── <email_id[0:2]>/               # shard by first 2 chars of email_id to keep dirs small
+│       └── <email_id>/
+│           ├── <attachment_id>-pitch-deck.pdf
+│           └── <attachment_id>-financials.xlsx
+```
+
+### 11b. Download flow
+
+Run asynchronously from message parsing (so a giant attachment doesn't block the sync loop):
+
+1. Parser records attachment stubs in `email_attachments` with `download_status='pending'`.
+2. An attachment-download worker pulls up to N at a time (default 4), calls `messages.attachments.get`, streams to a temp file, computes SHA-256, renames to final path.
+3. On success: `download_status='downloaded'`, `downloaded_at`, `sha256_hex`, `size_bytes` filled.
+4. On failure: `download_status='failed'`, `download_attempts` incremented, `download_error` set. Retry with backoff up to 5 attempts.
+5. Sanitize filenames aggressively: strip path separators, null bytes, control chars, truncate to 200 chars. Keep extension.
+
+### 11c. Dedup and size limits
+
+- Content-addressed by `sha256_hex`. If the same file already exists on disk (same hash), we don't re-download — we link the second `email_attachments` row to the same `storage_path`.
+- Default per-file cap: 50 MB. Above that, we record metadata only with `download_status='skipped'` and a note. Configurable via `CRM_GMAIL_MAX_ATTACHMENT_MB`.
+- Default total storage cap: 10 GB (logged warning at 80%).
+
+### 11d. Retrieval
+
+- `GET /api/email/attachments/:id/download` — authenticated, streams the file. 404 if status != `downloaded`.
+- No direct filesystem exposure.
+
+---
+
+## 12. Errors, rate limits, backoff
+
+### 12a. Error taxonomy (`errors.py`)
+
+```python
+class GmailError(Exception): ...
+class AuthError(GmailError): ...              # 401/403 — broken creds, needs operator attention
+class RateLimitError(GmailError): ...         # 429 or 403 rate_limit — backoff + retry
+class TransientError(GmailError): ...         # 500, 502, 503, 504, network — backoff + retry
+class NotFoundError(GmailError): ...          # 404 — message deleted, history expired
+class HistoryExpiredError(NotFoundError): ... # specifically for expired startHistoryId
+class PermanentError(GmailError): ...         # 400 bad request etc — don't retry, log+skip
+```
+
+Every HTTP call is routed through `gmail_client._call()` which maps status codes → exception types → retry decision.
+
+### 12b. Retry policy
+
+Exponential backoff with full jitter:
+- Initial delay: 1s
+- Max delay: 60s
+- Max attempts: 5 for transient/rate-limit, 0 for auth/permanent
+- On the 6th rate-limit error in a row for an account, mark account `sync_status='paused'` and alert via audit log.
+
+### 12c. Gmail quota budget
+
+- Per-project: 1,000,000,000 units/day (enormous; we won't get near it).
+- Per-user: 250 units/second. We pace to ~150 units/sec/account via a token bucket in `gmail_client.py`. With 5 accounts processed serially, that's naturally below the per-project burst limit of 1,200 units/sec.
+
+Call costs:
+- `messages.list` = 5
+- `messages.get` = 5
+- `messages.attachments.get` = 5
+- `history.list` = 2
+
+Typical incremental cycle for one active user (~50 new messages in 3h): ~260 units. Negligible.
+
+### 12d. Data validation
+
+All DB inserts go through `db.py` helpers that:
+- Enforce required fields (from_email, sent_at, rfc_message_id).
+- Cap TEXT field lengths (subject 998 bytes per RFC 5322, body_text 10 MB soft cap with truncation log).
+- Coerce timestamps to UTC ISO8601.
+- Wrap everything in a single transaction per email.
+
+---
+
+## 13. Security considerations
+
+| Concern | Mitigation |
+|---------|-----------|
+| Service account key theft | File at 0600, not in git, `.gitignore` covers `data/secrets/`, key usable only for Gmail scope, restricted to Workspace domain by DWD. |
+| SQL injection | All SQL uses parameterized queries; strict `?` placeholders. |
+| Path traversal in attachment filenames | Filenames are sanitized (`_sanitize_filename`) and written to a hash-based path; original name is stored in DB only. |
+| Rate-limit DoS on sync endpoint | `POST /api/email/sync/run-now` is admin-only and subject to existing `WRITE_RATE_LIMIT_PER_MIN`. |
+| Exfiltration via rendered email HTML | UI must render in a sandboxed iframe with CSP; no remote image/script loads. (Frontend change, not in scope for this scaffolding.) |
+| Secret exposure in logs | Access tokens and refresh tokens never logged. Error messages are redacted by a central `redact_secrets()` filter. |
+| Backup includes email bodies | Existing backup system writes JSON of `fundraising_state` only — it does **not** include the new `emails_*` tables. Future decision: include encrypted? See §17. |
+
+---
+
+## 14. Integration plan — zero-breakage activation
+
+The scaffolded code is **inert** until three very small changes are made to `server.py`. Each is individually reversible.
+
+### Patch 1: Table creation (append to `init_db()`)
+
+```python
+# server.py, near the end of init_db()
+try:
+    from email_integration.db import apply_migrations
+    apply_migrations(cursor)
+except ImportError:
+    pass  # module not deployed yet, skip silently
+```
+
+### Patch 2: Scheduler startup (append to `main()` before `server_forever()`)
+
+```python
+# server.py, in main() after start_backup_scheduler()
+if os.environ.get("CRM_GMAIL_INTEGRATION_ENABLED", "").lower() in ("1","true","yes"):
+    from email_integration.scheduler import start_sync_scheduler
+    start_sync_scheduler()
+```
+
+### Patch 3: Route dispatch hook (in `CRMHandler.do_GET` / `do_POST`, before the 404)
+
+```python
+# server.py, before the 404 fallthrough in CRMHandler
+from email_integration.routes import try_handle
+if try_handle(self):   # returns True if it claimed the request
+    return
+```
+
+All three are guarded by the env var `CRM_GMAIL_INTEGRATION_ENABLED`. Default is off. Setting it off instantly disables the integration on restart. Migrations are idempotent so there's no "partially-migrated" state.
+
+---
+
+## 15. Testing strategy
+
+### 15a. Unit tests (pure Python, no network)
+
+- `test_parser.py` — fixture Gmail payloads covering: plain-only, html-only, multipart/alternative, multipart/mixed with attachments, RFC 2047 subjects, malformed dates, missing Message-ID.
+- `test_matcher.py` — exact match, domain match, common-personal-domain exclusion, team-own-address exclusion, dedup.
+- `test_threads.py` — new thread, Gmail-only thread, RFC-only thread, cross-account reconciliation.
+- `test_db.py` — migrations run on empty DB, re-run is no-op, all upserts idempotent.
+- `test_crypto.py` — encrypt/decrypt roundtrip, wrong key fails, nonce uniqueness.
+
+### 15b. Integration tests (mocked HTTP)
+
+Use `responses` lib or a tiny in-proc mock server to simulate Gmail API responses:
+- Full backfill of 200 messages across 5 pages.
+- Incremental history with messagesAdded + labelsAdded + messagesDeleted.
+- Expired `startHistoryId` triggers date-backfill fallback.
+- 429 triggers retry; 6 consecutive 429s pauses account.
+- Attachment download happy path + large-file skip.
+
+### 15c. Manual smoke test before production
+
+1. Enable feature flag for a single test account first (`grant@ten31.xyz` only).
+2. Run initial backfill, confirm no errors in `email_sync_runs`.
+3. Check sample of matched emails in UI against Gmail source.
+4. Verify attachments download and hash matches.
+5. After 24h of incremental sync, enroll remaining 4 teammates.
+
+---
+
+## 16. Rollout plan
+
+| Week | Milestone |
+|------|-----------|
+| 1 | Review this doc. Adjust schema / decisions as needed. |
+| 2 | Implement scaffolded modules (code is already stubbed — fill in bodies). Unit tests. |
+| 3 | Integration tests against a GCP sandbox project. |
+| 4 | Apply server.py patches behind flag, enable for Grant's account only. Monitor for 1 week. |
+| 5 | Enroll remaining 4 teammates. Document ops runbook. |
+| 6+ | Phase 2: UI for threads, search, manual match override. |
+
+---
+
+## 17. Open questions / deferred decisions
+
+1. **Backup inclusion.** Should the existing nightly backup cover `emails_*` tables? Pro: completeness. Con: bodies contain sensitive content; backups currently written unencrypted. Recommend adding a separate encrypted email backup stream rather than merging into `fundraising_backup`.
+2. **Retention policy for unmatched emails.** Current plan: keep metadata indefinitely, never store body. Alternative: delete metadata after 90 days to reduce DB size. User preference needed.
+3. **Real-time push.** Gmail supports `users.watch` + Pub/Sub to get sub-second notifications. Out of scope for Phase 1 (requires ingress URL, Pub/Sub, harder on a self-hosted box). Revisit if latency becomes an issue.
+4. **Contact auto-creation.** Should an email from a new `@foo.vc` address automatically create a `contacts` row? Currently no — matching is passive. Easy to add later.
+5. **Sent email enrichment.** When a teammate sends email to a known investor, should it create/update a CRM task (e.g., "awaiting reply")? Phase 3 roadmap item.
+6. **Multi-domain Workspace.** If ten31.xyz ever adds a second domain (ten31.io etc), DWD needs updating. Document for future.
+
+---
+
+## Appendix A: Example end-to-end flow
+
+A concrete walkthrough to ground the design.
+
+**Scenario:** Alice from Acme Capital emails Grant with a fundraising question, cc'ing her partner Bob.
+
+1. **T+0.** Alice sends `Subject: "Re: Q2 allocation"` to `grant@ten31.xyz`, cc `bob@acme.vc`.
+2. **T+1 min.** Gmail delivers. Historically this is invisible to the CRM.
+3. **T+1-180 min.** Sync scheduler wakes up. For Grant's account, `history.list(startHistoryId=<checkpoint>)` returns this new message.
+4. **Metadata fetch.** `messages.get(id=..., format=metadata)` returns headers. Participant set = `{alice@acme.vc, grant@ten31.xyz, bob@acme.vc}`. Excluding Grant's own address → `{alice@acme.vc, bob@acme.vc}`.
+5. **Match.** `INVESTOR_EMAIL_INDEX[alice@acme.vc]` → Acme Capital (fundraising_investor_id=`fi_xyz`). Exact match, confidence 1.0.
+6. **Full fetch.** `messages.get(id=..., format=full)` returns body + attachment stubs.
+7. **Parse.** `parser.extract` produces `ParsedEmail(from=alice, to=[grant], cc=[bob], subject="Re: Q2 allocation", body_text="...", attachments=[{name:"term_sheet.pdf", gmail_attachment_id:"..."}])`.
+8. **DB writes (one transaction):**
+   - `INSERT INTO emails` with `rfc_message_id=<CAA...@mail.gmail.com>`, `match_status='matched'`.
+   - `INSERT INTO email_recipients` × 3.
+   - `INSERT INTO email_account_messages` with Grant's account + Gmail message id.
+   - `INSERT INTO email_investor_links` × 1 (to fi_xyz).
+   - `INSERT INTO email_attachments` with `download_status='pending'`.
+   - `resolve_thread()` finds existing thread via References, updates `email_threads.last_message_at`.
+9. **Attachment download.** Worker picks up the pending attachment, calls `messages.attachments.get`, writes `data/email_attachments/<emailid[:2]>/<emailid>/<attid>-term_sheet.pdf`, updates row with hash + status.
+10. **Sightings.** If Bob is also a CRM user with an enrolled mailbox, his sync run also picks up his copy of the same email. `rfc_message_id` already exists → `INSERT INTO email_account_messages` only (no new `emails` row). Emerges as a second sighting.
+11. **Result in UI.** Acme Capital's investor page shows a new email in the conversation feed, threaded with prior Q2 allocation messages. Term sheet attachment is one click away, served from local disk.
+
+---
+
+## Appendix B: What gets scaffolded now vs. filled in later
+
+The scaffolded modules (`backend/email_integration/*.py`) include:
+- All module-level docstrings explaining purpose.
+- Complete type signatures and dataclasses.
+- Full SQL for migrations.
+- Working implementations for: config loading, credential abstraction, DB migration application, HTTP route dispatch.
+- Skeleton implementations with `TODO` markers for: full Gmail API call bodies, MIME tree walking, retry loops.
+
+Lines of code delivered in scaffolding: ~1,500. Remaining to implement: ~1,000 (mostly the Gmail client and parser edge cases). Nothing in `server.py` or `crm.db` is changed.
@@ -0,0 +1,90 @@
+"""Core-schema migration runner for the Ten31 CRM.
+
+Phase 0 (Workstream A2) introduces ordered, reviewable SQL migrations for the
+*core* schema, generalizing the pattern the Gmail subsystem already uses
+(email_integration/db.py). Unlike the Gmail runner, this one keeps a
+`schema_migrations` ledger so each file runs exactly once and the applied set is
+auditable.
+
+Design rules (CLAUDE.md guardrails):
+  * Additive and reversible only — migrations add tables / nullable columns;
+    they never drop or rewrite existing data. Each NNNN_*.sql may ship a paired
+    NNNN_*.down.sql for manual rollback (the .down files are never auto-applied).
+  * Idempotent — files use `CREATE TABLE/INDEX IF NOT EXISTS`. For the few
+    non-idempotent `ALTER TABLE ... ADD COLUMN` statements (SQLite has no
+    IF NOT EXISTS for columns), a partial-apply is tolerated by skipping
+    "duplicate column name" errors, matching the existing defensive pattern in
+    server.py:init_db().
+
+Call `apply_core_migrations(conn)` from init_db() after the base tables exist.
+"""
+
+import os
+import sqlite3
+
+MIGRATIONS_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "migrations")
+
+
+def _split_statements(sql: str):
+    """Split a SQL script into individual statements on ';' boundaries.
+
+    Sufficient for our migrations, which contain no procedural blocks or
+    semicolons inside string literals.
+    """
+    return [s.strip() for s in sql.split(";") if s.strip()]
+
+
+def _apply_statementwise(cursor, sql: str) -> None:
+    """Execute a migration one statement at a time, tolerating an already-applied
+    `ALTER TABLE ... ADD COLUMN` (duplicate column). Used only as a fallback when
+    executescript() trips over a partially-applied migration."""
+    for stmt in _split_statements(sql):
+        try:
+            cursor.execute(stmt)
+        except sqlite3.OperationalError as exc:
+            if "duplicate column name" in str(exc).lower():
+                continue
+            raise
+
+
+def apply_core_migrations(conn) -> None:
+    """Apply any pending backend/migrations/NNNN_*.sql files once, in order.
+
+    Records each applied file in the `schema_migrations` ledger. `*.down.sql`
+    files are ignored (manual rollback only).
+    """
+    cursor = conn.cursor()
+    cursor.execute(
+        """
+        CREATE TABLE IF NOT EXISTS schema_migrations (
+            filename   TEXT PRIMARY KEY,
+            applied_at TEXT DEFAULT (datetime('now'))
+        )
+        """
+    )
+    conn.commit()
+
+    if not os.path.isdir(MIGRATIONS_DIR):
+        return
+
+    applied = {row[0] for row in cursor.execute("SELECT filename FROM schema_migrations")}
+
+    pending = sorted(
+        fn for fn in os.listdir(MIGRATIONS_DIR)
+        if fn.endswith(".sql") and not fn.endswith(".down.sql") and fn not in applied
+    )
+
+    for filename in pending:
+        path = os.path.join(MIGRATIONS_DIR, filename)
+        with open(path, "r", encoding="utf-8") as handle:
+            sql = handle.read()
+        try:
+            cursor.executescript(sql)
+        except sqlite3.OperationalError as exc:
+            if "duplicate column name" in str(exc).lower():
+                _apply_statementwise(cursor, sql)
+            else:
+                raise
+        cursor.execute("INSERT INTO schema_migrations (filename) VALUES (?)", (filename,))
+        conn.commit()
+        print(f"[migrations] applied {filename}")
@@ -0,0 +1,175 @@
+# `email_integration` — Gmail capture for the Venture CRM
+
+Scaffolded Phase 1 of the Gmail integration described in
+`GMAIL_INTEGRATION_ARCHITECTURE.md` (repo root). Everything in this module is
+isolated from `server.py` until you wire it in explicitly.
+
+## Contents
+
+| File | Purpose |
+|------|---------|
+| `config.py` | Env-var loader; exposes `CONFIG` singleton. |
+| `errors.py` | Exception taxonomy used by the retry loop. |
+| `crypto.py` | AES-GCM wrapper for OAuth refresh-token encryption (only used in OAuth mode). |
+| `credentials.py` | `CredentialProvider` protocol + `DWDCredentialProvider` / `OAuthCredentialProvider`. |
+| `gmail_client.py` | Gmail API HTTP wrapper (rate limit, retry, pagination). |
+| `db.py` | All SQL touching `emails_*` tables. Migrations live under `migrations/`. |
+| `parser.py` | Gmail payload → canonical dict (headers, body, attachments). |
+| `matcher.py` | Investor address index + match logic. |
+| `threads.py` | Thread resolution using Gmail threadId + RFC References. |
+| `attachments.py` | Stub rows + on-disk storage + download worker. |
+| `sync.py` | Orchestrator for backfill + incremental sync of one account. |
+| `scheduler.py` | Background thread that runs `sync.sync_all` on an interval. |
+| `routes.py` | HTTP handlers under `/api/email/*` compatible with `CRMHandler`. |
+| `migrations/0001_email_tables.sql` | Table DDL. |
+
+## Wiring it in
+
+All changes are in `backend/server.py`, all guarded by an env flag. Each is
+independently revertible. None run unless `CRM_GMAIL_INTEGRATION_ENABLED=true`.
+
+**Patch 1 — migrations** (append to `init_db()` after all existing
+`cursor.executescript(...)` calls, before `conn.commit()`):
+
+```python
+try:
+    from email_integration.db import apply_migrations
+    apply_migrations(cursor)
+except ImportError:
+    pass
+```
+
+**Patch 2 — scheduler** (in `main()`, after `start_backup_scheduler()`):
+
+```python
+if os.environ.get("CRM_GMAIL_INTEGRATION_ENABLED", "").lower() in ("1", "true", "yes", "on"):
+    from email_integration.scheduler import start_sync_scheduler
+    start_sync_scheduler()
+```
+
+**Patch 3 — routes** (add near the top of `CRMHandler.do_GET` and
+`CRMHandler.do_POST`, after auth/rate-limit pre-checks, before API routing):
+
+```python
+try:
+    from email_integration.routes import try_handle
+    if try_handle(self):
+        return
+except ImportError:
+    pass
+```
+
+## Environment variables
+
+```bash
+# Master on/off. Default off; scheduler won't start, routes return 503.
+CRM_GMAIL_INTEGRATION_ENABLED=true
+
+# Auth method: "dwd" (default, recommended) or "oauth"
+CRM_GMAIL_AUTH_METHOD=dwd
+
+# DWD mode
+CRM_GMAIL_SA_KEY_PATH=/path/to/CRM/data/secrets/gmail-service-account.json
+CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz
+
+# OAuth mode (fallback; not required for DWD)
+CRM_GMAIL_OAUTH_CLIENT_ID=...
+CRM_GMAIL_OAUTH_CLIENT_SECRET=...
+CRM_GMAIL_OAUTH_REDIRECT_URI=https://crm.ten31.xyz/api/email/oauth/callback
+CRM_GMAIL_SECRET_KEY=<base64-32-random-bytes>   # for encrypting refresh tokens
+
+# Sync
+CRM_GMAIL_SYNC_INTERVAL_MIN=180          # default 3h
+CRM_GMAIL_BACKFILL_PAGE_SIZE=500
+CRM_GMAIL_MAX_ATTACHMENT_MB=50
+CRM_GMAIL_ATTACH_CONCURRENCY=4
+CRM_GMAIL_RATE_UNITS_SEC=150             # per account, leaves 40% headroom
+CRM_GMAIL_RETRY_MAX=5
+CRM_GMAIL_HISTORY_STALE_DAYS=5
+```
+
+## Google Cloud / Workspace setup (DWD)
+
+See `GMAIL_INTEGRATION_ARCHITECTURE.md` §3 for the full runbook. Short form:
+
+1. Create GCP project, enable Gmail API.
+2. Create service account, download JSON key, enable domain-wide delegation.
+3. In Google Admin console → Security → API controls → Manage domain-wide
+   delegation, authorize the service account's client ID with scope
+   `https://www.googleapis.com/auth/gmail.readonly`.
+4. Copy the JSON key to `data/secrets/gmail-service-account.json`, `chmod 600`.
+5. Set env vars in `.env.beta`, restart CRM.
+6. As admin, POST `/api/email/accounts/enroll-all` to create `email_accounts`
+   rows for every active user whose email ends in the Workspace domain.
+
+## Adding the crypto dependency (only for OAuth mode)
+
+If you use OAuth fallback you need `cryptography`:
+
+```
+cryptography==42.0.5
+```
+
+Append to `backend/requirements.txt`. DWD mode also uses `cryptography` for
+the RSA signing of the JWT bearer token — so if you enable the integration in
+either mode, add the dep.
+
+## Rollback
+
+To disable instantly: set `CRM_GMAIL_INTEGRATION_ENABLED=false` and restart.
+The scheduler won't start, routes return 503, DB tables remain (unused).
+
+To remove completely: drop the env var, delete `data/email_attachments/`,
+drop all `emails_*` tables and `email_*` tables (migration is idempotent
+create-only; a separate drop script would be required — not provided in
+Phase 1 since you said you're not rushing).
+
+## Local development
+
+The module has zero network dependencies when imported without the scheduler
+starting. You can:
+
+```python
+python3 -c "from email_integration.parser import parse; \
+    import json; \
+    print(parse(json.load(open('fixture.json'))))"
+```
+
+## Testing checklist (before enabling in production)
+
+- [ ] Enable `CRM_GMAIL_INTEGRATION_ENABLED=true` on a staging copy of the DB only.
+- [ ] Verify migrations applied: `emails`, `email_accounts`, etc. present.
+- [ ] Enroll one account (yours) via `/api/email/accounts/enroll`.
+- [ ] Trigger `POST /api/email/sync/run-now`.
+- [ ] Check `email_sync_runs` for `status='ok'`.
+- [ ] Spot-check `emails` rows against Gmail.
+- [ ] Verify an attachment downloaded correctly (hash and size).
+- [ ] Let the scheduler run for 24 hours; monitor `/api/email/status`.
+- [ ] Enroll remaining 4 teammates.
+
+## What's scaffolded vs. TODO
+
+**Scaffolded and complete:**
+- Schema (migration 0001)
+- Config and env parsing
+- Error taxonomy + retry classifier
+- AES-GCM crypto helpers
+- DWD JWT minting + access token caching
+- OAuth refresh + consent flow endpoints
+- Gmail client (list/get/history/attachments/profile) with rate limit + retry
+- Full DB data-access layer
+- MIME parser including RFC 2047 subjects and HTML→text fallback
+- Investor matcher with exact + domain strategies
+- Thread resolution (Gmail threadId + RFC References cross-account)
+- Attachment storage with SHA-256 dedup
+- Sync orchestrator (backfill + incremental with history-expired fallback)
+- Scheduler with manual-trigger hook
+- HTTP routes (status, accounts, threads, enroll, run-now, rematch, oauth)
+
+**TODO before production (see architecture doc §15):**
+- Multipart batch metadata fetch in `gmail_client.batch_get_metadata`
+  (currently serial fallback).
+- Unit tests (fixtures for parser, matcher, threads; integration tests with
+  responses-style HTTP mock).
+- Frontend UI: a thread list + detail pane in `frontend/index.html`.
+- Sandboxed HTML rendering for email bodies (out of scope here).
@@ -0,0 +1,15 @@
+"""
+Gmail Integration for Venture CRM.
+
+Phase 1 scope: OAuth2/DWD authentication, incremental Gmail sync, MIME parsing,
+investor matching, threading, attachment storage. All logic isolated to this
+module; server.py integration is a 3-line patch guarded by
+CRM_GMAIL_INTEGRATION_ENABLED.
+
+See GMAIL_INTEGRATION_ARCHITECTURE.md at the repo root for full design.
+"""
+
+from . import config  # noqa: F401
+from . import errors  # noqa: F401
+
+__all__ = ["config", "errors"]
@@ -0,0 +1,234 @@
+"""
+Attachment download + on-disk storage.
+
+Two usage patterns:
+
+  1. During message parsing we call `register_stubs(conn, email_id, parsed)`
+     which writes pending rows to email_attachments.
+
+  2. A separate worker (kicked off by sync after each account completes)
+     calls `drain_pending()` which fetches attachment bytes from Gmail and
+     writes them to disk under CONFIG.attachments_dir.
+
+Files are named: <CRM_DATA_DIR>/email_attachments/<email_id[:2]>/<email_id>/<attachment_id>-<sanitized_filename>
+
+Sanitization prevents path traversal and keeps cross-platform-safe names.
+"""
+
+import base64
+import hashlib
+import os
+import re
+import sqlite3
+from typing import Iterable, Optional
+
+from . import config as _cfg
+from . import db as _db
+from . import errors as _errors
+from . import gmail_client as _gmail
+
+
+_MAX_FILENAME_LEN = 200
+_BAD_FILENAME_CHARS = re.compile(r'[/\\\x00-\x1f\x7f:*?"<>|]+')
+
+
+def _sanitize_filename(name: str) -> str:
+    if not name:
+        return "unnamed.bin"
+    # strip path components first
+    name = os.path.basename(name.replace("\\", "/"))
+    name = _BAD_FILENAME_CHARS.sub("_", name).strip(" .")
+    if not name:
+        name = "unnamed.bin"
+    if len(name) > _MAX_FILENAME_LEN:
+        stem, dot, ext = name.rpartition(".")
+        if dot:
+            name = stem[: _MAX_FILENAME_LEN - len(ext) - 1] + "." + ext
+        else:
+            name = name[:_MAX_FILENAME_LEN]
+    return name
+
+
+def _storage_path_for(email_id: str, attachment_id: str, sanitized_filename: str) -> str:
+    root = _cfg.CONFIG.attachments_dir
+    bucket = email_id[:2] or "_0"
+    dir_ = os.path.join(root, bucket, email_id)
+    os.makedirs(dir_, exist_ok=True)
+    return os.path.join(dir_, f"{attachment_id}-{sanitized_filename}")
+
+
+# ---------------------------------------------------------------------------- phase 1: register stubs
+
+def register_stubs(conn: sqlite3.Connection, *, email_id: str,
+                   parsed_attachments: Iterable[dict]) -> list[str]:
+    """Write pending attachment rows from parsed message data.
+
+    Also handles tiny inline attachments whose bytes arrived with the message
+    body (body.data present, no separate attachmentId) by writing them
+    directly and marking as downloaded.
+
+    Returns list of attachment ids created.
+    """
+    max_bytes = _cfg.CONFIG.max_attachment_mb * 1024 * 1024
+    ids = []
+
+    for att in parsed_attachments:
+        filename = att.get("filename") or "unnamed.bin"
+        sanitized = _sanitize_filename(filename)
+        gmail_att_id = att.get("gmail_attachment_id") or ""
+        mime = att.get("mime_type")
+        size = att.get("size")
+
+        # Determine storage path (we write the path whether or not the download
+        # succeeded; missing files surface via download_status).
+        att_row_id = _db.insert_attachment_stub(
+            conn,
+            email_id=email_id,
+            gmail_attachment_id=gmail_att_id,
+            filename=filename,
+            sanitized_filename=sanitized,
+            mime_type=mime,
+            size_bytes=size,
+            storage_path=_storage_path_for(email_id, gmail_att_id or att_row_id_fallback(), sanitized),
+        )
+        ids.append(att_row_id)
+
+        # Oversize guard.
+        if isinstance(size, int) and size > max_bytes:
+            conn.execute(
+                "UPDATE email_attachments SET download_status = 'skipped', "
+                "download_error = ? WHERE id = ?",
+                (f"exceeds max size {_cfg.CONFIG.max_attachment_mb}MB", att_row_id),
+            )
+            continue
+
+        # Inline data fast-path.
+        inline_b64 = att.get("inline_data_b64")
+        if inline_b64:
+            try:
+                raw = base64.urlsafe_b64decode(_pad(inline_b64).encode("ascii"))
+                path = _storage_path_for(email_id, att_row_id, sanitized)
+                _write_bytes(path, raw)
+                sha = hashlib.sha256(raw).hexdigest()
+                conn.execute(
+                    "UPDATE email_attachments SET storage_path = ? WHERE id = ?",
+                    (path, att_row_id),
+                )
+                _db.mark_attachment_downloaded(
+                    conn, att_row_id, sha256_hex=sha, size_bytes=len(raw)
+                )
+            except Exception as e:
+                _db.mark_attachment_failed(conn, att_row_id, error=f"inline decode: {e}")
+
+    return ids
+
+
+def att_row_id_fallback() -> str:
+    # Placeholder so the path template always produces something if gmail_att_id
+    # was missing at stub time; the real path is rewritten when the worker
+    # picks it up.
+    import uuid
+    return uuid.uuid4().hex
+
+
+# ---------------------------------------------------------------------------- phase 2: worker
+
+def drain_pending(conn_factory, client: _gmail.GmailClient, account_id: str,
+                  *, limit: int = 50) -> int:
+    """Download up to `limit` pending attachments for `account_id`.
+
+    Returns count of successfully downloaded attachments. Called after each
+    account's sync completes so large files don't block the sync loop.
+    """
+    conn = conn_factory()
+    try:
+        pending = _db.pending_attachments(conn, limit=limit)
+    finally:
+        conn.close()
+
+    downloaded = 0
+    for row in pending:
+        if row["account_id"] != account_id:
+            continue
+        conn = conn_factory()
+        try:
+            ok = _download_one(conn, client, row)
+            if ok:
+                downloaded += 1
+            conn.commit()
+        finally:
+            conn.close()
+    return downloaded
+
+
+def _download_one(conn: sqlite3.Connection, client: _gmail.GmailClient, row) -> bool:
+    try:
+        resp = client.get_attachment(row["gmail_message_id"], row["gmail_attachment_id"])
+    except _errors.RETRYABLE as e:
+        _db.mark_attachment_failed(conn, row["id"], error=f"transient: {type(e).__name__}")
+        return False
+    except _errors.GmailError as e:
+        _db.mark_attachment_failed(conn, row["id"], error=f"{type(e).__name__}: {e}")
+        return False
+
+    data_b64 = resp.get("data")
+    if not data_b64:
+        _db.mark_attachment_failed(conn, row["id"], error="empty data in response")
+        return False
+
+    try:
+        raw = base64.urlsafe_b64decode(_pad(data_b64).encode("ascii"))
+    except Exception as e:
+        _db.mark_attachment_failed(conn, row["id"], error=f"decode: {e}")
+        return False
+
+    sha = hashlib.sha256(raw).hexdigest()
+    # If an existing attachment has the same SHA, re-point storage_path and skip write.
+    existing = _find_existing_by_sha(conn, sha, exclude_id=row["id"])
+    if existing:
+        conn.execute(
+            "UPDATE email_attachments SET storage_path = ? WHERE id = ?",
+            (existing["storage_path"], row["id"]),
+        )
+        _db.mark_attachment_downloaded(conn, row["id"], sha256_hex=sha, size_bytes=len(raw))
+        return True
+
+    path = _storage_path_for(row["email_id"], row["id"], row["sanitized_filename"])
+    try:
+        _write_bytes(path, raw)
+    except OSError as e:
+        _db.mark_attachment_failed(conn, row["id"], error=f"disk: {e}")
+        return False
+
+    conn.execute(
+        "UPDATE email_attachments SET storage_path = ? WHERE id = ?",
+        (path, row["id"]),
+    )
+    _db.mark_attachment_downloaded(conn, row["id"], sha256_hex=sha, size_bytes=len(raw))
+    return True
+
+
+def _find_existing_by_sha(conn: sqlite3.Connection, sha: str, *, exclude_id: str) -> Optional[sqlite3.Row]:
+    cur = conn.cursor()
+    cur.execute(
+        "SELECT * FROM email_attachments WHERE sha256_hex = ? AND id != ? "
+        "AND download_status = 'downloaded' LIMIT 1",
+        (sha, exclude_id),
+    )
+    return cur.fetchone()
+
+
+# ---------------------------------------------------------------------------- utils
+
+def _pad(b64: str) -> str:
+    pad = 4 - (len(b64) % 4)
+    return b64 + ("=" * pad if pad != 4 else "")
+
+
+def _write_bytes(path: str, data: bytes) -> None:
+    os.makedirs(os.path.dirname(path), exist_ok=True)
+    tmp = path + ".tmp"
+    with open(tmp, "wb") as f:
+        f.write(data)
+    os.chmod(tmp, 0o600)
+    os.replace(tmp, path)
@@ -0,0 +1,112 @@
+"""
+Email integration configuration.
+
+Reads from the same env-var surface as the rest of the CRM (server.py style),
+no pydantic/dotenv magic — stdlib only.
+"""
+
+import os
+from dataclasses import dataclass
+from typing import Optional
+
+# Reuse the CRM's data dir so backups and email storage live together.
+_PROJECT_DIR = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+_DEFAULT_DATA_DIR = os.path.join(_PROJECT_DIR, "data")
+
+
+def _bool_env(name: str, default: bool = False) -> bool:
+    v = os.environ.get(name, "").strip().lower()
+    if v in ("1", "true", "yes", "on"):
+        return True
+    if v in ("0", "false", "no", "off"):
+        return False
+    return default
+
+
+def _int_env(name: str, default: int) -> int:
+    try:
+        return int(os.environ.get(name, str(default)))
+    except (TypeError, ValueError):
+        return default
+
+
+@dataclass(frozen=True)
+class EmailConfig:
+    # Master kill switch. When False, scheduler doesn't start and routes
+    # return 503. Migrations are still applied (so schema is ready).
+    enabled: bool
+
+    # Primary auth path. "dwd" means service account / domain-wide delegation.
+    # "oauth" means per-user refresh tokens. DWD is default; OAuth is the
+    # pluggable fallback.
+    primary_auth: str
+
+    # DWD specifics
+    dwd_key_path: Optional[str]
+    workspace_domain: Optional[str]
+
+    # OAuth specifics (used for fallback + admin UI)
+    oauth_client_id: Optional[str]
+    oauth_client_secret: Optional[str]
+    oauth_redirect_uri: Optional[str]
+
+    # Encryption key (base64) for OAuth refresh-token-at-rest encryption.
+    # Required whenever oauth path is in use. DWD path never persists tokens.
+    secret_key_b64: Optional[str]
+
+    # Sync scheduling
+    sync_interval_sec: int
+    backfill_page_size: int
+    max_attachment_mb: int
+    max_parallel_attachment_downloads: int
+
+    # Storage
+    data_dir: str
+    attachments_dir: str
+    secrets_dir: str
+
+    # Rate limit / retry
+    rate_limit_units_per_sec_per_account: int
+    retry_max_attempts: int
+    retry_initial_delay_sec: float
+    retry_max_delay_sec: float
+
+    # Gmail history retention — if we fall this far behind, switch to date
+    # backfill since Gmail may have pruned history records.
+    history_stale_days: int
+
+
+def load() -> EmailConfig:
+    data_dir = os.environ.get("CRM_DATA_DIR", _DEFAULT_DATA_DIR)
+    return EmailConfig(
+        enabled=_bool_env("CRM_GMAIL_INTEGRATION_ENABLED", False),
+        primary_auth=os.environ.get("CRM_GMAIL_AUTH_METHOD", "dwd").lower(),
+        dwd_key_path=os.environ.get("CRM_GMAIL_SA_KEY_PATH") or None,
+        workspace_domain=os.environ.get("CRM_GMAIL_WORKSPACE_DOMAIN") or None,
+        oauth_client_id=os.environ.get("CRM_GMAIL_OAUTH_CLIENT_ID") or None,
+        oauth_client_secret=os.environ.get("CRM_GMAIL_OAUTH_CLIENT_SECRET") or None,
+        oauth_redirect_uri=os.environ.get("CRM_GMAIL_OAUTH_REDIRECT_URI") or None,
+        secret_key_b64=os.environ.get("CRM_GMAIL_SECRET_KEY") or None,
+        sync_interval_sec=_int_env("CRM_GMAIL_SYNC_INTERVAL_MIN", 180) * 60,
+        backfill_page_size=_int_env("CRM_GMAIL_BACKFILL_PAGE_SIZE", 500),
+        max_attachment_mb=_int_env("CRM_GMAIL_MAX_ATTACHMENT_MB", 50),
+        max_parallel_attachment_downloads=_int_env("CRM_GMAIL_ATTACH_CONCURRENCY", 4),
+        data_dir=data_dir,
+        attachments_dir=os.path.join(data_dir, "email_attachments"),
+        secrets_dir=os.path.join(data_dir, "secrets"),
+        rate_limit_units_per_sec_per_account=_int_env("CRM_GMAIL_RATE_UNITS_SEC", 150),
+        retry_max_attempts=_int_env("CRM_GMAIL_RETRY_MAX", 5),
+        retry_initial_delay_sec=float(os.environ.get("CRM_GMAIL_RETRY_INITIAL_SEC", "1.0")),
+        retry_max_delay_sec=float(os.environ.get("CRM_GMAIL_RETRY_MAX_SEC", "60.0")),
+        history_stale_days=_int_env("CRM_GMAIL_HISTORY_STALE_DAYS", 5),
+    )
+
+
+# Singleton. Reload with `reload_config()` if env changes (mostly for tests).
+CONFIG = load()
+
+
+def reload_config() -> EmailConfig:
+    global CONFIG
+    CONFIG = load()
+    return CONFIG
@@ -0,0 +1,297 @@
+"""
+Credential providers for Gmail API access.
+
+Two implementations behind a common protocol:
+
+- DWDCredentialProvider: signs a JWT with the Workspace-authorized service
+  account, exchanges for a short-lived access token that impersonates a
+  specific user. No per-user persistent state.
+
+- OAuthCredentialProvider: uses a per-user refresh token (stored encrypted
+  in email_accounts.oauth_refresh_enc) to mint access tokens. Supports the
+  'connect Gmail' UI flow.
+
+Both provide the same interface:
+
+    provider.access_token_for(email_address: str) -> AccessToken
+"""
+
+import base64
+import json
+import os
+import threading
+import time
+from dataclasses import dataclass
+from typing import Optional, Protocol
+import urllib.parse
+import urllib.request
+
+from . import config as _cfg
+from . import crypto
+from . import errors
+
+
+GMAIL_READONLY_SCOPE = "https://www.googleapis.com/auth/gmail.readonly"
+GOOGLE_TOKEN_URL = "https://oauth2.googleapis.com/token"
+
+
+@dataclass
+class AccessToken:
+    token: str
+    expires_at: float   # epoch seconds
+
+
+class CredentialProvider(Protocol):
+    def access_token_for(self, email_address: str) -> AccessToken: ...
+    def revoke(self, email_address: str) -> None: ...
+
+
+# ============================================================================
+# Domain-wide delegation
+# ============================================================================
+
+class DWDCredentialProvider:
+    """Impersonation via service-account JWT bearer grant."""
+
+    def __init__(self, key_path: str):
+        with open(key_path, "r") as f:
+            self._key = json.load(f)
+        self._client_email = self._key["client_email"]
+        self._private_key_pem = self._key["private_key"].encode("utf-8")
+        self._cache: dict[str, AccessToken] = {}
+        self._lock = threading.Lock()
+
+    def access_token_for(self, email_address: str) -> AccessToken:
+        with self._lock:
+            cached = self._cache.get(email_address)
+            if cached and cached.expires_at - time.time() > 60:
+                return cached
+            token = self._mint(email_address)
+            self._cache[email_address] = token
+            return token
+
+    def revoke(self, email_address: str) -> None:
+        # DWD tokens expire naturally in <1h. Revocation is via Admin console.
+        # We just drop the cache so next call mints fresh.
+        with self._lock:
+            self._cache.pop(email_address, None)
+
+    # ------------------------------------------------------------------ helpers
+
+    def _mint(self, subject_email: str) -> AccessToken:
+        try:
+            from cryptography.hazmat.primitives import hashes, serialization  # type: ignore
+            from cryptography.hazmat.primitives.asymmetric import padding  # type: ignore
+        except ImportError as e:  # pragma: no cover
+            raise errors.AuthError(
+                "DWD requires the `cryptography` package. Add to requirements.txt."
+            ) from e
+
+        now = int(time.time())
+        header = {"alg": "RS256", "typ": "JWT"}
+        claim = {
+            "iss": self._client_email,
+            "sub": subject_email,
+            "scope": GMAIL_READONLY_SCOPE,
+            "aud": GOOGLE_TOKEN_URL,
+            "iat": now,
+            "exp": now + 3600,
+        }
+        signing_input = _b64url(_json(header)) + b"." + _b64url(_json(claim))
+
+        private_key = serialization.load_pem_private_key(self._private_key_pem, password=None)
+        signature = private_key.sign(signing_input, padding.PKCS1v15(), hashes.SHA256())
+        jwt = signing_input + b"." + _b64url(signature)
+
+        body = urllib.parse.urlencode({
+            "grant_type": "urn:ietf:params:oauth:grant-type:jwt-bearer",
+            "assertion": jwt.decode("ascii"),
+        }).encode("ascii")
+
+        req = urllib.request.Request(
+            GOOGLE_TOKEN_URL,
+            data=body,
+            headers={"Content-Type": "application/x-www-form-urlencoded"},
+        )
+        try:
+            with urllib.request.urlopen(req, timeout=15) as resp:
+                payload = json.loads(resp.read())
+        except urllib.error.HTTPError as e:
+            body = e.read().decode("utf-8", errors="replace")
+            try:
+                payload = json.loads(body)
+            except Exception:
+                payload = {"raw": body}
+            raise errors.classify_http(e.code, payload)
+
+        if "access_token" not in payload:
+            raise errors.AuthError("DWD token exchange returned no access_token", payload=payload)
+        return AccessToken(
+            token=payload["access_token"],
+            expires_at=time.time() + float(payload.get("expires_in", 3600)) - 30,
+        )
+
+
+# ============================================================================
+# Per-user OAuth (fallback)
+# ============================================================================
+
+class OAuthCredentialProvider:
+    """Refreshes access tokens using a stored encrypted refresh token.
+
+    Refresh tokens are obtained via the consent-flow routes in routes.py and
+    stored in email_accounts.oauth_refresh_enc (AES-GCM ciphertext).
+    """
+
+    def __init__(self, db_conn_factory, client_id: str, client_secret: str, secret_key_b64: str):
+        self._db = db_conn_factory
+        self._client_id = client_id
+        self._client_secret = client_secret
+        self._secret_key_b64 = secret_key_b64
+        self._lock = threading.Lock()
+
+    def access_token_for(self, email_address: str) -> AccessToken:
+        with self._lock:
+            row = self._load_account(email_address)
+            if row is None:
+                raise errors.AuthError(f"no email_accounts row for {email_address}")
+            # Cached access token still valid?
+            if row["oauth_token_enc"] and row["oauth_token_exp"]:
+                try:
+                    exp = float(row["oauth_token_exp"])
+                except ValueError:
+                    exp = 0.0
+                if exp - time.time() > 60:
+                    token = crypto.decrypt(row["oauth_token_enc"], secret_key_b64=self._secret_key_b64).decode("ascii")
+                    return AccessToken(token=token, expires_at=exp)
+            # Refresh.
+            return self._refresh(email_address, row)
+
+    def revoke(self, email_address: str) -> None:
+        row = self._load_account(email_address)
+        if not row or not row["oauth_refresh_enc"]:
+            return
+        refresh = crypto.decrypt(row["oauth_refresh_enc"], secret_key_b64=self._secret_key_b64).decode("ascii")
+        body = urllib.parse.urlencode({"token": refresh}).encode("ascii")
+        req = urllib.request.Request(
+            "https://oauth2.googleapis.com/revoke",
+            data=body,
+            headers={"Content-Type": "application/x-www-form-urlencoded"},
+        )
+        try:
+            urllib.request.urlopen(req, timeout=10).read()
+        except Exception:
+            pass  # best effort; we zero locally regardless
+        self._zero_account(email_address)
+
+    # ------------------------------------------------------------------ helpers
+
+    def _refresh(self, email_address: str, row) -> AccessToken:
+        if not row["oauth_refresh_enc"]:
+            raise errors.AuthError(f"no refresh token stored for {email_address}")
+        refresh = crypto.decrypt(row["oauth_refresh_enc"], secret_key_b64=self._secret_key_b64).decode("ascii")
+        body = urllib.parse.urlencode({
+            "grant_type": "refresh_token",
+            "refresh_token": refresh,
+            "client_id": self._client_id,
+            "client_secret": self._client_secret,
+        }).encode("ascii")
+        req = urllib.request.Request(
+            GOOGLE_TOKEN_URL,
+            data=body,
+            headers={"Content-Type": "application/x-www-form-urlencoded"},
+        )
+        try:
+            with urllib.request.urlopen(req, timeout=15) as resp:
+                payload = json.loads(resp.read())
+        except urllib.error.HTTPError as e:
+            body_text = e.read().decode("utf-8", errors="replace")
+            try:
+                payload = json.loads(body_text)
+            except Exception:
+                payload = {"raw": body_text}
+            raise errors.classify_http(e.code, payload)
+
+        if "access_token" not in payload:
+            raise errors.AuthError("OAuth refresh returned no access_token", payload=payload)
+
+        token_str = payload["access_token"]
+        exp = time.time() + float(payload.get("expires_in", 3600)) - 30
+        enc_token = crypto.encrypt(token_str.encode("ascii"), secret_key_b64=self._secret_key_b64)
+        self._save_token(email_address, enc_token, exp)
+        return AccessToken(token=token_str, expires_at=exp)
+
+    def _load_account(self, email_address: str):
+        conn = self._db()
+        try:
+            cur = conn.cursor()
+            cur.execute(
+                "SELECT id, oauth_refresh_enc, oauth_token_enc, oauth_token_exp "
+                "FROM email_accounts WHERE email_address = ?",
+                (email_address,),
+            )
+            return cur.fetchone()
+        finally:
+            conn.close()
+
+    def _save_token(self, email_address: str, enc_token: bytes, exp: float):
+        conn = self._db()
+        try:
+            conn.execute(
+                "UPDATE email_accounts SET oauth_token_enc = ?, oauth_token_exp = ?, "
+                "updated_at = datetime('now') WHERE email_address = ?",
+                (enc_token, str(exp), email_address),
+            )
+            conn.commit()
+        finally:
+            conn.close()
+
+    def _zero_account(self, email_address: str):
+        conn = self._db()
+        try:
+            conn.execute(
+                "UPDATE email_accounts SET oauth_refresh_enc = NULL, oauth_token_enc = NULL, "
+                "oauth_token_exp = NULL, sync_enabled = 0, sync_status = 'paused', "
+                "updated_at = datetime('now') WHERE email_address = ?",
+                (email_address,),
+            )
+            conn.commit()
+        finally:
+            conn.close()
+
+
+# ============================================================================
+# Factory — resolves CONFIG.primary_auth to a concrete provider
+# ============================================================================
+
+def build_provider(db_conn_factory) -> CredentialProvider:
+    cfg = _cfg.CONFIG
+    if cfg.primary_auth == "dwd":
+        if not cfg.dwd_key_path or not os.path.exists(cfg.dwd_key_path):
+            raise errors.AuthError(
+                f"CRM_GMAIL_SA_KEY_PATH not found: {cfg.dwd_key_path!r}"
+            )
+        return DWDCredentialProvider(cfg.dwd_key_path)
+    if cfg.primary_auth == "oauth":
+        if not (cfg.oauth_client_id and cfg.oauth_client_secret and cfg.secret_key_b64):
+            raise errors.AuthError(
+                "OAuth mode requires CRM_GMAIL_OAUTH_CLIENT_ID, "
+                "CRM_GMAIL_OAUTH_CLIENT_SECRET, and CRM_GMAIL_SECRET_KEY."
+            )
+        return OAuthCredentialProvider(
+            db_conn_factory,
+            cfg.oauth_client_id,
+            cfg.oauth_client_secret,
+            cfg.secret_key_b64,
+        )
+    raise errors.AuthError(f"unknown primary_auth: {cfg.primary_auth!r}")
+
+
+# ---------------------------------------------------------------------------- utils
+
+def _b64url(data: bytes) -> bytes:
+    return base64.urlsafe_b64encode(data).rstrip(b"=")
+
+
+def _json(obj) -> bytes:
+    return json.dumps(obj, separators=(",", ":")).encode("utf-8")
@@ -0,0 +1,79 @@
+"""
+AES-256-GCM encryption for OAuth refresh tokens at rest.
+
+Key material comes from CONFIG.secret_key_b64 (env: CRM_GMAIL_SECRET_KEY).
+Must be at least 32 bytes of entropy, base64-encoded.
+
+Storage format (as stored in BLOB columns):
+    version(1 byte) || nonce(12 bytes) || ciphertext+tag(N bytes)
+
+version = 1 for AES-GCM-256.
+
+Uses the `cryptography` library. If not available (optional at scaffold time),
+the OAuth fallback path is disabled with a clear error — DWD path is unaffected.
+"""
+
+import base64
+import os
+import secrets
+from typing import Optional
+
+try:
+    from cryptography.hazmat.primitives.ciphers.aead import AESGCM  # type: ignore
+    _AVAILABLE = True
+except ImportError:  # pragma: no cover
+    AESGCM = None  # type: ignore
+    _AVAILABLE = False
+
+
+VERSION = 1
+NONCE_LEN = 12
+
+
+class CryptoUnavailable(RuntimeError):
+    pass
+
+
+def _load_key(secret_key_b64: Optional[str]) -> bytes:
+    if not secret_key_b64:
+        raise CryptoUnavailable(
+            "CRM_GMAIL_SECRET_KEY not set; cannot encrypt/decrypt OAuth tokens. "
+            "DWD auth does not require this."
+        )
+    try:
+        key = base64.b64decode(secret_key_b64)
+    except Exception as e:
+        raise CryptoUnavailable(f"CRM_GMAIL_SECRET_KEY not valid base64: {e}") from e
+    if len(key) < 32:
+        raise CryptoUnavailable(
+            f"CRM_GMAIL_SECRET_KEY decodes to {len(key)} bytes; need >= 32."
+        )
+    return key[:32]  # AES-256
+
+
+def encrypt(plaintext: bytes, *, secret_key_b64: Optional[str]) -> bytes:
+    if not _AVAILABLE:
+        raise CryptoUnavailable("cryptography library not installed")
+    key = _load_key(secret_key_b64)
+    nonce = secrets.token_bytes(NONCE_LEN)
+    ct = AESGCM(key).encrypt(nonce, plaintext, None)
+    return bytes([VERSION]) + nonce + ct
+
+
+def decrypt(blob: bytes, *, secret_key_b64: Optional[str]) -> bytes:
+    if not _AVAILABLE:
+        raise CryptoUnavailable("cryptography library not installed")
+    if not blob or len(blob) < 1 + NONCE_LEN + 16:
+        raise ValueError("ciphertext too short")
+    version = blob[0]
+    if version != VERSION:
+        raise ValueError(f"unsupported crypto version: {version}")
+    nonce = blob[1:1 + NONCE_LEN]
+    ct = blob[1 + NONCE_LEN:]
+    key = _load_key(secret_key_b64)
+    return AESGCM(key).decrypt(nonce, ct, None)
+
+
+def generate_secret_key_b64() -> str:
+    """Helper for initial setup: prints a fresh key you can drop into env."""
+    return base64.b64encode(os.urandom(32)).decode("ascii")
@@ -0,0 +1,416 @@
+"""
+Data-access layer for the email_integration module.
+
+All SQL touching emails_* tables lives here. Other modules call named
+helpers — they never write SQL inline. This keeps schema changes contained.
+
+Connection pattern matches server.py get_db():
+    - WAL mode, foreign keys on, busy_timeout
+    - sqlite3.Row row_factory
+The caller is responsible for committing / closing.
+"""
+
+import json
+import os
+import sqlite3
+import uuid
+from datetime import datetime, timezone
+from typing import Iterable, Optional
+
+
+# ------------------------------------------------------------------ migrations
+
+def apply_migrations(cursor: sqlite3.Cursor) -> None:
+    """Apply all .sql migration files in migrations/ in lexicographic order.
+
+    Called from server.init_db(). Idempotent. Does not log past migrations in
+    a table yet — each file is guarded by CREATE ... IF NOT EXISTS etc. If
+    we ever need more complex migrations, add a schema_migrations table.
+    """
+    here = os.path.dirname(os.path.abspath(__file__))
+    mdir = os.path.join(here, "migrations")
+    if not os.path.isdir(mdir):
+        return
+    for name in sorted(os.listdir(mdir)):
+        if not name.endswith(".sql"):
+            continue
+        path = os.path.join(mdir, name)
+        with open(path, "r") as f:
+            sql = f.read()
+        cursor.executescript(sql)
+
+
+# ------------------------------------------------------------------ utils
+
+def _uuid() -> str:
+    return str(uuid.uuid4())
+
+
+def _now_iso() -> str:
+    return datetime.now(tz=timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
+
+
+def _json(v) -> str:
+    return json.dumps(v, separators=(",", ":"))
+
+
+# ------------------------------------------------------------------ email_accounts
+
+def list_sync_ready_accounts(conn: sqlite3.Connection) -> list[sqlite3.Row]:
+    cur = conn.cursor()
+    cur.execute(
+        "SELECT * FROM email_accounts "
+        "WHERE sync_enabled = 1 AND sync_status IN ('pending','active') "
+        "ORDER BY last_synced_at IS NOT NULL, last_synced_at"
+    )
+    return cur.fetchall()
+
+
+def get_account_by_email(conn: sqlite3.Connection, email_address: str) -> Optional[sqlite3.Row]:
+    cur = conn.cursor()
+    cur.execute("SELECT * FROM email_accounts WHERE email_address = ?", (email_address,))
+    return cur.fetchone()
+
+
+def upsert_account(conn: sqlite3.Connection, *, user_id: str, email_address: str,
+                   auth_method: str) -> str:
+    existing = get_account_by_email(conn, email_address)
+    if existing:
+        return existing["id"]
+    account_id = _uuid()
+    conn.execute(
+        "INSERT INTO email_accounts (id, user_id, email_address, auth_method) "
+        "VALUES (?, ?, ?, ?)",
+        (account_id, user_id, email_address, auth_method),
+    )
+    return account_id
+
+
+def set_account_status(conn: sqlite3.Connection, account_id: str, *,
+                       status: str, error: Optional[str] = None) -> None:
+    conn.execute(
+        "UPDATE email_accounts SET sync_status = ?, sync_error = ?, "
+        "updated_at = datetime('now') WHERE id = ?",
+        (status, error, account_id),
+    )
+
+
+def set_account_checkpoint(conn: sqlite3.Connection, account_id: str, *,
+                           history_id: Optional[str] = None,
+                           backfill_cursor: Optional[str] = None,
+                           backfill_complete: Optional[bool] = None,
+                           last_synced_at: Optional[str] = None) -> None:
+    sets, params = [], []
+    if history_id is not None:
+        sets.append("last_history_id = ?"); params.append(history_id)
+    if backfill_cursor is not None:
+        sets.append("backfill_cursor = ?"); params.append(backfill_cursor)
+    if backfill_complete is not None:
+        sets.append("backfill_complete = ?"); params.append(1 if backfill_complete else 0)
+    if last_synced_at is not None:
+        sets.append("last_synced_at = ?"); params.append(last_synced_at)
+    if not sets:
+        return
+    sets.append("updated_at = datetime('now')")
+    params.append(account_id)
+    conn.execute(f"UPDATE email_accounts SET {', '.join(sets)} WHERE id = ?", params)
+
+
+# ------------------------------------------------------------------ emails
+
+def find_email_by_rfc_id(conn: sqlite3.Connection, rfc_message_id: str) -> Optional[sqlite3.Row]:
+    cur = conn.cursor()
+    cur.execute("SELECT * FROM emails WHERE rfc_message_id = ?", (rfc_message_id,))
+    return cur.fetchone()
+
+
+def find_email_id_by_any_rfc_id(conn: sqlite3.Connection,
+                                rfc_ids: Iterable[str]) -> Optional[str]:
+    ids = [r for r in rfc_ids if r]
+    if not ids:
+        return None
+    placeholders = ",".join("?" for _ in ids)
+    cur = conn.cursor()
+    cur.execute(
+        f"SELECT id FROM emails WHERE rfc_message_id IN ({placeholders}) "
+        "ORDER BY sent_at ASC LIMIT 1",
+        ids,
+    )
+    row = cur.fetchone()
+    return row["id"] if row else None
+
+
+def insert_email(conn: sqlite3.Connection, *, parsed: dict, match_status: str) -> str:
+    """Insert a fresh emails row. Returns email_id.
+
+    Caller must ensure no row exists for parsed['rfc_message_id']; use
+    find_email_by_rfc_id first.
+    """
+    email_id = _uuid()
+    conn.execute(
+        """INSERT INTO emails
+        (id, rfc_message_id, gmail_thread_id, rfc_thread_root_id, subject,
+         from_email, from_name, to_emails_json, cc_emails_json, bcc_emails_json,
+         reply_to, sent_at, body_text, body_html, snippet, in_reply_to,
+         references_json, has_attachments, size_estimate, is_matched,
+         match_status, raw_headers_json)
+        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
+        (
+            email_id,
+            parsed["rfc_message_id"],
+            parsed.get("gmail_thread_id"),
+            parsed.get("rfc_thread_root_id"),
+            parsed.get("subject"),
+            parsed["from_email"],
+            parsed.get("from_name"),
+            _json(parsed.get("to", [])),
+            _json(parsed.get("cc", [])),
+            _json(parsed.get("bcc", [])),
+            parsed.get("reply_to"),
+            parsed["sent_at"],
+            parsed.get("body_text"),
+            parsed.get("body_html"),
+            parsed.get("snippet"),
+            parsed.get("in_reply_to"),
+            _json(parsed.get("references", [])),
+            1 if parsed.get("attachments") else 0,
+            parsed.get("size_estimate"),
+            1 if match_status == "matched" else 0,
+            match_status,
+            _json(parsed.get("raw_headers", {})) if parsed.get("raw_headers") else None,
+        ),
+    )
+    # recipients
+    for kind in ("from", "to", "cc", "bcc", "reply_to"):
+        addrs = []
+        if kind == "from" and parsed.get("from_email"):
+            addrs = [(parsed["from_email"], parsed.get("from_name"))]
+        elif kind == "reply_to" and parsed.get("reply_to"):
+            addrs = [(parsed["reply_to"], None)]
+        else:
+            for a in parsed.get(kind, []):
+                if isinstance(a, dict):
+                    addrs.append((a.get("email"), a.get("name")))
+                else:
+                    addrs.append((a, None))
+        for address, name in addrs:
+            if not address:
+                continue
+            conn.execute(
+                "INSERT INTO email_recipients (id, email_id, address, display_name, kind) "
+                "VALUES (?, ?, ?, ?, ?)",
+                (_uuid(), email_id, address.lower().strip(), name, kind),
+            )
+    return email_id
+
+
+def set_email_thread(conn: sqlite3.Connection, email_id: str, thread_id: str) -> None:
+    conn.execute(
+        "UPDATE emails SET thread_id = ?, updated_at = datetime('now') WHERE id = ?",
+        (thread_id, email_id),
+    )
+
+
+# ------------------------------------------------------------------ sightings
+
+def upsert_sighting(conn: sqlite3.Connection, *, email_id: str, account_id: str,
+                    gmail_message_id: str, gmail_thread_id: str,
+                    labels: list[str], is_sent: bool) -> None:
+    conn.execute(
+        """INSERT OR IGNORE INTO email_account_messages
+        (id, email_id, account_id, gmail_message_id, gmail_thread_id,
+         labels_json, is_sent)
+        VALUES (?, ?, ?, ?, ?, ?, ?)""",
+        (_uuid(), email_id, account_id, gmail_message_id, gmail_thread_id,
+         _json(labels), 1 if is_sent else 0),
+    )
+
+
+def update_sighting_labels(conn: sqlite3.Connection, *, account_id: str,
+                           gmail_message_id: str, labels: list[str]) -> None:
+    conn.execute(
+        "UPDATE email_account_messages SET labels_json = ? "
+        "WHERE account_id = ? AND gmail_message_id = ?",
+        (_json(labels), account_id, gmail_message_id),
+    )
+
+
+def tombstone_sighting(conn: sqlite3.Connection, *, account_id: str,
+                       gmail_message_id: str) -> None:
+    conn.execute(
+        "UPDATE email_account_messages SET deleted_at = datetime('now') "
+        "WHERE account_id = ? AND gmail_message_id = ?",
+        (account_id, gmail_message_id),
+    )
+
+
+# ------------------------------------------------------------------ attachments
+
+def insert_attachment_stub(conn: sqlite3.Connection, *, email_id: str,
+                           gmail_attachment_id: str, filename: str,
+                           sanitized_filename: str, mime_type: Optional[str],
+                           size_bytes: Optional[int], storage_path: str) -> str:
+    att_id = _uuid()
+    conn.execute(
+        """INSERT INTO email_attachments
+        (id, email_id, gmail_attachment_id, filename, sanitized_filename,
+         mime_type, size_bytes, storage_path)
+        VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
+        (att_id, email_id, gmail_attachment_id, filename, sanitized_filename,
+         mime_type, size_bytes, storage_path),
+    )
+    return att_id
+
+
+def mark_attachment_downloaded(conn: sqlite3.Connection, attachment_id: str, *,
+                               sha256_hex: str, size_bytes: int) -> None:
+    conn.execute(
+        "UPDATE email_attachments SET download_status = 'downloaded', "
+        "sha256_hex = ?, size_bytes = ?, downloaded_at = datetime('now') "
+        "WHERE id = ?",
+        (sha256_hex, size_bytes, attachment_id),
+    )
+
+
+def mark_attachment_failed(conn: sqlite3.Connection, attachment_id: str, *,
+                           error: str) -> None:
+    conn.execute(
+        "UPDATE email_attachments SET download_status = 'failed', "
+        "download_attempts = download_attempts + 1, download_error = ? "
+        "WHERE id = ?",
+        (error, attachment_id),
+    )
+
+
+def pending_attachments(conn: sqlite3.Connection, limit: int = 50) -> list[sqlite3.Row]:
+    cur = conn.cursor()
+    cur.execute(
+        "SELECT a.*, eam.gmail_message_id, eam.account_id "
+        "FROM email_attachments a "
+        "JOIN email_account_messages eam ON eam.email_id = a.email_id "
+        "WHERE a.download_status = 'pending' AND a.download_attempts < 5 "
+        "LIMIT ?",
+        (limit,),
+    )
+    return cur.fetchall()
+
+
+# ------------------------------------------------------------------ threads
+
+def find_thread_by_gmail_id(conn: sqlite3.Connection, gmail_thread_id: str) -> Optional[sqlite3.Row]:
+    cur = conn.cursor()
+    cur.execute(
+        "SELECT * FROM email_threads WHERE gmail_thread_id = ?",
+        (gmail_thread_id,),
+    )
+    return cur.fetchone()
+
+
+def find_thread_by_rfc_root(conn: sqlite3.Connection, rfc_root: str) -> Optional[sqlite3.Row]:
+    cur = conn.cursor()
+    cur.execute(
+        "SELECT * FROM email_threads WHERE rfc_thread_root_id = ?",
+        (rfc_root,),
+    )
+    return cur.fetchone()
+
+
+def create_thread(conn: sqlite3.Connection, *, gmail_thread_id: Optional[str],
+                  rfc_thread_root_id: Optional[str], subject_normalized: Optional[str],
+                  first_message_at: Optional[str]) -> str:
+    thread_id = _uuid()
+    conn.execute(
+        """INSERT INTO email_threads
+        (id, gmail_thread_id, rfc_thread_root_id, subject_normalized,
+         first_message_at, last_message_at, message_count)
+        VALUES (?, ?, ?, ?, ?, ?, 0)""",
+        (thread_id, gmail_thread_id, rfc_thread_root_id, subject_normalized,
+         first_message_at, first_message_at),
+    )
+    return thread_id
+
+
+def rollup_thread(conn: sqlite3.Connection, thread_id: str) -> None:
+    """Recompute count / last_message_at / participants from member emails.
+
+    Cheap at 5-person team volumes. For larger deployments swap to triggers.
+    """
+    cur = conn.cursor()
+    cur.execute(
+        "SELECT COUNT(*) AS n, MIN(sent_at) AS first, MAX(sent_at) AS last, "
+        "MAX(is_matched) AS matched FROM emails WHERE thread_id = ?",
+        (thread_id,),
+    )
+    row = cur.fetchone()
+    if not row or row["n"] == 0:
+        return
+    cur.execute(
+        "SELECT DISTINCT address FROM email_recipients er "
+        "JOIN emails e ON e.id = er.email_id WHERE e.thread_id = ?",
+        (thread_id,),
+    )
+    participants = [r["address"] for r in cur.fetchall()]
+    conn.execute(
+        "UPDATE email_threads SET message_count = ?, first_message_at = ?, "
+        "last_message_at = ?, participant_count = ?, participants_json = ?, "
+        "is_matched = ?, updated_at = datetime('now') WHERE id = ?",
+        (row["n"], row["first"], row["last"], len(participants),
+         _json(participants), int(row["matched"] or 0), thread_id),
+    )
+
+
+# ------------------------------------------------------------------ investor links
+
+def insert_investor_link(conn: sqlite3.Connection, *, email_id: str,
+                         link: dict) -> None:
+    conn.execute(
+        """INSERT INTO email_investor_links
+        (id, email_id, fundraising_investor_id, fundraising_contact_id,
+         contact_id, organization_id, matched_address, match_kind,
+         match_confidence)
+        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)""",
+        (
+            _uuid(),
+            email_id,
+            link.get("fundraising_investor_id"),
+            link.get("fundraising_contact_id"),
+            link.get("contact_id"),
+            link.get("organization_id"),
+            link["matched_address"],
+            link["match_kind"],
+            float(link.get("match_confidence", 1.0)),
+        ),
+    )
+
+
+# ------------------------------------------------------------------ sync runs
+
+def start_sync_run(conn: sqlite3.Connection, *, account_id: str, kind: str) -> str:
+    run_id = _uuid()
+    conn.execute(
+        "INSERT INTO email_sync_runs (id, account_id, kind, started_at, status) "
+        "VALUES (?, ?, ?, ?, 'running')",
+        (run_id, account_id, kind, _now_iso()),
+    )
+    return run_id
+
+
+def finish_sync_run(conn: sqlite3.Connection, run_id: str, *, status: str,
+                    stats: Optional[dict] = None, error: Optional[str] = None) -> None:
+    stats = stats or {}
+    conn.execute(
+        """UPDATE email_sync_runs
+           SET finished_at = ?, status = ?, messages_seen = ?, messages_stored = ?,
+               attachments_saved = ?, api_calls = ?, retries = ?, error = ?
+           WHERE id = ?""",
+        (
+            _now_iso(), status,
+            int(stats.get("messages_seen", 0)),
+            int(stats.get("messages_stored", 0)),
+            int(stats.get("attachments_saved", 0)),
+            int(stats.get("api_calls", 0)),
+            int(stats.get("retries", 0)),
+            error,
+            run_id,
+        ),
+    )
@@ -0,0 +1,79 @@
+"""
+Exception taxonomy for Gmail integration.
+
+gmail_client._call() maps HTTP status codes to these exception types. The retry
+loop in gmail_client._with_retry() inspects the class hierarchy to decide
+whether to back off + retry or fail fast.
+"""
+
+
+class GmailError(Exception):
+    """Base class for all Gmail-integration errors."""
+
+    def __init__(self, message: str = "", *, status: int = 0, payload: object = None):
+        super().__init__(message)
+        self.status = status
+        self.payload = payload
+
+
+class AuthError(GmailError):
+    """401 / 403 that is not a rate-limit. Requires operator intervention
+    (bad service account key, revoked OAuth, missing DWD scope). Not retried."""
+
+
+class RateLimitError(GmailError):
+    """429 or 403 with reason in {rateLimitExceeded, userRateLimitExceeded}.
+    Retried with exponential backoff."""
+
+
+class TransientError(GmailError):
+    """5xx or network error. Retried with exponential backoff."""
+
+
+class NotFoundError(GmailError):
+    """404. For messages this usually means 'deleted in Gmail after we saw it';
+    for history this is HistoryExpiredError."""
+
+
+class HistoryExpiredError(NotFoundError):
+    """404 on history.list with startHistoryId — Gmail only retains history
+    for a limited window (~7 days). Triggers date-based backfill fallback."""
+
+
+class PermanentError(GmailError):
+    """400 or other permanent failure. Skip and log; do not retry."""
+
+
+def classify_http(status: int, payload: object) -> GmailError:
+    """Map a Gmail API response to the appropriate exception type.
+
+    `payload` is the decoded JSON body if any; used to distinguish rate-limit
+    403s from pure auth 403s via the `reason` field Google returns.
+    """
+    reason = ""
+    if isinstance(payload, dict):
+        try:
+            errs = payload.get("error", {}).get("errors") or []
+            if errs:
+                reason = str(errs[0].get("reason", ""))
+        except Exception:  # pragma: no cover — defensive
+            pass
+
+    if status == 429:
+        return RateLimitError(f"rate limited: {reason}", status=status, payload=payload)
+    if status == 403:
+        if reason in ("rateLimitExceeded", "userRateLimitExceeded", "quotaExceeded"):
+            return RateLimitError(f"quota: {reason}", status=status, payload=payload)
+        return AuthError(f"forbidden: {reason}", status=status, payload=payload)
+    if status == 401:
+        return AuthError("unauthorized", status=status, payload=payload)
+    if status == 404:
+        return NotFoundError("not found", status=status, payload=payload)
+    if 500 <= status < 600:
+        return TransientError(f"server error {status}", status=status, payload=payload)
+    if 400 <= status < 500:
+        return PermanentError(f"client error {status}: {reason}", status=status, payload=payload)
+    return GmailError(f"unexpected status {status}", status=status, payload=payload)
+
+
+RETRYABLE = (RateLimitError, TransientError)
@@ -0,0 +1,249 @@
+"""
+Thin Gmail API wrapper.
+
+Responsibilities:
+- HTTPS calls to https://gmail.googleapis.com/gmail/v1/users/me/*
+- Per-account access-token injection via CredentialProvider
+- Rate limiting via token bucket
+- Retry loop with exponential backoff + jitter for RETRYABLE errors
+- Batch requests for metadata fetches (multipart/mixed) — sketch provided
+- Call-count accounting for observability (plumbed to email_sync_runs)
+
+We call Gmail over raw urllib instead of the google-api-python-client to keep
+the dependency surface small. If you prefer the Google SDK, replace _call()
+with client calls; everything else is independent.
+"""
+
+import json
+import random
+import threading
+import time
+import urllib.error
+import urllib.parse
+import urllib.request
+from dataclasses import dataclass, field
+from typing import Any, Iterator, Optional
+
+from . import config as _cfg
+from . import errors
+
+
+BASE = "https://gmail.googleapis.com/gmail/v1/users"
+
+
+# ---------------------------------------------------------------------------- token bucket
+
+class _TokenBucket:
+    """Simple per-account rate limiter. Call wait(cost) before each API call."""
+
+    def __init__(self, units_per_sec: int, burst: Optional[int] = None):
+        self._rate = float(units_per_sec)
+        self._burst = float(burst if burst is not None else units_per_sec)
+        self._tokens = self._burst
+        self._last = time.monotonic()
+        self._lock = threading.Lock()
+
+    def wait(self, cost: float) -> None:
+        while True:
+            with self._lock:
+                now = time.monotonic()
+                self._tokens = min(self._burst, self._tokens + (now - self._last) * self._rate)
+                self._last = now
+                if self._tokens >= cost:
+                    self._tokens -= cost
+                    return
+                needed = cost - self._tokens
+                sleep_for = needed / self._rate
+            time.sleep(sleep_for)
+
+
+# ---------------------------------------------------------------------------- call stats
+
+@dataclass
+class CallStats:
+    api_calls: int = 0
+    retries: int = 0
+    bytes_in: int = 0
+    last_errors: list[str] = field(default_factory=list)
+
+
+# ---------------------------------------------------------------------------- client
+
+class GmailClient:
+    """Per-account Gmail client. Bind one instance per sync run."""
+
+    def __init__(self, credential_provider, email_address: str, stats: Optional[CallStats] = None):
+        self._creds = credential_provider
+        self._email = email_address
+        self._bucket = _TokenBucket(units_per_sec=_cfg.CONFIG.rate_limit_units_per_sec_per_account)
+        self.stats = stats or CallStats()
+
+    # -------------------------------------------------------------- messages.*
+
+    def list_messages(self, *, q: str = "", page_token: Optional[str] = None,
+                      max_results: int = 500, label_ids: Optional[list[str]] = None) -> dict:
+        """https://developers.google.com/gmail/api/reference/rest/v1/users.messages/list"""
+        params = {"maxResults": str(max_results)}
+        if q:
+            params["q"] = q
+        if page_token:
+            params["pageToken"] = page_token
+        if label_ids:
+            for lid in label_ids:
+                params.setdefault("labelIds", [])
+                params["labelIds"].append(lid) if False else None
+        return self._get("/messages", params=params, cost=5)
+
+    def get_message(self, message_id: str, *, format: str = "metadata",
+                    metadata_headers: Optional[list[str]] = None) -> dict:
+        params = {"format": format}
+        if format == "metadata" and metadata_headers:
+            params["metadataHeaders"] = metadata_headers
+        return self._get(f"/messages/{message_id}", params=params, cost=5)
+
+    def get_attachment(self, message_id: str, attachment_id: str) -> dict:
+        return self._get(
+            f"/messages/{message_id}/attachments/{attachment_id}",
+            params=None,
+            cost=5,
+        )
+
+    # -------------------------------------------------------------- history.*
+
+    def list_history(self, *, start_history_id: str, page_token: Optional[str] = None,
+                     history_types: Optional[list[str]] = None) -> dict:
+        params = {"startHistoryId": start_history_id, "maxResults": "500"}
+        if page_token:
+            params["pageToken"] = page_token
+        if history_types:
+            params["historyTypes"] = history_types
+        try:
+            return self._get("/history", params=params, cost=2)
+        except errors.NotFoundError as e:
+            # Gmail returns 404 when startHistoryId is too old. Wrap for callers.
+            raise errors.HistoryExpiredError(
+                "startHistoryId no longer available", status=404, payload=getattr(e, "payload", None)
+            ) from e
+
+    # -------------------------------------------------------------- profile
+
+    def get_profile(self) -> dict:
+        return self._get("/profile", params=None, cost=1)
+
+    # -------------------------------------------------------------- iteration helpers
+
+    def iter_messages(self, *, q: str = "") -> Iterator[dict]:
+        page_token: Optional[str] = None
+        while True:
+            resp = self.list_messages(q=q, page_token=page_token,
+                                      max_results=_cfg.CONFIG.backfill_page_size)
+            for m in resp.get("messages") or []:
+                yield m
+            page_token = resp.get("nextPageToken")
+            if not page_token:
+                return
+
+    def iter_history(self, *, start_history_id: str,
+                     history_types: Optional[list[str]] = None) -> Iterator[dict]:
+        page_token: Optional[str] = None
+        while True:
+            resp = self.list_history(
+                start_history_id=start_history_id,
+                page_token=page_token,
+                history_types=history_types,
+            )
+            for h in resp.get("history") or []:
+                yield h
+            page_token = resp.get("nextPageToken")
+            if not page_token:
+                # Cache final historyId for caller to checkpoint.
+                self._last_history_id = resp.get("historyId")
+                return
+
+    @property
+    def last_history_id(self) -> Optional[str]:
+        return getattr(self, "_last_history_id", None)
+
+    # -------------------------------------------------------------- internals
+
+    def _get(self, path: str, *, params: Optional[dict], cost: float) -> dict:
+        return self._with_retry(lambda: self._call("GET", path, params=params, cost=cost))
+
+    def _call(self, method: str, path: str, *, params: Optional[dict] = None,
+              body: Optional[bytes] = None, cost: float = 1.0) -> dict:
+        self._bucket.wait(cost)
+        self.stats.api_calls += 1
+
+        qs = ""
+        if params:
+            # urllib.parse.urlencode with doseq=True handles repeated params
+            # like metadataHeaders=Foo&metadataHeaders=Bar correctly.
+            qs = "?" + urllib.parse.urlencode(params, doseq=True)
+        url = f"{BASE}/me{path}{qs}"
+
+        token = self._creds.access_token_for(self._email)
+        req = urllib.request.Request(url, method=method, data=body)
+        req.add_header("Authorization", f"Bearer {token.token}")
+        req.add_header("Accept", "application/json")
+        if body:
+            req.add_header("Content-Type", "application/json")
+
+        try:
+            with urllib.request.urlopen(req, timeout=30) as resp:
+                raw = resp.read()
+                self.stats.bytes_in += len(raw)
+                if not raw:
+                    return {}
+                return json.loads(raw)
+        except urllib.error.HTTPError as e:
+            raw = e.read() or b""
+            self.stats.bytes_in += len(raw)
+            try:
+                payload = json.loads(raw) if raw else {}
+            except Exception:
+                payload = {"raw": raw.decode("utf-8", errors="replace")}
+            err = errors.classify_http(e.code, payload)
+            # short-message logging hook (redacted of tokens by design)
+            self.stats.last_errors.append(f"{e.code} {type(err).__name__}")
+            self.stats.last_errors = self.stats.last_errors[-10:]
+            raise err
+        except (urllib.error.URLError, TimeoutError) as e:
+            raise errors.TransientError(f"network error: {e}") from e
+
+    def _with_retry(self, fn):
+        cfg = _cfg.CONFIG
+        attempts = 0
+        delay = cfg.retry_initial_delay_sec
+        while True:
+            try:
+                return fn()
+            except errors.RETRYABLE as e:
+                attempts += 1
+                if attempts >= cfg.retry_max_attempts:
+                    raise
+                self.stats.retries += 1
+                # Full jitter
+                sleep_for = random.uniform(0, min(delay, cfg.retry_max_delay_sec))
+                time.sleep(sleep_for)
+                delay = min(delay * 2, cfg.retry_max_delay_sec)
+            # Non-retryable errors propagate immediately.
+
+
+# ---------------------------------------------------------------------------- batch fetch sketch
+
+def batch_get_metadata(client: GmailClient, message_ids: list[str],
+                       headers: list[str]) -> dict[str, dict]:
+    """Fetch metadata for up to ~100 messages.
+
+    TODO: implement using Gmail's multipart/mixed batch endpoint at
+    https://www.googleapis.com/batch/gmail/v1 for efficiency. In the scaffold
+    we fall back to serial gets so the logic is correct from day 1.
+    """
+    out: dict[str, dict] = {}
+    for mid in message_ids:
+        try:
+            out[mid] = client.get_message(mid, format="metadata", metadata_headers=headers)
+        except errors.NotFoundError:
+            # Message deleted between list and get — skip.
+            continue
+    return out
@@ -0,0 +1,215 @@
+"""
+Investor matching.
+
+Builds an in-memory index of investor email addresses from:
+  - fundraising_contacts.email
+  - contacts.email
+  - organizations.email + organizations.website (domain only)
+
+For each synced email, returns a list of investor links. Exact-email matches
+beat domain matches; if any exact match exists, domain matches are suppressed.
+
+The index is rebuilt every `REFRESH_INTERVAL_SEC` or on demand via rebuild().
+"""
+
+import re
+import threading
+import time
+from dataclasses import dataclass
+from typing import Optional
+
+
+REFRESH_INTERVAL_SEC = 900  # 15 minutes
+
+# Domains we never domain-match against (personal mailboxes).
+COMMON_PERSONAL_DOMAINS = {
+    "gmail.com", "googlemail.com",
+    "outlook.com", "hotmail.com", "live.com", "msn.com",
+    "yahoo.com", "yahoo.co.uk", "ymail.com",
+    "icloud.com", "me.com", "mac.com",
+    "aol.com", "proton.me", "protonmail.com",
+    "pm.me", "fastmail.com", "tuta.io", "hey.com",
+    "duck.com", "zoho.com",
+}
+
+
+# Also skip matching on the team's own domain (they email each other).
+# Populated from CONFIG.workspace_domain at rebuild time.
+
+
+@dataclass
+class MatchTarget:
+    fundraising_investor_id: Optional[str] = None
+    fundraising_contact_id: Optional[str] = None
+    contact_id: Optional[str] = None
+    organization_id: Optional[str] = None
+    investor_name: Optional[str] = None
+
+
+@dataclass
+class InvestorLink:
+    matched_address: str
+    match_kind: str          # exact_email | domain_match | manual
+    match_confidence: float
+    target: MatchTarget
+
+
+class InvestorIndex:
+
+    def __init__(self, own_domain: Optional[str] = None):
+        self._email_index: dict[str, MatchTarget] = {}
+        self._domain_index: dict[str, list[MatchTarget]] = {}
+        self._own_domain = (own_domain or "").lower() or None
+        self._last_built = 0.0
+        self._lock = threading.Lock()
+
+    # ------------------------------------------------------------------ build
+
+    def rebuild(self, db_conn_factory) -> None:
+        with self._lock:
+            email_idx: dict[str, MatchTarget] = {}
+            domain_idx: dict[str, list[MatchTarget]] = {}
+
+            conn = db_conn_factory()
+            try:
+                cur = conn.cursor()
+
+                # fundraising_contacts
+                cur.execute(
+                    "SELECT fc.id, fc.email, fc.investor_id, fi.investor_name "
+                    "FROM fundraising_contacts fc "
+                    "LEFT JOIN fundraising_investors fi ON fi.id = fc.investor_id "
+                    "WHERE fc.email IS NOT NULL AND fc.email != ''"
+                )
+                for r in cur.fetchall():
+                    addr = (r["email"] or "").lower().strip()
+                    if not _valid_email(addr):
+                        continue
+                    email_idx[addr] = MatchTarget(
+                        fundraising_contact_id=r["id"],
+                        fundraising_investor_id=r["investor_id"],
+                        investor_name=r["investor_name"],
+                    )
+
+                # contacts
+                cur.execute(
+                    "SELECT id, email, organization_id FROM contacts "
+                    "WHERE email IS NOT NULL AND email != ''"
+                )
+                for r in cur.fetchall():
+                    addr = (r["email"] or "").lower().strip()
+                    if not _valid_email(addr):
+                        continue
+                    # Don't overwrite a fundraising_contact match; they're higher signal.
+                    email_idx.setdefault(addr, MatchTarget(
+                        contact_id=r["id"],
+                        organization_id=r["organization_id"],
+                    ))
+
+                # organizations — domain-only match source
+                cur.execute(
+                    "SELECT id, name, email, website FROM organizations "
+                    "WHERE (email IS NOT NULL AND email != '') OR (website IS NOT NULL AND website != '')"
+                )
+                for r in cur.fetchall():
+                    for d in _domains_for_org(r):
+                        if d in COMMON_PERSONAL_DOMAINS:
+                            continue
+                        if self._own_domain and d == self._own_domain:
+                            continue
+                        domain_idx.setdefault(d, []).append(MatchTarget(
+                            organization_id=r["id"],
+                            investor_name=r["name"],
+                        ))
+            finally:
+                conn.close()
+
+            self._email_index = email_idx
+            self._domain_index = domain_idx
+            self._last_built = time.time()
+
+    def rebuild_if_stale(self, db_conn_factory) -> None:
+        if time.time() - self._last_built > REFRESH_INTERVAL_SEC:
+            self.rebuild(db_conn_factory)
+
+    # ------------------------------------------------------------------ query
+
+    def match(self, addresses: set[str], *,
+              exclude_addresses: Optional[set[str]] = None) -> list[InvestorLink]:
+        excl = {a.lower() for a in (exclude_addresses or set())}
+        candidates = {a.lower().strip() for a in addresses if a} - excl
+
+        # Exclude own domain addresses (teammates emailing each other).
+        if self._own_domain:
+            candidates = {a for a in candidates
+                          if not a.endswith("@" + self._own_domain)}
+
+        links: list[InvestorLink] = []
+        seen_targets: set[tuple] = set()
+
+        # Exact email matches first.
+        for addr in candidates:
+            t = self._email_index.get(addr)
+            if t:
+                key = (t.fundraising_contact_id, t.contact_id)
+                if key in seen_targets:
+                    continue
+                seen_targets.add(key)
+                links.append(InvestorLink(
+                    matched_address=addr,
+                    match_kind="exact_email",
+                    match_confidence=1.0,
+                    target=t,
+                ))
+
+        if links:  # exact hits short-circuit domain matching
+            return links
+
+        # Domain fallback.
+        for addr in candidates:
+            _, _, domain = addr.partition("@")
+            if not domain or domain in COMMON_PERSONAL_DOMAINS:
+                continue
+            for t in self._domain_index.get(domain, []):
+                key = ("org", t.organization_id)
+                if key in seen_targets:
+                    continue
+                seen_targets.add(key)
+                links.append(InvestorLink(
+                    matched_address=addr,
+                    match_kind="domain_match",
+                    match_confidence=0.6,
+                    target=t,
+                ))
+        return links
+
+
+# ---------------------------------------------------------------------------- helpers
+
+_EMAIL_RE = re.compile(r"^[^@\s]+@[^@\s]+\.[^@\s]+$")
+
+
+def _valid_email(s: str) -> bool:
+    return bool(_EMAIL_RE.match(s))
+
+
+def _domains_for_org(row) -> list[str]:
+    out: list[str] = []
+    if row["email"]:
+        _, _, d = row["email"].lower().partition("@")
+        if d:
+            out.append(d)
+    if row["website"]:
+        d = _domain_from_url(row["website"])
+        if d:
+            out.append(d)
+    return list({d for d in out if d})
+
+
+def _domain_from_url(url: str) -> Optional[str]:
+    if not url:
+        return None
+    m = re.match(r"^\s*(?:https?://)?(?:www\.)?([^/:?#\s]+)", url.strip(), re.IGNORECASE)
+    if not m:
+        return None
+    return m.group(1).lower()
@@ -0,0 +1,192 @@
+-- Gmail Integration — Phase 1 migration
+-- Creates all tables for email capture, matching, threading, attachments.
+-- This migration is IDEMPOTENT: safe to re-run.
+-- Applied by email_integration.db.apply_migrations() on server startup when
+-- CRM_GMAIL_INTEGRATION_ENABLED is truthy.
+--
+-- DO NOT modify this file in place after it ships. Create 0002_*.sql, etc.
+
+-- ============================================================================
+-- email_accounts — one row per enrolled team-member mailbox
+-- ============================================================================
+CREATE TABLE IF NOT EXISTS email_accounts (
+    id                TEXT PRIMARY KEY,
+    user_id           TEXT NOT NULL,
+    email_address     TEXT NOT NULL UNIQUE,
+    auth_method       TEXT NOT NULL,                 -- 'dwd' | 'oauth'
+    oauth_refresh_enc BLOB,
+    oauth_token_enc   BLOB,
+    oauth_token_exp   TEXT,
+    sync_enabled      INTEGER NOT NULL DEFAULT 1,
+    sync_status       TEXT NOT NULL DEFAULT 'pending',
+    sync_error        TEXT,
+    last_history_id   TEXT,
+    last_synced_at    TEXT,
+    backfill_complete INTEGER NOT NULL DEFAULT 0,
+    backfill_cursor   TEXT,
+    created_at        TEXT DEFAULT (datetime('now')),
+    updated_at        TEXT DEFAULT (datetime('now')),
+    FOREIGN KEY(user_id) REFERENCES users(id)
+);
+CREATE INDEX IF NOT EXISTS idx_email_accounts_user ON email_accounts(user_id);
+CREATE INDEX IF NOT EXISTS idx_email_accounts_sync ON email_accounts(sync_enabled, sync_status);
+
+-- ============================================================================
+-- emails — canonical email record, dedup'd across accounts by RFC Message-ID
+-- ============================================================================
+CREATE TABLE IF NOT EXISTS emails (
+    id                 TEXT PRIMARY KEY,
+    rfc_message_id     TEXT NOT NULL UNIQUE,
+    gmail_thread_id    TEXT,
+    rfc_thread_root_id TEXT,
+    thread_id          TEXT,                         -- FK email_threads.id (populated by threads.py)
+    subject            TEXT,
+    from_email         TEXT NOT NULL,
+    from_name          TEXT,
+    to_emails_json     TEXT NOT NULL DEFAULT '[]',
+    cc_emails_json     TEXT NOT NULL DEFAULT '[]',
+    bcc_emails_json    TEXT NOT NULL DEFAULT '[]',
+    reply_to           TEXT,
+    sent_at            TEXT NOT NULL,
+    body_text          TEXT,
+    body_html          TEXT,
+    snippet            TEXT,
+    in_reply_to        TEXT,
+    references_json    TEXT DEFAULT '[]',
+    has_attachments    INTEGER NOT NULL DEFAULT 0,
+    size_estimate      INTEGER,
+    is_matched         INTEGER NOT NULL DEFAULT 0,
+    match_status       TEXT NOT NULL DEFAULT 'unmatched',   -- unmatched|matched|skipped
+    raw_headers_json   TEXT,
+    created_at         TEXT DEFAULT (datetime('now')),
+    updated_at         TEXT DEFAULT (datetime('now'))
+);
+CREATE INDEX IF NOT EXISTS idx_emails_thread      ON emails(gmail_thread_id);
+CREATE INDEX IF NOT EXISTS idx_emails_rfc_thread  ON emails(rfc_thread_root_id);
+CREATE INDEX IF NOT EXISTS idx_emails_thread_fk   ON emails(thread_id);
+CREATE INDEX IF NOT EXISTS idx_emails_from        ON emails(from_email);
+CREATE INDEX IF NOT EXISTS idx_emails_sent_at     ON emails(sent_at);
+CREATE INDEX IF NOT EXISTS idx_emails_matched     ON emails(is_matched, sent_at);
+CREATE INDEX IF NOT EXISTS idx_emails_in_reply_to ON emails(in_reply_to);
+
+-- ============================================================================
+-- email_recipients — denormalized for fast address lookups
+-- ============================================================================
+CREATE TABLE IF NOT EXISTS email_recipients (
+    id           TEXT PRIMARY KEY,
+    email_id     TEXT NOT NULL,
+    address      TEXT NOT NULL,
+    display_name TEXT,
+    kind         TEXT NOT NULL,              -- from|to|cc|bcc|reply_to
+    FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
+);
+CREATE INDEX IF NOT EXISTS idx_email_recipients_addr  ON email_recipients(address);
+CREATE INDEX IF NOT EXISTS idx_email_recipients_email ON email_recipients(email_id);
+
+-- ============================================================================
+-- email_account_messages — per-mailbox sighting of an email
+-- ============================================================================
+CREATE TABLE IF NOT EXISTS email_account_messages (
+    id                TEXT PRIMARY KEY,
+    email_id          TEXT NOT NULL,
+    account_id        TEXT NOT NULL,
+    gmail_message_id  TEXT NOT NULL,
+    gmail_thread_id   TEXT NOT NULL,
+    labels_json       TEXT DEFAULT '[]',
+    is_sent           INTEGER NOT NULL DEFAULT 0,
+    first_seen_at     TEXT DEFAULT (datetime('now')),
+    deleted_at        TEXT,
+    FOREIGN KEY(email_id)   REFERENCES emails(id) ON DELETE CASCADE,
+    FOREIGN KEY(account_id) REFERENCES email_accounts(id) ON DELETE CASCADE,
+    UNIQUE(account_id, gmail_message_id)
+);
+CREATE INDEX IF NOT EXISTS idx_eam_email     ON email_account_messages(email_id);
+CREATE INDEX IF NOT EXISTS idx_eam_account   ON email_account_messages(account_id);
+CREATE INDEX IF NOT EXISTS idx_eam_gmail_msg ON email_account_messages(gmail_message_id);
+
+-- ============================================================================
+-- email_attachments — metadata; bytes on disk under data/email_attachments/
+-- ============================================================================
+CREATE TABLE IF NOT EXISTS email_attachments (
+    id                  TEXT PRIMARY KEY,
+    email_id            TEXT NOT NULL,
+    gmail_attachment_id TEXT NOT NULL,
+    filename            TEXT NOT NULL,
+    sanitized_filename  TEXT NOT NULL,
+    mime_type           TEXT,
+    size_bytes          INTEGER,
+    sha256_hex          TEXT,
+    storage_path        TEXT NOT NULL,
+    download_status     TEXT NOT NULL DEFAULT 'pending',  -- pending|downloaded|failed|skipped
+    download_attempts   INTEGER NOT NULL DEFAULT 0,
+    download_error      TEXT,
+    downloaded_at       TEXT,
+    created_at          TEXT DEFAULT (datetime('now')),
+    FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
+);
+CREATE INDEX IF NOT EXISTS idx_attach_email  ON email_attachments(email_id);
+CREATE INDEX IF NOT EXISTS idx_attach_sha    ON email_attachments(sha256_hex);
+CREATE INDEX IF NOT EXISTS idx_attach_status ON email_attachments(download_status);
+
+-- ============================================================================
+-- email_threads — thread roll-up for UI
+-- ============================================================================
+CREATE TABLE IF NOT EXISTS email_threads (
+    id                 TEXT PRIMARY KEY,
+    gmail_thread_id    TEXT,
+    rfc_thread_root_id TEXT,
+    subject_normalized TEXT,
+    first_message_at   TEXT,
+    last_message_at    TEXT,
+    message_count      INTEGER NOT NULL DEFAULT 0,
+    participant_count  INTEGER NOT NULL DEFAULT 0,
+    participants_json  TEXT DEFAULT '[]',
+    is_matched         INTEGER NOT NULL DEFAULT 0,
+    created_at         TEXT DEFAULT (datetime('now')),
+    updated_at         TEXT DEFAULT (datetime('now'))
+);
+CREATE UNIQUE INDEX IF NOT EXISTS idx_threads_gmail_uniq ON email_threads(gmail_thread_id)
+    WHERE gmail_thread_id IS NOT NULL;
+CREATE INDEX IF NOT EXISTS idx_threads_rfc_root ON email_threads(rfc_thread_root_id);
+CREATE INDEX IF NOT EXISTS idx_threads_last_msg ON email_threads(last_message_at);
+
+-- ============================================================================
+-- email_investor_links — matched investors
+-- ============================================================================
+CREATE TABLE IF NOT EXISTS email_investor_links (
+    id                      TEXT PRIMARY KEY,
+    email_id                TEXT NOT NULL,
+    fundraising_investor_id TEXT,
+    fundraising_contact_id  TEXT,
+    contact_id              TEXT,
+    organization_id         TEXT,
+    matched_address         TEXT NOT NULL,
+    match_kind              TEXT NOT NULL,            -- exact_email|domain_match|manual
+    match_confidence        REAL NOT NULL DEFAULT 1.0,
+    created_at              TEXT DEFAULT (datetime('now')),
+    FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
+);
+CREATE INDEX IF NOT EXISTS idx_eil_email      ON email_investor_links(email_id);
+CREATE INDEX IF NOT EXISTS idx_eil_investor   ON email_investor_links(fundraising_investor_id);
+CREATE INDEX IF NOT EXISTS idx_eil_fr_contact ON email_investor_links(fundraising_contact_id);
+CREATE INDEX IF NOT EXISTS idx_eil_contact    ON email_investor_links(contact_id);
+
+-- ============================================================================
+-- email_sync_runs — per-run observability
+-- ============================================================================
+CREATE TABLE IF NOT EXISTS email_sync_runs (
+    id                TEXT PRIMARY KEY,
+    account_id        TEXT NOT NULL,
+    kind              TEXT NOT NULL,                  -- backfill|incremental|manual
+    started_at        TEXT NOT NULL,
+    finished_at       TEXT,
+    status            TEXT NOT NULL,                  -- running|ok|error|partial
+    messages_seen     INTEGER NOT NULL DEFAULT 0,
+    messages_stored   INTEGER NOT NULL DEFAULT 0,
+    attachments_saved INTEGER NOT NULL DEFAULT 0,
+    api_calls         INTEGER NOT NULL DEFAULT 0,
+    retries           INTEGER NOT NULL DEFAULT 0,
+    error             TEXT,
+    FOREIGN KEY(account_id) REFERENCES email_accounts(id) ON DELETE CASCADE
+);
+CREATE INDEX IF NOT EXISTS idx_sync_runs_account ON email_sync_runs(account_id, started_at);
@@ -0,0 +1,283 @@
+"""
+Parse a Gmail `users.messages.get` response (format=full) into a flat dict
+ready for db.insert_email().
+
+Input shape (abbreviated):
+  {
+    "id": "...",                     # Gmail message id
+    "threadId": "...",
+    "labelIds": ["INBOX","IMPORTANT",...],
+    "snippet": "...",
+    "historyId": "...",
+    "internalDate": "1713657600000", # ms epoch, authoritative
+    "sizeEstimate": 12345,
+    "payload": {
+      "headers": [{"name":"Subject","value":"..."}, ...],
+      "mimeType": "multipart/mixed",
+      "parts": [...recursive...],
+      "body": {"data": "<base64url>", "size": ...}
+    }
+  }
+"""
+
+import base64
+import email.utils
+import email.header
+import re
+from datetime import datetime, timezone
+from typing import Any, Iterable, Optional
+from html.parser import HTMLParser
+
+
+# ---------------------------------------------------------------------------- public
+
+def parse(message: dict, *, owning_account_address: Optional[str] = None) -> dict:
+    """Parse a Gmail message payload into our canonical dict shape."""
+    headers = _header_map(message.get("payload", {}).get("headers") or [])
+
+    from_name, from_email = _split_addr(headers.get("from", ""))
+    to_list = _parse_address_list(headers.get("to", ""))
+    cc_list = _parse_address_list(headers.get("cc", ""))
+    bcc_list = _parse_address_list(headers.get("bcc", ""))
+    reply_to = _split_addr(headers.get("reply-to", ""))[1] or None
+
+    sent_at = _parse_date_header(headers.get("date"), fallback_ms=message.get("internalDate"))
+
+    rfc_mid = headers.get("message-id", "").strip() or f"synthetic-{message.get('id')}@ten31.local"
+    rfc_mid = _strip_angle_brackets(rfc_mid)
+    in_reply_to = _strip_angle_brackets(headers.get("in-reply-to", "").strip()) or None
+    references = _split_references(headers.get("references", ""))
+    rfc_thread_root_id = references[0] if references else (in_reply_to or rfc_mid)
+
+    body_text, body_html, attachments = _walk_payload(message.get("payload", {}))
+
+    subject = _decode_rfc2047(headers.get("subject") or "")
+
+    labels = message.get("labelIds") or []
+    is_sent = "SENT" in labels
+
+    return {
+        "gmail_message_id": message.get("id"),
+        "gmail_thread_id": message.get("threadId"),
+        "rfc_message_id": rfc_mid,
+        "rfc_thread_root_id": rfc_thread_root_id,
+        "in_reply_to": in_reply_to,
+        "references": references,
+        "subject": subject,
+        "from_email": (from_email or "").lower(),
+        "from_name": from_name,
+        "to": [{"email": e.lower(), "name": n} for n, e in to_list if e],
+        "cc": [{"email": e.lower(), "name": n} for n, e in cc_list if e],
+        "bcc": [{"email": e.lower(), "name": n} for n, e in bcc_list if e],
+        "reply_to": reply_to.lower() if reply_to else None,
+        "sent_at": sent_at,
+        "body_text": _cap_text(body_text),
+        "body_html": _cap_text(body_html),
+        "snippet": message.get("snippet"),
+        "attachments": attachments,
+        "size_estimate": message.get("sizeEstimate"),
+        "labels": labels,
+        "is_sent": is_sent,
+        "raw_headers": headers,
+        "owning_account": owning_account_address,
+    }
+
+
+# ---------------------------------------------------------------------------- headers
+
+def _header_map(header_list: Iterable[dict]) -> dict[str, str]:
+    """Case-insensitive keys. Last-write-wins for duplicates (rare)."""
+    out: dict[str, str] = {}
+    for h in header_list:
+        name = (h.get("name") or "").lower()
+        out[name] = h.get("value") or ""
+    return out
+
+
+def _decode_rfc2047(s: str) -> str:
+    if not s:
+        return ""
+    try:
+        parts = email.header.decode_header(s)
+        pieces = []
+        for text, charset in parts:
+            if isinstance(text, bytes):
+                try:
+                    pieces.append(text.decode(charset or "utf-8", errors="replace"))
+                except LookupError:
+                    pieces.append(text.decode("utf-8", errors="replace"))
+            else:
+                pieces.append(text)
+        return "".join(pieces)
+    except Exception:
+        return s
+
+
+def _split_addr(raw: str) -> tuple[Optional[str], Optional[str]]:
+    if not raw:
+        return (None, None)
+    name, addr = email.utils.parseaddr(raw)
+    return (_decode_rfc2047(name) or None, addr or None)
+
+
+def _parse_address_list(raw: str) -> list[tuple[Optional[str], Optional[str]]]:
+    if not raw:
+        return []
+    parsed = email.utils.getaddresses([raw])
+    return [(_decode_rfc2047(n) or None, a or None) for n, a in parsed if a]
+
+
+def _parse_date_header(raw: Optional[str], *, fallback_ms: Optional[str]) -> str:
+    # Prefer RFC Date header, fall back to Gmail internalDate (epoch ms).
+    if raw:
+        try:
+            dt = email.utils.parsedate_to_datetime(raw)
+            if dt is not None:
+                if dt.tzinfo is None:
+                    dt = dt.replace(tzinfo=timezone.utc)
+                return dt.astimezone(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
+        except (TypeError, ValueError):
+            pass
+    if fallback_ms:
+        try:
+            dt = datetime.fromtimestamp(int(fallback_ms) / 1000.0, tz=timezone.utc)
+            return dt.strftime("%Y-%m-%dT%H:%M:%SZ")
+        except (TypeError, ValueError):
+            pass
+    return datetime.now(tz=timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
+
+
+def _split_references(raw: str) -> list[str]:
+    if not raw:
+        return []
+    return [_strip_angle_brackets(p) for p in raw.split() if p.strip()]
+
+
+def _strip_angle_brackets(s: str) -> str:
+    s = (s or "").strip()
+    if s.startswith("<") and s.endswith(">"):
+        return s[1:-1]
+    return s
+
+
+# ---------------------------------------------------------------------------- MIME walk
+
+def _walk_payload(payload: dict) -> tuple[Optional[str], Optional[str], list[dict]]:
+    """Returns (body_text, body_html, attachments).
+
+    Depth-first walk. First plain/text wins for body_text; first text/html
+    wins for body_html. Anything with a filename or attachment disposition
+    becomes an attachment entry.
+    """
+    text: Optional[str] = None
+    html_body: Optional[str] = None
+    attachments: list[dict] = []
+
+    def visit(part: dict):
+        nonlocal text, html_body
+        mime = (part.get("mimeType") or "").lower()
+        filename = part.get("filename") or ""
+        body = part.get("body") or {}
+        parts = part.get("parts") or []
+
+        headers = _header_map(part.get("headers") or [])
+        disposition = (headers.get("content-disposition") or "").lower()
+        is_attachment = bool(filename) or disposition.startswith("attachment")
+
+        if is_attachment:
+            attachments.append({
+                "filename": filename or f"unnamed.{_ext_for(mime)}",
+                "mime_type": mime or "application/octet-stream",
+                "size": body.get("size"),
+                "gmail_attachment_id": body.get("attachmentId"),
+                # Some tiny attachments come inlined as base64; attachmentId is
+                # then missing and data is in body.data. sync.py handles both.
+                "inline_data_b64": body.get("data"),
+                "content_disposition": "inline" if disposition.startswith("inline") else "attachment",
+            })
+        else:
+            if mime == "text/plain" and text is None:
+                text = _decode_body(body)
+            elif mime == "text/html" and html_body is None:
+                html_body = _decode_body(body)
+
+        for child in parts:
+            visit(child)
+
+    visit(payload)
+
+    # Derive a plain-text body from HTML if only HTML exists.
+    if text is None and html_body:
+        text = _strip_html(html_body)
+
+    return text, html_body, attachments
+
+
+def _decode_body(body: dict) -> Optional[str]:
+    data = body.get("data")
+    if not data:
+        return None
+    try:
+        padding = 4 - (len(data) % 4)
+        if padding != 4:
+            data = data + ("=" * padding)
+        raw = base64.urlsafe_b64decode(data.encode("ascii"))
+        return raw.decode("utf-8", errors="replace").replace("\r\n", "\n")
+    except Exception:
+        return None
+
+
+# ---------------------------------------------------------------------------- HTML stripping
+
+class _HTMLToText(HTMLParser):
+    def __init__(self):
+        super().__init__()
+        self._parts: list[str] = []
+        self._skip_depth = 0
+
+    def handle_starttag(self, tag, attrs):
+        if tag in ("script", "style"):
+            self._skip_depth += 1
+        if tag in ("br", "p", "div", "tr", "li"):
+            self._parts.append("\n")
+
+    def handle_endtag(self, tag):
+        if tag in ("script", "style"):
+            self._skip_depth = max(0, self._skip_depth - 1)
+        if tag in ("p", "div", "tr"):
+            self._parts.append("\n")
+
+    def handle_data(self, data):
+        if self._skip_depth == 0:
+            self._parts.append(data)
+
+    def text(self) -> str:
+        raw = "".join(self._parts)
+        return re.sub(r"\n{3,}", "\n\n", raw).strip()
+
+
+def _strip_html(html: str) -> str:
+    p = _HTMLToText()
+    try:
+        p.feed(html)
+        return p.text()
+    except Exception:
+        return re.sub(r"<[^>]+>", " ", html)
+
+
+def _ext_for(mime: str) -> str:
+    return mime.split("/")[-1] if "/" in mime else "bin"
+
+
+# ---------------------------------------------------------------------------- caps
+
+# Keep bodies bounded to avoid a pathological 500MB message exploding the DB.
+_BODY_CAP_BYTES = 10 * 1024 * 1024  # 10MB
+
+
+def _cap_text(s: Optional[str]) -> Optional[str]:
+    if s is None:
+        return None
+    if len(s.encode("utf-8", errors="ignore")) <= _BODY_CAP_BYTES:
+        return s
+    return s[: _BODY_CAP_BYTES // 2] + "\n\n[TRUNCATED BY CRM — body exceeded 10MB]"
@@ -0,0 +1,462 @@
+"""
+HTTP route handlers for the Gmail integration.
+
+Designed to plug into server.py's CRMHandler (BaseHTTPRequestHandler) pattern.
+The hook is a single function call near the top of do_GET / do_POST that
+lets this module claim any /api/email/* request:
+
+    # in CRMHandler.do_GET and CRMHandler.do_POST, before the 404 fallthrough:
+    from email_integration.routes import try_handle
+    if try_handle(self):
+        return
+
+`try_handle(handler)` inspects `handler.command` and `handler.get_path()` and
+returns True if it handled the request (sent a response).
+
+Every handler respects the same auth / rate-limit model as the rest of server.py
+by calling handler.get_user() and handler.rate_limited(...).
+"""
+
+import json
+import sqlite3
+from typing import Optional
+
+from . import config as _cfg
+from . import credentials as _creds
+from . import crypto as _crypto
+from . import db as _db
+from . import scheduler as _sched
+
+
+# ---------------------------------------------------------------------------- dispatch
+
+_GET_ROUTES = {
+    "/api/email/status": "status",
+    "/api/email/accounts": "list_accounts",
+    "/api/email/threads": "list_threads",
+    "/api/email/oauth/start": "oauth_start",
+    "/api/email/oauth/callback": "oauth_callback",
+}
+
+_POST_ROUTES = {
+    "/api/email/accounts/enroll-all": "enroll_all",
+    "/api/email/accounts/enroll": "enroll_one",
+    "/api/email/sync/run-now": "run_now",
+    "/api/email/rematch": "rematch",
+}
+
+
+def try_handle(handler) -> bool:
+    path = handler.get_path()
+    method = handler.command
+    table = _GET_ROUTES if method == "GET" else _POST_ROUTES if method == "POST" else {}
+    name = table.get(path)
+    if not path.startswith("/api/email/"):
+        return False
+    if not name:
+        # Route is owned by this module but unknown — return a proper 404
+        # instead of letting the main dispatcher's 404 abuse counter fire.
+        handler.send_error_json("Not found", 404)
+        return True
+
+    if not _cfg.CONFIG.enabled:
+        handler.send_error_json("Email integration disabled", 503)
+        return True
+
+    # Also enforce attachment streaming under a different prefix
+    # (handled above via prefix check).
+
+    impl = globals().get(f"_h_{name}")
+    if impl is None:
+        handler.send_error_json("Not implemented", 500)
+        return True
+
+    try:
+        impl(handler)
+    except Exception as e:
+        handler.send_error_json(f"Internal error: {e}", 500)
+    return True
+
+
+# ---------------------------------------------------------------------------- helpers
+
+def _conn() -> sqlite3.Connection:
+    import os
+    db_path = os.environ.get(
+        "CRM_DB_PATH",
+        os.path.join(_cfg.CONFIG.data_dir, "crm.db"),
+    )
+    conn = sqlite3.connect(db_path)
+    conn.execute("PRAGMA journal_mode=WAL")
+    conn.execute("PRAGMA foreign_keys=ON")
+    conn.execute("PRAGMA busy_timeout=5000")
+    conn.row_factory = sqlite3.Row
+    return conn
+
+
+def _require_auth(handler) -> Optional[dict]:
+    user = handler.get_user()
+    if not user:
+        handler.send_error_json("Unauthorized", 401)
+        return None
+    return user
+
+
+def _require_admin(handler) -> Optional[dict]:
+    user = _require_auth(handler)
+    if user is None:
+        return None
+    if user.get("role") != "admin":
+        handler.send_error_json("Admin required", 403)
+        return None
+    return user
+
+
+# ---------------------------------------------------------------------------- GET handlers
+
+def _h_status(handler):
+    user = _require_auth(handler)
+    if not user:
+        return
+    snap = _sched.status_snapshot()
+    conn = _conn()
+    try:
+        cur = conn.cursor()
+        cur.execute(
+            "SELECT COUNT(*) AS n_accounts, "
+            "SUM(CASE WHEN sync_status='active' THEN 1 ELSE 0 END) AS n_active, "
+            "SUM(CASE WHEN sync_status='error' THEN 1 ELSE 0 END) AS n_error "
+            "FROM email_accounts"
+        )
+        counts = dict(cur.fetchone() or {})
+        cur.execute("SELECT COUNT(*) AS n FROM emails WHERE match_status = 'matched'")
+        snap["matched_emails"] = cur.fetchone()["n"]
+    finally:
+        conn.close()
+    snap["accounts_summary"] = counts
+    handler.send_json(snap)
+
+
+def _h_list_accounts(handler):
+    user = _require_auth(handler)
+    if not user:
+        return
+    conn = _conn()
+    try:
+        cur = conn.cursor()
+        cur.execute(
+            "SELECT id, user_id, email_address, auth_method, sync_enabled, "
+            "sync_status, sync_error, last_synced_at, backfill_complete "
+            "FROM email_accounts ORDER BY email_address"
+        )
+        rows = [dict(r) for r in cur.fetchall()]
+    finally:
+        conn.close()
+    # Non-admins only see their own row
+    if user.get("role") != "admin":
+        rows = [r for r in rows if r["user_id"] == user["user_id"]]
+    handler.send_json({"accounts": rows})
+
+
+def _h_list_threads(handler):
+    user = _require_auth(handler)
+    if not user:
+        return
+    q = handler.get_query_params()
+    investor_id = q.get("investor_id")
+    limit = min(int(q.get("limit", 50)), 500)
+    conn = _conn()
+    try:
+        cur = conn.cursor()
+        if investor_id:
+            cur.execute(
+                """SELECT t.*
+                   FROM email_threads t
+                   JOIN emails e ON e.thread_id = t.id
+                   JOIN email_investor_links l ON l.email_id = e.id
+                   WHERE l.fundraising_investor_id = ?
+                      OR l.fundraising_contact_id IN (
+                         SELECT id FROM fundraising_contacts WHERE investor_id = ?
+                      )
+                   GROUP BY t.id
+                   ORDER BY t.last_message_at DESC
+                   LIMIT ?""",
+                (investor_id, investor_id, limit),
+            )
+        else:
+            cur.execute(
+                "SELECT * FROM email_threads WHERE is_matched = 1 "
+                "ORDER BY last_message_at DESC LIMIT ?",
+                (limit,),
+            )
+        threads = [dict(r) for r in cur.fetchall()]
+    finally:
+        conn.close()
+    handler.send_json({"threads": threads})
+
+
+def _h_oauth_start(handler):
+    """Begin per-user OAuth consent flow (fallback path)."""
+    user = _require_auth(handler)
+    if not user:
+        return
+    if _cfg.CONFIG.primary_auth != "oauth":
+        return handler.send_error_json(
+            "Per-user OAuth disabled (set CRM_GMAIL_AUTH_METHOD=oauth to enable)", 400
+        )
+    q = handler.get_query_params()
+    account_email = q.get("account_email") or ""
+    if not account_email:
+        return handler.send_error_json("account_email required", 400)
+
+    import secrets
+    import urllib.parse
+    state = secrets.token_urlsafe(32)
+    _oauth_state_store(state, user["user_id"], account_email)
+
+    params = {
+        "client_id": _cfg.CONFIG.oauth_client_id,
+        "redirect_uri": _cfg.CONFIG.oauth_redirect_uri,
+        "response_type": "code",
+        "scope": _creds.GMAIL_READONLY_SCOPE,
+        "access_type": "offline",
+        "prompt": "consent",
+        "state": state,
+        "login_hint": account_email,
+    }
+    url = "https://accounts.google.com/o/oauth2/v2/auth?" + urllib.parse.urlencode(params)
+    handler.send_json({"redirect_url": url})
+
+
+def _h_oauth_callback(handler):
+    """Exchange code for tokens, encrypt refresh token, store."""
+    q = handler.get_query_params()
+    code = q.get("code")
+    state = q.get("state")
+    if not code or not state:
+        return handler.send_error_json("code and state required", 400)
+
+    state_row = _oauth_state_consume(state)
+    if not state_row:
+        return handler.send_error_json("Invalid state", 400)
+
+    import urllib.parse
+    import urllib.request
+    body = urllib.parse.urlencode({
+        "code": code,
+        "client_id": _cfg.CONFIG.oauth_client_id,
+        "client_secret": _cfg.CONFIG.oauth_client_secret,
+        "redirect_uri": _cfg.CONFIG.oauth_redirect_uri,
+        "grant_type": "authorization_code",
+    }).encode("ascii")
+    req = urllib.request.Request(
+        "https://oauth2.googleapis.com/token",
+        data=body,
+        headers={"Content-Type": "application/x-www-form-urlencoded"},
+    )
+    try:
+        with urllib.request.urlopen(req, timeout=15) as resp:
+            payload = json.loads(resp.read())
+    except Exception as e:
+        return handler.send_error_json(f"Token exchange failed: {e}", 500)
+
+    refresh = payload.get("refresh_token")
+    if not refresh:
+        return handler.send_error_json("No refresh_token returned (user may have previously consented; prompt=consent required)", 400)
+
+    enc = _crypto.encrypt(refresh.encode("ascii"), secret_key_b64=_cfg.CONFIG.secret_key_b64)
+
+    conn = _conn()
+    try:
+        _db.upsert_account(conn, user_id=state_row["user_id"],
+                           email_address=state_row["account_email"],
+                           auth_method="oauth")
+        conn.execute(
+            "UPDATE email_accounts SET oauth_refresh_enc = ?, sync_status = 'pending', "
+            "updated_at = datetime('now') WHERE email_address = ?",
+            (enc, state_row["account_email"]),
+        )
+        conn.commit()
+    finally:
+        conn.close()
+
+    handler.send_json({"ok": True, "account_email": state_row["account_email"]})
+
+
+# ---------------------------------------------------------------------------- POST handlers
+
+def _h_enroll_all(handler):
+    """Admin: enroll every CRM user whose email is @workspace_domain via DWD."""
+    user = _require_admin(handler)
+    if not user:
+        return
+    if _cfg.CONFIG.primary_auth != "dwd":
+        return handler.send_error_json("enroll-all only valid in DWD mode", 400)
+    domain = _cfg.CONFIG.workspace_domain
+    if not domain:
+        return handler.send_error_json("CRM_GMAIL_WORKSPACE_DOMAIN not set", 400)
+
+    conn = _conn()
+    try:
+        cur = conn.cursor()
+        cur.execute(
+            "SELECT id, email FROM users WHERE is_active = 1 AND email LIKE ?",
+            (f"%@{domain}",),
+        )
+        users = cur.fetchall()
+        created = []
+        for u in users:
+            aid = _db.upsert_account(conn, user_id=u["id"],
+                                     email_address=u["email"].lower(),
+                                     auth_method="dwd")
+            created.append({"account_id": aid, "email": u["email"]})
+        conn.commit()
+    finally:
+        conn.close()
+    handler.send_json({"enrolled": created, "count": len(created)})
+
+
+def _h_enroll_one(handler):
+    user = _require_admin(handler)
+    if not user:
+        return
+    body = handler.get_body() or {}
+    # Accept either `email` or `email_address` for ergonomics.
+    email_address = (body.get("email_address") or body.get("email") or "").lower().strip()
+    user_id = body.get("user_id")
+    auth_method = body.get("auth_method") or _cfg.CONFIG.primary_auth
+
+    if not email_address:
+        return handler.send_error_json("email (or email_address) required", 400)
+
+    # If the caller didn't specify a CRM user_id, resolve it from the
+    # users table by matching email. Falls back to the authenticated
+    # admin's own id (handles the common case of a single admin
+    # enrolling themselves without having to paste their UUID).
+    if not user_id:
+        conn = _conn()
+        try:
+            cur = conn.cursor()
+            cur.execute("SELECT id FROM users WHERE LOWER(email) = ?",
+                        (email_address,))
+            row = cur.fetchone()
+            user_id = row["id"] if row else user.get("id")
+        finally:
+            conn.close()
+
+    if not user_id:
+        return handler.send_error_json("could not resolve user_id for that email", 400)
+
+    conn = _conn()
+    try:
+        aid = _db.upsert_account(conn, user_id=user_id,
+                                 email_address=email_address,
+                                 auth_method=auth_method)
+        conn.commit()
+    finally:
+        conn.close()
+    handler.send_json({"account_id": aid, "email": email_address, "user_id": user_id})
+
+
+def _h_run_now(handler):
+    user = _require_admin(handler)
+    if not user:
+        return
+    # Reuse existing rate limit so admins can't hammer this.
+    if handler.rate_limited("email-sync-now", 6):
+        return handler.send_error_json("Too many requests", 429)
+    result = _sched.trigger_run_now()
+    handler.send_json(result)
+
+
+def _h_rematch(handler):
+    """Re-evaluate unmatched emails against the current investor index."""
+    user = _require_admin(handler)
+    if not user:
+        return
+    body = handler.get_body() or {}
+    since = body.get("since")  # optional ISO8601
+    conn = _conn()
+    scanned = 0
+    matched = 0
+    try:
+        from .matcher import InvestorIndex
+        index = InvestorIndex(own_domain=_cfg.CONFIG.workspace_domain)
+        index.rebuild(_conn)
+        cur = conn.cursor()
+        sql = ("SELECT id, from_email, to_emails_json, cc_emails_json "
+               "FROM emails WHERE match_status = 'unmatched'")
+        params: list = []
+        if since:
+            sql += " AND sent_at >= ?"
+            params.append(since)
+        sql += " ORDER BY sent_at DESC LIMIT 10000"
+        cur.execute(sql, params)
+        for row in cur.fetchall():
+            scanned += 1
+            participants = set()
+            if row["from_email"]:
+                participants.add(row["from_email"].lower())
+            for col in ("to_emails_json", "cc_emails_json"):
+                try:
+                    arr = json.loads(row[col] or "[]")
+                except Exception:
+                    arr = []
+                for a in arr:
+                    e = a.get("email") if isinstance(a, dict) else a
+                    if e:
+                        participants.add(e.lower())
+            links = index.match(participants)
+            if not links:
+                continue
+            matched += 1
+            conn.execute(
+                "UPDATE emails SET match_status='matched', is_matched=1, "
+                "updated_at=datetime('now') WHERE id=?",
+                (row["id"],),
+            )
+            for link in links:
+                _db.insert_investor_link(conn, email_id=row["id"], link={
+                    "matched_address": link.matched_address,
+                    "match_kind": link.match_kind,
+                    "match_confidence": link.match_confidence,
+                    "fundraising_investor_id": link.target.fundraising_investor_id,
+                    "fundraising_contact_id": link.target.fundraising_contact_id,
+                    "contact_id": link.target.contact_id,
+                    "organization_id": link.target.organization_id,
+                })
+            # NOTE: body is still missing — we only have headers. A follow-up
+            # job can re-fetch the full message from Gmail using the sighting's
+            # gmail_message_id. Not done inline to keep this endpoint fast.
+        conn.commit()
+    finally:
+        conn.close()
+    handler.send_json({"scanned": scanned, "newly_matched": matched})
+
+
+# ---------------------------------------------------------------------------- OAuth state store (in-memory)
+# For a 5-person CRM the state store doesn't need to be durable — a server
+# restart between start and callback is rare and just requires a retry.
+
+_oauth_states: dict[str, dict] = {}
+_oauth_state_lock = __import__("threading").Lock()
+
+
+def _oauth_state_store(state: str, user_id: str, account_email: str) -> None:
+    import time
+    with _oauth_state_lock:
+        # Prune stale entries (>10 min).
+        cutoff = time.time() - 600
+        for k, v in list(_oauth_states.items()):
+            if v["created"] < cutoff:
+                _oauth_states.pop(k, None)
+        _oauth_states[state] = {
+            "user_id": user_id,
+            "account_email": account_email.lower().strip(),
+            "created": time.time(),
+        }
+
+
+def _oauth_state_consume(state: str) -> Optional[dict]:
+    with _oauth_state_lock:
+        return _oauth_states.pop(state, None)
@@ -0,0 +1,143 @@
+"""
+Background sync scheduler.
+
+Runs as a daemon thread started from server.py main(). One thread; it wakes
+every `sync_interval_sec`, processes all accounts serially, sleeps again.
+
+Singleton: start_sync_scheduler() is idempotent — calling twice won't spawn
+a second thread. stop_sync_scheduler() gracefully signals shutdown (not
+strictly needed since it's daemon, but useful for tests).
+"""
+
+import logging
+import sqlite3
+import threading
+import time
+from typing import Callable, Optional
+
+from . import config as _cfg
+from . import credentials as _creds
+from . import sync as _sync
+from .matcher import InvestorIndex
+
+
+log = logging.getLogger("email_integration.scheduler")
+
+
+_state: dict[str, object] = {
+    "thread": None,
+    "stop": threading.Event(),
+    "last_run": 0.0,
+    "last_result": None,
+    "running_now": False,
+}
+
+
+def _conn_factory_from_env() -> Callable[[], sqlite3.Connection]:
+    """Build a get_db() compatible with server.py's pattern.
+
+    We don't import server.py (avoid circular / startup ordering). Instead
+    we re-implement the same settings. If server.py's DB path differs from
+    the default, CRM_DB_PATH env var should be set — same mechanism.
+    """
+    import os
+    db_path = os.environ.get(
+        "CRM_DB_PATH",
+        os.path.join(_cfg.CONFIG.data_dir, "crm.db"),
+    )
+
+    def get_db() -> sqlite3.Connection:
+        conn = sqlite3.connect(db_path)
+        conn.execute("PRAGMA journal_mode=WAL")
+        conn.execute("PRAGMA foreign_keys=ON")
+        conn.execute("PRAGMA busy_timeout=5000")
+        conn.row_factory = sqlite3.Row
+        return conn
+
+    return get_db
+
+
+def start_sync_scheduler(conn_factory: Optional[Callable] = None) -> None:
+    if _state["thread"] is not None:
+        return  # already running
+
+    if not _cfg.CONFIG.enabled:
+        log.info("email_integration not enabled; scheduler will not start")
+        return
+
+    factory = conn_factory or _conn_factory_from_env()
+
+    try:
+        provider = _creds.build_provider(factory)
+    except Exception as e:
+        log.exception("cannot build credential provider: %s", e)
+        return
+
+    index = InvestorIndex(own_domain=_cfg.CONFIG.workspace_domain)
+    try:
+        index.rebuild(factory)
+    except Exception:
+        log.exception("initial investor-index build failed; scheduler continues")
+
+    stop = threading.Event()
+    _state["stop"] = stop
+
+    def _loop():
+        log.info("email sync scheduler started; interval=%ss", _cfg.CONFIG.sync_interval_sec)
+        # First cycle: short delay to let server finish startup.
+        if stop.wait(10):
+            return
+        while not stop.is_set():
+            _state["running_now"] = True
+            t0 = time.time()
+            try:
+                result = _sync.sync_all(factory, provider, index)
+                _state["last_result"] = result
+            except Exception:
+                log.exception("sync loop crashed; will retry next cycle")
+            finally:
+                _state["running_now"] = False
+                _state["last_run"] = t0
+            if stop.wait(_cfg.CONFIG.sync_interval_sec):
+                return
+
+    t = threading.Thread(target=_loop, name="email-sync", daemon=True)
+    t.start()
+    _state["thread"] = t
+    _state["provider"] = provider
+    _state["index"] = index
+    _state["factory"] = factory
+
+
+def stop_sync_scheduler() -> None:
+    ev: threading.Event = _state["stop"]  # type: ignore
+    ev.set()
+    t = _state.get("thread")
+    if t:
+        try:
+            t.join(timeout=5)
+        except Exception:
+            pass
+    _state["thread"] = None
+
+
+def trigger_run_now() -> dict:
+    """Force a single sync pass synchronously (admin 'sync now' endpoint)."""
+    if _state.get("running_now"):
+        return {"status": "already_running"}
+    factory = _state.get("factory")
+    provider = _state.get("provider")
+    index = _state.get("index")
+    if not (factory and provider and index):
+        return {"status": "not_initialized"}
+    return _sync.sync_all(factory, provider, index)  # type: ignore
+
+
+def status_snapshot() -> dict:
+    return {
+        "enabled": _cfg.CONFIG.enabled,
+        "running": _state["running_now"],
+        "last_run_unix": _state.get("last_run"),
+        "last_result": _state.get("last_result"),
+        "interval_sec": _cfg.CONFIG.sync_interval_sec,
+    }
@@ -0,0 +1,390 @@
+"""
+Sync orchestrator.
+
+Top-level entry points:
+
+  sync_account(conn_factory, credential_provider, account_row, matcher)
+      Full sync pass for one mailbox. Decides backfill vs. incremental based
+      on email_accounts.backfill_complete. Writes a sync_runs row.
+
+  sync_all(conn_factory, credential_provider, matcher)
+      Iterates every sync-enabled account sequentially. Called from
+      scheduler.py every CRM_GMAIL_SYNC_INTERVAL_MIN minutes.
+
+Design: match-only storage (see architecture doc §7). For each message:
+  1. Fetch metadata (cheap, 5 units).
+  2. Run matcher against participant addresses.
+  3. If matched → fetch full message, parse, persist body + register attachments.
+  4. If unmatched → persist header-only row.
+  5. In both cases, record the per-account sighting.
+"""
+
+import logging
+import sqlite3
+import traceback
+from typing import Optional
+
+from . import attachments as _attach
+from . import config as _cfg
+from . import db as _db
+from . import errors as _errors
+from . import gmail_client as _gmail
+from . import parser as _parser
+from . import threads as _threads
+from .matcher import InvestorIndex, InvestorLink
+
+
+log = logging.getLogger("email_integration.sync")
+
+
+METADATA_HEADERS = [
+    "From", "To", "Cc", "Bcc", "Subject", "Date",
+    "Message-ID", "In-Reply-To", "References", "Reply-To",
+]
+
+
+# ---------------------------------------------------------------------------- public
+
+def sync_all(conn_factory, credential_provider, index: InvestorIndex) -> dict:
+    """Run one pass across all enabled accounts. Returns summary stats."""
+    index.rebuild_if_stale(conn_factory)
+
+    conn = conn_factory()
+    try:
+        accounts = _db.list_sync_ready_accounts(conn)
+    finally:
+        conn.close()
+
+    totals = {"accounts": 0, "messages_stored": 0, "errors": 0}
+    for acc in accounts:
+        totals["accounts"] += 1
+        try:
+            stats = sync_account(conn_factory, credential_provider, acc, index)
+            totals["messages_stored"] += stats.get("messages_stored", 0)
+        except Exception:
+            totals["errors"] += 1
+            log.exception("sync failed for account %s", acc["email_address"])
+    return totals
+
+
+def sync_account(conn_factory, credential_provider, account,
+                 index: InvestorIndex) -> dict:
+    """Sync a single mailbox. Returns stats dict."""
+    email_addr = account["email_address"]
+    stats = _gmail.CallStats()
+    client = _gmail.GmailClient(credential_provider, email_addr, stats=stats)
+
+    # Mark running
+    conn = conn_factory()
+    try:
+        run_id = _db.start_sync_run(conn,
+                                    account_id=account["id"],
+                                    kind="backfill" if not account["backfill_complete"] else "incremental")
+        _db.set_account_status(conn, account["id"], status="active", error=None)
+        conn.commit()
+    finally:
+        conn.close()
+
+    run_stats = {"messages_seen": 0, "messages_stored": 0, "attachments_saved": 0}
+    error_str: Optional[str] = None
+    status = "ok"
+
+    try:
+        if not account["backfill_complete"]:
+            _run_backfill(conn_factory, client, account, index, run_stats)
+        else:
+            _run_incremental(conn_factory, client, account, index, run_stats)
+
+        # Drain attachments for this account.
+        conn = conn_factory()
+        try:
+            # Limit to a few cycles' worth of attachments per pass.
+            batched = _attach.drain_pending(conn_factory, client, account["id"], limit=100)
+            run_stats["attachments_saved"] = batched
+        finally:
+            conn.close()
+
+    except _errors.AuthError as e:
+        error_str = f"auth: {e}"
+        status = "error"
+    except _errors.HistoryExpiredError:
+        # Recover: reset to date-based backfill from last_synced_at.
+        error_str = "history expired; fallback to date backfill"
+        status = "partial"
+        _fallback_date_backfill(conn_factory, client, account, index, run_stats)
+    except Exception as e:
+        error_str = f"unexpected: {type(e).__name__}: {e}"
+        status = "error"
+        log.exception("unexpected during sync of %s", email_addr)
+    finally:
+        run_stats["api_calls"] = stats.api_calls
+        run_stats["retries"] = stats.retries
+        conn = conn_factory()
+        try:
+            _db.finish_sync_run(conn, run_id, status=status, stats=run_stats, error=error_str)
+            _db.set_account_status(conn, account["id"],
+                                   status="active" if status == "ok" else status,
+                                   error=error_str)
+            _db.set_account_checkpoint(conn, account["id"],
+                                       last_synced_at=_db._now_iso())
+            conn.commit()
+        finally:
+            conn.close()
+
+    return run_stats
+
+
+# ---------------------------------------------------------------------------- backfill
+
+def _run_backfill(conn_factory, client, account, index: InvestorIndex,
+                  run_stats: dict) -> None:
+    """Initial full-mailbox backfill, resumable via backfill_cursor."""
+    page_token = account["backfill_cursor"]
+    while True:
+        resp = client.list_messages(page_token=page_token,
+                                    max_results=_cfg.CONFIG.backfill_page_size)
+        messages = resp.get("messages") or []
+        for m in messages:
+            run_stats["messages_seen"] += 1
+            try:
+                _process_one_message(conn_factory, client, account, index,
+                                     gmail_message_id=m["id"], run_stats=run_stats)
+            except _errors.GmailError as e:
+                log.warning("skip msg %s on %s: %s", m["id"], account["email_address"], e)
+                continue
+
+        page_token = resp.get("nextPageToken")
+        conn = conn_factory()
+        try:
+            _db.set_account_checkpoint(conn, account["id"],
+                                       backfill_cursor=page_token,
+                                       backfill_complete=(not page_token))
+            conn.commit()
+        finally:
+            conn.close()
+
+        if not page_token:
+            # Capture current historyId as checkpoint for future incrementals.
+            prof = client.get_profile()
+            hid = prof.get("historyId")
+            if hid:
+                conn = conn_factory()
+                try:
+                    _db.set_account_checkpoint(conn, account["id"], history_id=str(hid))
+                    conn.commit()
+                finally:
+                    conn.close()
+            return
+
+
+# ---------------------------------------------------------------------------- incremental
+
+def _run_incremental(conn_factory, client, account, index: InvestorIndex,
+                     run_stats: dict) -> None:
+    start_hid = account["last_history_id"]
+    if not start_hid:
+        # Safety: if checkpoint is missing, re-enter backfill.
+        _run_backfill(conn_factory, client, account, index, run_stats)
+        return
+
+    # history_types filter limits bandwidth to what we care about.
+    new_hid: Optional[str] = None
+    try:
+        for h in client.iter_history(
+            start_history_id=start_hid,
+            history_types=["messageAdded", "messageDeleted", "labelAdded", "labelRemoved"],
+        ):
+            for ma in h.get("messagesAdded") or []:
+                msg = ma.get("message") or {}
+                run_stats["messages_seen"] += 1
+                try:
+                    _process_one_message(conn_factory, client, account, index,
+                                         gmail_message_id=msg.get("id"),
+                                         run_stats=run_stats)
+                except _errors.GmailError as e:
+                    log.warning("skip msg %s on %s: %s", msg.get("id"), account["email_address"], e)
+
+            for md in h.get("messagesDeleted") or []:
+                msg = md.get("message") or {}
+                conn = conn_factory()
+                try:
+                    _db.tombstone_sighting(
+                        conn,
+                        account_id=account["id"],
+                        gmail_message_id=msg.get("id"),
+                    )
+                    conn.commit()
+                finally:
+                    conn.close()
+
+            for la in (h.get("labelsAdded") or []) + (h.get("labelsRemoved") or []):
+                msg = la.get("message") or {}
+                # labels are the resulting label set in Gmail's payload after
+                # the change. We refresh them wholesale.
+                labels = msg.get("labelIds") or []
+                conn = conn_factory()
+                try:
+                    _db.update_sighting_labels(
+                        conn,
+                        account_id=account["id"],
+                        gmail_message_id=msg.get("id"),
+                        labels=labels,
+                    )
+                    conn.commit()
+                finally:
+                    conn.close()
+        new_hid = client.last_history_id
+    except _errors.HistoryExpiredError:
+        raise
+
+    if new_hid:
+        conn = conn_factory()
+        try:
+            _db.set_account_checkpoint(conn, account["id"], history_id=str(new_hid))
+            conn.commit()
+        finally:
+            conn.close()
+
+
+def _fallback_date_backfill(conn_factory, client, account, index, run_stats):
+    """Used when startHistoryId has been pruned by Gmail.
+
+    Pulls everything since last_synced_at (or 14d if unknown), which will
+    hit a large overlap with existing data but upserts are idempotent.
+    """
+    from datetime import datetime, timedelta, timezone
+    since = account["last_synced_at"] or (
+        datetime.now(tz=timezone.utc) - timedelta(days=14)
+    ).strftime("%Y-%m-%dT%H:%M:%SZ")
+    q = f"after:{since.replace('-', '/').split('T')[0]}"
+    for m in client.iter_messages(q=q):
+        run_stats["messages_seen"] += 1
+        try:
+            _process_one_message(conn_factory, client, account, index,
+                                 gmail_message_id=m["id"], run_stats=run_stats)
+        except _errors.GmailError as e:
+            log.warning("skip during date-backfill msg %s: %s", m["id"], e)
+    prof = client.get_profile()
+    hid = prof.get("historyId")
+    if hid:
+        conn = conn_factory()
+        try:
+            _db.set_account_checkpoint(conn, account["id"], history_id=str(hid))
+            conn.commit()
+        finally:
+            conn.close()
+
+
+# ---------------------------------------------------------------------------- per-message
+
+def _process_one_message(conn_factory, client, account, index: InvestorIndex,
+                         *, gmail_message_id: str, run_stats: dict) -> None:
+    """Fetch, match, persist one message. Idempotent."""
+    if not gmail_message_id:
+        return
+
+    # Skip if we've already sighted this message for this account.
+    conn = conn_factory()
+    try:
+        cur = conn.cursor()
+        cur.execute(
+            "SELECT email_id FROM email_account_messages "
+            "WHERE account_id = ? AND gmail_message_id = ?",
+            (account["id"], gmail_message_id),
+        )
+        if cur.fetchone():
+            return
+    finally:
+        conn.close()
+
+    # 1. Metadata fetch (cheap).
+    meta = client.get_message(gmail_message_id, format="metadata",
+                              metadata_headers=METADATA_HEADERS)
+    meta_parsed = _parser.parse(meta, owning_account_address=account["email_address"])
+
+    participants = set()
+    if meta_parsed.get("from_email"):
+        participants.add(meta_parsed["from_email"])
+    for kind in ("to", "cc", "bcc"):
+        for a in meta_parsed.get(kind, []):
+            if isinstance(a, dict) and a.get("email"):
+                participants.add(a["email"])
+
+    # Exclude owning account's own address so we don't try to "match" ourselves.
+    own = {account["email_address"].lower()}
+    links = index.match(participants, exclude_addresses=own)
+    is_matched = bool(links)
+
+    # 2. If matched, fetch full and parse for body + attachments.
+    if is_matched:
+        full = client.get_message(gmail_message_id, format="full")
+        parsed = _parser.parse(full, owning_account_address=account["email_address"])
+    else:
+        parsed = meta_parsed
+        # Strip any body fields (metadata fetch shouldn't have them but be safe).
+        parsed["body_text"] = None
+        parsed["body_html"] = None
+        parsed["attachments"] = []
+
+    # 3. Persist (idempotent on rfc_message_id).
+    conn = conn_factory()
+    try:
+        existing = _db.find_email_by_rfc_id(conn, parsed["rfc_message_id"])
+        if existing:
+            email_id = existing["id"]
+            # If the email was previously unmatched but now matches (e.g. user
+            # added the investor after first sight), upgrade the row.
+            if is_matched and existing["match_status"] == "unmatched":
+                conn.execute(
+                    "UPDATE emails SET match_status = 'matched', is_matched = 1, "
+                    "body_text = ?, body_html = ?, updated_at = datetime('now') "
+                    "WHERE id = ?",
+                    (parsed.get("body_text"), parsed.get("body_html"), email_id),
+                )
+                _attach.register_stubs(conn,
+                                       email_id=email_id,
+                                       parsed_attachments=parsed.get("attachments") or [])
+                for link in links:
+                    _db.insert_investor_link(conn, email_id=email_id, link=_flatten_link(link))
+        else:
+            match_status = "matched" if is_matched else "unmatched"
+            email_id = _db.insert_email(conn, parsed=parsed, match_status=match_status)
+            thread_id = _threads.resolve_thread_id(conn, parsed)
+            _db.set_email_thread(conn, email_id, thread_id)
+            if is_matched:
+                _attach.register_stubs(conn,
+                                       email_id=email_id,
+                                       parsed_attachments=parsed.get("attachments") or [])
+                for link in links:
+                    _db.insert_investor_link(conn, email_id=email_id, link=_flatten_link(link))
+            _db.rollup_thread(conn, thread_id)
+            run_stats["messages_stored"] += 1
+
+        # Record sighting (always, even if email row was pre-existing).
+        _db.upsert_sighting(
+            conn,
+            email_id=email_id,
+            account_id=account["id"],
+            gmail_message_id=gmail_message_id,
+            gmail_thread_id=parsed.get("gmail_thread_id") or "",
+            labels=parsed.get("labels", []),
+            is_sent=parsed.get("is_sent", False),
+        )
+        conn.commit()
+    except sqlite3.IntegrityError:
+        # Concurrent insert race — re-read and proceed.
+        pass
+    finally:
+        conn.close()
+
+
+def _flatten_link(link: InvestorLink) -> dict:
+    return {
+        "matched_address": link.matched_address,
+        "match_kind": link.match_kind,
+        "match_confidence": link.match_confidence,
+        "fundraising_investor_id": link.target.fundraising_investor_id,
+        "fundraising_contact_id": link.target.fundraising_contact_id,
+        "contact_id": link.target.contact_id,
+        "organization_id": link.target.organization_id,
+    }
@@ -0,0 +1,75 @@
+"""
+Threading resolution.
+
+Given a freshly-inserted emails row (or its about-to-be-inserted parsed dict),
+figure out which email_threads row it belongs to. If none exists, create one.
+
+Priority order (see architecture doc §10):
+  1. Existing email in our DB that shares any RFC Message-ID with this one's
+     References/In-Reply-To chain — inherit its thread.
+  2. Existing thread with the same gmail_thread_id.
+  3. Existing thread with the same rfc_thread_root_id.
+  4. Create a new thread.
+"""
+
+import re
+import sqlite3
+from typing import Optional
+
+from . import db as _db
+
+
+SUBJECT_PREFIX_RE = re.compile(r"^\s*(re|fwd?|aw|sv|antw|回复|fw)\s*:\s*", re.IGNORECASE)
+
+
+def normalize_subject(s: Optional[str]) -> Optional[str]:
+    if not s:
+        return None
+    out = s
+    # Strip up to 5 nested Re:/Fwd: prefixes.
+    for _ in range(5):
+        new = SUBJECT_PREFIX_RE.sub("", out, count=1)
+        if new == out:
+            break
+        out = new
+    return out.strip().lower()
+
+
+def resolve_thread_id(conn: sqlite3.Connection, parsed: dict) -> str:
+    """Returns a thread_id — either an existing one or a newly created one."""
+    # Step 1: RFC cross-link.
+    candidates = list(parsed.get("references") or [])
+    if parsed.get("in_reply_to"):
+        candidates.append(parsed["in_reply_to"])
+
+    if candidates:
+        existing_email_id = _db.find_email_id_by_any_rfc_id(conn, candidates)
+        if existing_email_id:
+            cur = conn.cursor()
+            cur.execute("SELECT thread_id FROM emails WHERE id = ?", (existing_email_id,))
+            row = cur.fetchone()
+            if row and row["thread_id"]:
+                return row["thread_id"]
+
+    # Step 2: gmail_thread_id match.
+    gt = parsed.get("gmail_thread_id")
+    if gt:
+        existing = _db.find_thread_by_gmail_id(conn, gt)
+        if existing:
+            return existing["id"]
+
+    # Step 3: RFC thread-root match.
+    rfc_root = parsed.get("rfc_thread_root_id")
+    if rfc_root:
+        existing = _db.find_thread_by_rfc_root(conn, rfc_root)
+        if existing:
+            return existing["id"]
+
+    # Step 4: create.
+    return _db.create_thread(
+        conn,
+        gmail_thread_id=gt,
+        rfc_thread_root_id=rfc_root,
+        subject_normalized=normalize_subject(parsed.get("subject")),
+        first_message_at=parsed.get("sent_at"),
+    )
@@ -0,0 +1,5 @@
+"""Ten31 Phase-0 ingest pipeline (entity resolution, chunking, embed, Qdrant upsert).
+
+All modules are local-only and read the CRM by SQLite file path (CRM is canonical;
+the canonical/vector layers are derived). No real data is sent to Claude here.
+"""
@@ -0,0 +1,64 @@
+#!/usr/bin/env python3
+"""Phase-0 Workstream B — backfill the CRM into Qdrant.
+
+Chunk -> dense (bge-m3 via Spark Control) + sparse (BM25 client-side) -> upsert
+to Qdrant `crm_chunks` with payload. Idempotent: deterministic point ids mean
+re-running upserts in place. Reads the CRM by file path; never sends data to Claude.
+
+    python3 backend/ingest/backfill.py --db data/crm_dev.db --recreate
+"""
+import argparse
+import sqlite3
+
+import chunking
+import config
+import embed
+import qdrant_io
+import sparse
+
+
+def run(db, recreate=False, batch=32):
+    conn = sqlite3.connect(db)
+    conn.row_factory = sqlite3.Row
+    chunks = chunking.build_chunks(conn)
+    conn.close()
+    print(f"Built {len(chunks)} chunks from {db}")
+
+    state = qdrant_io.create_collection(recreate=recreate)
+    qdrant_io.ensure_indexes()
+    print(f"Collection '{config.COLLECTION}': {state}")
+
+    total = 0
+    for i in range(0, len(chunks), batch):
+        group = chunks[i:i + batch]
+        dense = embed.dense_embed([c["text"] for c in group])
+        points = []
+        for c, dv in zip(group, dense):
+            sv = sparse.encode(c["text"])
+            points.append({
+                "id": c["point_id"],
+                "vector": {"dense": dv, "sparse": {"indices": sv["indices"], "values": sv["values"]}},
+                "payload": {
+                    "lp_id": c["lp_id"], "lp_name": c["lp_name"], "person_id": c["person_id"],
+                    "doc_type": c["doc_type"], "date_ts": c["date_ts"], "text": c["text"],
+                    "source_model": c["source_model"], "source_id": c["source_id"], "chunk_key": c["chunk_key"],
+                },
+            })
+        qdrant_io.upsert(points)
+        total += len(points)
+        print(f"  upserted {total}/{len(chunks)}")
+
+    print(f"Done. Qdrant '{config.COLLECTION}' now holds {qdrant_io.count()} points.")
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--db", default=config.DEFAULT_DB)
+    ap.add_argument("--recreate", action="store_true", help="drop & recreate the collection first")
+    ap.add_argument("--batch", type=int, default=32)
+    args = ap.parse_args()
+    run(args.db, recreate=args.recreate, batch=args.batch)
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,184 @@
+"""Phase-0 Workstream B1 — chunk the CRM for retrieval.
+
+Maps each CRM record type to one or more chunks per docs/EMBEDDINGS.md:
+  * one chunk per communications row          (doc_type = the comm type)
+  * one chunk per MATCHED email               (doc_type = email; body only when matched)
+  * one chunk per fundraising_investors notes LINE (the outreach log; split per line)
+  * one chunk each for free-text fields: contacts.notes, lp_profiles.notes,
+    opportunities (description + next_step), organizations.description
+
+Each chunk carries a canonical `lp_id` (resolved via entity_links) and a `date_ts`
+(epoch of the EVENT time, not created_at) so Qdrant can pre-filter and recency-rank.
+Entities/names/dates/types are payload (filterable); only prose is embedded.
+
+A chunk's stable `chunk_key` -> deterministic point id (uuid5), so re-ingest
+upserts in place (idempotent).
+"""
+import sqlite3
+import uuid
+from datetime import datetime, timezone
+
+_NS = uuid.UUID("6ba7b811-9dad-11d1-80b4-00c04fd430c8")  # uuid5 namespace for chunk ids
+
+
+def to_epoch(ts: str):
+    if not ts:
+        return None
+    s = ts.strip().replace("Z", "+00:00")
+    for parse in (datetime.fromisoformat,):
+        try:
+            dt = parse(s)
+            if dt.tzinfo is None:
+                dt = dt.replace(tzinfo=timezone.utc)
+            return int(dt.timestamp())
+        except Exception:
+            pass
+    # date-only fallback
+    try:
+        return int(datetime.strptime(ts[:10], "%Y-%m-%d").replace(tzinfo=timezone.utc).timestamp())
+    except Exception:
+        return None
+
+
+def _point_id(chunk_key: str) -> str:
+    return str(uuid.uuid5(_NS, chunk_key))
+
+
+def _mk(chunk_key, lp_id, lp_name, person_id, doc_type, date_ts, text, source_model, source_id):
+    text = (text or "").strip()
+    if not text or not lp_id:
+        return None
+    return {
+        "chunk_key": chunk_key,
+        "point_id": _point_id(chunk_key),
+        "lp_id": lp_id,
+        "lp_name": lp_name,
+        "person_id": person_id,
+        "doc_type": doc_type,
+        "date_ts": date_ts,
+        "text": text,
+        "source_model": source_model,
+        "source_id": source_id,
+    }
+
+
+def _canon_maps(conn):
+    """Resolution lookups from entity_links / canonical_entities."""
+    person_canon, org_canon, inv_canon = {}, {}, {}
+    for r in conn.execute("SELECT source_model, source_id, canonical_id FROM entity_links"):
+        if r["source_model"] == "contacts":
+            person_canon[r["source_id"]] = r["canonical_id"]
+        elif r["source_model"] == "organizations":
+            org_canon[r["source_id"]] = r["canonical_id"]
+        elif r["source_model"] == "fundraising_investors":
+            inv_canon[r["source_id"]] = r["canonical_id"]
+    name = {r["id"]: r["display_name"] for r in conn.execute("SELECT id, display_name FROM canonical_entities")}
+    contact_org = {r["id"]: r["organization_id"] for r in conn.execute("SELECT id, organization_id FROM contacts")}
+    return person_canon, org_canon, inv_canon, name, contact_org
+
+
+def _contact_lp(cid, person_canon, org_canon, name, contact_org):
+    """Best lp_id for a contact-anchored chunk: the firm if known, else the person."""
+    person = person_canon.get(cid)
+    firm = org_canon.get(contact_org.get(cid))
+    lp = firm or person
+    return lp, name.get(lp), person
+
+
+def build_chunks(conn):
+    person_canon, org_canon, inv_canon, name, contact_org = _canon_maps(conn)
+    chunks = []
+
+    # communications
+    for r in conn.execute("""SELECT id, contact_id, type, subject, body, outcome, next_action, communication_date
+                             FROM communications"""):
+        lp, lp_name, person = _contact_lp(r["contact_id"], person_canon, org_canon, name, contact_org)
+        parts = [p for p in (r["subject"], r["body"], r["outcome"], r["next_action"]) if (p or "").strip()]
+        chunks.append(_mk(f"communications:{r['id']}", lp, lp_name, person,
+                          r["type"] or "note", to_epoch(r["communication_date"]),
+                          "\n".join(parts), "communications", r["id"]))
+
+    # contacts.notes
+    for r in conn.execute("SELECT id, notes, updated_at FROM contacts WHERE notes IS NOT NULL AND notes <> ''"):
+        lp, lp_name, person = _contact_lp(r["id"], person_canon, org_canon, name, contact_org)
+        chunks.append(_mk(f"contacts.notes:{r['id']}", lp, lp_name, person,
+                          "contact_note", to_epoch(r["updated_at"]), r["notes"], "contacts", r["id"]))
+
+    # lp_profiles.notes
+    for r in conn.execute("""SELECT lp.id, lp.contact_id, lp.notes, lp.updated_at
+                             FROM lp_profiles lp WHERE lp.notes IS NOT NULL AND lp.notes <> ''"""):
+        lp, lp_name, person = _contact_lp(r["contact_id"], person_canon, org_canon, name, contact_org)
+        chunks.append(_mk(f"lp_profiles.notes:{r['id']}", lp, lp_name, person,
+                          "lp_note", to_epoch(r["updated_at"]), r["notes"], "lp_profiles", r["id"]))
+
+    # opportunities (description + next_step)
+    for r in conn.execute("""SELECT id, contact_id, name, description, next_step, updated_at
+                             FROM opportunities"""):
+        lp, lp_name, person = _contact_lp(r["contact_id"], person_canon, org_canon, name, contact_org)
+        parts = [p for p in (r["name"], r["description"], r["next_step"]) if (p or "").strip()]
+        chunks.append(_mk(f"opportunities:{r['id']}", lp, lp_name, person,
+                          "opportunity", to_epoch(r["updated_at"]), "\n".join(parts), "opportunities", r["id"]))
+
+    # organizations.description
+    for r in conn.execute("""SELECT id, description, updated_at FROM organizations
+                             WHERE description IS NOT NULL AND description <> ''"""):
+        lp = org_canon.get(r["id"])
+        chunks.append(_mk(f"organizations.description:{r['id']}", lp, name.get(lp), None,
+                          "org_note", to_epoch(r["updated_at"]), r["description"], "organizations", r["id"]))
+
+    # fundraising_investors.notes — running outreach log, split per non-empty line
+    for r in conn.execute("""SELECT id, notes, updated_at FROM fundraising_investors
+                             WHERE notes IS NOT NULL AND notes <> ''"""):
+        lp = inv_canon.get(r["id"])
+        for i, line in enumerate(str(r["notes"]).splitlines()):
+            if line.strip():
+                chunks.append(_mk(f"fundraising_investors.notes:{r['id']}:{i}", lp, name.get(lp), None,
+                                  "outreach_note", to_epoch(r["updated_at"]), line, "fundraising_investors", r["id"]))
+
+    # MATCHED emails (only matched rows carry a body; key lp via email_investor_links)
+    if _has_table(conn, "emails") and _has_table(conn, "email_investor_links"):
+        for r in conn.execute("""SELECT id, subject, body_text, snippet, sent_at FROM emails WHERE is_matched=1"""):
+            lp, lp_name = _email_lp(conn, r["id"], inv_canon, org_canon, person_canon, name)
+            text = "\n".join(p for p in (r["subject"], r["body_text"] or r["snippet"]) if (p or "").strip())
+            chunks.append(_mk(f"emails:{r['id']}", lp, lp_name, None, "email",
+                              to_epoch(r["sent_at"]), text, "emails", r["id"]))
+
+    return [c for c in chunks if c]
+
+
+def _has_table(conn, name):
+    return conn.execute("SELECT 1 FROM sqlite_master WHERE type='table' AND name=?", (name,)).fetchone() is not None
+
+
+def _email_lp(conn, email_id, inv_canon, org_canon, person_canon, name):
+    """Resolve a matched email's lp_id via email_investor_links, precedence:
+    fundraising_investor -> contact -> organization."""
+    row = conn.execute("""SELECT fundraising_investor_id, contact_id, organization_id
+                          FROM email_investor_links WHERE email_id=? ORDER BY match_confidence DESC LIMIT 1""",
+                       (email_id,)).fetchone()
+    if not row:
+        return None, None
+    lp = (inv_canon.get(row["fundraising_investor_id"]) or person_canon.get(row["contact_id"])
+          or org_canon.get(row["organization_id"]))
+    return lp, name.get(lp)
+
+
+if __name__ == "__main__":
+    import argparse
+    from collections import Counter
+    from config import DEFAULT_DB
+
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--db", default=DEFAULT_DB)
+    args = ap.parse_args()
+    conn = sqlite3.connect(args.db)
+    conn.row_factory = sqlite3.Row
+    chunks = build_chunks(conn)
+    print(f"{len(chunks)} chunks from {args.db}")
+    for dt, n in Counter(c["doc_type"] for c in chunks).most_common():
+        print(f"  {dt:<16} {n}")
+    unresolved = sum(1 for c in chunks if not c["lp_id"])
+    print(f"  (all chunks have an lp_id: {unresolved == 0})")
+    print("\nSample chunk:")
+    s = chunks[0]
+    print({k: (v[:80] + '…' if k == 'text' and v and len(v) > 80 else v) for k, v in s.items()})
@@ -0,0 +1,28 @@
+"""Ingest config — loads .env and exposes the Spark/Qdrant/CRM settings."""
+import os
+
+_ROOT = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+
+def load_env(path=None):
+    path = path or os.path.join(_ROOT, ".env")
+    if not os.path.exists(path):
+        return
+    with open(path, "r", encoding="utf-8") as fh:
+        for line in fh:
+            line = line.strip()
+            if not line or line.startswith("#") or "=" not in line:
+                continue
+            k, v = line.split("=", 1)
+            os.environ.setdefault(k.strip(), v.strip())
+
+
+load_env()
+
+SPARK_CONTROL_URL = os.environ.get("SPARK_CONTROL_URL", "").rstrip("/")
+SPARK_VERIFY_TLS = os.environ.get("SPARK_CONTROL_VERIFY_TLS", "false").lower() in ("1", "true", "yes", "on")
+QDRANT_URL = os.environ.get("QDRANT_URL", "").rstrip("/")
+COLLECTION = os.environ.get("CRM_QDRANT_COLLECTION", "crm_chunks")
+EMBED_MODEL = os.environ.get("CRM_EMBED_MODEL", "BAAI/bge-m3")
+DENSE_DIM = int(os.environ.get("CRM_EMBED_DIM", "1024"))
+DEFAULT_DB = os.environ.get("CRM_DEV_DB_PATH", os.path.join(_ROOT, "data", "crm_dev.db"))
@@ -0,0 +1,17 @@
+"""Dense embeddings via Spark Control /v1/embeddings (BAAI/bge-m3, 1024-d)."""
+import config
+import http_util
+
+
+def dense_embed(texts, batch=32):
+    out = []
+    for i in range(0, len(texts), batch):
+        group = texts[i:i + batch]
+        status, data = http_util.request(
+            "POST", f"{config.SPARK_CONTROL_URL}/v1/embeddings",
+            {"input": group, "model": config.EMBED_MODEL}, verify=config.SPARK_VERIFY_TLS)
+        if status != 200:
+            raise RuntimeError(f"/v1/embeddings -> {status}: {data}")
+        rows = sorted(data["data"], key=lambda d: d["index"])
+        out.extend(r["embedding"] for r in rows)
+    return out
@@ -0,0 +1,258 @@
+#!/usr/bin/env python3
+"""Phase-0 Workstream B3 / A4 — entity resolution (deterministic tier).
+
+Collapses the CRM's two parallel investor models into the canonical identity
+layer created by migration 0001:
+
+    organizations          ─┐
+    fundraising_investors  ─┴─►  canonical_entities (entity_kind = lp | organization)
+    contacts               ─┐
+    fundraising_contacts   ─┴─►  canonical_entities (entity_kind = person)
+    lp_profiles            ───►  linked to its contact's person entity
+
+Every source row is recorded in `entity_links` so any name variant resolves to
+one canonical id. This is the DETERMINISTIC tier — it merges only what we can
+prove (exact email; exact normalized name within the same canonical org). The
+HARD cases (nicknames like "Jon" vs "Jonathan", typos) are NOT guessed; they are
+emitted as *fuzzy candidates* for the local-Qwen tier (Spark Control
+/v1/chat/completions) to adjudicate later. Honest separation: we never silently
+merge on a guess.
+
+Properties:
+  * Local-only, read-mostly: reads CRM source tables, writes only the derived
+    canonical_entities / entity_links and an interaction_log audit row. Never
+    mutates a CRM source record (guardrail #2/#3).
+  * Idempotent: canonical ids are deterministic (sha1 of the resolution key), so
+    re-running upserts in place and keeps ids stable across runs — which keeps
+    downstream Qdrant point ids valid (no churn on re-embed).
+  * Logged: writes one interaction_log row per run (guardrail #5).
+
+Usage:
+    python3 backend/ingest/entity_resolution.py --db data/crm_dev.db
+    python3 backend/ingest/entity_resolution.py --db data/crm_dev.db --show-candidates
+"""
+import argparse
+import hashlib
+import json
+import re
+import sqlite3
+import uuid
+from collections import defaultdict
+from datetime import datetime, timezone
+
+
+# ── normalization ─────────────────────────────────────────────────────────────
+
+def norm_text(s: str) -> str:
+    s = (s or "").strip().lower()
+    s = re.sub(r"[^\w\s]", " ", s)
+    return re.sub(r"\s+", " ", s).strip()
+
+
+def norm_email(s: str) -> str:
+    return (s or "").strip().lower()
+
+
+def _eid(prefix: str, key: str) -> str:
+    """Deterministic canonical id: stable across runs for the same resolution key."""
+    return f"{prefix}_{hashlib.sha1(key.encode('utf-8')).hexdigest()[:12]}"
+
+
+def _now() -> str:
+    return datetime.now(timezone.utc).isoformat()
+
+
+def _split_name(full: str):
+    parts = norm_text(full).split()
+    if not parts:
+        return "", ""
+    return parts[0], parts[-1] if len(parts) > 1 else ""
+
+
+# ── upsert helpers ────────────────────────────────────────────────────────────
+
+def _upsert_entity(conn, eid, kind, display_name, primary_email):
+    conn.execute(
+        """
+        INSERT INTO canonical_entities (id, entity_kind, display_name, primary_email, source, created_at, updated_at)
+        VALUES (?, ?, ?, ?, 'entity_resolution', ?, ?)
+        ON CONFLICT(id) DO UPDATE SET
+            display_name  = excluded.display_name,
+            primary_email = COALESCE(excluded.primary_email, canonical_entities.primary_email),
+            entity_kind   = excluded.entity_kind,
+            updated_at    = excluded.updated_at
+        """,
+        (eid, kind, display_name, primary_email or None, _now(), _now()),
+    )
+
+
+def _link(conn, canonical_id, source_model, source_id, match_value, match_kind, confidence):
+    conn.execute(
+        """
+        INSERT INTO entity_links (id, canonical_id, source_model, source_id, match_value, match_kind, confidence, created_at)
+        VALUES (?, ?, ?, ?, ?, ?, ?, ?)
+        ON CONFLICT(source_model, source_id, match_value) DO UPDATE SET
+            canonical_id = excluded.canonical_id,
+            match_kind   = excluded.match_kind,
+            confidence   = excluded.confidence
+        """,
+        (str(uuid.uuid4()), canonical_id, source_model, source_id, match_value, match_kind, confidence, _now()),
+    )
+
+
+# ── resolution passes ─────────────────────────────────────────────────────────
+
+def resolve_organizations(conn):
+    """Merge organizations + fundraising_investors by normalized name.
+
+    Returns (org_canon_by_orgid, org_canon_by_fundinv) so the people pass can
+    attach each person to their firm's canonical id.
+    """
+    groups = defaultdict(lambda: {"orgs": [], "investors": [], "name": "", "email": ""})
+
+    for r in conn.execute("SELECT id, name, email FROM organizations"):
+        key = norm_text(r["name"])
+        if not key:
+            continue
+        g = groups[key]
+        g["orgs"].append(r["id"])
+        if len(r["name"] or "") > len(g["name"]):
+            g["name"] = r["name"]
+        if not g["email"] and (r["email"] or "").strip():
+            g["email"] = r["email"].strip()
+
+    for r in conn.execute("SELECT id, investor_name FROM fundraising_investors"):
+        key = norm_text(r["investor_name"])
+        if not key:
+            continue
+        g = groups[key]
+        g["investors"].append(r["id"])
+        if not g["name"]:
+            g["name"] = r["investor_name"]
+
+    org_canon_by_orgid, org_canon_by_fundinv = {}, {}
+    for key, g in groups.items():
+        # An org we are actively raising from (has a fundraising row) is an 'lp';
+        # otherwise a plain 'organization'.
+        kind = "lp" if g["investors"] else "organization"
+        cid = _eid("lp" if kind == "lp" else "org", key)
+        _upsert_entity(conn, cid, kind, g["name"], g["email"])
+        for oid in g["orgs"]:
+            _link(conn, cid, "organizations", oid, key, "exact_name", 1.0)
+            org_canon_by_orgid[oid] = cid
+        for iid in g["investors"]:
+            _link(conn, cid, "fundraising_investors", iid, key, "exact_name", 1.0)
+            org_canon_by_fundinv[iid] = cid
+
+    return org_canon_by_orgid, org_canon_by_fundinv
+
+
+def resolve_people(conn, org_canon_by_orgid, org_canon_by_fundinv):
+    """Merge contacts + fundraising_contacts by exact email, else exact name within
+    the same canonical org. Returns contact_id -> person canonical id (for lp_profiles)."""
+    # gather (model, source_id, full_name, email, org_canon)
+    people = []
+    for r in conn.execute("SELECT id, first_name, last_name, email, organization_id FROM contacts"):
+        full = f"{r['first_name'] or ''} {r['last_name'] or ''}".strip()
+        people.append(("contacts", r["id"], full, norm_email(r["email"]),
+                       org_canon_by_orgid.get(r["organization_id"])))
+    for r in conn.execute("SELECT id, full_name, email, investor_id FROM fundraising_contacts"):
+        people.append(("fundraising_contacts", r["id"], r["full_name"] or "", norm_email(r["email"]),
+                       org_canon_by_fundinv.get(r["investor_id"])))
+
+    contact_to_person = {}
+    person_meta = {}  # canonical_id -> {"org": org_canon, "last": last_norm, "name": display, "email": email}
+
+    for model, sid, full, email, org_canon in people:
+        name_norm = norm_text(full)
+        if email:
+            key = f"e|{email}"
+            match_kind, conf, match_value = "exact_email", 1.0, email
+        elif name_norm:
+            key = f"n|{name_norm}|{org_canon or ''}"
+            match_kind, conf, match_value = "name_org", 0.8, name_norm
+        else:
+            continue
+        cid = _eid("per", key)
+        display = full.strip() or email
+        _upsert_entity(conn, cid, "person", display, email)
+        _link(conn, cid, model, sid, match_value, match_kind, conf)
+        if model == "contacts":
+            contact_to_person[sid] = cid
+        meta = person_meta.setdefault(cid, {"org": org_canon, "last": _split_name(full)[1],
+                                             "name": display, "email": email})
+        if org_canon and not meta["org"]:
+            meta["org"] = org_canon
+
+    # lp_profiles -> the person entity of its contact
+    for r in conn.execute("SELECT id, contact_id FROM lp_profiles"):
+        cid = contact_to_person.get(r["contact_id"])
+        if cid:
+            _link(conn, cid, "lp_profiles", r["id"], r["contact_id"], "contact_fk", 1.0)
+
+    return person_meta
+
+
+def find_fuzzy_candidates(person_meta):
+    """Distinct person entities sharing the same canonical org AND surname are
+    likely the same individual under a name variant (e.g. Jon/Jonathan). Emit them
+    for the local-Qwen tier; do NOT merge here."""
+    by_org_last = defaultdict(list)
+    for cid, m in person_meta.items():
+        if m["org"] and m["last"]:
+            by_org_last[(m["org"], m["last"])].append((cid, m["name"], m["email"]))
+    return [{"org": org, "surname": last, "members": members}
+            for (org, last), members in by_org_last.items() if len(members) > 1]
+
+
+def run(db_path: str):
+    conn = sqlite3.connect(db_path)
+    conn.row_factory = sqlite3.Row
+    conn.execute("PRAGMA foreign_keys=ON")
+
+    org_by_oid, org_by_inv = resolve_organizations(conn)
+    conn.commit()
+    person_meta = resolve_people(conn, org_by_oid, org_by_inv)
+    conn.commit()
+    candidates = find_fuzzy_candidates(person_meta)
+
+    counts = {
+        "canonical_total": conn.execute("SELECT COUNT(*) FROM canonical_entities").fetchone()[0],
+        "lp": conn.execute("SELECT COUNT(*) FROM canonical_entities WHERE entity_kind='lp'").fetchone()[0],
+        "organization": conn.execute("SELECT COUNT(*) FROM canonical_entities WHERE entity_kind='organization'").fetchone()[0],
+        "person": conn.execute("SELECT COUNT(*) FROM canonical_entities WHERE entity_kind='person'").fetchone()[0],
+        "links": conn.execute("SELECT COUNT(*) FROM entity_links").fetchone()[0],
+        "fuzzy_candidates": len(candidates),
+    }
+
+    conn.execute(
+        """
+        INSERT INTO interaction_log (id, ts, actor_type, actor_id, action, target_type, payload, source, created_at)
+        VALUES (?, ?, 'system', 'entity_resolver', 'entity_resolution.run', 'canonical_entities', ?, 'ingest', ?)
+        """,
+        (str(uuid.uuid4()), _now(), json.dumps(counts), _now()),
+    )
+    conn.commit()
+    conn.close()
+    return counts, candidates
+
+
+def main():
+    ap = argparse.ArgumentParser(description="Deterministic entity resolution into the canonical layer.")
+    ap.add_argument("--db", default="data/crm_dev.db", help="path to the CRM SQLite DB")
+    ap.add_argument("--show-candidates", action="store_true", help="print fuzzy merge candidates")
+    args = ap.parse_args()
+
+    counts, candidates = run(args.db)
+    print(f"Entity resolution on {args.db}:")
+    for k, v in counts.items():
+        print(f"  {k:<18} {v}")
+    if args.show_candidates and candidates:
+        print("\nFuzzy candidates (same org + surname, different person — for the local-Qwen tier):")
+        for c in candidates:
+            names = ", ".join(f"{n!r}{(' <'+e+'>') if e else ''}" for _, n, e in c["members"])
+            print(f"  [{c['surname']}] {names}")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,35 @@
+"""Tiny stdlib JSON HTTP client (no third-party deps).
+
+Handles the Spark Control self-signed cert (verify=False) and plain-HTTP Qdrant.
+"""
+import json
+import ssl
+import urllib.error
+import urllib.request
+
+
+def _ctx(verify: bool):
+    if verify:
+        return None
+    ctx = ssl.create_default_context()
+    ctx.check_hostname = False
+    ctx.verify_mode = ssl.CERT_NONE
+    return ctx
+
+
+def request(method: str, url: str, body=None, verify: bool = True, timeout: int = 180):
+    data = json.dumps(body).encode("utf-8") if body is not None else None
+    req = urllib.request.Request(url, data=data, method=method,
+                                 headers={"Content-Type": "application/json"})
+    ctx = _ctx(verify) if url.lower().startswith("https") else None
+    try:
+        with urllib.request.urlopen(req, timeout=timeout, context=ctx) as resp:
+            raw = resp.read()
+            return resp.status, (json.loads(raw) if raw else {})
+    except urllib.error.HTTPError as exc:
+        raw = exc.read()
+        try:
+            payload = json.loads(raw) if raw else {}
+        except Exception:
+            payload = {"raw": raw.decode("utf-8", "replace")}
+        return exc.code, payload
@@ -0,0 +1,50 @@
+"""Minimal Qdrant REST client for the ingest pipeline (direct to QDRANT_URL).
+
+Creates the crm_chunks collection per EMBEDDINGS.md: a named dense vector
+(1024, Cosine) + a named sparse vector with modifier:idf, plus payload indexes.
+"""
+import config
+import http_util
+
+Q = config.QDRANT_URL
+COL = config.COLLECTION
+
+
+def _req(method, path, body=None):
+    return http_util.request(method, f"{Q}{path}", body, verify=False)
+
+
+def exists() -> bool:
+    status, _ = _req("GET", f"/collections/{COL}")
+    return status == 200
+
+
+def create_collection(recreate=False, dim=config.DENSE_DIM):
+    if exists():
+        if not recreate:
+            return "exists"
+        _req("DELETE", f"/collections/{COL}")
+    status, data = _req("PUT", f"/collections/{COL}", {
+        "vectors": {"dense": {"size": dim, "distance": "Cosine"}},
+        "sparse_vectors": {"sparse": {"modifier": "idf"}},
+    })
+    if status not in (200, 201):
+        raise RuntimeError(f"create collection -> {status}: {data}")
+    return "created"
+
+
+def ensure_indexes():
+    for field, schema in (("lp_id", "keyword"), ("doc_type", "keyword"), ("date_ts", "integer")):
+        _req("PUT", f"/collections/{COL}/index", {"field_name": field, "field_schema": schema})
+
+
+def upsert(points):
+    status, data = _req("PUT", f"/collections/{COL}/points?wait=true", {"points": points})
+    if status not in (200, 201):
+        raise RuntimeError(f"upsert -> {status}: {data}")
+    return data
+
+
+def count():
+    status, data = _req("POST", f"/collections/{COL}/points/count", {"exact": True})
+    return (data or {}).get("result", {}).get("count")
@@ -0,0 +1,109 @@
+#!/usr/bin/env python3
+"""Phase-0 retrieval — thin wrappers over Spark Control /api/search.
+
+These are the retrieval modes the CRM MCP server (Workstream C) will expose:
+  * semantic_search  — dense only (omit sparse), high recall
+  * hybrid_search    — dense + BM25 sparse (RRF) + rerank; best for entity queries
+  * keyword_search   — lean on the sparse leg
+All support a Qdrant `filter` (e.g. lp_id / date_ts range) to pre-filter.
+
+`--demo` runs an entity-heavy query in dense-only vs hybrid to show the BM25
+lexical leg surfacing the right LP. The query's sparse vector uses the SAME
+encoder as ingest (sparse.encode).
+"""
+import argparse
+
+import config
+import http_util
+import sparse
+
+
+def _search(query, sparse_vec=None, rerank=False, top_k=5, lp_id=None, retrieve_n=80, filt=None):
+    body = {"query": query, "collection": config.COLLECTION, "top_k": top_k,
+            "retrieve_n": retrieve_n, "fusion": "rrf", "text_field": "text",
+            "with_payload": True, "rerank": rerank}
+    if sparse_vec is not None:
+        body["sparse"] = {"indices": sparse_vec["indices"], "values": sparse_vec["values"]}
+    # An explicit raw Qdrant filter (filt) wins; otherwise build one from lp_id.
+    if filt is not None:
+        body["filter"] = filt
+    elif lp_id:
+        body["filter"] = {"must": [{"key": "lp_id", "match": {"value": lp_id}}]}
+    status, data = http_util.request("POST", f"{config.SPARK_CONTROL_URL}/api/search",
+                                     body, verify=config.SPARK_VERIFY_TLS)
+    if status != 200:
+        raise RuntimeError(f"/api/search -> {status}: {data}")
+    return data.get("data", [])
+
+
+def semantic_search(query, **kw):
+    return _search(query, sparse_vec=None, rerank=kw.pop("rerank", False), **kw)
+
+
+def hybrid_search(query, **kw):
+    return _search(query, sparse_vec=sparse.encode(query), rerank=kw.pop("rerank", True), **kw)
+
+
+def keyword_search(query, **kw):
+    return _search(query, sparse_vec=sparse.encode(query), rerank=kw.pop("rerank", True), **kw)
+
+
+def _row(r):
+    p = r.get("payload", {}) or {}
+    text = (r.get("text") or p.get("text") or "").replace("\n", " ")
+    return f"{p.get('lp_name', '?'):<22} [{p.get('doc_type', '?'):<13}] {text[:58]}"
+
+
+def _print(title, rows):
+    print(f"\n  {title}")
+    if not rows:
+        print("    (no results)")
+    for i, r in enumerate(rows, 1):
+        print(f"    {i}. score={r.get('score', 0):+.3f}  {_row(r)}")
+
+
+def demo():
+    target = "Cedar Point Capital"
+    q = "Fund III diligence and wire timeline for Cedar Point"
+    print(f"QUERY: {q!r}\nTarget LP: {target}")
+
+    dense = semantic_search(q, top_k=5)
+    hybrid = hybrid_search(q, top_k=5, rerank=False)   # rerank off to isolate the BM25 leg
+    _print("dense-only (semantic):", dense)
+    _print("hybrid (dense + BM25 RRF):", hybrid)
+
+    def first_rank(rows):
+        for i, r in enumerate(rows, 1):
+            if (r.get("payload", {}) or {}).get("lp_name") == target:
+                return i
+        return None
+    print(f"\n  First '{target}' chunk — dense rank: {first_rank(dense)}, hybrid rank: {first_rank(hybrid)}")
+
+    # Pre-filter demo: same query, restricted to one LP's chunks.
+    lp_id = None
+    for r in hybrid:
+        p = r.get("payload", {}) or {}
+        if p.get("lp_name") == target:
+            lp_id = p.get("lp_id")
+            break
+    if lp_id:
+        _print(f"hybrid + payload pre-filter (lp_id={lp_id}):",
+               hybrid_search(q, top_k=5, rerank=True, lp_id=lp_id))
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("query", nargs="?")
+    ap.add_argument("--mode", choices=["semantic", "hybrid", "keyword"], default="hybrid")
+    ap.add_argument("--top-k", type=int, default=5)
+    ap.add_argument("--lp-id")
+    ap.add_argument("--demo", action="store_true")
+    args = ap.parse_args()
+    if args.demo or not args.query:
+        return demo()
+    fn = {"semantic": semantic_search, "hybrid": hybrid_search, "keyword": keyword_search}[args.mode]
+    _print(f"{args.mode}: {args.query!r}", fn(args.query, top_k=args.top_k, lp_id=args.lp_id))
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,40 @@
+"""Client-side BM25 sparse vectors.
+
+EMBEDDINGS.md specifies FastEmbed `Qdrant/bm25` so Qdrant applies IDF (via the
+sparse vector's `modifier: idf`) over OUR corpus. FastEmbed pulls onnxruntime,
+which has no wheel for this Python (3.14) yet, so this module provides a
+dependency-free BM25 term-frequency encoder with the same contract:
+`encode(text) -> {"indices": [...], "values": [...]}`.
+
+Qdrant computes IDF server-side from the stored sparse vectors regardless of how
+indices are assigned, so this is a legitimate corpus-IDF BM25 leg. The ONLY hard
+requirement is that ingest and query use the SAME encoder — they both import this
+one. For production, swap `encode()` for FastEmbed `Qdrant/bm25` (and re-index, so
+ingest and query stay on the same tokenizer).
+"""
+import hashlib
+import math
+import re
+
+_TOKEN_RE = re.compile(r"[a-z0-9]+")
+
+
+def tokenize(text: str):
+    return _TOKEN_RE.findall((text or "").lower())
+
+
+def _index(token: str) -> int:
+    # Stable unsigned 32-bit index for a token (Qdrant sparse indices are u32).
+    return int.from_bytes(hashlib.md5(token.encode("utf-8")).digest()[:4], "big")
+
+
+def encode(text: str):
+    """Return a sparse vector {indices, values}. Value is 1 + ln(tf) (sublinear
+    term frequency); IDF is applied by Qdrant via modifier:idf."""
+    tf = {}
+    for tok in tokenize(text):
+        tf[tok] = tf.get(tok, 0) + 1
+    idx_val = {}
+    for tok, count in tf.items():
+        idx_val[_index(tok)] = 1.0 + math.log(count)
+    return {"indices": list(idx_val.keys()), "values": list(idx_val.values())}
@@ -0,0 +1,3 @@
+"""Ten31 CRM MCP server (Workstream C) — wraps CRM reads, retrieval modes, and
+logged writes for the Claude Agent SDK. No outbound/contact tools (Phase 0 gate).
+"""
@@ -0,0 +1,200 @@
+"""CRM MCP tool logic (Workstream C) — plain functions, transport-agnostic.
+
+Kept separate from the MCP server wiring so it is unit-testable without the MCP
+SDK. Each function returns JSON-serializable dicts. Reads go against the CRM
+SQLite DB by path; retrieval wraps Spark Control /api/search; writes go through
+the interaction_log (guardrail #5).
+
+Tool surface:
+  reads      get_entity, search_records, get_interaction_history
+  retrieval  semantic_search, hybrid_search, keyword_search
+  writes     log_interaction, set_entity_enrichment
+NO outbound/contact tools — that capability is gated to Phase 3.
+"""
+import json
+import os
+import sqlite3
+import sys
+import uuid
+from datetime import datetime, timezone
+
+sys.path.insert(0, os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "ingest"))
+import config  # noqa: E402
+import search  # noqa: E402
+
+
+def _conn(db=None):
+    c = sqlite3.connect(db or os.environ.get("CRM_DB_PATH") or config.DEFAULT_DB)
+    c.row_factory = sqlite3.Row
+    c.execute("PRAGMA foreign_keys=ON")
+    return c
+
+
+def _now():
+    return datetime.now(timezone.utc).isoformat()
+
+
+# ── read tools ────────────────────────────────────────────────────────────────
+
+def _contact_ids_for(c, lp_id):
+    """All contact ids belonging to a canonical entity: directly linked contacts
+    plus contacts whose organization resolves to this entity."""
+    ids = {r["source_id"] for r in c.execute(
+        "SELECT source_id FROM entity_links WHERE canonical_id=? AND source_model='contacts'", (lp_id,))}
+    org_src = [r["source_id"] for r in c.execute(
+        "SELECT source_id FROM entity_links WHERE canonical_id=? AND source_model='organizations'", (lp_id,))]
+    if org_src:
+        q = "SELECT id FROM contacts WHERE organization_id IN (%s)" % ",".join("?" * len(org_src))
+        ids.update(r["id"] for r in c.execute(q, org_src))
+    return ids
+
+
+def get_entity(lp_id, db=None):
+    """Fetch a canonical entity + its linked source rows and interaction count."""
+    c = _conn(db)
+    e = c.execute("SELECT * FROM canonical_entities WHERE id=?", (lp_id,)).fetchone()
+    if not e:
+        c.close()
+        return {"error": "not_found", "lp_id": lp_id}
+    out = dict(e)
+    out["links"] = [dict(r) for r in c.execute(
+        "SELECT source_model, source_id, match_kind, confidence FROM entity_links WHERE canonical_id=?", (lp_id,))]
+    cids = _contact_ids_for(c, lp_id)
+    out["interaction_count"] = (c.execute(
+        "SELECT COUNT(*) FROM communications WHERE contact_id IN (%s)" % ",".join("?" * len(cids)),
+        list(cids)).fetchone()[0] if cids else 0)
+    c.close()
+    return out
+
+
+def search_records(query=None, entity_kind=None, limit=20, db=None):
+    """Structured search over canonical entities (name substring + kind)."""
+    c = _conn(db)
+    sql = ("SELECT id, entity_kind, display_name, primary_email, segment, warmth_score "
+           "FROM canonical_entities WHERE deleted_at IS NULL")
+    args = []
+    if entity_kind:
+        sql += " AND entity_kind=?"
+        args.append(entity_kind)
+    if query:
+        sql += " AND lower(display_name) LIKE ?"
+        args.append(f"%{query.lower()}%")
+    sql += " ORDER BY display_name LIMIT ?"
+    args.append(limit)
+    rows = [dict(r) for r in c.execute(sql, args)]
+    c.close()
+    return {"results": rows, "count": len(rows)}
+
+
+def get_interaction_history(lp_id, limit=20, db=None):
+    """Merged, dated interaction history for an entity: communications + grid notes."""
+    c = _conn(db)
+    items = []
+    cids = _contact_ids_for(c, lp_id)
+    if cids:
+        q = ("SELECT type, subject, body, communication_date FROM communications "
+             "WHERE contact_id IN (%s) ORDER BY communication_date DESC LIMIT ?" % ",".join("?" * len(cids)))
+        for r in c.execute(q, [*cids, limit]):
+            items.append({"kind": r["type"], "date": r["communication_date"],
+                          "subject": r["subject"], "text": (r["body"] or "")[:240]})
+    inv_src = [r["source_id"] for r in c.execute(
+        "SELECT source_id FROM entity_links WHERE canonical_id=? AND source_model='fundraising_investors'", (lp_id,))]
+    if inv_src:
+        q = "SELECT notes, updated_at FROM fundraising_investors WHERE id IN (%s)" % ",".join("?" * len(inv_src))
+        for r in c.execute(q, inv_src):
+            if (r["notes"] or "").strip():
+                items.append({"kind": "grid_note", "date": r["updated_at"],
+                              "subject": "Fundraising grid notes", "text": r["notes"][:300]})
+    c.close()
+    items.sort(key=lambda x: (x["date"] or ""), reverse=True)
+    return {"lp_id": lp_id, "items": items[:limit], "count": len(items)}
+
+
+# ── retrieval tools (wrap /api/search) ────────────────────────────────────────
+
+def _filter(lp_id=None, doc_type=None, date_from=None, date_to=None):
+    must = []
+    if lp_id:
+        must.append({"key": "lp_id", "match": {"value": lp_id}})
+    if doc_type:
+        must.append({"key": "doc_type", "match": {"value": doc_type}})
+    if date_from is not None or date_to is not None:
+        rng = {}
+        if date_from is not None:
+            rng["gte"] = date_from
+        if date_to is not None:
+            rng["lte"] = date_to
+        must.append({"key": "date_ts", "range": rng})
+    return {"must": must} if must else None
+
+
+def _shape(rows):
+    out = []
+    for r in rows:
+        p = r.get("payload", {}) or {}
+        out.append({"score": r.get("score"), "lp_id": p.get("lp_id"), "lp_name": p.get("lp_name"),
+                    "doc_type": p.get("doc_type"), "date_ts": p.get("date_ts"),
+                    "text": r.get("text") or p.get("text"),
+                    "source": f"{p.get('source_model')}:{p.get('source_id')}"})
+    return out
+
+
+def hybrid_search(query, top_k=8, lp_id=None, doc_type=None, date_from=None, date_to=None):
+    """Dense + BM25 + rerank. Default mode; best for entity-heavy queries."""
+    return {"mode": "hybrid", "query": query,
+            "results": _shape(search.hybrid_search(query, top_k=top_k,
+                              filt=_filter(lp_id, doc_type, date_from, date_to)))}
+
+
+def semantic_search(query, top_k=8, lp_id=None, doc_type=None, date_from=None, date_to=None):
+    """Dense only, high recall."""
+    return {"mode": "semantic", "query": query,
+            "results": _shape(search.semantic_search(query, top_k=top_k,
+                              filt=_filter(lp_id, doc_type, date_from, date_to)))}
+
+
+def keyword_search(query, top_k=8, lp_id=None, doc_type=None, date_from=None, date_to=None):
+    """High-precision lexical (sparse leg + rerank)."""
+    return {"mode": "keyword", "query": query,
+            "results": _shape(search.keyword_search(query, top_k=top_k,
+                              filt=_filter(lp_id, doc_type, date_from, date_to)))}
+
+
+# ── write tools (every write logged — guardrail #5) ───────────────────────────
+
+def log_interaction(action, actor_type="agent", actor_id=None, target_id=None,
+                    target_type="canonical_entity", payload=None, source="mcp", db=None):
+    """Append an entry to the append-only interaction log."""
+    c = _conn(db)
+    iid = str(uuid.uuid4())
+    c.execute("""INSERT INTO interaction_log
+                 (id, ts, actor_type, actor_id, action, target_type, target_id, payload, source, created_at)
+                 VALUES (?,?,?,?,?,?,?,?,?,?)""",
+              (iid, _now(), actor_type, actor_id, action, target_type, target_id,
+               json.dumps(payload) if payload is not None else None, source, _now()))
+    c.commit()
+    c.close()
+    return {"id": iid, "logged": True}
+
+
+_ENRICH_FIELDS = {"thesis_fit", "segment", "accreditation_status", "qp_status",
+                  "warmth_score", "source", "owner_id", "last_touch_at", "notes"}
+
+
+def set_entity_enrichment(lp_id, fields, actor_id="analyst", db=None):
+    """One-way enrichment write INTO the canonical entity (guardrail #8). Logged."""
+    upd = {k: v for k, v in (fields or {}).items() if k in _ENRICH_FIELDS}
+    if not upd:
+        return {"error": "no_valid_fields", "allowed": sorted(_ENRICH_FIELDS)}
+    c = _conn(db)
+    sets = ", ".join(f"{k}=?" for k in upd) + ", updated_at=?"
+    c.execute(f"UPDATE canonical_entities SET {sets} WHERE id=?", [*upd.values(), _now(), lp_id])
+    iid = str(uuid.uuid4())
+    c.execute("""INSERT INTO interaction_log
+                 (id, ts, actor_type, actor_id, action, target_type, target_id, payload, source, created_at)
+                 VALUES (?,?,?,?,?,?,?,?,?,?)""",
+              (iid, _now(), "agent", actor_id, "enrichment.written", "canonical_entity", lp_id,
+               json.dumps(upd), "mcp", _now()))
+    c.commit()
+    c.close()
+    return {"lp_id": lp_id, "updated": list(upd.keys()), "log_id": iid}
@@ -0,0 +1,88 @@
+#!/usr/bin/env python3
+"""Ten31 CRM MCP server (Workstream C).
+
+Exposes CRM reads, retrieval modes, and logged writes to the Claude Agent SDK
+over MCP (stdio). All logic lives in crm_tools.py (tested independently); this
+file is the thin transport wrapper.
+
+Run:
+    pip install mcp            # one-time (MCP Python SDK)
+    CRM_DB_PATH=/data/crm.db python3 backend/mcp/server.py
+
+Register with the Agent SDK / Claude Code as an stdio MCP server pointing at this
+script. NO outbound/contact tools are exposed — that capability is gated to
+Phase 3 behind the compliance review (CLAUDE.md guardrails #4, #6).
+"""
+import os
+import sys
+
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+import crm_tools as t  # noqa: E402
+
+from mcp.server.fastmcp import FastMCP  # noqa: E402
+
+mcp = FastMCP("ten31-crm")
+
+
+# ── reads ──
+@mcp.tool()
+def get_entity(lp_id: str) -> dict:
+    """Fetch a canonical LP/organization/person entity by id, with its linked
+    source records and interaction count."""
+    return t.get_entity(lp_id)
+
+
+@mcp.tool()
+def search_records(query: str = "", entity_kind: str = "", limit: int = 20) -> dict:
+    """Structured search over canonical entities by name substring and kind
+    ('lp' | 'organization' | 'person')."""
+    return t.search_records(query=query or None, entity_kind=entity_kind or None, limit=limit)
+
+
+@mcp.tool()
+def get_interaction_history(lp_id: str, limit: int = 20) -> dict:
+    """Merged, dated interaction history (communications + fundraising grid notes)
+    for a canonical entity."""
+    return t.get_interaction_history(lp_id, limit=limit)
+
+
+# ── retrieval modes ──
+@mcp.tool()
+def hybrid_search(query: str, top_k: int = 8, lp_id: str = "", doc_type: str = "",
+                  date_from: int = 0, date_to: int = 0) -> dict:
+    """Dense + BM25 + rerank retrieval (default; best for entity-heavy queries).
+    Optional filters: lp_id, doc_type, date_from/date_to (epoch seconds)."""
+    return t.hybrid_search(query, top_k=top_k, lp_id=lp_id or None, doc_type=doc_type or None,
+                           date_from=date_from or None, date_to=date_to or None)
+
+
+@mcp.tool()
+def semantic_search(query: str, top_k: int = 8, lp_id: str = "", doc_type: str = "") -> dict:
+    """Dense-only retrieval (high recall)."""
+    return t.semantic_search(query, top_k=top_k, lp_id=lp_id or None, doc_type=doc_type or None)
+
+
+@mcp.tool()
+def keyword_search(query: str, top_k: int = 8, lp_id: str = "", doc_type: str = "") -> dict:
+    """High-precision lexical retrieval (sparse leg + rerank)."""
+    return t.keyword_search(query, top_k=top_k, lp_id=lp_id or None, doc_type=doc_type or None)
+
+
+# ── writes (logged) ──
+@mcp.tool()
+def log_interaction(action: str, actor_type: str = "agent", actor_id: str = "",
+                    target_id: str = "", payload: dict = None, source: str = "mcp") -> dict:
+    """Append an entry to the append-only interaction log (guardrail #5)."""
+    return t.log_interaction(action, actor_type=actor_type, actor_id=actor_id or None,
+                             target_id=target_id or None, payload=payload, source=source)
+
+
+@mcp.tool()
+def set_entity_enrichment(lp_id: str, fields: dict, actor_id: str = "analyst") -> dict:
+    """One-way enrichment write into a canonical entity (thesis_fit, segment,
+    warmth_score, accreditation_status, etc.). Logged automatically."""
+    return t.set_entity_enrichment(lp_id, fields, actor_id=actor_id)
+
+
+if __name__ == "__main__":
+    mcp.run()
@@ -0,0 +1,23 @@
+-- Reversal of 0001_phase0_foundation.sql.
+--
+-- Run manually (it is NOT applied automatically) if the Phase-0 foundation
+-- schema needs to be rolled back. Drops the new tables. The added `deleted_at`
+-- columns on existing tables are left in place by default because they are
+-- inert (nullable, unused unless the DELETE handlers are changed) and dropping
+-- columns requires SQLite >= 3.35; uncomment the DROP COLUMN lines to remove
+-- them on a modern SQLite.
+
+DROP TABLE IF EXISTS relationship_edges;
+DROP TABLE IF EXISTS interaction_log;
+DROP TABLE IF EXISTS entity_links;
+DROP TABLE IF EXISTS canonical_entities;
+
+-- Requires SQLite >= 3.35.0. Safe to leave the columns if unsure.
+-- ALTER TABLE contacts       DROP COLUMN deleted_at;
+-- ALTER TABLE organizations  DROP COLUMN deleted_at;
+-- ALTER TABLE opportunities  DROP COLUMN deleted_at;
+-- ALTER TABLE communications DROP COLUMN deleted_at;
+-- ALTER TABLE lp_profiles    DROP COLUMN deleted_at;
+
+-- Forget the migration so it can be re-applied:
+DELETE FROM schema_migrations WHERE filename = '0001_phase0_foundation.sql';
@@ -0,0 +1,116 @@
+-- Phase 0 — Workstream A2: foundation schema for the agentic system.
+--
+-- ADDITIVE AND REVERSIBLE ONLY (CLAUDE.md guardrail #3): this migration adds
+-- new tables and new nullable columns alongside the existing CRM. It never
+-- drops, renames, or rewrites existing data. Its reversal is 0001_phase0_foundation.down.sql.
+--
+-- Applied once at startup by backend/core_migrations.py, tracked in the
+-- schema_migrations ledger. Safe to leave in place; the canonical layer it
+-- creates starts EMPTY and is populated later by entity resolution (A4/B3).
+
+-- ============================================================================
+-- 1. canonical_entities — the single, model-agnostic identity for an LP /
+--    organization / person. Both the classic contacts/lp_profiles model and the
+--    fundraising_* grid map INTO this; neither existing model is demoted.
+--    IDs are full-length (e.g. 'lp_' + uuid4 hex), NOT the 8-char truncated
+--    UUIDs used elsewhere in the CRM, so they are safe as the index/payload key.
+-- ============================================================================
+CREATE TABLE IF NOT EXISTS canonical_entities (
+    id                    TEXT PRIMARY KEY,
+    entity_kind           TEXT NOT NULL,            -- 'lp' | 'organization' | 'person'
+    display_name          TEXT NOT NULL,
+    primary_email         TEXT,
+    -- Phase-0 LP/prospect fields (model-agnostic home):
+    thesis_fit            TEXT,
+    segment               TEXT,
+    accreditation_status  TEXT,                     -- free-text until counsel defines the vocabulary (guardrail #6)
+    qp_status             TEXT,
+    warmth_score          REAL,
+    source                TEXT,
+    owner_id              TEXT REFERENCES users(id),
+    last_touch_at         TEXT,
+    notes                 TEXT,
+    created_at            TEXT DEFAULT (datetime('now')),
+    updated_at            TEXT DEFAULT (datetime('now')),
+    deleted_at            TEXT                       -- soft-delete (never hard-delete; guardrail #3)
+);
+CREATE INDEX IF NOT EXISTS idx_canonical_kind  ON canonical_entities(entity_kind);
+CREATE INDEX IF NOT EXISTS idx_canonical_email ON canonical_entities(primary_email);
+CREATE INDEX IF NOT EXISTS idx_canonical_owner ON canonical_entities(owner_id);
+
+-- ============================================================================
+-- 2. entity_links — resolution map. Every source row (a contacts row, a
+--    fundraising_investors row, etc.) and every email/name variant points at the
+--    canonical entity it resolves to. This is how name variants collapse to one id.
+-- ============================================================================
+CREATE TABLE IF NOT EXISTS entity_links (
+    id            TEXT PRIMARY KEY,
+    canonical_id  TEXT NOT NULL REFERENCES canonical_entities(id) ON DELETE CASCADE,
+    source_model  TEXT NOT NULL,    -- contacts|organizations|lp_profiles|fundraising_investors|fundraising_contacts|email_address|alias
+    source_id     TEXT,             -- the local PK in that model (NULL for a bare email/name alias)
+    match_value   TEXT,             -- normalized email or name variant
+    match_kind    TEXT NOT NULL,    -- exact_email|name_variant|domain|manual
+    confidence    REAL DEFAULT 1.0,
+    created_at    TEXT DEFAULT (datetime('now')),
+    UNIQUE(source_model, source_id, match_value)
+);
+CREATE INDEX IF NOT EXISTS idx_entity_links_canonical ON entity_links(canonical_id);
+CREATE INDEX IF NOT EXISTS idx_entity_links_match     ON entity_links(match_value);
+CREATE INDEX IF NOT EXISTS idx_entity_links_source    ON entity_links(source_model, source_id);
+
+-- ============================================================================
+-- 3. interaction_log — APPEND-ONLY record of every agent action and every human
+--    touch (guardrail #5). Distinct from audit_log (which is mutation-diff-only
+--    and has no actor/agent dimension). Nothing in this table is ever updated or
+--    deleted by convention.
+-- ============================================================================
+CREATE TABLE IF NOT EXISTS interaction_log (
+    id           TEXT PRIMARY KEY,
+    ts           TEXT NOT NULL DEFAULT (datetime('now')),  -- event time
+    actor_type   TEXT NOT NULL,    -- human | agent | system
+    actor_id     TEXT,             -- users.id, or an agent name (Scout/Analyst/...)
+    action       TEXT NOT NULL,    -- e.g. note.created | email.matched | enrichment.written | search.run
+    target_type  TEXT,             -- canonical_entity | contact | communication | opportunity | ...
+    target_id    TEXT,             -- canonical_entities.id where possible
+    payload      TEXT,             -- JSON blob with the action detail
+    source       TEXT,             -- crm_ui | mcp | ingest | scout | ...
+    created_at   TEXT DEFAULT (datetime('now'))
+);
+CREATE INDEX IF NOT EXISTS idx_interaction_target ON interaction_log(target_type, target_id);
+CREATE INDEX IF NOT EXISTS idx_interaction_ts     ON interaction_log(ts);
+CREATE INDEX IF NOT EXISTS idx_interaction_actor  ON interaction_log(actor_type, actor_id);
+
+-- ============================================================================
+-- 4. relationship_edges — derived graph of who-knows-whom between canonical
+--    entities. Starts EMPTY; seeded later from email_investor_links + calendar +
+--    X follower overlap (Analyst, Phase 2).
+-- ============================================================================
+CREATE TABLE IF NOT EXISTS relationship_edges (
+    id            TEXT PRIMARY KEY,
+    src_id        TEXT NOT NULL REFERENCES canonical_entities(id) ON DELETE CASCADE,
+    dst_id        TEXT NOT NULL REFERENCES canonical_entities(id) ON DELETE CASCADE,
+    edge_type     TEXT NOT NULL,    -- email_corr | calendar | x_follow | intro | colleague
+    source        TEXT NOT NULL,    -- provenance of this edge
+    strength      REAL DEFAULT 0,
+    directed      INTEGER DEFAULT 0,
+    evidence      TEXT,             -- JSON supporting detail
+    first_seen_at TEXT,
+    last_seen_at  TEXT,
+    created_at    TEXT DEFAULT (datetime('now')),
+    updated_at    TEXT DEFAULT (datetime('now')),
+    UNIQUE(src_id, dst_id, edge_type, source)
+);
+CREATE INDEX IF NOT EXISTS idx_rel_src ON relationship_edges(src_id);
+CREATE INDEX IF NOT EXISTS idx_rel_dst ON relationship_edges(dst_id);
+
+-- ============================================================================
+-- 5. Soft-delete columns on existing tables. Additive nullable columns; the CRM
+--    currently HARD-deletes everywhere (guardrail #3 gap). Adding the column is
+--    safe now; switching the DELETE handlers to set it instead of hard-deleting
+--    is a separate, reviewed code change.
+-- ============================================================================
+ALTER TABLE contacts       ADD COLUMN deleted_at TEXT;
+ALTER TABLE organizations  ADD COLUMN deleted_at TEXT;
+ALTER TABLE opportunities  ADD COLUMN deleted_at TEXT;
+ALTER TABLE communications ADD COLUMN deleted_at TEXT;
+ALTER TABLE lp_profiles    ADD COLUMN deleted_at TEXT;
@@ -9,3 +9,4 @@ passlib[bcrypt]==1.7.4
 python-multipart==0.0.9
 aiofiles==23.2.1
 httpx==0.27.0
+cryptography==42.0.5
@@ -0,0 +1,279 @@
+#!/usr/bin/env python3
+"""Seed a SYNTHETIC dev database for Ten31 CRM ingest/retrieval testing.
+
+ALL DATA IS FAKE. No real LP/prospect information appears here (CLAUDE.md
+guardrail #9: Claude works only on synthetic/redacted data). This produces a
+realistic-shaped corpus so the Phase-0 ingest, chunking, and entity-resolution
+work can be developed and tested without ever touching the live CRM.
+
+What it builds (into a SEPARATE dev DB, never crm.db):
+  * The full real schema, via server.init_db() — which also runs the new
+    core migration (backend/migrations/), so the canonical/interaction/graph
+    tables exist.
+  * A classic-model dataset: organizations, contacts (investors + prospects),
+    opportunities across pipeline stages, communications with entity-rich prose
+    notes, and lp_profiles.
+  * A fundraising grid (fundraising_state.grid_json) populated via the real
+    sync_fundraising_relational() code path, so the normalized mirror + the
+    grid->classic bridge behave exactly as in production.
+  * DELIBERATE entity-resolution test cases: several investors appear in BOTH
+    models with NAME VARIANTS (e.g. "Jonathan Reyes" vs grid contact "Jon
+    Reyes"), some with matching email (easy merge) and some without (hard case).
+
+Usage:
+    python3 backend/scripts/seed_synthetic.py                 # -> data/crm_dev.db
+    python3 backend/scripts/seed_synthetic.py --db /tmp/x.db
+"""
+import argparse
+import datetime
+import json
+import os
+import random
+import sys
+
+HERE = os.path.dirname(os.path.abspath(__file__))
+BACKEND_DIR = os.path.dirname(HERE)
+PROJECT_DIR = os.path.dirname(BACKEND_DIR)
+sys.path.insert(0, BACKEND_DIR)
+
+ORGS = [
+    ("Cedar Point Capital", "family_office"),
+    ("Harbor & Vine Family Office", "family_office"),
+    ("Northwind Endowment", "endowment"),
+    ("Granite Peak Partners", "institutional"),
+    ("Solano Ventures", "wealth_management"),
+    ("Tidewater Holdings", "family_office"),
+    ("Brightwater Capital", "institutional"),
+    ("Meridian Trust", "foundation"),
+]
+
+FIRST = ["Jonathan", "Katherine", "Michael", "William", "Robert", "Elena", "Priya",
+         "David", "Sarah", "James", "Maria", "Thomas", "Laura", "Daniel", "Rachel",
+         "Steven", "Nicole", "Andrew", "Jessica", "Brian"]
+LAST = ["Reyes", "Calder", "Okonkwo", "Brandt", "Sutter", "Vance", "Mehta", "Ellison",
+        "Cho", "Whitlock", "Santos", "Aldridge", "Kerr", "Nilsson", "Pope", "Devlin",
+        "Frye", "Osei", "Lindqvist", "Marsh"]
+NICK = {"Jonathan": "Jon", "Katherine": "Kate", "Michael": "Mike", "William": "Bill",
+        "Robert": "Bob", "James": "Jim", "Thomas": "Tom", "Daniel": "Dan",
+        "Steven": "Steve", "Jessica": "Jess"}
+
+FUND_COLS = ["fund_i", "fund_ii", "fund_iii", "tactical_fund", "pawn_to_e4",
+             "ten31_terahash", "sats_and_stats", "pawn_to_f4", "join_the_fold"]
+FUND_LABELS = {"fund_i": "Fund I", "fund_ii": "Fund II", "fund_iii": "Fund III",
+               "tactical_fund": "Tactical Fund"}
+AMOUNTS = [250_000, 500_000, 1_000_000, 2_500_000, 5_000_000]
+LEADS = ["JK", "Grant", "MB", "Parker"]
+
+COMM_TEMPLATES = [
+    ("call", "Intro call recap",
+     "Spoke with {person} ({org}) for {dur} min about {fund}. Strong interest in the "
+     "bitcoin-energy and AI-infrastructure thesis; wants the latest deck and DPI figures. "
+     "Flagged accreditation paperwork still outstanding. Next: send one-pager and schedule a partner call."),
+    ("email", "Follow-up: {fund} allocation",
+     "Sent {person} the {fund} summary and the scarcity/critical-infrastructure memo. "
+     "They asked how Ten31 Terahash relates to the energy thesis. Following up next week on commitment size."),
+    ("meeting", "Partner meeting notes",
+     "Met {person} at {org}. Discussed pacing into {fund} and co-invest appetite. "
+     "Concern about lockup; reassured on secondary options. Warm — wants to meet the GP again before committing."),
+    ("note", "Diligence status",
+     "{person} is mid-diligence on {fund}. Legal reviewing subscription docs; wire expected within 30 days. "
+     "Keep warm; send the Q update."),
+    ("text", "Quick ping",
+     "Texted {person} re: the {fund} close timeline. Said they're 'in for at least a unit' pending IC approval."),
+]
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--db", default=os.path.join(PROJECT_DIR, "data", "crm_dev.db"))
+    args = ap.parse_args()
+    db = os.path.abspath(args.db)
+    if os.path.basename(db) == "crm.db":
+        sys.exit("Refusing to seed the real crm.db. Use a dev path like data/crm_dev.db.")
+
+    for ext in ("", "-wal", "-shm"):
+        if os.path.exists(db + ext):
+            os.remove(db + ext)
+
+    os.environ["CRM_DB_PATH"] = db
+    import server  # noqa: E402  (must follow CRM_DB_PATH assignment)
+
+    server.init_db()
+    gen, now = server.generate_id, server.now
+    conn = server.get_db()
+    random.seed(31)
+
+    def past(days_ago):
+        return (datetime.datetime.utcnow() - datetime.timedelta(days=days_ago)).isoformat() + "Z"
+
+    # ── dev user (FK target for created_by/owner_id) ──
+    uid = gen()
+    conn.execute(
+        "INSERT INTO users (id, username, email, password_hash, full_name, role) VALUES (?,?,?,?,?,?)",
+        (uid, "dev_admin", "dev@example.invalid", server.hash_password("devpassword"),
+         "Dev Admin", "admin"))
+
+    # ── organizations ──
+    org_ids = {}
+    for name, otype in ORGS:
+        oid = gen()
+        org_ids[name] = oid
+        conn.execute(
+            "INSERT INTO organizations (id, name, type, industry, country, description, created_by, updated_at) "
+            "VALUES (?,?,?,?,?,?,?,?)",
+            (oid, name, otype, "Investment Management", "USA",
+             f"{name} — synthetic {otype.replace('_', ' ')} used for ingest testing.", uid, now()))
+
+    # ── classic contacts (investors + prospects) ──
+    contacts = []          # (cid, first, last, org_name, contact_type)
+    overlap_specs = []     # investors we will also place in the grid, with variants
+    used = set()
+    for i, (org_name, _) in enumerate(ORGS):
+        # one "primary" investor contact per org
+        first, last = FIRST[i], LAST[i]
+        used.add((first, last))
+        cid = gen()
+        email = f"{first.lower()}.{last.lower()}@{org_name.split()[0].lower()}.invalid"
+        conn.execute(
+            "INSERT INTO contacts (id, first_name, last_name, email, title, organization_id, contact_type, "
+            "status, source, notes, created_by, updated_at) VALUES (?,?,?,?,?,?,?,?,?,?,?,?)",
+            (cid, first, last, email, "Managing Partner", org_ids[org_name], "investor", "active",
+             "referral", f"Primary relationship at {org_name}. Met via conference intro.", uid, now()))
+        contacts.append((cid, first, last, org_name, "investor"))
+        # mark 5 of 8 for grid overlap with a NAME VARIANT
+        if i < 5:
+            variant = NICK.get(first, first[0] + ".")  # nickname or initial
+            match_email = email if i % 2 == 0 else ""    # half share email (easy), half don't (hard)
+            overlap_specs.append((org_name, f"{variant} {last}", match_email))
+
+    # extra prospect contacts (no org sometimes)
+    for j in range(12):
+        first = FIRST[(j + 8) % len(FIRST)]
+        last = LAST[(j + 8) % len(LAST)]
+        if (first, last) in used:
+            last = LAST[(j + 11) % len(LAST)]
+        used.add((first, last))
+        org_name = ORGS[j % len(ORGS)][0] if j % 3 else None
+        cid = gen()
+        email = f"{first.lower()}{last.lower()}@example.invalid"
+        conn.execute(
+            "INSERT INTO contacts (id, first_name, last_name, email, title, organization_id, contact_type, "
+            "status, source, notes, created_by, updated_at) VALUES (?,?,?,?,?,?,?,?,?,?,?,?)",
+            (cid, first, last, email, "Principal", org_ids.get(org_name) if org_name else None,
+             "prospect", "active", random.choice(["inbound", "referral", "conference", "x"]),
+             f"Prospect sourced via {random.choice(['X DM', 'warm intro', 'podcast'])}.", uid, now()))
+        contacts.append((cid, first, last, org_name, "prospect"))
+
+    # ── opportunities + lp_profiles + communications ──
+    stages = server.PIPELINE_STAGES
+    for idx, (cid, first, last, org_name, ctype) in enumerate(contacts):
+        person = f"{first} {last}"
+        # opportunity for most contacts
+        if idx % 5 != 4:
+            stage = stages[idx % len(stages)]
+            fund_label = random.choice(list(FUND_LABELS.values()))
+            amt = random.choice(AMOUNTS)
+            conn.execute(
+                "INSERT INTO opportunities (id, name, contact_id, organization_id, stage, commitment_amount, "
+                "expected_amount, probability, fund_name, description, next_step, owner_id, priority, updated_at) "
+                "VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?)",
+                (gen(), f"{org_name or person} — {fund_label}", cid, org_ids.get(org_name) if org_name else None,
+                 stage, amt if stage in ("committed", "funded") else 0, amt,
+                 {"lead": 10, "outreach": 25, "meeting": 40, "due_diligence": 60, "committed": 90, "funded": 100}[stage],
+                 fund_label, f"Potential {fund_label} allocation for {person}.",
+                 random.choice(["Send deck", "Schedule call", "Await IC", "Send subdocs"]),
+                 uid, random.choice(["low", "medium", "high"]), now()))
+
+        # lp_profile for ~closed investors
+        if ctype == "investor" and idx % 2 == 0:
+            amt = random.choice(AMOUNTS)
+            conn.execute(
+                "INSERT INTO lp_profiles (id, contact_id, commitment_amount, funded_amount, commitment_date, "
+                "fund_name, investor_type, accredited, legal_docs_signed, wire_received, k1_sent, notes, updated_at) "
+                "VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)",
+                (gen(), cid, amt, amt if idx % 4 == 0 else 0, past(120),
+                 random.choice(list(FUND_LABELS.values())),
+                 random.choice(["family_office", "institutional", "endowment", "individual"]),
+                 1, 1 if idx % 3 else 0, 1 if idx % 4 == 0 else 0, 0,
+                 f"Closed LP. Accreditation on file. Primary contact {person}.", now()))
+
+        # 2-4 communications each, entity-rich prose
+        for k in range(random.randint(2, 4)):
+            ctype_comm, subj, body = random.choice(COMM_TEMPLATES)
+            fund = random.choice(["Fund III", "Tactical Fund", "Ten31 Terahash", "Fund II"])
+            conn.execute(
+                "INSERT INTO communications (id, contact_id, type, subject, body, communication_date, "
+                "duration_minutes, outcome, next_action, created_by, updated_at) VALUES (?,?,?,?,?,?,?,?,?,?,?)",
+                (gen(), cid, ctype_comm,
+                 subj.format(fund=fund),
+                 body.format(person=person, org=org_name or "their firm", fund=fund, dur=random.choice([20, 30, 45])),
+                 past(random.randint(1, 200)), random.choice([20, 30, 45, None]),
+                 random.choice(["positive", "neutral", "needs follow-up"]),
+                 random.choice(["Send deck", "Schedule call", "Send subdocs", None]),
+                 uid, now()))
+
+    # ── fundraising grid (authoritative grid_json -> real sync path) ──
+    columns = server.DEFAULT_FUNDRAISING_COLUMNS
+    views = server.DEFAULT_GRID_VIEWS
+    rows = []
+
+    # (a) overlap investors — same org as a classic investor, but a NAME-VARIANT contact
+    for org_name, variant_name, match_email in overlap_specs:
+        row = {"id": "row-" + gen(), "investor_name": org_name, "lead": random.choice(LEADS),
+               "lead_source": random.choice(["Conference", "Warm intro", "X"]),
+               "notes": f"[call] {variant_name}: discussed Fund III pacing and co-invest. Warm.\n"
+                        f"[email] {variant_name}: sent the energy-thesis memo.",
+               "priority": random.random() < 0.4, "follow_up": random.random() < 0.5, "graveyard": False,
+               "contacts": [{"name": variant_name,
+                             "email": match_email or f"{variant_name.split()[0].lower()}@{org_name.split()[0].lower()}.invalid",
+                             "title": "Managing Partner"}]}
+        for fc in random.sample(FUND_COLS, k=random.randint(1, 3)):
+            row[fc] = random.choice(AMOUNTS)
+        rows.append(row)
+
+    # (b) grid-only investors (no classic counterpart) — exercise the create path
+    for n in range(7):
+        nm = f"{random.choice(['Slate', 'Copper', 'Ridgeline', 'Anchor', 'Falcon', 'Quarry', 'Beacon'])} " \
+             f"{random.choice(['Capital', 'Partners', 'Holdings', 'Group'])}"
+        row = {"id": "row-" + gen(), "investor_name": nm, "lead": random.choice(LEADS),
+               "lead_source": random.choice(["Inbound", "Referral", "Podcast"]),
+               "notes": f"[note] First touch with {nm}. Sourced via X. Gauging thesis fit.",
+               "priority": False, "follow_up": random.random() < 0.6,
+               "graveyard": n >= 5,  # a couple in the graveyard list
+               "contacts": [{"name": f"{random.choice(FIRST)} {random.choice(LAST)}",
+                             "email": f"contact{n}@{nm.split()[0].lower()}.invalid", "title": "Partner"}]}
+        for fc in random.sample(FUND_COLS, k=random.randint(0, 2)):
+            row[fc] = random.choice(AMOUNTS)
+        rows.append(row)
+
+    grid = {"columns": columns, "rows": rows}
+    conn.execute(
+        "INSERT INTO fundraising_state (id, grid_json, views_json, version, updated_by, created_at, updated_at) "
+        "VALUES ('main', ?, ?, 1, ?, ?, ?)",
+        (json.dumps(grid), json.dumps(views), uid, now(), now()))
+    server.sync_fundraising_relational(conn, grid, views, actor_user_id=uid)
+    conn.commit()
+
+    # ── summary ──
+    def count(t):
+        return conn.execute(f"SELECT COUNT(*) FROM {t}").fetchone()[0]
+
+    print(f"\nSynthetic dev DB written to: {db}")
+    print("  Classic model:")
+    for t in ("organizations", "contacts", "opportunities", "communications", "lp_profiles"):
+        print(f"    {t:<24} {count(t)}")
+    print("  Fundraising grid (after real sync):")
+    for t in ("fundraising_investors", "fundraising_contacts", "fundraising_funds",
+              "fundraising_commitments", "fundraising_list_memberships"):
+        print(f"    {t:<24} {count(t)}")
+    print("  Phase-0 foundation tables (from migration, empty until entity resolution):")
+    for t in ("canonical_entities", "entity_links", "interaction_log", "relationship_edges"):
+        print(f"    {t:<24} {count(t)}")
+    inv = count("contacts")  # note grid bridge may have created extra investor contacts (the variants)
+    print(f"\n  Entity-resolution test bed: {len(overlap_specs)} investors intentionally appear in BOTH models "
+          f"with name variants; total contacts now {inv} (grid bridge added the variant rows).")
+    conn.close()
+
+
+if __name__ == "__main__":
+    main()
@@ -18,7 +18,7 @@ import re
 import base64
 import threading
 from datetime import datetime, timedelta
-from http.server import HTTPServer, BaseHTTPRequestHandler
+from http.server import HTTPServer, ThreadingHTTPServer, BaseHTTPRequestHandler
 from urllib.parse import urlparse, parse_qs, unquote
 from functools import wraps

@@ -52,6 +52,14 @@ CORS_ORIGIN = os.environ.get("CRM_CORS_ORIGIN", "*")
 ENV = os.environ.get("CRM_ENV", "development")
 LOGIN_RATE_LIMIT_PER_MIN = int(os.environ.get("CRM_LOGIN_RATE_LIMIT_PER_MIN", "20"))
 WRITE_RATE_LIMIT_PER_MIN = int(os.environ.get("CRM_WRITE_RATE_LIMIT_PER_MIN", "300"))
+GET_RATE_LIMIT_PER_MIN = int(os.environ.get("CRM_GET_RATE_LIMIT_PER_MIN", "600"))
+# Auto-ban any IP that racks up too many 404s in a short window — almost always
+# a vulnerability scanner blasting common paths (/.env, /.git/config, /swagger,
+# /actuator/env, wp-json, etc.). Banned IPs get instant 429s with no DB or
+# filesystem work, so they can't keep the single SQLite writer busy.
+ABUSE_404_THRESHOLD = int(os.environ.get("CRM_ABUSE_404_THRESHOLD", "15"))
+ABUSE_404_WINDOW_SEC = int(os.environ.get("CRM_ABUSE_404_WINDOW_SEC", "60"))
+ABUSE_BAN_SEC = int(os.environ.get("CRM_ABUSE_BAN_SEC", "900"))  # 15 minutes
 BACKUP_POLICY_SETTING_KEY = "fundraising_backup_policy"
 DEFAULT_BACKUP_POLICY = {
    "enabled": True,
@@ -127,6 +135,10 @@ def init_db():
            tags TEXT DEFAULT '[]',
            notes TEXT,
            linkedin_url TEXT,
+            city TEXT,
+            state TEXT,
+            country TEXT,
+            location_query TEXT,
            preferred_contact TEXT DEFAULT 'email',
            created_by TEXT REFERENCES users(id),
            created_at TEXT DEFAULT (datetime('now')),
@@ -258,6 +270,7 @@ def init_db():
            investor_name TEXT NOT NULL,
            notes TEXT,
            lead TEXT,
+            lead_source TEXT,
            priority INTEGER DEFAULT 0,
            follow_up INTEGER DEFAULT 0,
            graveyard INTEGER DEFAULT 0,
@@ -337,6 +350,30 @@ def init_db():
            UNIQUE(investor_id, list_key)
        );

+        CREATE TABLE IF NOT EXISTS fundraising_presence (
+            user_id TEXT PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE,
+            username TEXT NOT NULL,
+            full_name TEXT,
+            active_view TEXT,
+            row_id TEXT,
+            col_id TEXT,
+            is_editing INTEGER DEFAULT 0,
+            cell_key TEXT,
+            last_seen_at TEXT DEFAULT (datetime('now')),
+            expires_at_epoch INTEGER NOT NULL
+        );
+
+        CREATE TABLE IF NOT EXISTS fundraising_cell_locks (
+            cell_key TEXT PRIMARY KEY,
+            row_id TEXT NOT NULL,
+            col_id TEXT NOT NULL,
+            locked_by_user_id TEXT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
+            locked_by_username TEXT NOT NULL,
+            locked_by_full_name TEXT,
+            last_seen_at TEXT DEFAULT (datetime('now')),
+            expires_at_epoch INTEGER NOT NULL
+        );
+
        CREATE TABLE IF NOT EXISTS app_settings (
            key TEXT PRIMARY KEY,
            value_json TEXT NOT NULL,
@@ -363,9 +400,42 @@ def init_db():
        CREATE INDEX IF NOT EXISTS idx_fr_commitments_fund ON fundraising_commitments(fund_id);
        CREATE INDEX IF NOT EXISTS idx_fr_automation_runs_created ON fundraising_automation_runs(created_at);
        CREATE INDEX IF NOT EXISTS idx_fr_memberships_list ON fundraising_list_memberships(list_key);
+        CREATE INDEX IF NOT EXISTS idx_fr_presence_expires ON fundraising_presence(expires_at_epoch);
+        CREATE INDEX IF NOT EXISTS idx_fr_locks_expires ON fundraising_cell_locks(expires_at_epoch);
    """)

+    # Lightweight schema migrations for existing databases.
+    for stmt in [
+        "ALTER TABLE contacts ADD COLUMN city TEXT",
+        "ALTER TABLE contacts ADD COLUMN state TEXT",
+        "ALTER TABLE contacts ADD COLUMN country TEXT",
+        "ALTER TABLE contacts ADD COLUMN location_query TEXT",
+        "ALTER TABLE fundraising_investors ADD COLUMN lead_source TEXT",
+    ]:
+        try:
+            conn.execute(stmt)
+        except sqlite3.OperationalError:
+            pass
+
+    # ─── Gmail integration migrations (feature-flag-guarded import) ───
+    try:
+        from email_integration.db import apply_migrations as _email_apply_migrations
+        _email_apply_migrations(cursor)
+    except ImportError:
+        pass
+    except Exception as _e:
+        print(f"[email_integration] migration warning: {_e}")
+
    conn.commit()
+
+    # ─── Core schema migrations (Phase 0+; ordered .sql files w/ ledger) ───
+    # Additive/reversible only; tracked in schema_migrations. See core_migrations.py.
+    try:
+        from core_migrations import apply_core_migrations as _apply_core_migrations
+        _apply_core_migrations(conn)
+    except Exception as _e:
+        print(f"[migrations] core migration warning: {_e}")
+
    conn.close()
    print(f"Database initialized at {DB_PATH}")

@@ -563,6 +633,16 @@ def _split_full_name(full_name):
 def _normalize_text(value):
    return str(value or '').strip().lower()

+def _parse_location_text(text):
+    raw = str(text or '').strip()
+    if not raw:
+        return '', '', '', ''
+    parts = [p.strip() for p in raw.split(',') if p.strip()]
+    city = parts[0] if len(parts) >= 1 else ''
+    state = parts[1] if len(parts) >= 2 else ''
+    country = parts[2] if len(parts) >= 3 else ''
+    return city, state, country, raw
+
 def ensure_default_automation_rules(conn):
    defaults = [
        {
@@ -662,6 +742,11 @@ def _upsert_contact_from_fundraising(conn, investor_name, contact, actor_user_id
    full_name = str(contact.get('name') or '').strip()
    email = str(contact.get('email') or '').strip()
    title = str(contact.get('title') or '').strip()
+    source = str(contact.get('source') or '').strip()
+    city = str(contact.get('city') or '').strip()
+    state = str(contact.get('state') or '').strip()
+    country = str(contact.get('country') or '').strip()
+    location_query = str(contact.get('location_query') or '').strip()
    if not full_name and not email:
        return None
    first_name, last_name = _split_full_name(full_name)
@@ -700,20 +785,25 @@ def _upsert_contact_from_fundraising(conn, investor_name, contact, actor_user_id
        next_last = last_name if (last_name or full_name) else str(existing['last_name'] or '')
        next_email = email or str(existing['email'] or '')
        next_title = title or str(existing['title'] or '')
+        next_source = source or str(existing['source'] or '')
+        next_city = city or str(existing['city'] or '')
+        next_state = state or str(existing['state'] or '')
+        next_country = country or str(existing['country'] or '')
+        next_location_query = location_query or str(existing['location_query'] or '')
        next_org = org_id or existing['organization_id']
        conn.execute("""
            UPDATE contacts
            SET first_name = ?, last_name = ?, email = ?, title = ?,
-                organization_id = ?, contact_type = 'investor', updated_at = ?
+                organization_id = ?, source = ?, contact_type = 'investor', city = ?, state = ?, country = ?, location_query = ?, updated_at = ?
            WHERE id = ?
-        """, (next_first, next_last, next_email, next_title, next_org, now(), existing['id']))
+        """, (next_first, next_last, next_email, next_title, next_org, next_source, next_city, next_state, next_country, next_location_query, now(), existing['id']))
        return existing['id']

    contact_id = generate_id()
    conn.execute("""
        INSERT INTO contacts (
-            id, first_name, last_name, email, title, organization_id, contact_type, status, created_by, updated_at
-        ) VALUES (?, ?, ?, ?, ?, ?, 'investor', 'active', ?, ?)
+            id, first_name, last_name, email, title, organization_id, source, contact_type, status, city, state, country, location_query, created_by, updated_at
+        ) VALUES (?, ?, ?, ?, ?, ?, ?, 'investor', 'active', ?, ?, ?, ?, ?, ?)
    """, (
        contact_id,
        first_name or 'Unknown',
@@ -721,6 +811,11 @@ def _upsert_contact_from_fundraising(conn, investor_name, contact, actor_user_id
        email,
        title,
        org_id,
+        source,
+        city,
+        state,
+        country,
+        location_query,
        actor_user_id,
        now()
    ))
@@ -748,6 +843,11 @@ def _sync_contact_to_fundraising_state(conn, contact_row, actor_user_id=None, re
    email = str(contact_row.get('email') or '').strip()
    full_name = ' '.join([str(contact_row.get('first_name') or '').strip(), str(contact_row.get('last_name') or '').strip()]).strip()
    title = str(contact_row.get('title') or '').strip()
+    source = str(contact_row.get('source') or '').strip()
+    city = str(contact_row.get('city') or '').strip()
+    state = str(contact_row.get('state') or '').strip()
+    country = str(contact_row.get('country') or '').strip()
+    location_query = str(contact_row.get('location_query') or '').strip()
    if not full_name and not email:
        return

@@ -814,22 +914,24 @@ def _sync_contact_to_fundraising_state(conn, contact_row, actor_user_id=None, re
                "name": full_name or existing.get('name') or '',
                "email": email or existing.get('email') or '',
                "title": title or existing.get('title') or '',
-                "city": str(existing.get('city') or ''),
-                "state": str(existing.get('state') or ''),
-                "country": str(existing.get('country') or ''),
-                "location_query": str(existing.get('location_query') or '')
+                "city": city or str(existing.get('city') or ''),
+                "state": state or str(existing.get('state') or ''),
+                "country": country or str(existing.get('country') or ''),
+                "location_query": location_query or str(existing.get('location_query') or '')
            }
        else:
            next_contacts.append({
                "name": full_name,
                "email": email,
                "title": title,
-                "city": "",
-                "state": "",
-                "country": "",
-                "location_query": ""
+                "city": city,
+                "state": state,
+                "country": country,
+                "location_query": location_query
            })
        inv['contacts'] = next_contacts
+        if source and not str(inv.get('lead_source') or '').strip():
+            inv['lead_source'] = source
        changed = True

    if not changed:
@@ -901,6 +1003,7 @@ def sync_fundraising_relational(conn, grid, views, actor_user_id=None):
        investor_name = str(row.get('investor_name') or '').strip() or 'Untitled Investor'
        notes = str(row.get('notes') or '')
        lead = str(row.get('lead') or '')
+        lead_source = str(row.get('lead_source') or row.get('combined_lead_source') or '').strip()
        total_invested = 0.0
        for _, col in fund_columns:
            total_invested += _to_number(row.get(str(col.get('id'))))
@@ -909,13 +1012,17 @@ def sync_fundraising_relational(conn, grid, views, actor_user_id=None):
        investor_id = existing['id'] if existing else generate_id()
        conn.execute("""
            INSERT INTO fundraising_investors (
-                id, investor_name, notes, lead, priority, follow_up, graveyard,
+                id, investor_name, notes, lead, lead_source, priority, follow_up, graveyard,
                source_row_id, total_invested, updated_at
-            ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+            ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            ON CONFLICT(source_row_id) DO UPDATE SET
                investor_name = excluded.investor_name,
                notes = excluded.notes,
                lead = excluded.lead,
+                lead_source = CASE
+                    WHEN COALESCE(trim(fundraising_investors.lead_source), '') = '' THEN excluded.lead_source
+                    ELSE fundraising_investors.lead_source
+                END,
                priority = excluded.priority,
                follow_up = excluded.follow_up,
                graveyard = excluded.graveyard,
@@ -926,6 +1033,7 @@ def sync_fundraising_relational(conn, grid, views, actor_user_id=None):
            investor_name,
            notes,
            lead,
+            lead_source,
            1 if _to_bool(row.get('priority')) else 0,
            1 if _to_bool(row.get('follow_up')) else 0,
            1 if _to_bool(row.get('graveyard')) else 0,
@@ -947,7 +1055,10 @@ def sync_fundraising_relational(conn, grid, views, actor_user_id=None):
                email = str(c.get('email') or '').strip()
                if not full_name and not email:
                    continue
-                _upsert_contact_from_fundraising(conn, investor_name, c, actor_user_id=actor_user_id)
+                contact_payload = dict(c)
+                if lead_source and not str(contact_payload.get('source') or '').strip():
+                    contact_payload['source'] = lead_source
+                _upsert_contact_from_fundraising(conn, investor_name, contact_payload, actor_user_id=actor_user_id)
                conn.execute("""
                    INSERT INTO fundraising_contacts (
                        id, investor_id, full_name, email, title, city, state, country, location_query, sort_order, updated_at
@@ -958,7 +1069,7 @@ def sync_fundraising_relational(conn, grid, views, actor_user_id=None):
                    str(c.get('location_query') or ''), i, now()
                ))
        elif isinstance(contacts, str) and contacts.strip():
-            _upsert_contact_from_fundraising(conn, investor_name, {"name": contacts.strip(), "email": "", "title": ""}, actor_user_id=actor_user_id)
+            _upsert_contact_from_fundraising(conn, investor_name, {"name": contacts.strip(), "email": "", "title": "", "source": lead_source}, actor_user_id=actor_user_id)
            conn.execute("""
                INSERT INTO fundraising_contacts (
                    id, investor_id, full_name, email, title, city, state, country, location_query, sort_order, updated_at
@@ -1291,6 +1402,7 @@ DEFAULT_FUNDRAISING_COLUMNS = [
    {"id": "contacts", "label": "Contacts", "type": "contacts", "width": 260},
    {"id": "log_action", "label": "Log", "type": "action", "readOnly": True, "width": 90},
    {"id": "notes", "label": "Notes / Communication / Outreach", "type": "longtext", "width": 420},
+    {"id": "lead_source", "label": "Lead Source", "type": "text", "width": 180},
    {"id": "notes_last_modified", "label": "Notes Last Modified", "type": "date", "readOnly": True, "width": 180},
    {"id": "last_communication_date", "label": "Last Communication Date", "type": "date", "readOnly": True, "width": 195},
    {"id": "priority", "label": "Priority", "type": "checkbox", "width": 110},
@@ -1314,7 +1426,12 @@ DEFAULT_FUNDRAISING_ROWS = []

 class CRMHandler(BaseHTTPRequestHandler):
    """Main HTTP request handler for the CRM API."""
-    _rate_limit_buckets = {}
+    # Class-level state shared across all handler threads. Protected by
+    # _abuse_lock; see rate_limited() and record_404() for usage.
+    _rate_limit_buckets = {}   # (scope, ip) -> [timestamps]
+    _404_buckets = {}          # ip -> [timestamps] of recent 404 responses
+    _banned_ips = {}           # ip -> ban_until_epoch
+    _abuse_lock = threading.Lock()

    def log_message(self, format, *args):
        """Override to use cleaner logging."""
@@ -1323,14 +1440,22 @@ class CRMHandler(BaseHTTPRequestHandler):
    # ── Request Parsing ──

    def get_body(self):
+        # Cache parsed body on the request handler instance so repeated
+        # calls don't try to re-read an already-consumed stream. Handler
+        # instances are one-per-request in ThreadingHTTPServer, so the
+        # cache is naturally request-scoped.
+        if hasattr(self, '_cached_body'):
+            return self._cached_body
        content_length = int(self.headers.get('Content-Length', 0))
        if content_length == 0:
-            return {}
+            self._cached_body = {}
+            return self._cached_body
        body = self.rfile.read(content_length)
        try:
-            return json.loads(body.decode('utf-8'))
+            self._cached_body = json.loads(body.decode('utf-8'))
        except json.JSONDecodeError:
-            return {}
+            self._cached_body = {}
+        return self._cached_body

    def get_query_params(self):
        parsed = urlparse(self.path)
@@ -1373,15 +1498,57 @@ class CRMHandler(BaseHTTPRequestHandler):
    def rate_limited(self, scope, limit_per_minute):
        now_ts = time.time()
        bucket_key = f"{scope}:{self.get_client_ip()}"
-        bucket = self._rate_limit_buckets.get(bucket_key, [])
-        cutoff = now_ts - 60.0
-        bucket = [t for t in bucket if t >= cutoff]
-        if len(bucket) >= max(1, int(limit_per_minute)):
+        with self._abuse_lock:
+            bucket = self._rate_limit_buckets.get(bucket_key, [])
+            cutoff = now_ts - 60.0
+            bucket = [t for t in bucket if t >= cutoff]
+            if len(bucket) >= max(1, int(limit_per_minute)):
+                self._rate_limit_buckets[bucket_key] = bucket
+                return True
+            bucket.append(now_ts)
            self._rate_limit_buckets[bucket_key] = bucket
+            return False
+
+    def is_banned(self):
+        """Return True if the client IP is currently in the abuse blacklist.
+
+        Uses a coarse class-level dict — fine for a handful of scanners hitting
+        a small team CRM. Auto-expires entries when their ban window passes.
+        """
+        ip = self.get_client_ip()
+        now_ts = time.time()
+        with self._abuse_lock:
+            until = self._banned_ips.get(ip)
+            if until is None:
+                return False
+            if now_ts >= until:
+                self._banned_ips.pop(ip, None)
+                return False
            return True
-        bucket.append(now_ts)
-        self._rate_limit_buckets[bucket_key] = bucket
-        return False
+
+    def record_404(self):
+        """Track 404s per IP and auto-ban IPs that exceed the burst threshold.
+
+        Called from send_error_json whenever we send a 404. A scanner probing
+        /.env, /.git/config, /swagger, /actuator/env etc. will trip this fast
+        and get parked on the blacklist for ABUSE_BAN_SEC seconds.
+        """
+        ip = self.get_client_ip()
+        now_ts = time.time()
+        with self._abuse_lock:
+            bucket = self._404_buckets.get(ip, [])
+            cutoff = now_ts - ABUSE_404_WINDOW_SEC
+            bucket = [t for t in bucket if t >= cutoff]
+            bucket.append(now_ts)
+            if len(bucket) >= ABUSE_404_THRESHOLD:
+                self._banned_ips[ip] = now_ts + ABUSE_BAN_SEC
+                self._404_buckets.pop(ip, None)
+                sys.stderr.write(
+                    f"[abuse] Banning {ip} for {ABUSE_BAN_SEC}s after "
+                    f"{len(bucket)} 404s in {ABUSE_404_WINDOW_SEC}s\n"
+                )
+            else:
+                self._404_buckets[ip] = bucket

    # ── Response Helpers ──

@@ -1395,6 +1562,14 @@ class CRMHandler(BaseHTTPRequestHandler):
        self.wfile.write(json.dumps(data, default=str).encode('utf-8'))

    def send_error_json(self, message, status=400):
+        # Record 404s for the abuser-detection blacklist before responding.
+        # We do this on send (not on routing) so that any code path that 404s
+        # contributes to the burst counter, including unknown POST paths.
+        if status == 404:
+            try:
+                self.record_404()
+            except Exception:
+                pass
        self.send_json({"error": message}, status)

    def send_file(self, filepath, content_type='text/html'):
@@ -1412,6 +1587,8 @@ class CRMHandler(BaseHTTPRequestHandler):
    # ── Routing ──

    def do_OPTIONS(self):
+        if self.is_banned():
+            return self.send_error_json("Too many requests", 429)
        self.send_response(200)
        self.send_header('Access-Control-Allow-Origin', CORS_ORIGIN)
        self.send_header('Access-Control-Allow-Methods', 'GET, POST, PUT, PATCH, DELETE, OPTIONS')
@@ -1419,8 +1596,25 @@ class CRMHandler(BaseHTTPRequestHandler):
        self.end_headers()

    def do_GET(self):
+        # Short-circuit known abusers before any work, including file reads.
+        if self.is_banned():
+            return self.send_error_json("Too many requests", 429)
+        # Generic per-IP GET rate limit. Generous enough for a normal user
+        # (page load fans out ~15 GETs + heartbeats every ~6s) but blocks a
+        # scanner blasting hundreds of paths per second.
+        if self.rate_limited('get', GET_RATE_LIMIT_PER_MIN):
+            return self.send_error_json("Too many requests", 429)
+
        path = self.get_path()

+        # ─── Gmail integration routes (feature-flag-guarded) ─────────
+        try:
+            from email_integration.routes import try_handle as _email_try_handle
+            if _email_try_handle(self):
+                return
+        except ImportError:
+            pass
+
        # Serve frontend
        if path == '/' or path == '/index.html':
            return self.send_file(os.path.join(FRONTEND_DIR, 'index.html'))
@@ -1510,6 +1704,8 @@ class CRMHandler(BaseHTTPRequestHandler):
        # Fundraising grid state
        if path == '/api/fundraising/state':
            return self.handle_get_fundraising_state(user)
+        if path == '/api/fundraising/collab/state':
+            return self.handle_get_fundraising_collab_state(user)
        if path == '/api/fundraising/export':
            return self.handle_export_fundraising_state(user)
        if path == '/api/fundraising/backups':
@@ -1538,12 +1734,23 @@ class CRMHandler(BaseHTTPRequestHandler):
        self.send_error_json("Not found", 404)

    def do_POST(self):
+        if self.is_banned():
+            return self.send_error_json("Too many requests", 429)
+
        path = self.get_path()
        body = self.get_body()

        if self.rate_limited('write', WRITE_RATE_LIMIT_PER_MIN):
            return self.send_error_json("Too many requests", 429)

+        # ─── Gmail integration routes (feature-flag-guarded) ─────────
+        try:
+            from email_integration.routes import try_handle as _email_try_handle
+            if _email_try_handle(self):
+                return
+        except ImportError:
+            pass
+
        # Auth (no token needed)
        if path == '/api/auth/login':
            if self.rate_limited('login', LOGIN_RATE_LIMIT_PER_MIN):
@@ -1573,8 +1780,12 @@ class CRMHandler(BaseHTTPRequestHandler):
            return self.handle_create_feature_request(user, body)
        if path == '/api/fundraising/log-communication':
            return self.handle_log_fundraising_communication(user, body)
+        if path == '/api/fundraising/collab/heartbeat':
+            return self.handle_fundraising_collab_heartbeat(user, body)
        if path == '/api/admin/users':
            return self.handle_admin_create_user(user, body)
+        if path == '/api/admin/reset-all-data':
+            return self.handle_admin_reset_all_data(user, body)
        if path == '/api/fundraising/backup':
            return self.handle_backup_fundraising_state(user)
        if path == '/api/fundraising/restore-preview':
@@ -1587,6 +1798,8 @@ class CRMHandler(BaseHTTPRequestHandler):
        self.send_error_json("Not found", 404)

    def do_PUT(self):
+        if self.is_banned():
+            return self.send_error_json("Too many requests", 429)
        path = self.get_path()
        body = self.get_body()
        if self.rate_limited('write', WRITE_RATE_LIMIT_PER_MIN):
@@ -1611,6 +1824,8 @@ class CRMHandler(BaseHTTPRequestHandler):
        self.send_error_json("Not found", 404)

    def do_PATCH(self):
+        if self.is_banned():
+            return self.send_error_json("Too many requests", 429)
        path = self.get_path()
        body = self.get_body()
        if self.rate_limited('write', WRITE_RATE_LIMIT_PER_MIN):
@@ -1637,6 +1852,8 @@ class CRMHandler(BaseHTTPRequestHandler):
        self.send_error_json("Not found", 404)

    def do_DELETE(self):
+        if self.is_banned():
+            return self.send_error_json("Too many requests", 429)
        path = self.get_path()
        if self.rate_limited('write', WRITE_RATE_LIMIT_PER_MIN):
            return self.send_error_json("Too many requests", 429)
@@ -1761,8 +1978,8 @@ class CRMHandler(BaseHTTPRequestHandler):
            args.append(params['status'])
        if params.get('search'):
            search = f"%{params['search']}%"
-            query += " AND (c.first_name LIKE ? OR c.last_name LIKE ? OR c.email LIKE ? OR o.name LIKE ?)"
-            args.extend([search, search, search, search])
+            query += " AND (c.first_name LIKE ? OR c.last_name LIKE ? OR c.email LIKE ? OR o.name LIKE ? OR c.source LIKE ?)"
+            args.extend([search, search, search, search, search])
        if params.get('organization_id'):
            query += " AND c.organization_id = ?"
            args.append(params['organization_id'])
@@ -1772,7 +1989,7 @@ class CRMHandler(BaseHTTPRequestHandler):

        sort = params.get('sort', 'updated_at')
        order = 'DESC' if params.get('order', 'desc').lower() == 'desc' else 'ASC'
-        allowed_sorts = ['first_name', 'last_name', 'email', 'created_at', 'updated_at', 'contact_type']
+        allowed_sorts = ['first_name', 'last_name', 'email', 'created_at', 'updated_at', 'contact_type', 'source']
        if sort in allowed_sorts:
            query += f" ORDER BY c.{sort} {order}"
        else:
@@ -1837,20 +2054,25 @@ class CRMHandler(BaseHTTPRequestHandler):

        contact_id = generate_id()
        conn = get_db()
+        organization_id = body.get('organization_id')
+        if not organization_id and body.get('organization'):
+            organization_id = _ensure_organization_by_name(conn, body.get('organization'), user['user_id'])

        tags = json.dumps(body.get('tags', []))
        conn.execute("""
            INSERT INTO contacts (id, first_name, last_name, email, phone, mobile, title,
                organization_id, contact_type, status, source, tags, notes, linkedin_url,
-                preferred_contact, created_by)
-            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+                city, state, country, location_query, preferred_contact, created_by)
+            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        """, (
            contact_id, body['first_name'], body['last_name'],
            body.get('email'), body.get('phone'), body.get('mobile'),
-            body.get('title'), body.get('organization_id'),
+            body.get('title'), organization_id,
            body.get('contact_type', 'prospect'), body.get('status', 'active'),
            body.get('source'), tags, body.get('notes'),
-            body.get('linkedin_url'), body.get('preferred_contact', 'email'),
+            body.get('linkedin_url'), body.get('city'), body.get('state'),
+            body.get('country'), body.get('location_query'),
+            body.get('preferred_contact', 'email'),
            user['user_id']
        ))

@@ -1879,13 +2101,17 @@ class CRMHandler(BaseHTTPRequestHandler):

        updatable = ['first_name', 'last_name', 'email', 'phone', 'mobile', 'title',
                     'organization_id', 'contact_type', 'status', 'source', 'notes',
-                     'linkedin_url', 'preferred_contact']
+                     'linkedin_url', 'city', 'state', 'country', 'location_query', 'preferred_contact']
        sets = []
        args = []
        for field in updatable:
            if field in body:
                sets.append(f"{field} = ?")
                args.append(body[field])
+        if 'organization' in body and 'organization_id' not in body:
+            org_id = _ensure_organization_by_name(conn, body.get('organization'), user['user_id'])
+            sets.append("organization_id = ?")
+            args.append(org_id)

        if 'tags' in body:
            sets.append("tags = ?")
@@ -2876,12 +3102,17 @@ class CRMHandler(BaseHTTPRequestHandler):
        entity_type = body.get('entity_type', 'contacts')
        mapping = body.get('mapping', {})
        dry_run = body.get('dry_run', False)
+        update_existing = bool(body.get('update_existing', True))
+        action_overrides_raw = body.get('action_overrides', {}) or {}

        if not csv_data:
            return self.send_error_json("No data provided. Send 'data' as array of objects.")

        conn = get_db()
-        results = {"created": 0, "updated": 0, "skipped": 0, "errors": []}
+        results = {"created": 0, "updated": 0, "skipped": 0, "errors": [], "matches": []}
+        # Keep in-memory email matches so dry-run mirrors real behavior for
+        # duplicate emails appearing multiple times in the same CSV batch.
+        batch_email_matches = {}

        try:
            for i, row in enumerate(csv_data):
@@ -2911,11 +3142,44 @@ class CRMHandler(BaseHTTPRequestHandler):
                            continue

                        email = data.get('email', '').strip()
+                        email_key = email.lower()
+                        linkedin_url = data.get('linkedin_url', data.get('linkedin', '')).strip()
+                        city = data.get('city', '').strip()
+                        state = data.get('state', '').strip()
+                        country = data.get('country', '').strip()
+                        location_query = data.get('location_query', '').strip()
+                        raw_location = data.get('location', data.get('city_location', data.get('city/location', ''))).strip()
+                        if raw_location:
+                            p_city, p_state, p_country, p_query = _parse_location_text(raw_location)
+                            city = city or p_city
+                            state = state or p_state
+                            country = country or p_country
+                            location_query = location_query or p_query

                        # Check for existing contact by email
                        existing = None
+                        existing_summary = None
                        if email:
-                            existing = conn.execute("SELECT id FROM contacts WHERE email = ?", (email,)).fetchone()
+                            if email_key in batch_email_matches:
+                                existing_summary = batch_email_matches[email_key]
+                                existing = {"id": existing_summary.get('id')}
+                            else:
+                                existing = conn.execute("""
+                                    SELECT c.id, c.first_name, c.last_name, c.email, o.name as organization_name
+                                    FROM contacts c
+                                    LEFT JOIN organizations o ON c.organization_id = o.id
+                                    WHERE lower(c.email) = lower(?)
+                                    ORDER BY c.updated_at DESC
+                                    LIMIT 1
+                                """, (email,)).fetchone()
+                                if existing:
+                                    existing_summary = {
+                                        "id": existing['id'],
+                                        "name": f"{str(existing['first_name'] or '').strip()} {str(existing['last_name'] or '').strip()}".strip(),
+                                        "email": str(existing['email'] or ''),
+                                        "organization": str(existing['organization_name'] or '')
+                                    }
+                                    batch_email_matches[email_key] = existing_summary

                        # Handle organization
                        org_id = None
@@ -2931,32 +3195,149 @@ class CRMHandler(BaseHTTPRequestHandler):
                                    (org_id, org_name, user['user_id'])
                                )

+                        action_override = None
+                        if isinstance(action_overrides_raw, dict):
+                            action_override = action_overrides_raw.get(str(i + 1)) or action_overrides_raw.get(i + 1)
+                        default_action = 'update' if update_existing else 'skip'
+                        action = action_override if action_override in ('update', 'skip', 'create_duplicate') else default_action
+                        if existing:
+                            incoming_name = f"{first_name} {last_name}".strip()
+                            results['matches'].append({
+                                "row": i + 1,
+                                "incoming_name": incoming_name,
+                                "incoming_email": email,
+                                "incoming_organization": org_name,
+                                "existing_id": existing_summary.get('id') if isinstance(existing_summary, dict) else existing['id'],
+                                "existing_name": existing_summary.get('name') if isinstance(existing_summary, dict) else '',
+                                "existing_email": existing_summary.get('email') if isinstance(existing_summary, dict) else email,
+                                "existing_organization": existing_summary.get('organization') if isinstance(existing_summary, dict) else '',
+                                "default_action": default_action,
+                                "action": action
+                            })
+
                        if not dry_run:
                            if existing:
-                                conn.execute("""
-                                    UPDATE contacts SET first_name=?, last_name=?, phone=?, title=?,
-                                    organization_id=COALESCE(?, organization_id),
-                                    contact_type=COALESCE(?, contact_type), updated_at=?
-                                    WHERE id=?
-                                """, (first_name, last_name, data.get('phone'),
-                                      data.get('title'), org_id,
-                                      data.get('contact_type'), now(), existing['id']))
-                                results['updated'] += 1
+                                if action == 'update':
+                                    conn.execute("""
+                                        UPDATE contacts SET first_name=?, last_name=?, phone=?, title=?,
+                                        organization_id=COALESCE(?, organization_id),
+                                        contact_type=COALESCE(?, contact_type),
+                                        linkedin_url=COALESCE(?, linkedin_url),
+                                        city=COALESCE(?, city),
+                                        state=COALESCE(?, state),
+                                        country=COALESCE(?, country),
+                                        location_query=COALESCE(?, location_query),
+                                        updated_at=?
+                                        WHERE id=?
+                                    """, (first_name, last_name, data.get('phone'),
+                                          data.get('title'), org_id,
+                                          data.get('contact_type'),
+                                          linkedin_url if linkedin_url else None,
+                                          city if city else None,
+                                          state if state else None,
+                                          country if country else None,
+                                          location_query if location_query else None,
+                                      now(), existing['id']))
+                                    if email:
+                                        batch_email_matches[email_key] = {
+                                            "id": existing['id'],
+                                            "name": f"{first_name} {last_name}".strip(),
+                                            "email": email,
+                                            "organization": org_name
+                                        }
+                                    updated_contact = row_to_dict(conn.execute("""
+                                        SELECT c.*, o.name as organization_name
+                                        FROM contacts c LEFT JOIN organizations o ON c.organization_id = o.id
+                                        WHERE c.id = ?
+                                    """, (existing['id'],)).fetchone())
+                                    _sync_contact_to_fundraising_state(conn, updated_contact, actor_user_id=user['user_id'], remove=False)
+                                    results['updated'] += 1
+                                elif action == 'create_duplicate':
+                                    contact_id = generate_id()
+                                    conn.execute("""
+                                        INSERT INTO contacts (id, first_name, last_name, email, phone,
+                                            title, organization_id, contact_type, status, source,
+                                            linkedin_url, city, state, country, location_query, created_by)
+                                        VALUES (?, ?, ?, ?, ?, ?, ?, ?, 'active', 'import', ?, ?, ?, ?, ?, ?)
+                                    """, (contact_id, first_name, last_name, email,
+                                          data.get('phone'), data.get('title'), org_id,
+                                          data.get('contact_type', 'prospect'), linkedin_url,
+                                          city, state, country, location_query, user['user_id']))
+                                    if email:
+                                        batch_email_matches[email_key] = {
+                                            "id": contact_id,
+                                            "name": f"{first_name} {last_name}".strip(),
+                                            "email": email,
+                                            "organization": org_name
+                                        }
+                                    created_contact = row_to_dict(conn.execute("""
+                                        SELECT c.*, o.name as organization_name
+                                        FROM contacts c LEFT JOIN organizations o ON c.organization_id = o.id
+                                        WHERE c.id = ?
+                                    """, (contact_id,)).fetchone())
+                                    _sync_contact_to_fundraising_state(conn, created_contact, actor_user_id=user['user_id'], remove=False)
+                                    results['created'] += 1
+                                else:
+                                    results['skipped'] += 1
+                                    results['errors'].append(f"Row {i+1}: Existing contact matched by email; skipped")
                            else:
                                contact_id = generate_id()
                                conn.execute("""
                                    INSERT INTO contacts (id, first_name, last_name, email, phone,
-                                        title, organization_id, contact_type, status, source, created_by)
-                                    VALUES (?, ?, ?, ?, ?, ?, ?, ?, 'active', 'import', ?)
+                                        title, organization_id, contact_type, status, source,
+                                        linkedin_url, city, state, country, location_query, created_by)
+                                    VALUES (?, ?, ?, ?, ?, ?, ?, ?, 'active', 'import', ?, ?, ?, ?, ?, ?)
                                """, (contact_id, first_name, last_name, email,
                                      data.get('phone'), data.get('title'), org_id,
-                                      data.get('contact_type', 'prospect'), user['user_id']))
+                                      data.get('contact_type', 'prospect'), linkedin_url,
+                                      city, state, country, location_query, user['user_id']))
+                                if email:
+                                    batch_email_matches[email_key] = {
+                                        "id": contact_id,
+                                        "name": f"{first_name} {last_name}".strip(),
+                                        "email": email,
+                                        "organization": org_name
+                                    }
+                                created_contact = row_to_dict(conn.execute("""
+                                    SELECT c.*, o.name as organization_name
+                                    FROM contacts c LEFT JOIN organizations o ON c.organization_id = o.id
+                                    WHERE c.id = ?
+                                """, (contact_id,)).fetchone())
+                                _sync_contact_to_fundraising_state(conn, created_contact, actor_user_id=user['user_id'], remove=False)
                                results['created'] += 1
                        else:
                            if existing:
-                                results['updated'] += 1
+                                if action == 'update':
+                                    results['updated'] += 1
+                                    if email:
+                                        batch_email_matches[email_key] = {
+                                            "id": existing['id'],
+                                            "name": f"{first_name} {last_name}".strip(),
+                                            "email": email,
+                                            "organization": org_name
+                                        }
+                                elif action == 'create_duplicate':
+                                    results['created'] += 1
+                                    if email:
+                                        batch_email_matches[email_key] = {
+                                            "id": f"dryrun-{i+1}",
+                                            "name": f"{first_name} {last_name}".strip(),
+                                            "email": email,
+                                            "organization": org_name
+                                        }
+                                else:
+                                    results['skipped'] += 1
+                                    results['errors'].append(f"Row {i+1}: Existing contact matched by email; would be skipped")
                            else:
                                results['created'] += 1
+                                if email:
+                                    # Simulate that the row now exists for subsequent duplicate-email rows.
+                                    batch_email_matches[email_key] = {
+                                        "id": f"dryrun-{i+1}",
+                                        "name": f"{first_name} {last_name}".strip(),
+                                        "email": email,
+                                        "organization": org_name
+                                    }

                except Exception as e:
                    results['errors'].append(f"Row {i+1}: {str(e)}")
@@ -3141,6 +3522,58 @@ class CRMHandler(BaseHTTPRequestHandler):
        conn.close()
        return self.send_json({"data": updated})

+    def handle_admin_reset_all_data(self, user, body):
+        if not require_admin(user):
+            return self.send_error_json("Admin only", 403)
+
+        confirm_phrase = str(body.get('confirm_phrase') or '').strip()
+        if confirm_phrase != 'RESET ALL DATA':
+            return self.send_error_json("Confirmation phrase must be exactly: RESET ALL DATA", 400)
+
+        conn = get_db()
+        try:
+            ensure_fundraising_state_row(conn)
+            state = conn.execute("SELECT * FROM fundraising_state WHERE id = 'main'").fetchone()
+            pre_backup = create_fundraising_backup_file(state, kind="pre_restore") if state else None
+
+            conn.execute("DELETE FROM communications")
+            conn.execute("DELETE FROM opportunities")
+            conn.execute("DELETE FROM lp_profiles")
+            conn.execute("DELETE FROM custom_field_values")
+            conn.execute("DELETE FROM custom_fields")
+            conn.execute("DELETE FROM feature_requests")
+            conn.execute("DELETE FROM contacts")
+            conn.execute("DELETE FROM organizations")
+
+            default_grid = {
+                "columns": deep_copy_json(DEFAULT_FUNDRAISING_COLUMNS),
+                "rows": deep_copy_json(DEFAULT_FUNDRAISING_ROWS)
+            }
+            default_views = sanitize_grid_views(deep_copy_json(DEFAULT_GRID_VIEWS))
+            conn.execute("""
+                UPDATE fundraising_state
+                SET grid_json = ?, views_json = ?, version = COALESCE(version, 1) + 1, updated_by = ?, updated_at = ?
+                WHERE id = 'main'
+            """, (json.dumps(default_grid), json.dumps(default_views), user['user_id'], now()))
+            sync_fundraising_relational(conn, default_grid, default_views, actor_user_id=user['user_id'])
+
+            log_audit(conn, user['user_id'], 'system', 'all-data', 'reset', {
+                "pre_backup": pre_backup['filename'] if pre_backup else None
+            })
+            conn.commit()
+        except Exception as exc:
+            conn.rollback()
+            conn.close()
+            return self.send_error_json(f"Failed to reset data: {str(exc)}", 500)
+
+        conn.close()
+        return self.send_json({
+            "data": {
+                "status": "ok",
+                "pre_backup": pre_backup
+            }
+        })
+
    def handle_list_audit_log(self, user, params):
        if not require_admin(user):
            return self.send_error_json("Admin access required", 403)
@@ -3574,6 +4007,128 @@ class CRMHandler(BaseHTTPRequestHandler):
            }
        })

+    def _cleanup_fundraising_collab(self, conn):
+        now_epoch = int(time.time())
+        conn.execute("DELETE FROM fundraising_presence WHERE expires_at_epoch <= ?", (now_epoch,))
+        conn.execute("DELETE FROM fundraising_cell_locks WHERE expires_at_epoch <= ?", (now_epoch,))
+
+    def _list_fundraising_collab_state(self, conn):
+        presence_rows = rows_to_list(conn.execute("""
+            SELECT user_id, username, full_name, active_view, row_id, col_id, is_editing, cell_key, last_seen_at
+            FROM fundraising_presence
+            ORDER BY last_seen_at DESC
+        """).fetchall())
+        lock_rows = rows_to_list(conn.execute("""
+            SELECT cell_key, row_id, col_id, locked_by_user_id, locked_by_username, locked_by_full_name, last_seen_at
+            FROM fundraising_cell_locks
+            ORDER BY last_seen_at DESC
+        """).fetchall())
+        for row in presence_rows:
+            row['is_editing'] = bool(row.get('is_editing'))
+        return {"presence": presence_rows, "locks": lock_rows}
+
+    def handle_get_fundraising_collab_state(self, user):
+        conn = get_db()
+        self._cleanup_fundraising_collab(conn)
+        snapshot = self._list_fundraising_collab_state(conn)
+        conn.commit()
+        conn.close()
+        return self.send_json({"data": snapshot})
+
+    def handle_fundraising_collab_heartbeat(self, user, body):
+        active_view = str(body.get('active_view') or '').strip()
+        selected = body.get('selected') if isinstance(body.get('selected'), dict) else {}
+        editing = body.get('editing') if isinstance(body.get('editing'), dict) else {}
+        selected_row_id = str(selected.get('row_id') or '').strip()
+        selected_col_id = str(selected.get('col_id') or '').strip()
+        editing_row_id = str(editing.get('row_id') or '').strip()
+        editing_col_id = str(editing.get('col_id') or '').strip()
+        is_editing = bool(editing_row_id and editing_col_id)
+        ttl_seconds = int(body.get('ttl_seconds') or 25)
+        ttl_seconds = max(10, min(120, ttl_seconds))
+        now_epoch = int(time.time())
+        expires_at_epoch = now_epoch + ttl_seconds
+        seen_at = now()
+        lock_conflict = None
+
+        conn = get_db()
+        self._cleanup_fundraising_collab(conn)
+
+        user_row = conn.execute("SELECT username, full_name FROM users WHERE id = ?", (user['user_id'],)).fetchone()
+        username = str(user_row['username']) if user_row and user_row['username'] else str(user.get('username') or '')
+        full_name = str(user_row['full_name']) if user_row and user_row['full_name'] else ''
+        editing_cell_key = f"{editing_row_id}:{editing_col_id}" if is_editing else None
+
+        if is_editing and editing_cell_key:
+            existing_lock = conn.execute("""
+                SELECT cell_key, row_id, col_id, locked_by_user_id, locked_by_username, locked_by_full_name, last_seen_at
+                FROM fundraising_cell_locks
+                WHERE cell_key = ? AND locked_by_user_id != ? AND expires_at_epoch > ?
+                LIMIT 1
+            """, (editing_cell_key, user['user_id'], now_epoch)).fetchone()
+            if existing_lock:
+                lock_conflict = row_to_dict(existing_lock)
+                is_editing = False
+                editing_cell_key = None
+            else:
+                conn.execute("""
+                    INSERT INTO fundraising_cell_locks (
+                        cell_key, row_id, col_id, locked_by_user_id, locked_by_username, locked_by_full_name, last_seen_at, expires_at_epoch
+                    ) VALUES (?, ?, ?, ?, ?, ?, ?, ?)
+                    ON CONFLICT(cell_key) DO UPDATE SET
+                        row_id = excluded.row_id,
+                        col_id = excluded.col_id,
+                        locked_by_user_id = excluded.locked_by_user_id,
+                        locked_by_username = excluded.locked_by_username,
+                        locked_by_full_name = excluded.locked_by_full_name,
+                        last_seen_at = excluded.last_seen_at,
+                        expires_at_epoch = excluded.expires_at_epoch
+                """, (editing_cell_key, editing_row_id, editing_col_id, user['user_id'], username, full_name, seen_at, expires_at_epoch))
+                conn.execute("""
+                    DELETE FROM fundraising_cell_locks
+                    WHERE locked_by_user_id = ? AND cell_key != ?
+                """, (user['user_id'], editing_cell_key))
+        else:
+            conn.execute("DELETE FROM fundraising_cell_locks WHERE locked_by_user_id = ?", (user['user_id'],))
+
+        conn.execute("""
+            INSERT INTO fundraising_presence (
+                user_id, username, full_name, active_view, row_id, col_id, is_editing, cell_key, last_seen_at, expires_at_epoch
+            ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+            ON CONFLICT(user_id) DO UPDATE SET
+                username = excluded.username,
+                full_name = excluded.full_name,
+                active_view = excluded.active_view,
+                row_id = excluded.row_id,
+                col_id = excluded.col_id,
+                is_editing = excluded.is_editing,
+                cell_key = excluded.cell_key,
+                last_seen_at = excluded.last_seen_at,
+                expires_at_epoch = excluded.expires_at_epoch
+        """, (
+            user['user_id'],
+            username,
+            full_name,
+            active_view,
+            selected_row_id or editing_row_id,
+            selected_col_id or editing_col_id,
+            1 if is_editing else 0,
+            editing_cell_key,
+            seen_at,
+            expires_at_epoch
+        ))
+
+        snapshot = self._list_fundraising_collab_state(conn)
+        conn.commit()
+        conn.close()
+
+        return self.send_json({
+            "data": {
+                **snapshot,
+                "lock_conflict": lock_conflict
+            }
+        })
+
    def handle_update_fundraising_state(self, user, body):
        grid = body.get('grid', {})
        views = body.get('views')
@@ -3948,7 +4503,20 @@ def main():
        print("Demo data seeding disabled (set CRM_SEED_DEMO_DATA=1 to enable).")
    start_backup_scheduler()

-    server = HTTPServer((HOST, PORT), CRMHandler)
+    # ─── Gmail sync scheduler (feature-flag-guarded) ─────────────────
+    if os.environ.get("CRM_GMAIL_INTEGRATION_ENABLED", "").lower() in ("1", "true", "yes", "on"):
+        try:
+            from email_integration.scheduler import start_sync_scheduler
+            start_sync_scheduler()
+            print("[email_integration] Gmail sync scheduler started")
+        except Exception as _e:
+            print(f"[email_integration] failed to start scheduler: {_e}")
+
+    # ThreadingHTTPServer lets one slow request (or a wave of scanner probes)
+    # not block legit users. SQLite is opened per-request via get_db(), and
+    # WAL mode allows concurrent readers + a single writer, so this is safe.
+    server = ThreadingHTTPServer((HOST, PORT), CRMHandler)
+    server.daemon_threads = True
    print(f"\n{'='*60}")
    print(f"  Venture Fund CRM Server")
    print(f"  Running at http://{HOST}:{PORT}")
@@ -0,0 +1,202 @@
+# Embeddings + Retrieval through Spark Control (v0.15.0)
+
+Spark Control now fronts a local RAG stack so your agent/CRM system can do
+dense embeddings, reranking, and hybrid vector search against one trusted host
+— same TLS cert and allowlist as the LLM and audio endpoints.
+
+## What runs where
+
+| Component | Host | Port | Role |
+|---|---|---|---|
+| **spark-embed** | Spark 2 (GPU) | 8088 | `BAAI/bge-m3` dense embeddings (1024-d) + `BAAI/bge-reranker-v2-m3` cross-encoder rerank |
+| **Qdrant** | Spark 2 (CPU) | 6333/6334 | Vector storage, hybrid dense+sparse retrieval, RRF fusion, payload filtering |
+| **Spark Control** | Start9 | (your LAN URL) | Proxies all of the above behind one host |
+
+`spark-embed` is a small FastAPI server built **from the NGC PyTorch image**
+(the torch we've proven runs on the GB10's sm_121 GPU). We did *not* use HF Text
+Embeddings Inference because as of 2026 it ships no arm64 CUDA image. No
+torchaudio, no flash-attn — bge-m3 + the reranker are plain XLM-RoBERTa
+encoders.
+
+## Endpoints (all on the Spark Control host)
+
+### `POST /v1/embeddings` — dense embeddings (OpenAI-compatible)
+
+```json
+// request
+{ "input": "text to embed", "model": "BAAI/bge-m3" }
+// or  { "input": ["batch", "of", "texts"] }
+
+// response (OpenAI shape)
+{
+  "object": "list",
+  "data": [ { "object": "embedding", "index": 0, "embedding": [0.01, ...1024 floats] } ],
+  "model": "BAAI/bge-m3",
+  "usage": { "prompt_tokens": 0, "total_tokens": 0 }
+}
+```
+
+Vectors are L2-normalized by default (cosine == dot product). Works with the
+stock OpenAI Python/JS client by pointing `base_url` at Spark Control.
+
+### `POST /v1/rerank` — cross-encoder rerank
+
+```json
+// request
+{ "query": "did Brightwater commit?", "documents": ["chunk a", "chunk b", ...],
+  "top_n": 5, "return_documents": false }
+
+// response (Cohere-ish)
+{ "object": "rerank.result", "model": "BAAI/bge-reranker-v2-m3",
+  "results": [ { "index": 3, "relevance_score": 5.21 }, { "index": 0, "relevance_score": 1.04 }, ... ] }
+```
+
+`relevance_score` is the reranker's raw logit (higher = more relevant; unbounded,
+roughly −10..+10). Sort desc. Send a candidate set (≤200), not your whole corpus.
+
+### `POST /api/search` — orchestrated hybrid retrieval
+
+One call: embeds the query (dense), retrieves from Qdrant (hybrid dense+sparse
+with RRF **when you supply a sparse vector**, else dense-only), optionally
+cross-encoder reranks, returns top_k.
+
+```json
+// request
+{
+  "query": "Did Brightwater commit to the Fund III close in Q1?",
+  "collection": "crm_chunks",
+  "top_k": 8,
+  "retrieve_n": 80,                 // first-stage candidates (default max(50, top_k*10))
+  "sparse": { "indices": [12, 904], "values": [0.7, 1.2] },  // optional BM25 vector for hybrid
+  "fusion": "rrf",                  // or "dbsf"
+  "filter": {                       // raw Qdrant filter (pre-filter, see below)
+    "must": [ { "key": "lp_id", "match": { "value": "lp_0427" } } ]
+  },
+  "rerank": true,
+  "text_field": "text",             // payload field holding the chunk text
+  "with_payload": true
+}
+
+// response
+{
+  "object": "search.result_list",
+  "model": "BAAI/bge-m3+bge-reranker-v2-m3",
+  "query": "...",
+  "collection": "crm_chunks",
+  "reranked": true,
+  "data": [
+    { "object": "search.result", "index": 0, "id": "chunk_99c1_3",
+      "score": 5.21, "fused_score": 0.41, "rerank_score": 5.21,
+      "text": "...Brightwater confirmed verbal commitment...",
+      "payload": { "lp_id": "lp_0427", "doc_type": "email", "date_ts": 1771027200, ... } }
+  ],
+  "usage": { "embed_ms": 21, "qdrant_ms": 9, "rerank_ms": 140, "candidates": 80 }
+}
+```
+
+`score` is the rerank score when reranked, else the fused/dense score.
+
+## The sparse (BM25) story — important for entity-heavy data
+
+bge-m3's dense vectors carry semantic meaning, but exact entity matches
+(fund names, tickers, people) are a **lexical** signal. For that you want hybrid
+dense + sparse. Two design facts:
+
+1. **spark-embed serves dense only.** TEI/Infinity can't emit bge-m3's learned
+   sparse either, and bge-m3's pretrained sparse weights underweight novel
+   entity tokens (brand-new fund names). So we don't use bge-m3 sparse.
+2. **Use Qdrant BM25 with its built-in IDF**, generated client-side. This learns
+   IDF over *your* corpus, so novel entity strings get correct weight.
+
+### Your ingest pipeline (the part you own)
+
+For each chunk, produce and upsert **both** vectors:
+
+- **dense**: call Spark Control `POST /v1/embeddings` → 1024-d vector.
+- **sparse**: run [FastEmbed](https://github.com/qdrant/fastembed) BM25 client-side:
+  ```python
+  from fastembed import SparseTextEmbedding
+  bm25 = SparseTextEmbedding(model_name="Qdrant/bm25")
+  sp = next(bm25.embed([chunk_text]))   # -> {indices, values}
+  ```
+
+Create the collection with a named dense vector and a named sparse vector that
+uses `modifier: idf` (so Qdrant applies IDF server-side):
+
+```json
+PUT /collections/crm_chunks
+{ "vectors": { "dense": { "size": 1024, "distance": "Cosine" } },
+  "sparse_vectors": { "sparse": { "modifier": "idf" } } }
+```
+
+Add payload indexes for your filter fields so filtered queries stay fast:
+
+```json
+PUT /collections/crm_chunks/index   { "field_name": "lp_id",     "field_schema": "keyword" }
+PUT /collections/crm_chunks/index   { "field_name": "doc_type",  "field_schema": "keyword" }
+PUT /collections/crm_chunks/index   { "field_name": "date_ts",   "field_schema": "integer" }
+```
+
+Upsert points with both vectors + payload:
+
+```json
+PUT /collections/crm_chunks/points
+{ "points": [ { "id": 1,
+    "vector": { "dense": [...1024...], "sparse": { "indices": [...], "values": [...] } },
+    "payload": { "lp_id": "lp_0427", "lp_name": "Brightwater Capital",
+                 "doc_type": "email", "date_ts": 1771027200, "text": "..." } } ] }
+```
+
+### At query time
+
+Generate the query's BM25 sparse vector with the **same** FastEmbed model, and
+pass it to `/api/search` as `sparse`. Spark Control fuses dense+sparse with RRF
+inside Qdrant, then reranks. If you omit `sparse`, you get dense + rerank
+(still good, just no exact-lexical leg).
+
+> You can talk to Qdrant directly (`http://<spark2>:6333`) for collection
+> management and upserts — that's the natural home for ingest. `/api/search` is
+> the convenience path for the read side so your agents hit one host. If you'd
+> rather Spark Control proxy Qdrant admin too, say the word.
+
+## Chunking guidance (entity-heavy CRM)
+
+- One chunk per email / note / transcript-turn; one chunk per memo *section* —
+  don't split mid-thought.
+- Keep entity + date as **payload fields** (filterable), not buried in embedded text.
+- Time-aware: don't merge a 2022 note with a 2026 update in one chunk; store
+  `date_ts` so you can pre-filter and recency-boost.
+- Resolve entity-name variants ("J. Smith" / "Jonathan Smith" / "JS") to one
+  canonical `lp_id` at ingest, or retrieval fragments across variants.
+
+## Pre-filtering
+
+Most agent queries aren't pure semantic — they're "recent emails with Bob about
+energy". Pass a Qdrant `filter` to restrict the search space *before* vector
+scoring (faster + more precise). Qdrant also supports server-side recency
+boosting via Formula/Score-Boosting if you want decay-by-`date_ts` without an
+app round-trip — ask and we can expose it through `/api/search`.
+
+## Service discovery + health
+
+- `GET /api/endpoints` includes `embeddings` and `qdrant` entries (base_url +
+  ready flag) alongside `vllm`, `parakeet`, `kokoro`.
+- The dashboard shows Embeddings + Qdrant health dots and Start/Restart/Stop
+  controls in the Always-On Services panel.
+- Spark Control will auto-restart a wedged **embedding** container (GPU CUDA
+  wedge recovery, like the audio services) but **never** auto-restarts Qdrant —
+  it holds your only copy of the index, so a restart is surfaced for manual
+  action instead.
+
+## Capacity
+
+At your scale (tens of thousands of chunks now → low hundreds of thousands),
+this is trivial: ~0.6–1.2 GB of dense vectors at 300k chunks, spark-embed +
+Qdrant together use a few GB of GPU/RAM on Spark 2's 122 GB. Full re-embed of
+300k chunks is ~8–15 min, so re-indexing is cheap if you change models.
+
+## Model upgrade path
+
+If dense recall becomes the bottleneck, `Qwen3-Embedding-4B` (Matryoshka-trained,
+tops 2026 MTEB) is the A/B candidate — same `/v1/embeddings` contract, swap the
+model in spark-embed. bge-m3 is the lower-risk starting point and ships now.
@@ -0,0 +1,71 @@
+# Phase 0 — Foundation: Kickoff Brief
+
+**Goal:** stand up the data + retrieval substrate that every later agent depends on. By the end of Phase 0 we can ingest the CRM into a searchable index and query it through a CRM MCP server using multiple retrieval modes. **No outward-facing agent behavior in this phase.**
+
+See `../CLAUDE.md` for settled architecture and non-negotiable guardrails. This brief assumes them.
+
+## Prerequisites / inputs needed before building
+
+- [ ] CRM source code present in this repo.
+- [ ] `docs/EMBEDDINGS.md` read — it is the authoritative embedding/retrieval contract and ingest recipe. This brief defers to it for all vector mechanics.
+- [ ] `.env` populated from `.env.example`: `ANTHROPIC_API_KEY`, `SPARK_CONTROL_URL` (gateway), `QDRANT_URL` (direct Qdrant on Spark 2 for ingest), `X_API_KEY`, CRM connection vars.
+- [ ] Spark Control endpoints live (shipped in v0.15.0): `/v1/embeddings`, `/v1/rerank`, `/api/search`. Verify with `GET /api/endpoints`.
+- [ ] `fastembed` installed in the ingest environment (client-side BM25 sparse vectors, model `Qdrant/bm25`).
+- [ ] A small set of real example records + documents (a few LPs, a memo, a transcript) to test ingest and retrieval against.
+
+## Workstream A — Understand the CRM, then extend the schema
+
+1. Read the CRM code. Write `docs/crm-overview.md`: storage engine and where it runs, the data model (records, custom fields, and especially how notes/interactions/correspondence are stored), the API/route surface, and auth. Fill in the Conventions section and CRM env vars in `CLAUDE.md`.
+2. Design and apply additive, reversible schema extensions:
+   - LP/prospect fields: thesis fit, segment, accreditation/QP status, warmth score, source, owner, last-touch.
+   - **Interaction log** table: append-only record of every agent action and every human touch (actor, action, target, timestamp, payload).
+   - **Relationship graph** table: edges between people/entities with source + strength (seeded later from email/calendar + X follower overlap).
+   - **Canonical entity IDs**: a canonical record each name variant resolves to (see Workstream B3).
+
+   *Guardrail: additive and reversible only; no destructive migrations; review before running.*
+
+## Workstream B — Ingest / sync pipeline
+
+**Follow the ingest recipe in `docs/EMBEDDINGS.md` exactly** (collection schema, dual-vector upsert, FastEmbed BM25, payload fields). This brief covers only the CRM-specific decisions layered on top of it.
+
+1. **Chunking** (a bigger quality lever than model choice). Per EMBEDDINGS.md: one chunk per email / note / transcript-turn; one chunk per memo *section*; never split mid-thought; keep entities + `date_ts` as filterable payload, not embedded text. The CRM-specific work is mapping *your* record types to chunks (how a thread, a note, a memo, a call transcript each decompose).
+2. **Two vectors per chunk.** Dense via Spark Control `POST /v1/embeddings` (`BAAI/bge-m3`, 1024-d); sparse BM25 via FastEmbed (`Qdrant/bm25`) generated **client-side**. Create the `crm_chunks` collection with a named dense vector (1024, Cosine) + named sparse vector (`modifier: idf`), and payload indexes on `lp_id` (keyword), `doc_type` (keyword), `date_ts` (integer). Upsert both vectors + payload (`lp_id`, `lp_name`, `doc_type`, `date_ts`, `text`). **Ingest talks to Qdrant directly** (`QDRANT_URL`), not through the gateway.
+3. **Entity resolution at ingest.** Resolve name variants ("John Smith" / "J. Smith" / "JS") to one canonical `lp_id` before storing — a lightweight local-Qwen call via Spark Control `/v1/chat/completions`. Without it, retrieval fragments across variants. Same canonical-ID work as Workstream A4.
+4. **Backfill + incremental sync, idempotent.** One-time full backfill, then incremental updates as records change. A full re-embed is ~8–15 min, so the index is cheaply rebuildable — but only if the pipeline is **idempotent and re-runnable** (Qdrant is never auto-restarted and holds the only live index). Prefer simple, correct, repeatable sync over premature optimization.
+
+## Workstream C — CRM MCP server
+
+Wrap the CRM behind an MCP server that agents (Claude Agent SDK) call. The retrieval modes are thin wrappers over Spark Control `POST /api/search` (which orchestrates embed → Qdrant RRF → rerank); for hybrid/keyword modes, generate the query's BM25 sparse vector with the **same** FastEmbed model used at ingest and pass it as `sparse`. Tools to expose:
+- **Reads:** fetch record(s), list/search records by structured filters, fetch interaction history for an entity.
+- **Retrieval modes** (all via `/api/search`, varying parameters):
+  - `semantic_search` — dense only (omit `sparse`), large `top_k`/`retrieve_n`. High recall (Scout).
+  - `hybrid_search` — pass `sparse` + `rerank: true` + `filter`. Dense+BM25 RRF then rerank. **Default**, best for entity-heavy queries.
+  - `keyword_search` — high-precision lexical ("did we ever discuss X with this LP?"): lean on the sparse leg + rerank. For a pure-BM25 result, query Qdrant's sparse vector directly via `QDRANT_URL`.
+  - Always support a Qdrant `filter` (e.g. `lp_id`, `date_ts` range) to pre-filter before scoring.
+- **Writes:** create/update records, append to the interaction log, write enrichment results back. (No outbound/contact actions — Phase 3, gated.)
+
+Keep retrieval modes as small, composable functions so the orchestrator can select among them later.
+
+## Workstream D — Compliance groundwork (parallel, non-blocking)
+
+Engage counsel to define solicitation posture (506(b) vs 506(c)), accreditation/QP verification, and recordkeeping rules. This does not block A–C, but it must be resolved before any Phase 3 outbound work begins.
+
+## Acceptance criteria
+
+- [ ] CRM schema extended (additive, reversible) and documented.
+- [ ] `crm_chunks` collection created per EMBEDDINGS.md (named dense + sparse vectors, payload indexes).
+- [ ] Full CRM corpus backfilled into Qdrant with **both** dense + BM25 sparse vectors and payload; incremental sync working and idempotent.
+- [ ] Entity resolution collapses obvious name variants to one canonical `lp_id`.
+- [ ] CRM MCP server live; retrieval modes wrap `/api/search`; hybrid (sparse + rerank) visibly beats dense-only on entity-heavy test queries (e.g. exact fund/LP name lookups).
+- [ ] Every write goes through the interaction log.
+- [ ] No outward-facing behavior exists anywhere in the codebase.
+
+## Explicitly out of scope for Phase 0
+
+- Any agent that sends email, posts, or contacts a prospect.
+- Scout monitoring, Closer outreach, Scribe publishing, the Orchestrator.
+- The workflow/scheduling engine (revisited at Phase 3).
+
+## Suggested order
+
+A1 → A2 → B1 → B2 → B3 → B4 → C. Start D in parallel.
@@ -0,0 +1,133 @@
+# Ten31 — Agentic Capability Build Plan
+
+*Working document. Purpose: a concrete, sequenced plan for building an in-house system of AI agents to widen the top of the fundraising funnel, refine and propagate Ten31's thesis, and automate marketing/branding workflows — built with internal resources using Claude and Claude Code as the engineering partner.*
+
+---
+
+## 1. Approach in one paragraph
+
+Build **six agents** — five workers plus a lightweight orchestrator — on the **Claude Agent SDK**, connected to your systems through **MCP**. Run the *reasoning* on **Claude** (frontier-quality judgment for research, messaging, drafting). **Self-host the data layer and the privacy-sensitive model work** on your existing Start9 server and your **dual DGX Sparks**. **Buy nothing for the core**: your self-built CRM becomes the system of record, and your existing Gmail/Superhuman + calendar connectors supply the relationship data. The real unit of reuse is not the agent count — it is one shared **LP graph** (your CRM) plus a library of **skills** every agent draws from.
+
+---
+
+## 2. Guiding principles
+
+1. **Sovereignty first.** Sensitive LP and relationship data stays on infrastructure you control (Start9 + DGX Sparks). Only the minimum necessary context per call ever reaches a third-party model API.
+2. **Frontier reasoning where it is best-in-class; local where privacy or cost dominate.** Claude for hard agentic reasoning and LP-facing output; local open models for embeddings, redaction, triage, transcription, and reasoning over data that must not leave your walls.
+3. **Human-in-the-loop on anything outbound or thesis-defining.** Agents draft and prepare; partners approve and send.
+4. **Compliant by design.** Log every agent action; gate all outbound; bring counsel in before any cold outreach goes live.
+5. **One source of truth.** Every agent reads from and writes to the same LP graph, so research → outreach → nurture → meeting prep compound instead of fragmenting.
+
+---
+
+## 3. The agent roster (6)
+
+| Agent | Job | Cadence | Brain | Human gate |
+|---|---|---|---|---|
+| **Scout** | Watches sources (X/nostr, filings, treasury announcements, conference rosters, podcast networks); flags trigger events; populates the pipeline. | Continuous / scheduled | Local (triage) + Claude (judgment calls) | None (internal only) |
+| **Analyst** | Builds LP dossiers, enriches records, maps shortest warm-intro path through the team's network. | On-demand + triggered | Claude (synthesis); local for RAG/embeddings | None (internal only) |
+| **Architect** | **Thesis articulation.** Owns and refines the canonical messaging — the scarcity / critical-infrastructure throughline tying bitcoin to AI infrastructure. The copilot partners sit with to sharpen the narrative. Output = a living "messaging source of truth." | On-demand, collaborative | Claude | Partner sign-off on canonical thesis |
+| **Scribe** | **Distribution / amplification.** Takes the Architect's canonical thesis + your content (Bitcoin Alpha, partner shows, memos) and propagates segment-specific cuts across X, nostr, LinkedIn, email. | Scheduled + on-demand | Claude | Review before publish |
+| **Closer** | Drafts personalized outreach and nurture sequences, preps partners before LP calls, writes follow-ups, keeps the CRM clean. | Triggered + on-demand | Claude | **Hard gate** — human sends all outbound |
+| **Orchestrator** ("Chief of Staff") | Schedules runs, routes work between agents, escalates to a human. | Always on | Claude (light) | n/a |
+
+**Why Architect and Scribe are separate.** Distribution is high-frequency and semi-mechanical; thesis articulation is low-frequency, high-judgment, and collaborative. Keeping them apart lets the Architect own a stable, partner-approved narrative that the Scribe then propagates consistently everywhere.
+
+---
+
+## 4. Architecture and hosting map
+
+### 4.1 Model layer
+- **Claude (API)** — the brains for Analyst synthesis, Architect thesis work, Scribe drafting, Closer judgment, and Orchestrator routing. Use a stronger model for Architect/Analyst, a faster one for high-volume Scout/Closer tasks.
+- **Local model on the DGX Sparks** — current local model is **Qwen3.6 35B-A3B running on a single Spark**. Used for PII redaction before any data leaves your walls, inbound triage/classification, transcription orchestration, structuring/extraction, and local reasoning over data you choose never to send out.
+  - The **A3B (~3B active params)** design means only a small slice of the model runs per token, so it largely sidesteps the Spark's memory-bandwidth limit and keeps decode fast despite being a 35B-total model. No need to link both Sparks for a larger model — that earlier ceiling is moot for this workload.
+  - **Embeddings + reranking (shipped, Spark Control v0.15.0).** Retrieval runs on `BAAI/bge-m3` (dense, 1024-dim, L2-normalized) plus `BAAI/bge-reranker-v2-m3` (cross-encoder), served by **spark-embed** — a small FastAPI server on **Spark 2** built from the NGC PyTorch image (HF TEI was ruled out: no arm64 CUDA image). Exposed through Spark Control as `/v1/embeddings`, `/v1/rerank`, and `/api/search` (orchestrated hybrid retrieval). Combined GPU footprint on Spark 2 is trivial (~3 GB).
+  - **Spark allocation.** Spark 1 = LLM serving (hot KV cache). Spark 2 = embeddings + reranker + audio + the Qdrant vector index. Both Sparks are treated as always-on production infrastructure.
+- **All local model services are fronted by Spark Control** (the self-hosted gateway on Start9): agents hit one trusted URL for chat, embeddings, rerank, transcription, and TTS, with shared TLS, access control, and observability.
+- **Auth note:** Agent SDK agents must authenticate with an **API key**, not a claude.ai login.
+
+### 4.2 Data layer — the LP graph (self-hosted)
+- **The CRM (self-hosted on Start9) is the canonical system of record.** Extend it to be the LP graph. Add: prospect/LP schema fields (thesis fit, segment, accreditation/QP status, warmth score, source, owner, last-touch), an interaction log (every agent action + every human touch), a derived **relationship graph** table, and **canonical entity IDs** for entity resolution (see ingest pipeline).
+- **Vector store: Qdrant on Spark 2 (settled).** Holds the embedded chunks. It is a **rebuildable, derived index**, not a second source of truth — if lost, it re-embeds from the CRM in minutes. Qdrant provides dense search + native BM25 + payload filtering + Reciprocal Rank Fusion in one service.
+- **Retrieval pipeline.** One orchestrated call to Spark Control `/api/search`: embed query (BGE-M3) → Qdrant dense + BM25 RRF with payload pre-filter → cross-encoder rerank → top_k. BM25 is generated **client-side** via FastEmbed (`Qdrant/bm25`) at both ingest and query time, with Qdrant applying IDF over *your* corpus — so domain entities (LP names, tickers, portfolio companies) are weighted by your own term statistics rather than BGE-M3's general-web sparse weights.
+- **Ingest pipeline (the real Phase 0 work).** CRM record/change → chunk (one chunk per email/note/transcript-turn; one per memo *section*; time-aware; entities + `date_ts` kept as filterable payload, not embedded text) → resolve entities to a canonical `lp_id` (lightweight local-Qwen step) → produce **both** a dense vector (`/v1/embeddings`) and a sparse BM25 vector (FastEmbed) → upsert both + payload to Qdrant **directly** (not via the gateway). One-time backfill + idempotent incremental sync. Full recipe: `docs/EMBEDDINGS.md`.
+- **Per-agent retrieval modes.** Don't force one pipeline on all agents. Build a small library the orchestrator picks from: high-recall dense at large K (Scout), high-precision keyword/BM25 (Closer — "did we ever discuss X with this LP?"), long-context + rerank (Architect). The CRM MCP server exposes these as tools.
+- **Wrap the CRM in an MCP server** so all agents read/write through one uniform interface, including the retrieval modes above. Because the CRM is self-built, any endpoint the agents need can be added.
+
+### 4.3 Integration layer (MCP fabric)
+- MCP servers to stand up / connect:
+  - **CRM / LP graph** (custom, self-hosted) — primary.
+  - **Email + calendar** — Gmail/Superhuman connectors are already live; these feed Closer (drafting, follow-ups) and the Analyst's warm-path derivation.
+  - **Drive / notes** — internal documents and memos.
+  - **Publishing channels** — X, nostr, LinkedIn, email/newsletter (for Scribe).
+  - **Public data sources** — filings, web search, and the **X API (official key in hand)** for Scout/Analyst enrichment. X is a primary source here: per-prospect public profile/bio/activity and follower-following overlap for thesis-fit scoring and mutual-connection discovery (Analyst), plus account/list/keyword monitoring and follower-graph signals (Scout). Confirm what your X access *tier* permits (full-archive search, follower-graph pulls, streaming) — that sets the ceiling on heavier monitoring. nostr APIs as a complementary source.
+
+### 4.4 Orchestration / runtime
+- Inner loop: **Claude Agent SDK** handles each agent's tool-use loop and context management.
+- Outer loop: a thin workflow engine decides *when* and *which* agent runs (Temporal for durable retries, or simpler cron/queue + n8n glue to start).
+- **Observability:** structured logging of every agent action, with a simple dashboard. Required for both debugging and compliance.
+
+### 4.5 Enrichment (privacy-preserving)
+- Default: **one-way, per-prospect public lookups** that write results *into* the CRM. Never upload the LP list to a third party. The **X API** is the workhorse here — public, per-prospect, ToS-compliant via the official key — and its follower-graph data complements the email/calendar relationship graph for warm-path mapping.
+- Optional: a **self-hosted scraper/enrichment pipeline on the Sparks** if you want zero third-party API exposure.
+
+### 4.6 Redaction / re-hydration boundary (Claude-facing reasoning)
+- For the steps where an agent must have **Claude reason over LP-specific content** (Analyst dossiers, Closer drafting), a local **scrub → reason → re-hydrate** round-trip keeps identifiers off the third-party API: the Sparks pseudonymize names/orgs/amounts to stable placeholders, Claude reasons over the de-identified prompt, and real values are swapped back locally before a human reviews. The ingest/retrieval path is already fully local and needs none of this.
+- This is **designed now, built in Phase 2/3** (it is not needed in Phase 0). Full design: `docs/redaction-rehydration.md`.
+
+---
+
+## 5. Build sequence
+
+### Phase 0 — Foundation
+The substrate: data layer + retrieval, no live-in-the-wild agents yet. Division of labor:
+- **Spark developer (their side):** TEI serving BGE-M3 + BGE-Reranker-v2-m3 and Qdrant on Spark 2, exposed via Spark Control `/v1/embeddings` + `/v1/rerank`.
+- **Claude Code + you (this project):**
+  1. Read the CRM code; document the storage engine, schema, and API surface.
+  2. Extend the CRM schema (LP/prospect fields, interaction log, relationship graph, canonical entity IDs).
+  3. Build the ingest/sync pipeline (chunking + entity resolution + metadata payloads; backfill + incremental).
+  4. Build the CRM MCP server wrapping CRM reads/writes and the per-agent retrieval modes.
+  5. Bring counsel in to define outbound and recordkeeping rules so the system is compliant from day one.
+
+### Phase 1 — Architect + Scribe
+- Stand up the **Architect** first: encode the current thesis, voice, and segment definitions as skills; use it collaboratively to produce the canonical messaging source of truth.
+- Then **Scribe**: propagate that thesis into segment-specific content with human review before publish.
+- Lowest risk, highest immediate awareness ROI, never touches cold outreach — and it proves the full pattern (SDK + skills + MCP + human review).
+
+### Phase 2 — Scout + Analyst
+- **Scout** populates the pipeline from public signals (X monitoring via the API key); **Analyst** builds dossiers and derives warm paths from your own email/calendar graph plus X follower-graph overlap.
+- Internal-facing, still no outbound. This is where the Sparks earn their keep (bulk classification, embeddings, RAG).
+
+### Phase 3 — Closer + Orchestrator
+- **Closer** drafts outbound, nurture, and meeting prep — with hard human-in-the-loop gates and full logging. Highest-risk and regulated, so it comes last.
+- **Orchestrator** added once there are multiple agents to coordinate and schedule.
+
+---
+
+## 6. Team and ownership model
+
+- **Engineering partner:** Claude + Claude Code, supplying Agent SDK and MCP fluency, scaffolding the agents, writing the MCP servers and orchestration, and customizing the Start9 CRM package.
+- **Operator:** you (and your partner). You own deployment, secrets/key management, uptime, and the human-review gates. Your prior Start9 CRM build demonstrates this is well within reach.
+- **The one real risk is time, not capability.** Removing the part-time data/ops hire means operational ownership lands on the partners. If partner time is scarce, that — not tooling or skill — is the constraint to manage. Mitigations: keep the early phases internal-only (no on-call urgency), automate logging/monitoring, and stage the highest-maintenance agent (Closer) last.
+
+---
+
+## 7. Compliance by design
+
+- Log every agent action and every outbound draft.
+- Gate all outbound through human send.
+- Resolve solicitation posture (e.g. 506(b) vs 506(c)), accreditation/QP verification, and recordkeeping with counsel **before** the Closer touches cold outreach.
+- Start with distribution and inbound nurture, where constraints are lightest.
+
+---
+
+## 8. Open decisions
+
+**Resolved:** local chat/triage model = Qwen3.6 35B-A3B (Spark 1); embedding = `BAAI/bge-m3` dense 1024-dim; reranker = `BAAI/bge-reranker-v2-m3`; vector DB = Qdrant v1.16.0 on Spark 2; serving = **spark-embed** (custom FastAPI on NGC PyTorch image, *not* TEI); BM25 sparse generated client-side via FastEmbed (`Qdrant/bm25`); all fronted by Spark Control (`/v1/embeddings`, `/v1/rerank`, `/api/search`), shipped v0.15.0. Embedding-model A/B upgrade candidate if dense recall lags: `Qwen3-Embedding-4B` (same `/v1/embeddings` contract).
+
+**Still open:**
+1. Workflow engine for the outer loop (Phase 3): Temporal vs. cron/queue + n8n to start.
+2. Whether any third-party enrichment API is acceptable, or X + fully self-hosted enrichment only.
+3. Confirm **X API usage limits** (full-archive search, follower-graph pulls, streaming) to size Scout's monitoring scope. (Current access is pay-as-you-go credits.)
+4. Segment definitions for the Architect/Scribe (who are the distinct LP audiences, and what does each one need to hear?).
+5. Embedding dimension/quantization left at BGE-M3 native 1024-dim fp16 — no Matryoshka truncation or int8 needed at this corpus scale.
@@ -0,0 +1,207 @@
+# CRM Overview — Storage, Data Model, API, Auth
+
+*Workstream A1 deliverable (see `PHASE_0.md`). Read-only documentation of the existing CRM as of 2026-06. Every concrete claim is anchored to `file:line`. This is a description of what exists today, not a proposal — the schema-extension proposal for A2 lives separately.*
+
+---
+
+## 0. TL;DR for Phase 0
+
+- **One Python file, no framework.** The whole backend is `backend/server.py` (~4,530 lines): a stdlib `http.server.ThreadingHTTPServer` with a hand-written `CRMHandler(BaseHTTPRequestHandler)` and manual path dispatch. `requirements.txt` lists FastAPI/SQLAlchemy/Alembic/Pydantic but **none are imported** — they are vestigial.
+- **Storage is one SQLite file** (`data/crm.db`), WAL mode, opened fresh per request. Schema is created idempotently in-code at boot. There is no Alembic; "migrations" are `CREATE TABLE IF NOT EXISTS` + best-effort `ALTER TABLE ADD COLUMN`.
+- **Two parallel investor data models** coexist with no shared key: (1) the *classic* `contacts / organizations / opportunities / communications / lp_profiles` CRM, and (2) the *newer, actively-used* `fundraising_*` collaborative grid. They are bridged only by fuzzy name/email matching. **This duality is the central entity-resolution problem for Phase 0.**
+- **A real Gmail subsystem** (`backend/email_integration/`) stores threaded correspondence in `crm.db` and matches emails to investors — but is **self-disabling** (off unless a service-account key is present).
+- **Auth is a single scheme:** username/password → HS256 JWT (Bearer header), re-validated against the `users` table each request; two roles (`admin`/`member`). The `X_API_KEY` named in `CLAUDE.md`/`PHASE_0.md` **does not exist in the code** — it is aspirational.
+- **Guardrail flags:** all deletes are **hard deletes** (violates guardrail #3 as written); a destructive `POST /api/admin/reset-all-data` exists; `audit_log` is mutation-only and is *not* the append-only interaction log Phase 0 wants.
+
+---
+
+## 1. Storage engine & where it runs
+
+### 1.1 Runtime
+- **Server:** `ThreadingHTTPServer((HOST, PORT), CRMHandler)`, `daemon_threads = True`, `serve_forever()` — `backend/server.py:4509`. Handler class at `backend/server.py:1418`. Pure Python stdlib (`http.server`); **not** FastAPI/uvicorn despite `backend/requirements.txt:1-2`.
+- **Concurrency model:** one OS thread per request. Safe because each request opens its own short-lived SQLite connection under WAL (rationale documented inline at `backend/server.py:4506-4508`).
+- **Request lifecycle:** banned-IP check → per-IP rate limit → (email module hook) → manual `if path == …` dispatch. Body is read once and cached on the handler instance (`get_body()`, `backend/server.py:1433`). Malformed JSON silently becomes `{}`.
+- **Background threads:** a backup scheduler loops every 60 s (`start_backup_scheduler`, `backend/server.py:1367`); an optional Gmail sync scheduler starts only if `CRM_GMAIL_INTEGRATION_ENABLED` is truthy (`backend/server.py:4498`).
+- **Default bind:** `0.0.0.0:8080`, plain HTTP — TLS is expected to be terminated upstream (Start9 / Tailscale).
+
+### 1.2 SQLite configuration
+`get_db()` (`backend/server.py:77-84`) sets, on every connection:
+- `PRAGMA journal_mode=WAL` — concurrent readers + single writer (this is what makes the ingest reader safe against the live writer).
+- `PRAGMA foreign_keys=ON` — FKs are enforced at runtime (per-connection in SQLite, so re-set each time).
+- `PRAGMA busy_timeout=5000` — 5 s wait on a lock.
+- `row_factory = sqlite3.Row`.
+
+The Gmail module re-implements the identical pragmas (`email_integration/scheduler.py:49`, `email_integration/routes.py:89`) rather than import `server.py`, to avoid a circular import.
+
+### 1.3 Schema bootstrap & "migrations"
+- `init_db()` (`backend/server.py:86`) runs once at startup, before binding. One big `executescript` of `CREATE TABLE/INDEX IF NOT EXISTS` (`backend/server.py:91-405`) creates **both** data models plus `app_settings`.
+- **Core "migrations":** a hardcoded list of `ALTER TABLE … ADD COLUMN` wrapped in a try/except that swallows `OperationalError` (`backend/server.py:407-418`) — additive-only, idempotent-by-failure. No version table, no down-migrations. (Currently adds `city/state/country/location_query` to `contacts` and `lead_source` to `fundraising_investors`.)
+- **The only real migration runner** is in the Gmail module: `email_integration.db.apply_migrations()` (`email_integration/db.py:23`) runs numbered `NNNN_*.sql` files lexicographically. There is one today: `migrations/0001_email_tables.sql`. ⚠️ This is called from `init_db()` guarded only by `ImportError` (`backend/server.py:421-427`), so on any image where the package is importable, **the email tables are created even when Gmail sync is disabled**.
+
+### 1.4 `data/` layout
+- `crm.db`, `crm.db-wal`, `crm.db-shm` — the DB + WAL + shared-memory.
+- `backups/` — JSON snapshots of the **fundraising grid state only** (not the whole DB), written by the backup scheduler.
+- `secrets/` — holds `gmail-service-account.json` (mode 600).
+- `email_attachments/` — Gmail attachment bytes on disk (created on the 0.4 image).
+- `.crm-secret` — persisted JWT secret, written by the container entrypoint (not the app).
+
+### 1.5 Production deployment (StartOS)
+Package id `ten-database` ("Ten31 Database"). Both generations run the **same** app (`python3 /app/backend/server.py`) in a `python:3.11-slim` container; all state on a single persistent volume `main` mounted at `/data`.
+
+- **`start9/0.4/` — current/live target.** Manifest & lifecycle are TypeScript under `start9/0.4/startos/` (`manifest/index.ts`, `main.ts`, `backups.ts`, `interfaces.ts`). Built for `x86_64` + `aarch64`. Whole-volume backups (`sdk.Backups.ofVolumes('main')`). The richer `docker_entrypoint.sh` creates `/data/{backups,secrets,email_attachments}`, persists `CRM_SECRET_KEY` to `/data/.crm-secret`, and **conditionally enables Gmail** iff `/data/secrets/gmail-service-account.json` exists (then exports DWD env: `CRM_GMAIL_AUTH_METHOD=dwd`, `CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz`, sync interval 180 min). Version notes record the 0.3.5→0.4 migration is complete and the live `/data` volume is the sole source of truth (no more baked-in seed snapshot).
+- **`start9/0.3.5/` — legacy.** Hand-written YAML manifest (`manifest.yaml`), arm64-only, Tor `80→8080` + LAN `443(ssl)→8080`. Superseded by 0.4.
+- **Local/dev:** `start.sh` runs `python3 backend/server.py` with dev defaults. `start_beta.sh` is a Tailscale launcher that sources `.env.beta`, forces `CRM_ENV=production`, and requires a ≥24-char `CRM_SECRET_KEY`. In production mode the app refuses to start without `CRM_SECRET_KEY` (`backend/server.py:4487`).
+
+### 1.6 Environment variables (for CLAUDE.md "CRM connection vars")
+**Core server (`backend/server.py:42-71`):** `CRM_DATA_DIR`, `CRM_FRONTEND_DIR`, `CRM_DB_PATH` (default `<DATA_DIR>/crm.db`), `CRM_SECRET_KEY` (JWT signing — required in production), `CRM_HOST` (default `0.0.0.0`), `CRM_PORT` (default `8080`), `CRM_CORS_ORIGIN` (default `*`), `CRM_ENV` (default `development`), `CRM_LOGIN_RATE_LIMIT_PER_MIN`, `CRM_WRITE_RATE_LIMIT_PER_MIN`, `CRM_GET_RATE_LIMIT_PER_MIN`, `CRM_ABUSE_404_THRESHOLD`, `CRM_ABUSE_404_WINDOW_SEC`, `CRM_ABUSE_BAN_SEC`, `CRM_SEED_DEMO_DATA`, `CRM_GMAIL_INTEGRATION_ENABLED`.
+
+**Gmail module (`email_integration/config.py:80-101`):** `CRM_GMAIL_AUTH_METHOD` (`dwd`/`oauth`), `CRM_GMAIL_SA_KEY_PATH`, `CRM_GMAIL_WORKSPACE_DOMAIN`, `CRM_GMAIL_OAUTH_CLIENT_ID/SECRET/REDIRECT_URI`, `CRM_GMAIL_SECRET_KEY` (AES key for OAuth-token-at-rest — **separate** from `CRM_SECRET_KEY`), `CRM_GMAIL_SYNC_INTERVAL_MIN`, `CRM_GMAIL_BACKFILL_PAGE_SIZE`, `CRM_GMAIL_MAX_ATTACHMENT_MB`, plus rate/retry knobs.
+
+> **There is no network DB protocol.** "Connecting to the CRM" means either (a) opening the same SQLite file (`CRM_DB_PATH`) — only possible co-located with the `/data` volume — or (b) HTTP at `http://<host>:8080` with a Bearer JWT.
+
+---
+
+## 2. Data model
+
+PKs are **8-char truncated UUIDs** (`generate_id()` = `str(uuid.uuid4())[:8]`, `backend/server.py:522`). Timestamps are ISO-8601 UTC strings (`now()`, `backend/server.py:525`). JSON-bearing TEXT columns (`tags`, `attendees`, `options`) are `json.dumps`'d on write and auto-parsed on read by `row_to_dict()` (`backend/server.py:506-517`).
+
+### 2.1 Classic CRM model
+
+| Table | Role | Key columns / notes |
+|---|---|---|
+| `users` | auth + ownership principal | `username`/`email` UNIQUE, `password_hash`, `role` ∈ {`admin`,`member`}, `is_active`. First user forced `admin`. (`backend/server.py:92`) |
+| `organizations` | weak parent of contacts/opps | `name` (not unique), `type` (free-text, default `other`), `tags` JSON, `description`. (`backend/server.py:104`) |
+| `contacts` | **the hub** | `first_name`/`last_name` (req), `organization_id` (FK SET NULL), `contact_type` (free-text; load-bearing values `prospect`/`investor`), `status` (default `active`), `source`, `tags` JSON, `notes`, `linkedin_url`. (`backend/server.py:123`) |
+| `opportunities` | deal pipeline | `contact_id` (req, FK **CASCADE**), `stage` (allowlist `PIPELINE_STAGES` at `backend/server.py:1380`, enforced **only** on the stage endpoint), `commitment_amount`, `expected_amount`, `fund_name`, `owner_id`, `lost_reason`. (`backend/server.py:148`) |
+| `lp_profiles` | closed-LP extension | 1:1 with a contact (`contact_id` UNIQUE, FK CASCADE). Holds `commitment_amount`, `funded_amount`, `accredited` (bare 0/1), `legal_docs_signed`, `wire_received`, `k1_sent`, `investor_type` (free-text). (`backend/server.py:186`) |
+| `custom_fields` / `custom_field_values` | EAV custom fields | **Dead**: schema exists but has **no routes/handlers**; only ever wiped by reset. Do not build on this. (`backend/server.py:206`) |
+| `tags` | global tag palette | `name` UNIQUE + `color`. Not FK-linked to the per-row `tags` JSON arrays; just an autocomplete source. (`backend/server.py:237`) |
+| `audit_log` | mutation diff trail | `user_id`, `entity_type`, `entity_id`, `action`, `changes` JSON. **Mutation-only**, no reads, no actor/agent dimension. (`backend/server.py:227`) |
+
+**How an LP is represented:** a single `contacts` row is the canonical record; `contact_type` carries the funnel stage (`prospect`→`investor`). Promotion to `investor` is a side effect of creating an `lp_profiles` row (`backend/server.py:2834`) or of fundraising-grid sync (`backend/server.py:788`). The contact dossier is assembled by `GET /api/contacts/{id}` (`backend/server.py:2008`): contact + last-20 communications + all opportunities + the one lp_profile. **Note the unreconciled double-modeling of money:** in-flight commitment lives on `opportunities`, closed commitment lives on `lp_profiles`, and the grid has a *third* copy in `fundraising_commitments`.
+
+### 2.2 Fundraising grid model (newer, actively used)
+
+A **real-time collaborative spreadsheet** the partners actually edit: funds are columns, investors are rows, dollar commitments are cells, plus saved views, live presence/cell-locks, and a small automation engine.
+
+- **Authoritative store = one JSON blob:** `fundraising_state.grid_json` + `views_json` (row `id='main'`, `backend/server.py:258`). Reads/exports come straight from this.
+- **Normalized tables are a derived mirror**, fully rebuilt from the JSON on **every** write by `sync_fundraising_relational()` (`backend/server.py:945`): `fundraising_investors` (keyed by `source_row_id`), `fundraising_funds` (by `column_id`), `fundraising_commitments` (cells), `fundraising_contacts`, `fundraising_views`. ⚠️ **`fundraising_contacts`/`fundraising_commitments`/`views` get fresh UUIDs on every save** (DELETE+reinsert); only `fundraising_investors.id` and `fundraising_funds.id` are stable. Don't persist external references to the volatile ones.
+- **Automation engine** (`run_fundraising_automations`, `backend/server.py:668`): currently **ignores** the rules' `condition_json`/`action_json` and uses hard-coded flag logic (`graveyard`→graveyard list, `follow_up`→follow_up list, everyone→`all`), rebuilding `fundraising_list_memberships` and logging changes to `fundraising_automation_runs`. So the rules table is display/config surface, not a live interpreter.
+- **Backups:** JSON-only filesystem snapshots of grid state to `data/backups/` (manual/auto/pre_restore), governed by a policy in `app_settings`. Restore overwrites state then re-syncs the mirror.
+
+### 2.3 The two-model bridge (the central problem)
+
+There is **no foreign key** between `fundraising_investors` and the classic `contacts/organizations/opportunities/lp_profiles`. They are joined only by best-effort name/email matching, essentially one-directional **grid → classic**:
+
+1. **Grid → classic (write-through):** every grid save pushes each grid contact into classic `contacts`/`organizations` via `_upsert_contact_from_fundraising` (`backend/server.py:730`), matching by lowercased email else (name + org). No stored key links the resulting `contacts.id` back to `fundraising_investors.id` — it re-matches by name/email each time.
+2. **Grid-logged comms → classic `communications`:** `POST /api/fundraising/log-communication` (`backend/server.py:2561`) writes into the classic `communications` table *and* appends a dated line to the grid row's `notes`.
+3. **Classic → grid (partial reverse mirror):** `_sync_contact_to_fundraising_state` (`backend/server.py:815`) patches an existing matching grid row but will **not** create a new investor row.
+
+**Net:** the same investor can simultaneously exist as a `fundraising_investors` row, one-or-more `contacts` rows, an `organizations` row, and an `lp_profiles` row, with **no authoritative join key**. The email matcher treats the **fundraising side as higher-signal** (`matcher.py:103`). `lp_profiles` is entirely outside the fundraising subsystem.
+
+### 2.4 Notes, interactions & correspondence (Phase-0 critical)
+
+Three subsystems hold embeddable text:
+
+**(a) `communications`** (`backend/server.py:168`) — the primary human-logged activity store. One row per `note`/`call`/`email`/`meeting`/`text` (type is free-text; UI offers those 5, `frontend/index.html:4220`). Columns: `contact_id` (req, CASCADE), `opportunity_id` (SET NULL), `subject`, `body`, `communication_date` (the **event** timestamp, distinct from `created_at` — this is the `date_ts` source), `outcome`, `next_action`, `attendees` JSON. Written by `POST /api/communications` and by the fundraising log endpoint. **Hard delete** at `backend/server.py:2758`.
+
+**(b) Scattered free-text fields** worth embedding: `contacts.notes`, `lp_profiles.notes`, `fundraising_investors.notes` (a running, newline-appended outreach log mirroring the grid Notes column), `opportunities.description`/`next_step`, `organizations.description`.
+
+**(c) Gmail correspondence** (`backend/email_integration/`, schema in `migrations/0001_email_tables.sql`):
+- `emails` — canonical record deduped by RFC `message_id`; `subject`, `from_*`, `to/cc/bcc` JSON, `sent_at`, `body_text`, `body_html`, `snippet`, `is_matched`, `match_status`. ⚠️ **Bodies are stored only for matched emails**; unmatched emails are metadata-only with the body nulled (`sync.py:319`).
+- `email_threads` — thread roll-up; threading via RFC `In-Reply-To`/`References` chain then Gmail thread id (`threads.py:38`).
+- `email_account_messages` — per-mailbox sighting (dedup across team inboxes).
+- `email_attachments` — metadata; bytes on disk, deduped by SHA-256.
+- **`email_investor_links`** — the entity linkage. Populates any subset of `fundraising_investor_id` / `fundraising_contact_id` / `contact_id` / `organization_id` (all **soft references, no FK**) with `match_kind` (`exact_email` conf 1.0 / `domain_match` conf 0.6) and confidence. A single email can link to several entities at once. `email_sync_runs` records per-run observability.
+
+**`audit_log` is *not* the interaction log.** It is mutation-only (≈26 write sites), logs no reads, and has no agent/actor dimension. Phase 0's "append-only interaction log of every agent action and every human touch" (Workstream A2) needs a **new** table, not a repurpose.
+
+---
+
+## 3. API / route surface
+
+Full REST verbs exist (mutations are **not** tunneled through POST): `do_GET` (1589), `do_POST` (1727), `do_PUT` (1791), `do_PATCH` (1817), `do_DELETE` (1845), `do_OPTIONS` (1580), all in `backend/server.py`. Routing is a flat `if/elif` ladder: exact string for collections, `re.match(r'^/api/x/[^/]+$')` for items, path params parsed positionally with `path.split('/')`. The Gmail module hooks in at the top of `do_GET`/`do_POST` via `try_handle(self)` (`email_integration/routes.py:49`), claiming any `/api/email/*` path.
+
+**Response envelope:** reads/writes → `{"data": …}` (+`total`/`limit`/`offset` for lists); errors → `{"error": msg}`; create → 201; auth → bare `{"token","user"}`; email handlers use ad-hoc keys. CORS allows all verbs; `Access-Control-Allow-Origin` echoes `CRM_CORS_ORIGIN` (default `*`).
+
+**Auth column:** None = public · Bearer = any active user · Admin = `require_admin`.
+
+| Method | Path | Purpose | Auth |
+|---|---|---|---|
+| GET | `/`, `/index.html`, `/assets/*` | Serve SPA + static | None |
+| GET | `/api/health` | Liveness | None |
+| GET | `/api/bootstrap/status` | First-run check | None |
+| POST | `/api/auth/login` | Login → JWT | None |
+| POST | `/api/auth/register` | First-user registration (self-disables) | None |
+| GET/POST | `/api/contacts` | List/search · Create | Bearer |
+| GET/PUT/DELETE | `/api/contacts/{id}` | Detail (dossier) · Update · **Hard delete** | Bearer |
+| GET | `/api/contacts/{id}/communications` | Per-contact interaction history | Bearer |
+| GET/POST · GET/PUT/DELETE | `/api/organizations[/{id}]` | Org CRUD | Bearer |
+| GET/POST · GET/PUT/DELETE | `/api/opportunities[/{id}]` | Opp CRUD | Bearer |
+| PATCH | `/api/opportunities/{id}/stage` | Move pipeline stage (validated) | Bearer |
+| GET/POST · GET/PUT/DELETE | `/api/communications[/{id}]` | Comms CRUD | Bearer |
+| GET/POST · GET/PUT | `/api/lp-profiles[/{id}]` | LP-profile CRUD (no delete route) | Bearer |
+| GET | `/api/reports/{dashboard,pipeline,lp-breakdown,activity}` | Aggregates | Bearer |
+| GET | `/api/export/contacts` | Export **all** contacts (returns JSON, not CSV) | Bearer |
+| POST | `/api/import/csv` | Bulk import from JSON rows | Bearer |
+| GET/POST · PATCH | `/api/feature-requests[/{id}]` | Feature-request tracker | Bearer |
+| GET | `/api/users` | List users (no hashes) | Bearer |
+| POST · PATCH | `/api/admin/users[/{id}]` | Create / update user | **Admin** |
+| POST | `/api/admin/reset-all-data` | ⚠️ Wipe CRM (confirm phrase `RESET ALL DATA`) | **Admin** |
+| GET | `/api/audit-log` | Mutation audit trail | **Admin** |
+| GET | `/api/security/status` | Config/security status | **Admin** |
+| GET/PUT | `/api/fundraising/state` | Get / save grid (optimistic `version`, 409 on conflict) | Bearer |
+| GET/POST | `/api/fundraising/collab/{state,heartbeat}` | Presence + cell locks | Bearer |
+| POST | `/api/fundraising/log-communication` | Log comm + append grid note | Bearer |
+| GET | `/api/fundraising/{export,relational-summary}` | Export / counts | Bearer |
+| GET | `/api/fundraising/activity` | Merged audit+automation+backup feed | **Admin** |
+| GET/PATCH | `/api/fundraising/automations[/{id}]` | Automation rules | **Admin** |
+| GET | `/api/fundraising/automation-runs` | Run history | **Admin** |
+| GET/POST | `/api/fundraising/{backups,backup,backup-verify}` | Backup mgmt | **Admin** |
+| POST | `/api/fundraising/{restore-preview,restore}` | Restore grid | **Admin** |
+| GET/PATCH | `/api/fundraising/backup-policy` | Backup policy | **Admin** |
+| GET | `/api/email/{status,accounts,threads}` | Sync status / accounts / matched threads | Bearer + flag |
+| GET | `/api/email/oauth/{start,callback}` | Per-user OAuth (callback is state-token gated, no Bearer) | mixed + flag |
+| POST | `/api/email/accounts/{enroll-all,enroll}` · `/sync/run-now` · `/rematch` | Enrollment & sync ops | **Admin** + flag |
+
+**Defined but NOT routed:** `handle_list_tags`/`create`/`delete` exist (`backend/server.py:3366-3400`) but no `/api/tags` route is wired; the `custom_fields` tables have no routes at all. Treat both as dead for Phase 0.
+
+---
+
+## 4. Authentication & authorization
+
+- **Login:** `POST /api/auth/login` → `handle_login` (`backend/server.py:1880`). Looks up active user, `verify_password` (bcrypt, PBKDF2-SHA256 fallback, `backend/server.py:444`), issues `create_token` (HS256 JWT via PyJWT, HMAC fallback; claims `user_id/username/role/exp/iat`; 24 h expiry; `backend/server.py:464`).
+- **Per-request verification:** `get_user()` (`backend/server.py:1458`) reads `Authorization: Bearer`, `decode_token` (pins `algorithms=["HS256"]` — no `alg:none` downgrade), then **re-loads the user row** and rejects if missing/inactive. Identity (incl. role) comes from the **DB row, not token claims**, so deactivation and role changes take effect immediately.
+- **No cookies, no logout, no refresh, no revocation.** The only early kill-switch is `is_active=0`.
+- **Bootstrap:** `GET /api/bootstrap/status` (public) reports `setup_required`. `POST /api/auth/register` (public, **self-disables** once any user exists) creates and force-promotes the first user to `admin`. A separate flag-gated `seed_demo_data()` hardcodes `admin/admin123` + `grant/password` and prints them — **dev-only, off by default** (`backend/server.py:4351-4374`).
+- **Roles:** only `admin`/`member`. Enforcement is an inline `require_admin(user)` (`backend/server.py:541`) at the top of each admin handler — no middleware. **No row-level authorization:** any active member can read/edit *all* LP and fundraising data; `created_by`/`owner_id` are informational only.
+- **`X_API_KEY` does not exist in code.** Repo-wide it appears only in `CLAUDE.md` and `docs/PHASE_0.md`. There is no API-key header path and no service-auth distinct from the user JWT. (The `Bearer` tokens in `email_integration/` are *outbound* Google OAuth tokens, unrelated.)
+- **Secrets:** JWT key `CRM_SECRET_KEY` (random per-process default; hard-fails in production if unset, `backend/server.py:4487`). Gmail OAuth tokens are encrypted at rest with **AES-256-GCM** keyed off the *separate* `CRM_GMAIL_SECRET_KEY` (`email_integration/crypto.py:37`) — a genuinely sound scheme. The Google service-account key lives at `data/secrets/gmail-service-account.json` (0600).
+- **Hardening present:** per-IP sliding-window rate limits (login 20/min, writes 300/min, GETs 600/min); auto-ban of scanner IPs after a 404 burst (`record_404`, `backend/server.py:1520`). **Absent:** security headers (HSTS/CSP/X-Frame-Options), CORS defaults to wildcard `*`, `X-Forwarded-For` is trusted (only safe behind a controlled proxy).
+
+---
+
+## 5. Phase-0 implications (carry into A2/B/C)
+
+1. **Pick a canonical LP identity.** The classic vs fundraising duality (§2.3) means entity resolution (A4/B3) must collapse across **both** models — grid investor + grid contacts + classic contact/org/lp_profile — into one canonical `lp_id`, not just dedupe name variants within one table. The email matcher's preference (`fundraising_contacts` > `contacts`) is a hint that the **grid is the operationally-live LP graph**, but closed financials/accreditation live only in `lp_profiles`.
+2. **Canonical ID host.** `contacts.id` is the natural join target (lp_profiles + opportunities FK to it; the dossier is keyed on it), but the **8-char truncated UUID** is a uniqueness risk if it becomes the Qdrant payload key — resolve before indexing.
+3. **Interaction log is greenfield.** `audit_log` won't do (mutation-only, no actor/agent dimension). A2 should add a dedicated append-only `interaction_log` table and route all MCP/agent writes through it (guardrail #5).
+4. **Soft-delete gap.** Everything is hard-deleted today (guardrail #3 risk) and there's no tombstone for an idempotent re-embed to detect removals. A2 should add a `deleted_at`/archive flag and have ingest treat it as a delete-from-index signal.
+5. **Chunk sources** (per `EMBEDDINGS.md`): one chunk per `communications` row; one per **matched** `emails` row (only these have bodies); split the `fundraising_investors.notes` outreach log per dated line; one chunk each for the scattered note/description fields. Keep ids/names/dates/types/confidence as filterable payload; embed only subject/body/note prose. `date_ts` = epoch of the **event** timestamp (`communication_date`/`sent_at`), not `created_at`. Key Qdrant point ids on stable ids (`emails.id`), namespaced to avoid the 8-char collision risk.
+6. **Migration mechanism.** No Alembic. Either extend the in-code idempotent pattern (`CREATE … IF NOT EXISTS` + try/except `ALTER ADD COLUMN`) or adopt the email module's numbered-`.sql` runner for the core schema (recommended; add a `schema_migrations` ledger). SQLite ALTER is add-column/rename only — which conveniently forces the additive/reversible guardrail.
+7. **Ingest connection.** No network DB protocol — open `data/crm.db` **read-only** (`sqlite3` URI `mode=ro`) with the same WAL/`busy_timeout` pragmas, co-located with the `/data` volume; WAL guarantees the reader never blocks the live writer. Reserve HTTP+JWT for any write-back. Decide the MCP↔CRM auth boundary explicitly (recommend: read-only direct SQLite for ingest; a constrained service principal for write-back) since `X_API_KEY` is unimplemented.
+
+---
+
+## 6. Open questions for the owner
+
+1. **Which model is canonical for an LP** — the `contacts` row or the `fundraising_investors` grid row? (Determines the ingest spine and the canonical-ID target.)
+2. **Is the Gmail integration enabled on the live Start9 box, and has a backfill run?** If not, the Phase-0 corpus is just `communications` + note fields until it is. (The email schema exists regardless.)
+3. **Should `custom_fields` (EAV) and the unrouted `tags` CRUD be revived or removed?** Leaving the EAV risks a second divergent custom-data path next to the live fundraising custom columns.
+4. **Accreditation today is a single boolean** (`lp_profiles.accredited`) with no QP flag / method / date — where should the 506(b)/506(c) + accreditation/QP fields counsel will require (guardrail #6) live?
+5. **MCP↔CRM auth:** build a real `X_API_KEY` service-key path, authenticate the MCP server as a dedicated CRM user (24 h JWT, must re-login), or read SQLite directly? And does the agent principal need a new least-privilege role below `member`?
+6. **`fundraising_state.grid_json` vs the normalized mirror** — confirmed authoritative = the JSON blob; is reading the normalized mirror (consistent after each save) acceptable for ingest, treating the JSON as the re-derivable source of truth?
+
+---
+
+*Sources: `backend/server.py`, `backend/email_integration/*`, `backend/email_integration/migrations/0001_email_tables.sql`, `start9/0.3.5/*`, `start9/0.4/*`. Generated from a structured multi-agent read of the codebase, cross-checked against the live `data/crm.db` schema (currently a near-empty seeded instance: 1 user, 9 funds, 4 views, 2 automation rules — the real corpus lives on the Start9 deployment).*
@@ -0,0 +1,115 @@
+# Gmail Integration — Enablement Runbook
+
+*How to turn on the (already-built) Gmail correspondence integration on the live Start9 box, validate it with a small observed backfill, then roll out to the domain. Read-only capture; all mail stays on Ten31 infrastructure.*
+
+Code: `backend/email_integration/`. Schema: `migrations/0001_email_tables.sql`. See `docs/crm-overview.md` §2.4 for the data model.
+
+---
+
+## What this does & the sovereignty posture
+
+- Pulls Gmail messages for enrolled `@ten31.xyz` mailboxes into the CRM's own SQLite DB (`emails`, `email_threads`, `email_attachments`, …), **deduped across inboxes**, **threaded**, and **matched** to investors/contacts (`email_investor_links`).
+- **Scope is `https://www.googleapis.com/auth/gmail.readonly`** (`credentials.py:34`) — the integration can *read* mail, never send or modify. Lower risk, and it's all the ingest needs.
+- **Data path is Google → your Start9 box only.** No new third party, and per guardrail #9 Claude never reads the mail — the correspondence becomes ingest input for *local* embeddings (bge-m3 on the Sparks), not API context. (Contrast with Superhuman's MCP — see §7.)
+
+## 0. Pick the auth method
+
+| Method | When | What you provide |
+|---|---|---|
+| **DWD (domain-wide delegation)** — recommended | You administer the `ten31.xyz` Google Workspace and want to capture team mailboxes without per-user consent | One service-account JSON key + a Workspace admin authorization |
+| **Per-user OAuth** | Capturing a mailbox you don't admin, or avoiding DWD | OAuth client id/secret + each user clicks through `/api/email/oauth/start` |
+
+The Start9 0.4 entrypoint is built around **DWD** (auto-detects the key, sets `CRM_GMAIL_AUTH_METHOD=dwd`, `CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz`). The rest of this runbook assumes DWD.
+
+## 1. Google-side setup (one time)
+
+You need Workspace **super-admin** + a GCP project.
+
+1. **GCP project** → enable the **Gmail API** (`APIs & Services → Library → Gmail API → Enable`).
+2. **Create a service account** (`IAM & Admin → Service Accounts`). Note its **client ID** (a long number) and its email.
+3. **Create a JSON key** for it (`Keys → Add key → JSON`). This file is the secret — handle per guardrail #7.
+4. **Authorize domain-wide delegation** in the Workspace **Admin console** (`Security → Access and data control → API controls → Domain-wide delegation → Add new`):
+   - **Client ID** = the service account's client ID from step 2.
+   - **OAuth scopes** = `https://www.googleapis.com/auth/gmail.readonly`
+   - Save. (Without this exact scope authorized, sync returns a non-retryable auth error — see `errors.py:21`.)
+
+## 2. Install the key on Start9
+
+1. Copy the JSON key to the service's data volume at **`/data/secrets/gmail-service-account.json`**.
+2. Lock it down: `chmod 600 /data/secrets/gmail-service-account.json` (the entrypoint also `chmod 700`s `/data/secrets`).
+3. **Restart the service.** On boot the 0.4 entrypoint detects the key and exports:
+   `CRM_GMAIL_INTEGRATION_ENABLED=true`, `CRM_GMAIL_AUTH_METHOD=dwd`, `CRM_GMAIL_SA_KEY_PATH=/data/secrets/gmail-service-account.json`, `CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz`, `CRM_GMAIL_SYNC_INTERVAL_MIN=180`.
+   It logs `Gmail integration: ENABLED (key at …)`.
+
+## 3. Smoke test — ONE mailbox first (the "don't rush it" gate)
+
+Do a single-mailbox run before enrolling the whole team, to shake out auth/matching bugs on a small surface. All calls need an **admin Bearer token**:
+
+```bash
+CRM=https://<your-start9-crm-host>           # the CRM's address
+TOKEN=$(curl -sk $CRM/api/auth/login -H 'Content-Type: application/json' \
+  -d '{"username":"<admin>","password":"<pw>"}' | python3 -c 'import sys,json;print(json.load(sys.stdin)["token"])')
+
+# integration alive?
+curl -sk $CRM/api/email/status -H "Authorization: Bearer $TOKEN"
+
+# enroll just yourself
+curl -sk $CRM/api/email/accounts/enroll -H "Authorization: Bearer $TOKEN" \
+  -H 'Content-Type: application/json' -d '{"email":"you@ten31.xyz"}'
+
+# trigger a sync now (otherwise it runs every 180 min)
+curl -sk $CRM/api/email/sync/run-now -X POST -H "Authorization: Bearer $TOKEN"
+```
+
+**Tip:** to keep the first backfill small, set `CRM_GMAIL_BACKFILL_PAGE_SIZE` low (e.g. `50`) before the restart, watch one page land, then raise it.
+
+## 4. Verify (on the box, read-only SQL)
+
+```sql
+-- sync ran cleanly?
+SELECT kind, status, messages_seen, messages_stored, attachments_saved, error
+FROM email_sync_runs ORDER BY started_at DESC LIMIT 3;
+
+-- mail captured + how much got matched to investors/contacts
+SELECT COUNT(*) total, SUM(is_matched) matched FROM emails;
+
+-- who did it match, and how confidently?
+SELECT match_kind, COUNT(*) FROM email_investor_links GROUP BY match_kind;
+```
+
+Or via the API: `GET /api/email/status` (counts) and `GET /api/email/threads?investor_id=<id>` (matched threads for one investor). If matching looks thin, run `POST /api/email/rematch` with `{"since":"<ISO8601>"}` after the investor list is populated.
+
+## 5. Roll out to the domain
+
+Once the single mailbox looks right:
+
+```bash
+curl -sk $CRM/api/email/accounts/enroll-all -X POST -H "Authorization: Bearer $TOKEN"
+curl -sk $CRM/api/email/sync/run-now -X POST -H "Authorization: Bearer $TOKEN"
+```
+
+Incremental sync then runs every `CRM_GMAIL_SYNC_INTERVAL_MIN` (default 180) via the scheduler thread.
+
+## 6. Tuning knobs (env, `config.py`)
+
+`CRM_GMAIL_SYNC_INTERVAL_MIN` (180) · `CRM_GMAIL_BACKFILL_PAGE_SIZE` (500) · `CRM_GMAIL_MAX_ATTACHMENT_MB` (50) · `CRM_GMAIL_ATTACH_CONCURRENCY` (4) · `CRM_GMAIL_RATE_UNITS_SEC` (150) · `CRM_GMAIL_HISTORY_STALE_DAYS` (5, forces a backfill if Gmail pruned history).
+
+## 7. Where Superhuman fits (and where it doesn't)
+
+You have Superhuman connected to Gmail, and it exposes an MCP server. The two are **complementary, not competing**, and it matters which job each does:
+
+- **Canonical correspondence ingest → use this DWD integration, not Superhuman.** It pulls mail straight into your own `crm.db` on Start9 and feeds the *local* embedding pipeline. Routing bulk ingest through Superhuman's MCP would put your email content through Superhuman's servers and — because an agent/Claude would be driving those calls — through Anthropic, which is exactly what guardrail #1 keeps the corpus away from. DWD keeps the data path Google → your box.
+- **Human mail workflow & drafting → Superhuman MCP is great.** Reading/triaging your own inbox, and Closer-style *draft* generation that a human reviews and sends, naturally happen in your real mail client. The `batch-draft-writer` skill already drives the Superhuman MCP for that, and it's usable today — independent of the CRM pipeline.
+
+Net: **DWD = system-of-record correspondence (sovereign, for retrieval). Superhuman MCP = the human's working surface (drafting, triage).** Don't make Superhuman the ingest source of truth.
+
+## 8. Disable / rollback
+
+Remove (or rename) `/data/secrets/gmail-service-account.json` and restart → the entrypoint logs `DISABLED` and routes return 503; captured data remains. To pause one mailbox without disabling the whole integration, set its `email_accounts.sync_enabled = 0`.
+
+## 9. Troubleshooting
+
+- **401/403 from Google on sync** → DWD scope not authorized, wrong client ID, or Gmail API not enabled (steps 1 & 4). This error is non-retryable by design (`errors.py`).
+- **`status` says disabled / routes 503** → key not found at `CRM_GMAIL_SA_KEY_PATH`, or `CRM_GMAIL_INTEGRATION_ENABLED` not truthy (the entrypoint only sets it when the key file exists).
+- **Mail captured but `matched = 0`** → the investor/contact list was empty or addresses don't match; populate the CRM/grid first, then `POST /api/email/rematch`.
+- **Bodies missing on some emails** → by design, unmatched emails are stored metadata-only (no body) until matched (`sync.py`); re-match to backfill.
@@ -0,0 +1,88 @@
+# Redaction / Re-hydration Boundary
+
+*Design note for the privacy boundary between Ten31's sovereign data and the Claude API. Written in Phase 0 so it is a designed step, not an afterthought. **It is BUILT in Phase 2/3** (Analyst, Closer) — Phase 0 does not need it (see §1).*
+
+Related: `CLAUDE.md` guardrails #1 (sovereignty), #4 (human-in-the-loop outbound), #9 (dev-time data handling); `docs/Ten31_Agentic_Build_Plan.md` §4.5.
+
+---
+
+## 1. When this applies (and when it doesn't)
+
+The system has two very different data paths, and only one of them needs this:
+
+- **Local-only paths — no redaction needed, because Claude is never in the loop.** All of Phase 0 is here: ingest, chunking, embeddings (bge-m3), the vector index (Qdrant), and entity resolution (local Qwen). Sensitive data flows `CRM → Sparks → Qdrant`, all on Ten31 infrastructure. Nothing reaches Anthropic, so there is nothing to scrub. **Do not add redaction overhead to the ingest/retrieval path.**
+- **Claude-facing reasoning steps — this boundary applies.** When an agent asks Claude to *reason over* LP-specific content: **Analyst** (building a dossier from retrieved chunks), **Closer** (drafting outreach/nurture/meeting prep), and any Orchestrator step that forwards record content. These send sensitive context to a third-party API and are the reason this boundary exists.
+
+The guiding rule (guardrail #1) is *"only the minimum necessary, non-sensitive context per call."* Redaction/re-hydration is how we honor that rule **when the task genuinely needs record content** — as opposed to simply sending less.
+
+## 2. The three-tier data classification
+
+Before any agent calls Claude, classify each piece of context:
+
+| Tier | Examples | Treatment |
+|---|---|---|
+| **Never send** | Full LP list/export, bulk relationship graph, raw account numbers, wire details, SSNs/passport, anything covered by a confidentiality obligation | Stays on Ten31 infra. Not even tokenized — just excluded. |
+| **Tokenize (pseudonymize)** | Person names, org/fund names, emails, phone, physical addresses, exact $ amounts, dates that pin identity | Replaced with stable placeholders before the call; real values swapped back locally after. |
+| **Send as-is** | The *substance* an agent needs Claude to reason about: thesis discussion, sentiment, objections, generic deal mechanics, the drafted message body (minus identifiers) | Sent in the de-identified prompt. |
+
+The art is the middle tier: keep enough semantic content for Claude to be useful, while every *identifier* is a placeholder.
+
+## 3. The round-trip
+
+```
+  ┌─────────────────────────── Ten31 infrastructure (sovereign) ───────────────────────────┐
+  │                                                                                          │
+  │  1. SCRUB (local model on the Sparks, via Spark Control)                                  │
+  │     - Pull the minimal context the task needs (retrieved chunks + record fields).         │
+  │     - NER + rule pass replaces Tier-2 identifiers with stable tokens:                      │
+  │         "Jonathan Reyes" -> [PERSON_1]   "Cedar Point Capital" -> [ORG_1]                  │
+  │         "jon@cedarpoint..." -> [EMAIL_1]  "$5,000,000" -> [AMOUNT_1]   "Fund III" -> [FUND_1]│
+  │     - Tokens are STABLE within a task (same entity -> same token) and CONSISTENT across    │
+  │       all chunks in the call, so Claude can reason about relationships.                    │
+  │     - The pseudonym map { [PERSON_1] -> "Jonathan Reyes", ... } is held LOCALLY, keyed to  │
+  │       the task/session. It never leaves the box.                                           │
+  │     - Drop Tier-1 content entirely. Log the scrub to the interaction_log.                  │
+  │                                                                                          │
+  └──────────────────────────────────────────┬───────────────────────────────────────────────┘
+                                              │  de-identified prompt (placeholders only)
+                                              ▼
+                         2. REASON  —  Claude API (Agent SDK)
+                            Drafts / synthesizes using [PERSON_1], [ORG_1], [FUND_1] ...
+                                              │  response referencing the same placeholders
+  ┌──────────────────────────────────────────┴───────────────────────────────────────────────┐
+  │                                                                                          │
+  │  3. RE-HYDRATE (local)                                                                     │
+  │     - Substitute real values back in using the local pseudonym map.                        │
+  │     - A human reviews the re-hydrated draft (guardrail #4) before anything is sent.        │
+  │     - Log the rehydrate + the human decision to the interaction_log.                       │
+  │                                                                                          │
+  └──────────────────────────────────────────────────────────────────────────────────────────┘
+```
+
+## 4. Where it runs
+
+The natural home is **Spark Control** (the gateway that already fronts the local models): add a scrub endpoint and a rehydrate endpoint, or run the pair as middleware inside the Agent SDK tool loop so every outbound Claude call passes through it. Keeping it at the gateway means:
+- One enforcement point — agents can't accidentally bypass it.
+- The pseudonym map lives next to the local models, never in agent memory that might get logged.
+- The scrub uses the same local Qwen already used for entity resolution, so the NER is consistent with how entities were canonicalized at ingest (reuse the `canonical_entities` → token mapping).
+
+## 5. Caveats (why this is a tool, not a magic switch)
+
+- **Free-text leakage is the hard part.** A note that says *"the family that sold the mining company in Texas last year"* re-identifies even with the name tokenized. The scrub model must catch *descriptive* identifiers, not just named entities — and it will not be perfect. For high-sensitivity tasks, prefer sending *less* (summary/thesis only) over trusting the scrub to catch everything.
+- **Re-identification by inference.** Enough tokenized-but-specific detail (amounts + dates + sector) can still single out a person. Keep Tier-2 amounts/dates *bucketed* ("~$5M", "Q1") when the exact value isn't needed.
+- **Map integrity.** The pseudonym map is sensitive (it's the de-anonymization key) — keep it local, in memory or short-lived, never logged to a third party, never sent in a prompt.
+- **It does not replace minimization.** First ask "does Claude need this record content at all?" Often a retrieval summary suffices. Redaction is for when the answer is genuinely yes.
+- **Consistency with retrieval.** Retrieval itself is already local, so chunks come back with real values; the scrub is applied at the *prompt-assembly* step, not at ingest.
+
+## 6. Verification (when we build it)
+
+- A test harness that asserts no Tier-1 string and no real Tier-2 identifier appears in any outbound payload (golden-file diff over recorded prompts).
+- A re-identification spot-check: have the local model attempt to re-identify entities from the de-identified prompt alone; flag anything it gets right.
+- Every scrub/rehydrate logged to `interaction_log` (actor, task, token-count, what tier was dropped) for audit (guardrail #5).
+
+## 7. Open questions (resolve at build time, Phase 2/3)
+
+1. Token granularity — per-task ephemeral maps, or a stable per-entity token space reused across tasks (better for Claude's cross-call memory, worse for re-identification risk)?
+2. Do we tokenize the *drafted outbound message itself* (Closer) and re-hydrate, or draft against placeholders and let the human fill specifics? (Affects how much the human edits.)
+3. Bucketing policy for amounts/dates — what precision is "non-sensitive enough"?
+4. Where exactly in the Agent SDK loop the middleware sits, and how it composes with prompt caching (placeholders must be cache-stable).
@@ -1,33 +0,0 @@
-# Deploy on StartOS 0.3.5 (Raspberry Pi)
-
-## 1) Build the package on your Mac
-```bash
-cd /Users/macpro/Projects/CRM
-make -C start9/0.3.5 package
-```
-
-This creates:
- `start9/0.3.5/image.tar`
- `start9/0.3.5/ten31-database.s9pk`
-
-## 2) Upload package to StartOS
-1. Open StartOS web UI.
-2. Go to Services -> Sideload Package (or equivalent 0.3.5 menu).
-3. Upload `ten31-database.s9pk`.
-4. Install and start the service.
-
-## 3) First run
-1. Open the service UI.
-2. Create first admin account on the login screen.
-3. In Settings, run one manual backup immediately.
-
-## 4) Data persistence contract
- App DB path: `/data/crm.db`
- Backup path: `/data/backups`
-
-Because these are in the persistent service volume, app restarts/upgrades do not erase data.
-
-## 5) Before any upgrade/migration
-1. Run manual backup in-app.
-2. Export fundraising state in-app.
-3. Keep both files off-device as recovery copy.
@@ -1,25 +0,0 @@
-FROM python:3.11-slim
-
-ENV PYTHONDONTWRITEBYTECODE=1 \
-    PYTHONUNBUFFERED=1 \
-    CRM_ENV=production \
-    CRM_HOST=0.0.0.0 \
-    CRM_PORT=8080 \
-    CRM_DATA_DIR=/data \
-    CRM_FRONTEND_DIR=/app/frontend
-
-WORKDIR /app
-
-RUN apt-get update \
-  && apt-get install -y --no-install-recommends ca-certificates curl \
-  && rm -rf /var/lib/apt/lists/*
-
-COPY backend/server.py /app/backend/server.py
-COPY frontend /app/frontend
-COPY start9/0.3.5/docker_entrypoint.sh /usr/local/bin/docker_entrypoint.sh
-COPY start9/0.3.5/healthcheck.sh /usr/local/bin/healthcheck.sh
-
-RUN chmod +x /usr/local/bin/docker_entrypoint.sh /usr/local/bin/healthcheck.sh
-
-EXPOSE 8080
-ENTRYPOINT ["/usr/local/bin/docker_entrypoint.sh"]
@@ -1,23 +0,0 @@
-PKG_ID := ten-database
-PKG_VERSION := 0.1.0.1
-REPO_ROOT := $(abspath ../..)
-WRAPPER_DIR := $(CURDIR)
-IMAGE_NAME := start9/$(PKG_ID)/main:$(PKG_VERSION)
-
-.PHONY: image-arm package verify clean
-
-image-arm:
-	docker buildx build --platform=linux/arm64 \
-		-f $(WRAPPER_DIR)/Dockerfile \
-		-t $(IMAGE_NAME) \
-		-o type=docker,dest=$(WRAPPER_DIR)/image.tar \
-		$(REPO_ROOT)
-
-package: image-arm
-	start-sdk pack
-
-verify:
-	start-sdk verify s9pk $(PKG_ID).s9pk
-
-clean:
-	rm -f $(WRAPPER_DIR)/image.tar $(WRAPPER_DIR)/$(PKG_ID).s9pk
@@ -1,23 +0,0 @@
-# Start9 Wrapper (0.3.5)
-
-This directory contains the StartOS 0.3.5 package wrapper for Ten31 Database.
-
-## Build prerequisites
- Docker with buildx
- `start-sdk` installed on build machine
-
-## Build package
-```bash
-cd /Users/macpro/Projects/CRM
-make -C start9/0.3.5 package
-```
-
-## Verify package
-```bash
-cd /Users/macpro/Projects/CRM
-make -C start9/0.3.5 verify
-```
-
-## Outputs
- `start9/0.3.5/image.tar`
- `start9/0.3.5/ten-database.s9pk`
@@ -1,20 +0,0 @@
-#!/bin/sh
-set -eu
-
-DATA_DIR="${CRM_DATA_DIR:-/data}"
-SECRET_FILE="$DATA_DIR/.crm-secret"
-
-mkdir -p "$DATA_DIR" "$DATA_DIR/backups"
-
-if [ -z "${CRM_SECRET_KEY:-}" ]; then
-  if [ -f "$SECRET_FILE" ]; then
-    CRM_SECRET_KEY="$(cat "$SECRET_FILE")"
-  else
-    CRM_SECRET_KEY="$(head -c 48 /dev/urandom | base64 | tr -d '\n' | tr '/+' 'ab')"
-    printf '%s' "$CRM_SECRET_KEY" > "$SECRET_FILE"
-    chmod 600 "$SECRET_FILE"
-  fi
-  export CRM_SECRET_KEY
-fi
-
-exec python3 /app/backend/server.py
@@ -1,5 +0,0 @@
-#!/bin/sh
-set -eu
-
-PORT="${CRM_PORT:-8080}"
-curl -fsS "http://127.0.0.1:${PORT}/api/health" >/dev/null
@@ -1,24 +0,0 @@
-# Ten31 Database (StartOS 0.3.5)
-
-## What this package does
- Runs Ten31 Database as a private web app.
- Persists all data under the StartOS service volume (`/data`).
- Exposes web UI/API on internal port `8080`.
-
-## First launch
-1. Open the service UI from StartOS.
-2. If this is a fresh install, create the first admin account from the login screen.
-3. Go to Settings and run a manual backup once.
-
-## Airtable migration
-1. Open Settings -> Migration.
-2. Choose "Import from Airtable CSV".
-3. Confirm row/column mappings before final import.
-
-## Data safety
- Database path in container: `/data/crm.db`.
- Backups path in container: `/data/backups/`.
- Before StartOS or package upgrades, run a backup and export from Settings.
-
-## Upgrade note
-This 0.3.5 wrapper keeps app/runtime files separate from data volume so migration to a future 0.4 wrapper can preserve the same data directory layout.
@@ -1,95 +0,0 @@
-id: ten-database
-title: Ten31 Database
-version: 0.1.0.1
-release-notes: >-
-  Initial StartOS 0.3.5 package wrapper for Ten31 Database.
-license: MIT
-wrapper-repo: https://github.com/ten31/ten31-database-startos
-upstream-repo: https://github.com/ten31/ten31-database
-support-site: https://github.com/ten31/ten31-database/issues
-marketing-site: https://ten31.vc
-build: ["make image-arm"]
-min-os-version: 0.3.5
-
-description:
-  short: Self-hosted investor and fundraising database for Ten31.
-  long: >-
-    Ten31 Database is an Airtable-style investor CRM with fundraising grid,
-    communications logging, views, backups, and CSV import. This package stores
-    all runtime data in the service volume for upgrade-safe persistence.
-
-assets:
-  license: LICENSE
-  icon: icon.png
-  instructions: instructions.md
-  docker-images: image.tar
-
-main:
-  type: docker
-  image: main
-  entrypoint: docker_entrypoint.sh
-  args: []
-  mounts:
-    main: /data
-
-health-checks:
-  main:
-    name: API health
-    success-message: CRM API is responding.
-    type: docker
-    image: main
-    entrypoint: healthcheck.sh
-    args: []
-    inject: true
-
-config: ~
-dependencies: {}
-
-volumes:
-  main:
-    type: data
-
-interfaces:
-  main:
-    name: Web Interface
-    description: Browser UI and API for Ten31 Database.
-    tor-config:
-      port-mapping:
-        80: "8080"
-    lan-config:
-      8080:
-        ssl: false
-        internal: 8080
-    ui: true
-    protocols: [http]
-
-backup:
-  create:
-    type: docker
-    image: main
-    system: false
-    entrypoint: sh
-    args:
-      - -c
-      - |
-        set -eu
-        rm -rf /backup/*
-        cp -a /data/. /backup/
-    mounts:
-      main: /data
-      backup: /backup
-  restore:
-    type: docker
-    image: main
-    system: false
-    entrypoint: sh
-    args:
-      - -c
-      - |
-        set -eu
-        cp -a /backup/. /data/
-    mounts:
-      main: /data
-      backup: /backup
-
-actions: {}
@@ -0,0 +1,6 @@
+node_modules/
+javascript/
+javascript.old/
+.DS_Store
+.home/
+*.s9pk
@@ -0,0 +1,231 @@
+# Ten31 Database — StartOS 0.4 deployment guide
+
+This guide walks through building the `ten-database` 0.4 service package and
+sideloading it onto an x86_64 StartOS 0.4 beta machine.
+
+The `start9/0.4/` folder is intentionally self-contained. It does not share
+any files with `start9/0.3.5/`, so the legacy package stays intact and can
+be rebuilt later if needed.
+
+---
+
+## 0 — How data preservation works
+
+Starting with **0.1.0:40**, this package no longer ships a seed snapshot. The
+0.3.5 → 0.4 migration is complete and the live `/data` volume on the StartOS
+host is the sole source of truth.
+
+Key facts:
+
+- StartOS preserves the `main` volume across sideloads. Reinstalling a new
+  `.s9pk` does **not** touch `/data/crm.db`, `/data/backups/`, or
+  `/data/.crm-secret`. Live edits made between releases are kept.
+- Only `Uninstall` from the StartOS UI destroys the `main` volume. As long
+  as you only `Stop → Sideload new .s9pk → Start`, your data persists.
+- Use StartOS-level **Backups → Create Backup** for full volume snapshots,
+  and the in-app **Settings → Admin → Run Backup** for JSON exports under
+  `/data/backups/`.
+
+Container paths (unchanged from 0.3.5):
+
+- `/data/crm.db` — primary SQLite DB (WAL journal mode)
+- `/data/backups/` — JSON exports
+- `/data/.crm-secret` — JWT signing key (kept across restarts so sessions stick)
+
+> Historical note: `0.1.0:39` shipped a baked-in seed snapshot and a
+> first-boot copy guarded by `! -f /data/crm.db`. That code path was removed
+> in `0.1.0:40`. If you ever need to bootstrap a fresh host again, sideload
+> `0.1.0:39` first, let it seed, then upgrade to the latest.
+
+---
+
+## 1 — Build-machine prerequisites
+
+The 0.4 build runs on any machine with:
+
+- Node.js ≥ 20 and npm
+- Docker with buildx enabled (Docker Desktop on macOS works; Linux Docker
+  must have the `buildx` plugin)
+- `start-cli` (Start9 SDK) — install per
+  https://docs.start9.com/packaging/0.4.0.x/environment-setup.html
+- `jq`, `make`, `s3cmd` (s3cmd only if you also plan to `make publish`)
+
+Recommended one-time setup:
+
+```sh
+# Initialize the Start9 developer key (run once per build machine)
+start-cli init-key
+
+# Create ~/.startos/config.yaml so `make install` can sideload:
+cat > ~/.startos/config.yaml <<'YAML'
+# Replace with the hostname of your 0.4 beta node
+host: http://start9.local
+YAML
+```
+
+---
+
+## 2 — Build the x86_64 .s9pk
+
+From the repo root:
+
+```sh
+cd start9/0.4
+
+# One-time dependency install (pulls start-sdk + friends):
+npm ci
+
+# Clean build (produces ten-database_x86_64.s9pk):
+make clean
+make x86
+```
+
+Output:
+
+- `ten-database_x86_64.s9pk` in `start9/0.4/`
+- Build summary printed by s9pk.mk (title, version, arch, SDK version,
+  git hash)
+
+> Note: `make` by default builds x86, arm, and riscv. The `Makefile`
+> in this folder overrides `ARCHES := x86` so only x86_64 is produced.
+> If you later need arm64 too, switch to `ARCHES := x86 arm`.
+
+### If the build fails
+
+Common causes and fixes:
+
+- **`.git/HEAD` or `.git/index` missing** — s9pk.mk requires a real git
+  repo. It looks at `../../.git` relative to `start9/0.4/` (i.e. the repo
+  root). Make sure you're building inside the actual repo.
+- **`start-cli not found`** — install the Start9 SDK CLI.
+- **docker buildx error** — run `docker buildx create --use` once.
+- **Permission denied removing `javascript/` between builds** — macOS
+  extended attributes can make ncc output files immutable. Run
+  `chmod -R u+w start9/0.4/javascript` and retry, or just `rm -rf
+  start9/0.4/javascript` from Finder.
+
+---
+
+## 3 — Sideload onto the StartOS 0.4 beta node
+
+Two options:
+
+### Option 1 — `make install` (uses ~/.startos/config.yaml)
+
+```sh
+cd start9/0.4
+make install
+```
+
+This runs `start-cli package install -s ten-database_x86_64.s9pk` against
+whatever host you set in `~/.startos/config.yaml`.
+
+### Option 2 — StartOS web UI
+
+1. Copy `ten-database_x86_64.s9pk` onto a machine that can reach the
+   StartOS 0.4 UI.
+2. In the UI: **System → Sideload Service → pick the .s9pk → Install.**
+3. After the install completes, open the service and click **Start**.
+
+### First-boot verification
+
+After upgrading the service:
+
+1. Open the Ten31 Database UI from the Interfaces page.
+2. Log in with your existing account — passwords and sessions persist
+   because `/data/.crm-secret` is preserved.
+3. Spot-check a few rows in the fundraising grid against what you saw
+   before the upgrade.
+4. Run one manual backup (Settings → Admin → Run Backup) to confirm the
+   app's write path works.
+
+---
+
+## 4 — Rollback plan
+
+If a new sideload misbehaves:
+
+1. **Stop** the service in StartOS — do not Uninstall (that deletes the
+   `main` volume).
+2. Sideload the previous `.s9pk` (keep one around) and Start.
+3. Investigate by opening the service logs from the StartOS UI.
+
+For full disaster recovery, restore the `main` volume from a StartOS-level
+Backup.
+
+---
+
+## 5 — File map (what lives where)
+
+```
+start9/0.4/
+├── DEPLOY_040.md              # this file
+├── README.md                  # short overview
+├── Dockerfile                 # self-contained; refs only start9/0.4/ paths
+├── Makefile                   # thin override: ARCHES := x86
+├── s9pk.mk                    # shared 0.4 build plumbing (do not edit)
+├── package.json, -lock.json   # start-sdk + build tooling
+├── tsconfig.json
+├── docker_entrypoint.sh       # ensures /data dirs + JWT secret, starts server.py
+├── healthcheck.sh             # curl /api/health (diagnostics only)
+├── icon.svg                   # service icon
+├── LICENSE
+├── refresh_seed.sh            # (LEGACY) scp helper from 0.3.5; kept for reference
+├── assets/
+│   └── ABOUT.md               # user-facing install description
+├── seed/                      # (LEGACY) historical seed snapshot, NOT shipped
+│   ├── README.md
+│   └── data/                  # crm.db + backups from initial 0.3.5 → 0.4 cut
+└── startos/                   # SDK source (manifest, main, interfaces…)
+    ├── index.ts               # SDK entry (no edits normally needed)
+    ├── sdk.ts                 # typed SDK instance
+    ├── utils.ts               # shared constants
+    ├── i18n.ts                # simple passthrough
+    ├── manifest/
+    │   ├── index.ts           # id, title, images, arches, volumes, alerts
+    │   └── i18n.ts            # localized short/long description
+    ├── versions/
+    │   ├── index.ts           # versionGraph wiring
+    │   ├── v0.1.0.39.ts       # first 0.4 release (with seed)
+    │   └── v0.1.0.40.ts       # current release (seed removed)
+    ├── init/index.ts          # setupInit ordering
+    ├── main.ts                # daemon + health check
+    ├── interfaces.ts          # HTTP interface on port 8080
+    ├── backups.ts             # Backups.ofVolumes('main')
+    ├── dependencies.ts        # (none)
+    └── actions/index.ts       # (none)
+```
+
+The `seed/` directory and `refresh_seed.sh` are no longer referenced by the
+build and can be deleted from the repo at any time. They are kept on disk
+purely as a historical snapshot of the data that was migrated off the 0.3.5
+host on first cutover.
+
+---
+
+## 6 — Things to remember
+
+- Package id stays `ten-database` across both 0.3.5 and 0.4 so there is
+  exactly one service to manage on each host.
+- The service volume id is `main` on both sides and mounts at `/data`
+  inside the container. This is what makes data preservation trivial.
+- The 0.4 release is x86_64 only. If you later deploy to aarch64, change
+  `ARCHES` in the Makefile and rebuild.
+- The built `.s9pk` is not committed — treat it as a build artifact.
+  `.gitignore` already ignores `*.s9pk` and `javascript/`.
+- If you change anything under `startos/`, run `npm run check` (tsc) and
+  `npm run build` (ncc) before re-packaging.
+
+---
+
+## 7 — Quick cheat sheet
+
+```sh
+cd start9/0.4
+make clean
+make x86
+make install
+```
+
+After `make install` completes, open the service in the StartOS UI,
+hit Start, and verify the app still works.
@@ -0,0 +1,52 @@
+# ─────────────────────────────────────────────────────────────────
+#  Ten31 Database — StartOS 0.4 container image
+# ─────────────────────────────────────────────────────────────────
+#  Build context (from the startos manifest dockerBuild.workdir)
+#  is the repository root (two levels up from start9/0.4/), so all
+#  COPY paths below are relative to the repo root.
+#
+#  This image is intentionally self-contained under start9/0.4/:
+#  no files are pulled from start9/0.3.5/ so the two packages can
+#  evolve independently.
+#
+#  As of 0.1.0:40 the image NO LONGER ships a seed snapshot. The
+#  initial migration from 0.3.5 has been completed; from this
+#  release forward the live /data volume on the StartOS host is
+#  the sole source of truth and is preserved across sideloads.
+# ─────────────────────────────────────────────────────────────────
+FROM python:3.11-slim
+
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    CRM_ENV=production \
+    CRM_HOST=0.0.0.0 \
+    CRM_PORT=8080 \
+    CRM_DATA_DIR=/data \
+    CRM_FRONTEND_DIR=/app/frontend
+
+WORKDIR /app
+
+RUN apt-get update \
+  && apt-get install -y --no-install-recommends ca-certificates curl \
+  && rm -rf /var/lib/apt/lists/*
+
+# ── Python dependencies ─────────────────────────────────────────
+# Only one hard dep for now: `cryptography` is required by the Gmail
+# integration's RS256 JWT signing (DWD bearer tokens). Everything else
+# server.py needs is stdlib.
+RUN pip install --no-cache-dir cryptography==42.0.5
+
+# ── Application source ──────────────────────────────────────────
+COPY backend/server.py        /app/backend/server.py
+COPY backend/email_integration /app/backend/email_integration
+COPY frontend                 /app/frontend
+
+# ── StartOS wrapper scripts ─────────────────────────────────────
+COPY start9/0.4/docker_entrypoint.sh /usr/local/bin/docker_entrypoint.sh
+COPY start9/0.4/healthcheck.sh       /usr/local/bin/healthcheck.sh
+
+RUN chmod +x /usr/local/bin/docker_entrypoint.sh \
+             /usr/local/bin/healthcheck.sh
+
+EXPOSE 8080
+ENTRYPOINT ["/usr/local/bin/docker_entrypoint.sh"]
@@ -0,0 +1,4 @@
+# overrides to s9pk.mk must precede the include statement
+ARCHES := x86
+
+include s9pk.mk
@@ -1,9 +1,52 @@
-# Start9 Wrapper (0.4 placeholder)
+# Ten31 Database — StartOS 0.4 wrapper (x86_64)

-This directory is reserved for the StartOS 0.4 package wrapper.
+This directory is the self-contained StartOS 0.4 service package for
+Ten31 Database. It is the x86_64 successor to the 0.3.5 (aarch64)
+wrapper in `../0.3.5/`. Both packages share the same package id
+(`ten-database`) and the same `/data` volume layout so data can be
+preserved across the migration.

-Migration plan from 0.3.5:
-1. Keep package id stable (`ten-database`) if StartOS migration path allows.
-2. Keep mounted data directory contract unchanged (`/data/crm.db`, `/data/backups`).
-3. Rebuild wrapper files against 0.4 packaging spec and verify with current start-sdk.
-4. Test upgrade on a staging node using production backup restore before live cutover.
+## Start here
+
+**Read `DEPLOY_040.md` first.** It covers:
+
+1. How the image-seed data-preservation mechanism works.
+2. How to refresh the seed with live production data from the 0.3.5 host
+   (via `./refresh_seed.sh` or manual scp).
+3. How to install the build prerequisites (Node, Docker, `start-cli`).
+4. How to build the x86_64 `.s9pk`.
+5. How to sideload onto the StartOS 0.4 beta node.
+6. A rollback plan and a post-install verification checklist.
+
+## Quick cheat sheet
+
+```sh
+# From this directory:
+./refresh_seed.sh embassy@embassy.local   # pull live prod data into seed/
+make clean
+make x86
+make install                              # uses ~/.startos/config.yaml
+```
+
+## Data layout (unchanged from 0.3.5)
+
+Inside the container:
+
+- `/data/crm.db`         — SQLite database
+- `/data/backups/`       — app-level JSON exports
+- `/data/.crm-secret`    — JWT signing key (created on first boot if absent)
+
+The entrypoint seeds an empty volume from the image's baked-in snapshot on
+first boot, and is a no-op for every later boot. Existing volumes are
+never overwritten.
+
+## Status
+
+- Source scaffold: complete and `tsc --noEmit` clean against
+  `@start9labs/start-sdk` 0.4.0.
+- Dockerfile: self-contained under `start9/0.4/` with no cross-folder
+  references to `start9/0.3.5/`.
+- Seed snapshot: present at `seed/data/` (repo dev DB — replace with live
+  prod data before building).
+- Not yet built into a `.s9pk` here; build on a machine with Docker +
+  `start-cli` per `DEPLOY_040.md`.
@@ -0,0 +1,11 @@
+Ten31 Database is a self-hosted investor CRM and fundraising database.
+
+This StartOS 0.4 package is the x86_64 successor to the 0.3.5 (aarch64) wrapper. It preserves the original runtime data layout inside the service volume:
+
+- `/data/crm.db` — SQLite database (investors, contacts, fundraising grid, views, users, backups, feature requests, app settings)
+- `/data/backups/` — app-level JSON snapshot exports
+- `/data/.crm-secret` — JWT signing key (generated on first boot if absent)
+
+First boot seeds the service volume from a snapshot baked into the image so the new install comes up with existing data already populated. The seed is skipped if the volume already contains a `crm.db`, so it is safe to reinstall or restore from a future StartOS 0.4 backup without losing data.
+
+The wrapper's only differences from upstream are StartOS container wiring, the private web interface on internal port 8080, and backup integration (the whole `main` volume is included in StartOS backups).
@@ -0,0 +1,61 @@
+#!/bin/sh
+# ═══════════════════════════════════════════════════════════════
+#  Ten31 Database container entrypoint (StartOS 0.4 wrapper)
+# ═══════════════════════════════════════════════════════════════
+#
+#  Responsibilities:
+#    1. Ensure the mounted /data volume directories exist.
+#    2. Ensure a persistent CRM_SECRET_KEY exists so issued JWTs
+#       survive container restarts.
+#    3. Launch the Python backend server.
+#
+#  Note: This entrypoint NO LONGER seeds /data from a baked-in
+#  snapshot. The 0.3.5 → 0.4 migration is complete; from 0.1.0:40
+#  forward the live /data volume on the StartOS host is the sole
+#  source of truth. StartOS preserves /data across sideloads, so
+#  upgrades will not disturb live data.
+# ═══════════════════════════════════════════════════════════════
+
+set -eu
+
+DATA_DIR="${CRM_DATA_DIR:-/data}"
+SECRET_FILE="$DATA_DIR/.crm-secret"
+SECRETS_DIR="$DATA_DIR/secrets"
+EMAIL_ATTACHMENTS_DIR="$DATA_DIR/email_attachments"
+GMAIL_SA_KEY="$SECRETS_DIR/gmail-service-account.json"
+
+mkdir -p "$DATA_DIR" "$DATA_DIR/backups" "$SECRETS_DIR" "$EMAIL_ATTACHMENTS_DIR"
+# /data/secrets holds the Gmail service-account key; lock it down so only
+# the container user can read the directory. chmod on the file itself is
+# the operator's responsibility when they drop the key in.
+chmod 700 "$SECRETS_DIR" 2>/dev/null || true
+
+# ── Persistent JWT secret ───────────────────────────────────────
+if [ -z "${CRM_SECRET_KEY:-}" ]; then
+  if [ -f "$SECRET_FILE" ]; then
+    CRM_SECRET_KEY="$(cat "$SECRET_FILE")"
+  else
+    CRM_SECRET_KEY="$(head -c 48 /dev/urandom | base64 | tr -d '\n' | tr '/+' 'ab')"
+    printf '%s' "$CRM_SECRET_KEY" > "$SECRET_FILE"
+    chmod 600 "$SECRET_FILE"
+  fi
+  export CRM_SECRET_KEY
+fi
+
+# ── Gmail integration env vars ──────────────────────────────────
+# The integration is enabled only if the service-account key file is
+# actually present on the /data volume. This makes the package
+# self-disabling on fresh installs until an operator drops the key in.
+if [ -f "$GMAIL_SA_KEY" ]; then
+  export CRM_GMAIL_INTEGRATION_ENABLED="${CRM_GMAIL_INTEGRATION_ENABLED:-true}"
+  export CRM_GMAIL_AUTH_METHOD="${CRM_GMAIL_AUTH_METHOD:-dwd}"
+  export CRM_GMAIL_SA_KEY_PATH="${CRM_GMAIL_SA_KEY_PATH:-$GMAIL_SA_KEY}"
+  export CRM_GMAIL_WORKSPACE_DOMAIN="${CRM_GMAIL_WORKSPACE_DOMAIN:-ten31.xyz}"
+  export CRM_GMAIL_SYNC_INTERVAL_MIN="${CRM_GMAIL_SYNC_INTERVAL_MIN:-180}"
+  echo "[entrypoint] Gmail integration: ENABLED (key at $GMAIL_SA_KEY)"
+else
+  echo "[entrypoint] Gmail integration: DISABLED (no key at $GMAIL_SA_KEY)"
+fi
+
+# ── Launch the app ──────────────────────────────────────────────
+exec python3 /app/backend/server.py
@@ -0,0 +1,9 @@
+#!/bin/sh
+# Container-side health probe for the Ten31 Database service.
+# The StartOS 0.4 daemon uses checkPortListening at the platform
+# level, but this script is kept for parity with the 0.3.5 wrapper
+# and so the same image can be exec'd directly for diagnostics.
+set -eu
+
+PORT="${CRM_PORT:-8080}"
+curl -fsS "http://127.0.0.1:${PORT}/api/health" >/dev/null
@@ -0,0 +1,43 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<svg id="Layer_1" data-name="Layer 1" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 722.69 280.85">
+  <defs>
+    <style>
+      .cls-1 {
+        font-family: LTCGoudyOldstylePro-Bold, 'LTC Goudy Oldstyle Pro';
+        font-size: 192px;
+        font-weight: 700;
+      }
+
+      .cls-1, .cls-2, .cls-3 {
+        fill: #fff;
+      }
+
+      .cls-2, .cls-4 {
+        stroke-width: 3px;
+      }
+
+      .cls-2, .cls-4, .cls-3 {
+        stroke: #fff;
+        stroke-miterlimit: 10;
+      }
+
+      .cls-4 {
+        fill: none;
+      }
+
+      .cls-5 {
+        letter-spacing: -.06em;
+      }
+    </style>
+  </defs>
+  <text class="cls-1" transform="translate(120.54 208.45)"><tspan class="cls-5" x="0" y="0">T</tspan><tspan x="120.96" y="0">en31</tspan></text>
+  <g>
+    <polygon class="cls-3" points="95.52 140.42 54.54 154.4 54.54 126.45 95.52 140.42"/>
+    <line class="cls-2" x1="0" y1="140.42" x2="60.54" y2="140.42"/>
+  </g>
+  <rect class="cls-4" x="97.1" y="1.5" width="527.95" height="277.85"/>
+  <g>
+    <polygon class="cls-3" points="721.15 140.42 680.16 154.4 680.16 126.45 721.15 140.42"/>
+    <line class="cls-2" x1="625.62" y1="140.42" x2="686.16" y2="140.42"/>
+  </g>
+</svg>
@@ -0,0 +1,330 @@
+{
+  "name": "ten-database-startos-040",
+  "lockfileVersion": 3,
+  "requires": true,
+  "packages": {
+    "": {
+      "name": "ten-database-startos-040",
+      "dependencies": {
+        "@start9labs/start-sdk": "^0.4.0-beta.66"
+      },
+      "devDependencies": {
+        "@types/node": "^22.19.0",
+        "@vercel/ncc": "^0.38.4",
+        "prettier": "^3.6.2",
+        "typescript": "^5.9.3"
+      }
+    },
+    "node_modules/@iarna/toml": {
+      "version": "3.0.0",
+      "resolved": "https://registry.npmjs.org/@iarna/toml/-/toml-3.0.0.tgz",
+      "integrity": "sha512-td6ZUkz2oS3VeleBcN+m//Q6HlCFCPrnI0FZhrt/h4XqLEdOyYp2u21nd8MdsR+WJy5r9PTDaHTDDfhf4H4l6Q==",
+      "license": "ISC"
+    },
+    "node_modules/@noble/curves": {
+      "version": "1.9.7",
+      "resolved": "https://registry.npmjs.org/@noble/curves/-/curves-1.9.7.tgz",
+      "integrity": "sha512-gbKGcRUYIjA3/zCCNaWDciTMFI0dCkvou3TL8Zmy5Nc7sJ47a0jtOeZoTaMxkuqRo9cRhjOdZJXegxYE5FN/xw==",
+      "license": "MIT",
+      "dependencies": {
+        "@noble/hashes": "1.8.0"
+      },
+      "engines": {
+        "node": "^14.21.3 || >=16"
+      },
+      "funding": {
+        "url": "https://paulmillr.com/funding/"
+      }
+    },
+    "node_modules/@noble/hashes": {
+      "version": "1.8.0",
+      "resolved": "https://registry.npmjs.org/@noble/hashes/-/hashes-1.8.0.tgz",
+      "integrity": "sha512-jCs9ldd7NwzpgXDIf6P3+NrHh9/sD6CQdxHyjQI+h/6rDNo88ypBxxz45UDuZHz9r3tNz7N/VInSVoVdtXEI4A==",
+      "license": "MIT",
+      "engines": {
+        "node": "^14.21.3 || >=16"
+      },
+      "funding": {
+        "url": "https://paulmillr.com/funding/"
+      }
+    },
+    "node_modules/@start9labs/start-sdk": {
+      "version": "0.4.0",
+      "resolved": "https://registry.npmjs.org/@start9labs/start-sdk/-/start-sdk-0.4.0.tgz",
+      "integrity": "sha512-PFfO7tV9nzQFZL3KXaZyf16C5VZtM+dCDlRhLHpmwssTKtcjyCEhBrB9locuS2yFqu69rj+5kLFzCWDHeRRibg==",
+      "license": "MIT",
+      "dependencies": {
+        "@iarna/toml": "^3.0.0",
+        "@noble/curves": "^1.8.2",
+        "@noble/hashes": "^1.7.2",
+        "@types/ini": "^4.1.1",
+        "deep-equality-data-structures": "^2.0.0",
+        "fast-xml-parser": "^5.5.6",
+        "ini": "^5.0.0",
+        "isomorphic-fetch": "^3.0.0",
+        "mime": "^4.0.7",
+        "yaml": "^2.7.1",
+        "zod": "^4.3.6",
+        "zod-deep-partial": "^1.2.0"
+      }
+    },
+    "node_modules/@types/ini": {
+      "version": "4.1.1",
+      "resolved": "https://registry.npmjs.org/@types/ini/-/ini-4.1.1.tgz",
+      "integrity": "sha512-MIyNUZipBTbyUNnhvuXJTY7B6qNI78meck9Jbv3wk0OgNwRyOOVEKDutAkOs1snB/tx0FafyR6/SN4Ps0hZPeg==",
+      "license": "MIT"
+    },
+    "node_modules/@types/node": {
+      "version": "22.19.17",
+      "resolved": "https://registry.npmjs.org/@types/node/-/node-22.19.17.tgz",
+      "integrity": "sha512-wGdMcf+vPYM6jikpS/qhg6WiqSV/OhG+jeeHT/KlVqxYfD40iYJf9/AE1uQxVWFvU7MipKRkRv8NSHiCGgPr8Q==",
+      "dev": true,
+      "license": "MIT",
+      "dependencies": {
+        "undici-types": "~6.21.0"
+      }
+    },
+    "node_modules/@vercel/ncc": {
+      "version": "0.38.4",
+      "resolved": "https://registry.npmjs.org/@vercel/ncc/-/ncc-0.38.4.tgz",
+      "integrity": "sha512-8LwjnlP39s08C08J5NstzriPvW1SP8Zfpp1BvC2sI35kPeZnHfxVkCwu4/+Wodgnd60UtT1n8K8zw+Mp7J9JmQ==",
+      "dev": true,
+      "license": "MIT",
+      "bin": {
+        "ncc": "dist/ncc/cli.js"
+      }
+    },
+    "node_modules/deep-equality-data-structures": {
+      "version": "2.0.0",
+      "resolved": "https://registry.npmjs.org/deep-equality-data-structures/-/deep-equality-data-structures-2.0.0.tgz",
+      "integrity": "sha512-qgrUr7MKXq7VRN+WUpQ48QlXVGL0KdibAoTX8KRg18lgOgqbEKMAW1WZsVCtakY4+XX42pbAJzTz/DlXEFM2Fg==",
+      "license": "MIT",
+      "dependencies": {
+        "object-hash": "^3.0.0"
+      }
+    },
+    "node_modules/fast-xml-builder": {
+      "version": "1.1.4",
+      "resolved": "https://registry.npmjs.org/fast-xml-builder/-/fast-xml-builder-1.1.4.tgz",
+      "integrity": "sha512-f2jhpN4Eccy0/Uz9csxh3Nu6q4ErKxf0XIsasomfOihuSUa3/xw6w8dnOtCDgEItQFJG8KyXPzQXzcODDrrbOg==",
+      "funding": [
+        {
+          "type": "github",
+          "url": "https://github.com/sponsors/NaturalIntelligence"
+        }
+      ],
+      "license": "MIT",
+      "dependencies": {
+        "path-expression-matcher": "^1.1.3"
+      }
+    },
+    "node_modules/fast-xml-parser": {
+      "version": "5.5.12",
+      "resolved": "https://registry.npmjs.org/fast-xml-parser/-/fast-xml-parser-5.5.12.tgz",
+      "integrity": "sha512-nUR0q8PPfoA/svPM43Gup7vLOZWppaNrYgGmrVqrAVJa7cOH4hMG6FX9M4mQ8dZA1/ObGZHzES7Ed88hxEBSJg==",
+      "funding": [
+        {
+          "type": "github",
+          "url": "https://github.com/sponsors/NaturalIntelligence"
+        }
+      ],
+      "license": "MIT",
+      "dependencies": {
+        "fast-xml-builder": "^1.1.4",
+        "path-expression-matcher": "^1.5.0",
+        "strnum": "^2.2.3"
+      },
+      "bin": {
+        "fxparser": "src/cli/cli.js"
+      }
+    },
+    "node_modules/ini": {
+      "version": "5.0.0",
+      "resolved": "https://registry.npmjs.org/ini/-/ini-5.0.0.tgz",
+      "integrity": "sha512-+N0ngpO3e7cRUWOJAS7qw0IZIVc6XPrW4MlFBdD066F2L4k1L6ker3hLqSq7iXxU5tgS4WGkIUElWn5vogAEnw==",
+      "license": "ISC",
+      "engines": {
+        "node": "^18.17.0 || >=20.5.0"
+      }
+    },
+    "node_modules/isomorphic-fetch": {
+      "version": "3.0.0",
+      "resolved": "https://registry.npmjs.org/isomorphic-fetch/-/isomorphic-fetch-3.0.0.tgz",
+      "integrity": "sha512-qvUtwJ3j6qwsF3jLxkZ72qCgjMysPzDfeV240JHiGZsANBYd+EEuu35v7dfrJ9Up0Ak07D7GGSkGhCHTqg/5wA==",
+      "license": "MIT",
+      "dependencies": {
+        "node-fetch": "^2.6.1",
+        "whatwg-fetch": "^3.4.1"
+      }
+    },
+    "node_modules/mime": {
+      "version": "4.1.0",
+      "resolved": "https://registry.npmjs.org/mime/-/mime-4.1.0.tgz",
+      "integrity": "sha512-X5ju04+cAzsojXKes0B/S4tcYtFAJ6tTMuSPBEn9CPGlrWr8Fiw7qYeLT0XyH80HSoAoqWCaz+MWKh22P7G1cw==",
+      "funding": [
+        "https://github.com/sponsors/broofa"
+      ],
+      "license": "MIT",
+      "bin": {
+        "mime": "bin/cli.js"
+      },
+      "engines": {
+        "node": ">=16"
+      }
+    },
+    "node_modules/node-fetch": {
+      "version": "2.7.0",
+      "resolved": "https://registry.npmjs.org/node-fetch/-/node-fetch-2.7.0.tgz",
+      "integrity": "sha512-c4FRfUm/dbcWZ7U+1Wq0AwCyFL+3nt2bEw05wfxSz+DWpWsitgmSgYmy2dQdWyKC1694ELPqMs/YzUSNozLt8A==",
+      "license": "MIT",
+      "dependencies": {
+        "whatwg-url": "^5.0.0"
+      },
+      "engines": {
+        "node": "4.x || >=6.0.0"
+      },
+      "peerDependencies": {
+        "encoding": "^0.1.0"
+      },
+      "peerDependenciesMeta": {
+        "encoding": {
+          "optional": true
+        }
+      }
+    },
+    "node_modules/object-hash": {
+      "version": "3.0.0",
+      "resolved": "https://registry.npmjs.org/object-hash/-/object-hash-3.0.0.tgz",
+      "integrity": "sha512-RSn9F68PjH9HqtltsSnqYC1XXoWe9Bju5+213R98cNGttag9q9yAOTzdbsqvIa7aNm5WffBZFpWYr2aWrklWAw==",
+      "license": "MIT",
+      "engines": {
+        "node": ">= 6"
+      }
+    },
+    "node_modules/path-expression-matcher": {
+      "version": "1.5.0",
+      "resolved": "https://registry.npmjs.org/path-expression-matcher/-/path-expression-matcher-1.5.0.tgz",
+      "integrity": "sha512-cbrerZV+6rvdQrrD+iGMcZFEiiSrbv9Tfdkvnusy6y0x0GKBXREFg/Y65GhIfm0tnLntThhzCnfKwp1WRjeCyQ==",
+      "funding": [
+        {
+          "type": "github",
+          "url": "https://github.com/sponsors/NaturalIntelligence"
+        }
+      ],
+      "license": "MIT",
+      "engines": {
+        "node": ">=14.0.0"
+      }
+    },
+    "node_modules/prettier": {
+      "version": "3.8.2",
+      "resolved": "https://registry.npmjs.org/prettier/-/prettier-3.8.2.tgz",
+      "integrity": "sha512-8c3mgTe0ASwWAJK+78dpviD+A8EqhndQPUBpNUIPt6+xWlIigCwfN01lWr9MAede4uqXGTEKeQWTvzb3vjia0Q==",
+      "dev": true,
+      "license": "MIT",
+      "bin": {
+        "prettier": "bin/prettier.cjs"
+      },
+      "engines": {
+        "node": ">=14"
+      },
+      "funding": {
+        "url": "https://github.com/prettier/prettier?sponsor=1"
+      }
+    },
+    "node_modules/strnum": {
+      "version": "2.2.3",
+      "resolved": "https://registry.npmjs.org/strnum/-/strnum-2.2.3.tgz",
+      "integrity": "sha512-oKx6RUCuHfT3oyVjtnrmn19H1SiCqgJSg+54XqURKp5aCMbrXrhLjRN9TjuwMjiYstZ0MzDrHqkGZ5dFTKd+zg==",
+      "funding": [
+        {
+          "type": "github",
+          "url": "https://github.com/sponsors/NaturalIntelligence"
+        }
+      ],
+      "license": "MIT"
+    },
+    "node_modules/tr46": {
+      "version": "0.0.3",
+      "resolved": "https://registry.npmjs.org/tr46/-/tr46-0.0.3.tgz",
+      "integrity": "sha512-N3WMsuqV66lT30CrXNbEjx4GEwlow3v6rr4mCcv6prnfwhS01rkgyFdjPNBYd9br7LpXV1+Emh01fHnq2Gdgrw==",
+      "license": "MIT"
+    },
+    "node_modules/typescript": {
+      "version": "5.9.3",
+      "resolved": "https://registry.npmjs.org/typescript/-/typescript-5.9.3.tgz",
+      "integrity": "sha512-jl1vZzPDinLr9eUt3J/t7V6FgNEw9QjvBPdysz9KfQDD41fQrC2Y4vKQdiaUpFT4bXlb1RHhLpp8wtm6M5TgSw==",
+      "dev": true,
+      "license": "Apache-2.0",
+      "bin": {
+        "tsc": "bin/tsc",
+        "tsserver": "bin/tsserver"
+      },
+      "engines": {
+        "node": ">=14.17"
+      }
+    },
+    "node_modules/undici-types": {
+      "version": "6.21.0",
+      "resolved": "https://registry.npmjs.org/undici-types/-/undici-types-6.21.0.tgz",
+      "integrity": "sha512-iwDZqg0QAGrg9Rav5H4n0M64c3mkR59cJ6wQp+7C4nI0gsmExaedaYLNO44eT4AtBBwjbTiGPMlt2Md0T9H9JQ==",
+      "dev": true,
+      "license": "MIT"
+    },
+    "node_modules/webidl-conversions": {
+      "version": "3.0.1",
+      "resolved": "https://registry.npmjs.org/webidl-conversions/-/webidl-conversions-3.0.1.tgz",
+      "integrity": "sha512-2JAn3z8AR6rjK8Sm8orRC0h/bcl/DqL7tRPdGZ4I1CjdF+EaMLmYxBHyXuKL849eucPFhvBoxMsflfOb8kxaeQ==",
+      "license": "BSD-2-Clause"
+    },
+    "node_modules/whatwg-fetch": {
+      "version": "3.6.20",
+      "resolved": "https://registry.npmjs.org/whatwg-fetch/-/whatwg-fetch-3.6.20.tgz",
+      "integrity": "sha512-EqhiFU6daOA8kpjOWTL0olhVOF3i7OrFzSYiGsEMB8GcXS+RrzauAERX65xMeNWVqxA6HXH2m69Z9LaKKdisfg==",
+      "license": "MIT"
+    },
+    "node_modules/whatwg-url": {
+      "version": "5.0.0",
+      "resolved": "https://registry.npmjs.org/whatwg-url/-/whatwg-url-5.0.0.tgz",
+      "integrity": "sha512-saE57nupxk6v3HY35+jzBwYa0rKSy0XR8JSxZPwgLr7ys0IBzhGviA1/TUGJLmSVqs8pb9AnvICXEuOHLprYTw==",
+      "license": "MIT",
+      "dependencies": {
+        "tr46": "~0.0.3",
+        "webidl-conversions": "^3.0.0"
+      }
+    },
+    "node_modules/yaml": {
+      "version": "2.8.3",
+      "resolved": "https://registry.npmjs.org/yaml/-/yaml-2.8.3.tgz",
+      "integrity": "sha512-AvbaCLOO2Otw/lW5bmh9d/WEdcDFdQp2Z2ZUH3pX9U2ihyUY0nvLv7J6TrWowklRGPYbB/IuIMfYgxaCPg5Bpg==",
+      "license": "ISC",
+      "bin": {
+        "yaml": "bin.mjs"
+      },
+      "engines": {
+        "node": ">= 14.6"
+      },
+      "funding": {
+        "url": "https://github.com/sponsors/eemeli"
+      }
+    },
+    "node_modules/zod": {
+      "version": "4.3.6",
+      "resolved": "https://registry.npmjs.org/zod/-/zod-4.3.6.tgz",
+      "integrity": "sha512-rftlrkhHZOcjDwkGlnUtZZkvaPHCsDATp4pGpuOOMDaTdDDXF91wuVDJoWoPsKX/3YPQ5fHuF3STjcYyKr+Qhg==",
+      "license": "MIT",
+      "funding": {
+        "url": "https://github.com/sponsors/colinhacks"
+      }
+    },
+    "node_modules/zod-deep-partial": {
+      "version": "1.4.4",
+      "resolved": "https://registry.npmjs.org/zod-deep-partial/-/zod-deep-partial-1.4.4.tgz",
+      "integrity": "sha512-aWkPl7hVStgE01WzbbSxCgX4O+sSpgt8JOjvFUtMTF75VgL6MhWQbiZi+AWGN85SfSTtI9gsOtL1vInoqfDVaA==",
+      "license": "MIT",
+      "peerDependencies": {
+        "zod": "^4.1.13"
+      }
+    }
+  }
+}
@@ -0,0 +1,23 @@
+{
+  "name": "ten-database-startos-040",
+  "scripts": {
+    "build": "rm -rf ./javascript && ncc build startos/index.ts -o ./javascript",
+    "prettier": "prettier --write startos",
+    "check": "tsc --noEmit"
+  },
+  "dependencies": {
+    "@start9labs/start-sdk": "^0.4.0-beta.66"
+  },
+  "devDependencies": {
+    "@types/node": "^22.19.0",
+    "@vercel/ncc": "^0.38.4",
+    "prettier": "^3.6.2",
+    "typescript": "^5.9.3"
+  },
+  "prettier": {
+    "trailingComma": "all",
+    "tabWidth": 2,
+    "semi": false,
+    "singleQuote": true
+  }
+}
@@ -0,0 +1,116 @@
+#!/bin/bash
+# ═══════════════════════════════════════════════════════════════
+#  refresh_seed.sh
+#  Pull the live Ten31 Database data off a StartOS 0.3.5 host
+#  and stage it as the seed snapshot baked into the 0.4 image.
+# ═══════════════════════════════════════════════════════════════
+#
+#  Usage:
+#    ./refresh_seed.sh <ssh-user@host> [remote-data-dir]
+#
+#  Examples:
+#    ./refresh_seed.sh start9@192.168.1.50
+#    ./refresh_seed.sh embassy@embassy.local \
+#        /embassy-data/package-data/volumes/ten-database/data/main
+#
+#  What it does:
+#    1. Finds the remote /data directory for the ten-database service.
+#    2. Copies crm.db, backups/, and (optionally) .crm-secret into
+#       start9/0.4/seed/data/ on this machine.
+#    3. Prints a row-count summary so you can verify content.
+#
+#  After it finishes, run:
+#      make clean && make x86
+#  from this (start9/0.4/) directory to rebuild the .s9pk.
+# ═══════════════════════════════════════════════════════════════
+
+set -eu
+
+if [ $# -lt 1 ]; then
+  echo "Usage: $0 <ssh-user@host> [remote-data-dir]"
+  echo ""
+  echo "Remote data dir defaults (tried in order):"
+  echo "  /embassy-data/package-data/volumes/ten-database/data/main"
+  echo "  /mnt/embassy-os/package-data/volumes/ten-database/data/main"
+  echo "  /var/lib/embassy/services/ten-database/data"
+  exit 1
+fi
+
+REMOTE="$1"
+REMOTE_DIR="${2:-}"
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+SEED_DIR="$SCRIPT_DIR/seed/data"
+
+echo ""
+echo "  Staging production seed from $REMOTE"
+echo "  into $SEED_DIR"
+echo ""
+
+# Auto-detect remote data dir if not supplied
+if [ -z "$REMOTE_DIR" ]; then
+  echo "  Probing for remote data directory..."
+  for candidate in \
+    "/embassy-data/package-data/volumes/ten-database/data/main" \
+    "/mnt/embassy-os/package-data/volumes/ten-database/data/main" \
+    "/var/lib/embassy/services/ten-database/data"; do
+    if ssh "$REMOTE" "[ -f \"$candidate/crm.db\" ]" 2>/dev/null; then
+      REMOTE_DIR="$candidate"
+      echo "    found: $REMOTE_DIR"
+      break
+    fi
+  done
+  if [ -z "$REMOTE_DIR" ]; then
+    echo "  Could not auto-detect a valid data directory with crm.db on $REMOTE."
+    echo "  Re-run this script and pass the path explicitly as the 2nd argument."
+    exit 2
+  fi
+fi
+
+mkdir -p "$SEED_DIR/backups"
+
+echo ""
+echo "  Copying crm.db ..."
+scp "$REMOTE:$REMOTE_DIR/crm.db" "$SEED_DIR/crm.db"
+
+echo "  Copying backups/ (if present) ..."
+if ssh "$REMOTE" "[ -d \"$REMOTE_DIR/backups\" ]" 2>/dev/null; then
+  scp -r "$REMOTE:$REMOTE_DIR/backups/." "$SEED_DIR/backups/" || true
+else
+  echo "    (none found, skipping)"
+fi
+
+echo "  Copying .crm-secret (optional — keeps existing JWTs valid) ..."
+if ssh "$REMOTE" "[ -f \"$REMOTE_DIR/.crm-secret\" ]" 2>/dev/null; then
+  read -r -p "  Include .crm-secret in the baked image? [y/N] " ans
+  case "$ans" in
+    [yY]*) scp "$REMOTE:$REMOTE_DIR/.crm-secret" "$SEED_DIR/.crm-secret" ;;
+    *) echo "    skipping .crm-secret; a fresh secret will be generated on first boot" ;;
+  esac
+else
+  echo "    (no .crm-secret on remote)"
+fi
+
+echo ""
+echo "  Summary of staged seed:"
+ls -la "$SEED_DIR"
+echo ""
+
+if command -v python3 >/dev/null 2>&1 && [ -f "$SEED_DIR/crm.db" ]; then
+  python3 - <<PY
+import sqlite3
+db = sqlite3.connect("$SEED_DIR/crm.db")
+cur = db.cursor()
+cur.execute("PRAGMA integrity_check")
+print("  integrity_check:", cur.fetchone()[0])
+for t in ("users","fundraising_state","fundraising_funds","fundraising_views",
+          "contacts","organizations","audit_log","feature_requests","app_settings"):
+    try:
+        cur.execute(f"SELECT COUNT(*) FROM {t}")
+        print(f"  {t:30s} {cur.fetchone()[0]} rows")
+    except Exception as e:
+        print(f"  {t}: n/a ({e})")
+PY
+fi
+
+echo ""
+echo "  Seed refreshed. Next: cd $(dirname "$SCRIPT_DIR")/0.4 && make clean && make x86"
@@ -0,0 +1,130 @@
+# ** Plumbing. DO NOT EDIT **.
+# This file is imported by ./Makefile. Make edits there.
+
+PACKAGE_ID := $(shell awk -F"'" '/id:/ {print $$2}' startos/manifest/index.ts)
+INGREDIENTS := $(shell start-cli s9pk list-ingredients 2>/dev/null)
+REPO_GIT_DIR := ../../.git
+
+ARCHES ?= x86 arm riscv
+TARGETS ?= arches
+
+ifdef VARIANT
+BASE_NAME := $(PACKAGE_ID)_$(VARIANT)
+else
+BASE_NAME := $(PACKAGE_ID)
+endif
+
+.PHONY: all arches aarch64 x86_64 riscv64 arm arm64 x86 riscv arch/* clean install check-deps check-init package ingredients
+.DELETE_ON_ERROR:
+.SECONDARY:
+
+define SUMMARY
+	@manifest=$$(start-cli s9pk inspect $(1) manifest); \
+	size=$$(du -h $(1) | awk '{print $$1}'); \
+	title=$$(printf '%s' "$$manifest" | jq -r .title); \
+	version=$$(printf '%s' "$$manifest" | jq -r .version); \
+	arches=$$(printf '%s' "$$manifest" | jq -r '[.images[].arch // []] | flatten | unique | join(", ")'); \
+	sdkv=$$(printf '%s' "$$manifest" | jq -r .sdkVersion); \
+	gitHash=$$(printf '%s' "$$manifest" | jq -r .gitHash | sed -E 's/(.*-modified)$$/\x1b[0;31m\1\x1b[0m/'); \
+	printf "\n"; \
+	printf "\033[1;32m✅ Build Complete!\033[0m\n"; \
+	printf "\n"; \
+	printf "\033[1;37m $$title\033[0m \033[36mv$$version\033[0m\n"; \
+	printf "───────────────────────────────\n"; \
+	printf " \033[1;36mFilename:\033[0m %s\n" "$(1)"; \
+	printf " \033[1;36mSize:\033[0m %s\n" "$$size"; \
+	printf " \033[1;36mArch:\033[0m %s\n" "$$arches"; \
+	printf " \033[1;36mSDK:\033[0m %s\n" "$$sdkv"; \
+	printf " \033[1;36mGit:\033[0m %s\n" "$$gitHash"; \
+	echo ""
+endef
+
+all: $(TARGETS)
+
+arches: $(ARCHES)
+
+universal: $(BASE_NAME).s9pk
+	$(call SUMMARY,$<)
+
+arch/%: $(BASE_NAME)_%.s9pk
+	$(call SUMMARY,$<)
+
+x86 x86_64: arch/x86_64
+arm arm64 aarch64: arch/aarch64
+riscv riscv64: arch/riscv64
+
+$(BASE_NAME).s9pk: $(INGREDIENTS) $(REPO_GIT_DIR)/HEAD $(REPO_GIT_DIR)/index
+	@$(MAKE) --no-print-directory ingredients
+	@echo " Packing '$@'..."
+	start-cli s9pk pack -o $@
+
+$(BASE_NAME)_%.s9pk: $(INGREDIENTS) $(REPO_GIT_DIR)/HEAD $(REPO_GIT_DIR)/index
+	@$(MAKE) --no-print-directory ingredients
+	@echo " Packing '$@'..."
+	start-cli s9pk pack --arch=$* -o $@
+
+ingredients: $(INGREDIENTS)
+	@echo " Re-evaluating ingredients..."
+
+install: | check-deps check-init
+	@HOST=$$(awk -F'/' '/^host:/ {print $$3}' ~/.startos/config.yaml); \
+	if [ -z "$$HOST" ]; then \
+		echo "Error: You must define \"host: http://server-name.local\" in ~/.startos/config.yaml"; \
+		exit 1; \
+	fi; \
+	S9PK=$$(ls -t *.s9pk 2>/dev/null | head -1); \
+	if [ -z "$$S9PK" ]; then \
+		echo "Error: No .s9pk file found. Run 'make' first."; \
+		exit 1; \
+	fi; \
+	printf "\n Installing %s to %s ...\n" "$$S9PK" "$$HOST"; \
+	start-cli package install -s "$$S9PK"
+
+publish: | all
+	@REGISTRY=$$(awk -F'/' '/^registry:/ {print $$3}' ~/.startos/config.yaml); \
+	if [ -z "$$REGISTRY" ]; then \
+		echo "Error: You must define \"registry: https://my-registry.tld\" in ~/.startos/config.yaml"; \
+		exit 1; \
+	fi; \
+	S3BASE=$$(awk -F'/' '/^s9pk-s3base:/ {print $$3}' ~/.startos/config.yaml); \
+	if [ -z "$$S3BASE" ]; then \
+		echo "Error: You must define \"s3base: https://s9pks.my-s3-bucket.tld\" in ~/.startos/config.yaml"; \
+		exit 1; \
+	fi; \
+	command -v s3cmd >/dev/null || \
+		(echo "Error: s3cmd not found. It must be installed to publish using s3." && exit 1); \
+	printf "\n Publishing to %s; indexing on %s ...\n" "$$S3BASE" "$$REGISTRY"; \
+	for s9pk in *.s9pk; do \
+		age=$$(( $$(date +%s) - $$(stat -f %m "$$s9pk" 2>/dev/null || stat -c %Y "$$s9pk") )); \
+		if [ "$$age" -gt 3600 ]; then \
+			printf "\033[1;33m⚠️ %s is %d minutes old. Publish anyway? [y/N] \033[0m" "$$s9pk" "$$((age / 60))"; \
+			read -r ans; \
+			case "$$ans" in [yY]*) ;; *) echo "Skipping $$s9pk"; continue ;; esac; \
+		fi; \
+		start-cli s9pk publish "$$s9pk"; \
+	done
+
+check-deps:
+	@command -v start-cli >/dev/null || \
+		(echo "Error: start-cli not found. Please see https://docs.start9.com/latest/developer-guide/sdk/installing-the-sdk" && exit 1)
+	@command -v npm >/dev/null || \
+		(echo "Error: npm not found. Please install Node.js and npm." && exit 1)
+
+check-init:
+	@if [ ! -f ~/.startos/developer.key.pem ]; then \
+		echo "Initializing StartOS developer environment..."; \
+		start-cli init-key; \
+	fi
+
+javascript/index.js: $(shell find startos -type f) tsconfig.json node_modules
+	npm run build
+
+node_modules: package-lock.json
+	npm ci
+
+package-lock.json: package.json
+	npm i
+
+clean:
+	@echo "Cleaning up build artifacts..."
+	@rm -rf $(PACKAGE_ID).s9pk $(PACKAGE_ID)_x86_64.s9pk $(PACKAGE_ID)_aarch64.s9pk $(PACKAGE_ID)_riscv64.s9pk javascript node_modules
@@ -0,0 +1,58 @@
+# Seed Snapshot (baked into the Docker image)
+
+Anything under `seed/data/` is copied into the container image at build time
+and placed at `/app/seed/data/`. On first boot, if `/data/crm.db` is not
+present on the StartOS service volume, `docker_entrypoint.sh` copies
+`/app/seed/data/.` into `/data/.` so the new 0.4 install starts with the
+preserved data instead of an empty database.
+
+## What's currently baked in
+
+Initial snapshot was taken from the repo-level `data/` directory at build
+time (the same DB that the 0.3.5 dev workflow pointed at). Files:
+
+- `seed/data/crm.db` — SQLite database (investors, contacts, fundraising
+  rows, views, feature_requests, users, app_settings, etc.)
+- `seed/data/backups/*.json` — app-level snapshot exports
+
+## Refreshing the seed before a build
+
+If you want the 0.4 deploy to come up with the absolute latest production
+state from the 0.3.5 StartOS server, replace the files in `seed/data/`
+BEFORE running `make`:
+
+```sh
+# 1) On the 0.3.5 StartOS server, take a fresh app backup and/or grab
+#    the live database file:
+#      /media/embassy/services/ten-database/data/crm.db         (canonical)
+#      /media/embassy/services/ten-database/data/backups/*.json (optional)
+#      /media/embassy/services/ten-database/data/.crm-secret    (optional)
+#    Exact path may differ by StartOS 0.3.5 build.
+#
+# 2) scp them into this folder:
+scp embassy@<old-host>:/media/.../ten-database/data/crm.db \
+    start9/0.4/seed/data/crm.db
+# (Optional) include backups + secret:
+scp embassy@<old-host>:/media/.../ten-database/data/backups/* \
+    start9/0.4/seed/data/backups/
+scp embassy@<old-host>:/media/.../ten-database/data/.crm-secret \
+    start9/0.4/seed/data/.crm-secret
+#
+# 3) Rebuild:
+cd start9/0.4 && make clean && make x86
+```
+
+## Keeping `.crm-secret` out of the image
+
+By default `seed/data/.crm-secret` is NOT included. The first boot on the
+new machine generates a fresh JWT secret. Existing password hashes in
+`crm.db` remain valid, so users just log in once on the new host.
+
+If you WANT to preserve the exact secret (so already-issued JWTs remain
+valid), drop the file at `seed/data/.crm-secret` and rebuild.
+
+## Safety
+
+The entrypoint never overwrites an existing `/data/crm.db`. If the volume
+already contains data (StartOS restore, manual SSH pre-seed, prior install)
+the seed is skipped and a `.seeded` marker is written.
@@ -0,0 +1,3 @@
+import { sdk } from '../sdk'
+
+export const actions = sdk.Actions.of()
@@ -0,0 +1,7 @@
+import { sdk } from './sdk'
+
+export const { createBackup, restoreInit } = sdk.setupBackups(async () =>
+  // Preserve the entire service volume so crm.db, backup JSON files, and the
+  // persisted JWT secret all remain compatible with the prior package layout.
+  sdk.Backups.ofVolumes('main'),
+)
@@ -0,0 +1,5 @@
+import { sdk } from './sdk'
+
+export const setDependencies = sdk.setupDependencies(async () => {
+  return {}
+})
@@ -0,0 +1 @@
+export const i18n = (text: string) => text
@@ -0,0 +1,13 @@
+/**
+ * Plumbing. DO NOT EDIT.
+ */
+export { createBackup } from './backups'
+export { main } from './main'
+export { init, uninit } from './init'
+export { actions } from './actions'
+
+import { buildManifest } from '@start9labs/start-sdk'
+import { manifest as sdkManifest } from './manifest'
+import { versionGraph } from './versions'
+
+export const manifest = buildManifest(versionGraph, sdkManifest)
@@ -0,0 +1,16 @@
+import { sdk } from '../sdk'
+import { setDependencies } from '../dependencies'
+import { setInterfaces } from '../interfaces'
+import { versionGraph } from '../versions'
+import { actions } from '../actions'
+import { restoreInit } from '../backups'
+
+export const init = sdk.setupInit(
+  restoreInit,
+  versionGraph,
+  setInterfaces,
+  setDependencies,
+  actions,
+)
+
+export const uninit = sdk.setupUninit(versionGraph)
@@ -0,0 +1,25 @@
+import { i18n } from './i18n'
+import { sdk } from './sdk'
+import { WEB_PORT } from './utils'
+
+export const setInterfaces = sdk.setupInterfaces(async ({ effects }) => {
+  const uiMulti = sdk.MultiHost.of(effects, 'ui-multi')
+  const uiMultiOrigin = await uiMulti.bindPort(WEB_PORT, {
+    protocol: 'http',
+  })
+
+  const ui = sdk.createInterface(effects, {
+    name: i18n('Web UI'),
+    id: 'ui',
+    description: i18n('The web interface of Ten31 Database'),
+    type: 'ui',
+    masked: false,
+    schemeOverride: null,
+    username: null,
+    path: '',
+    query: {},
+  })
+
+  const uiReceipt = await uiMultiOrigin.export([ui])
+  return [uiReceipt]
+})
@@ -0,0 +1,33 @@
+import { i18n } from './i18n'
+import { sdk } from './sdk'
+import { DATA_MOUNT_PATH, IMAGE_ID, WEB_PORT } from './utils'
+
+export const main = sdk.setupMain(async ({ effects }) => {
+  console.info(i18n('Starting Ten31 Database'))
+
+  return sdk.Daemons.of(effects).addDaemon('primary', {
+    subcontainer: await sdk.SubContainer.of(
+      effects,
+      { imageId: IMAGE_ID },
+      sdk.Mounts.of().mountVolume({
+        volumeId: 'main',
+        subpath: null,
+        mountpoint: DATA_MOUNT_PATH,
+        readonly: false,
+      }),
+      'ten31-database-main',
+    ),
+    exec: {
+      command: ['/usr/local/bin/docker_entrypoint.sh'],
+    },
+    ready: {
+      display: i18n('Web Interface'),
+      fn: () =>
+        sdk.healthCheck.checkPortListening(effects, WEB_PORT, {
+          successMessage: i18n('CRM API is responding.'),
+          errorMessage: i18n('CRM API is not responding.'),
+        }),
+    },
+    requires: [],
+  })
+})
@@ -0,0 +1,13 @@
+export const short = {
+  en_US: 'Self-hosted investor and fundraising database for Ten31.',
+}
+
+export const long = {
+  en_US:
+    'Ten31 Database is an Airtable-style investor CRM with fundraising grid, communications logging, views, backups, and CSV import. This StartOS 0.4 wrapper preserves the existing /data layout for upgrade-safe persistence.',
+}
+
+export const alertUpdate = {
+  en_US:
+    'This 0.4 package is designed to keep using the existing /data/crm.db, /data/backups, and /data/.crm-secret layout from the 0.3.5.1 package.',
+}
@@ -0,0 +1,35 @@
+import { setupManifest } from '@start9labs/start-sdk'
+import { alertUpdate, long, short } from './i18n'
+
+export const manifest = setupManifest({
+  id: 'ten-database',
+  title: 'Ten31 Database',
+  license: 'MIT',
+  packageRepo: 'https://github.com/ten31/ten31-database-startos',
+  upstreamRepo: 'https://github.com/ten31/ten31-database',
+  marketingUrl: 'https://ten31.vc',
+  donationUrl: null,
+  docsUrls: ['https://docs.start9.com/packaging/0.4.0.x/'],
+  description: { short, long },
+  volumes: ['main'],
+  images: {
+    main: {
+      source: {
+        dockerBuild: {
+          dockerfile: './Dockerfile',
+          workdir: '../..',
+        },
+      },
+      arch: ['x86_64', 'aarch64'],
+    },
+  },
+  alerts: {
+    install: null,
+    update: alertUpdate,
+    uninstall: null,
+    restore: null,
+    start: null,
+    stop: null,
+  },
+  dependencies: {},
+})
@@ -0,0 +1,9 @@
+import { StartSdk } from '@start9labs/start-sdk'
+import { manifest } from './manifest'
+
+/**
+ * Plumbing. DO NOT EDIT.
+ *
+ * The exported `sdk` const is used throughout this package codebase.
+ */
+export const sdk = StartSdk.of().withManifest(manifest).build(true)
@@ -0,0 +1,16 @@
+// Informational constants shared across the startos/ modules.
+// The authoritative id, title and version for the package come
+// from manifest/index.ts (id, title) and versions/ (version).
+export const PACKAGE_ID = 'ten-database'
+export const PACKAGE_TITLE = 'Ten31 Database'
+// ExVer form of the current 0.4 wrapper release (upstream 0.1.0, wrapper rev 41).
+//   * 0.3.5 wrapper:  0.1.0.38 (legacy, aarch64)
+//   * First 0.4:      0.1.0:39 (shipped seed snapshot for migration)
+//   * Cleanup:        0.1.0:40 (seed removed + multi-threaded server + abuser auto-ban)
+//   * Current:        0.1.0:41 (frontend persists auth across refreshes)
+export const PACKAGE_VERSION = '0.1.0:41'
+
+export const DATA_MOUNT_PATH = '/data'
+export const WEB_PORT = 8080
+export const IMAGE_ID = 'main'
+export const VOLUME_ID = 'main'
@@ -0,0 +1,11 @@
+import { VersionGraph } from '@start9labs/start-sdk'
+import { v_0_1_0_39 } from './v0.1.0.39'
+import { v_0_1_0_40 } from './v0.1.0.40'
+import { v_0_1_0_41 } from './v0.1.0.41'
+import { v_0_1_0_42 } from './v0.1.0.42'
+import { v_0_1_0_43 } from './v0.1.0.43'
+
+export const versionGraph = VersionGraph.of({
+  current: v_0_1_0_43,
+  other: [v_0_1_0_39, v_0_1_0_40, v_0_1_0_41, v_0_1_0_42],
+})
@@ -0,0 +1,38 @@
+import { VersionInfo } from '@start9labs/start-sdk'
+
+// First StartOS 0.4 release of Ten31 Database.
+//
+// Upgrade context:
+//   * The 0.3.5 wrapper shipped at 0.1.0.38 (legacy, aarch64 only).
+//   * This 0.4 wrapper is built for x86_64 and is intended for a
+//     parallel install on a new StartOS 0.4 host.
+//   * Data continuity is NOT handled by a StartOS-level in-place
+//     upgrade (that path does not exist across StartOS majors).
+//     Instead the container image is pre-seeded with a snapshot of
+//     /data (crm.db, backups/, optional .crm-secret). On first boot
+//     docker_entrypoint.sh copies that snapshot into the mounted
+//     `main` volume if it is empty.
+//
+// Because both "up" and "down" paths are inside the same wrapper
+// lineage (and the first 0.4 release has no earlier 0.4 version
+// to migrate from), the migration functions are intentionally
+// no-ops. Future 0.4.x releases can chain off this node in the
+// version graph.
+export const v_0_1_0_39 = VersionInfo.of({
+  version: '0.1.0:39',
+  releaseNotes: {
+    en_US: [
+      'First StartOS 0.4 package for Ten31 Database.',
+      'Built for x86_64; sideload-only during beta.',
+      'Container image ships with a baked-in /data snapshot so the',
+      'service boots with the existing investor and fundraising data,',
+      'saved views, backups, users, and app settings already in place.',
+      'No StartOS-level migration is performed from the 0.3.5 package;',
+      'this package is installed fresh on a 0.4 host.',
+    ].join(' '),
+  },
+  migrations: {
+    up: async () => {},
+    down: async () => {},
+  },
+})
@@ -0,0 +1,57 @@
+import { VersionInfo } from '@start9labs/start-sdk'
+
+// Post-migration cleanup + hardening release.
+//
+// Context:
+//   * 0.1.0:39 was the first 0.4 package and shipped a baked-in
+//     /data snapshot that docker_entrypoint.sh copied into the
+//     mounted `main` volume on first boot (only if the volume was
+//     empty). That snapshot did its job and the live host now has
+//     a populated /data with all real investor + fundraising data.
+//   * 0.1.0:40 removes the seed snapshot from the image and the
+//     seeding logic from the entrypoint. The live /data volume is
+//     the sole source of truth from here on. StartOS preserves the
+//     volume across sideloads, so this upgrade does not disturb
+//     any data — it just slims the image and removes a code path
+//     that should never run again.
+//   * 0.1.0:40 also hardens the backend HTTP server against the
+//     vulnerability scanners that find the StartTunnel-exposed
+//     interface within hours of going live:
+//       - HTTPServer → ThreadingHTTPServer so one slow request or
+//         a wave of scanner probes can't block legit users.
+//       - Per-IP GET rate limit (default 600/min) in addition to
+//         the existing login/write limits.
+//       - 404-burst auto-ban: any IP that produces ABUSE_404_THRESHOLD
+//         404s within ABUSE_404_WINDOW_SEC (default 15 in 60s) is
+//         parked on a class-level blacklist for ABUSE_BAN_SEC
+//         (default 15 minutes). Banned IPs get an instant 429 with
+//         no DB or filesystem work.
+//       - All limits stay tunable via env vars
+//         (CRM_GET_RATE_LIMIT_PER_MIN, CRM_ABUSE_404_THRESHOLD,
+//         CRM_ABUSE_404_WINDOW_SEC, CRM_ABUSE_BAN_SEC).
+//
+// No data migration is required: the SQLite schema is unchanged
+// and the live DB on /data is left exactly as-is.
+export const v_0_1_0_40 = VersionInfo.of({
+  version: '0.1.0:40',
+  releaseNotes: {
+    en_US: [
+      'Removes the baked-in /data seed snapshot now that the',
+      '0.3.5 → 0.4 migration is complete. The live /data volume',
+      'on the StartOS host is the sole source of truth and is',
+      'preserved across sideloads, so no live data is touched by',
+      'this upgrade. Image is smaller and the first-boot seeding',
+      'code path has been removed. Also hardens the backend',
+      'against vulnerability scanners hitting the public',
+      'StartTunnel interface: the HTTP server is now multi-threaded',
+      'so one slow request can no longer block legit users, GET',
+      'requests are rate-limited per IP, and any IP that bursts',
+      'too many 404s in a short window is auto-banned for 15',
+      'minutes with no DB work performed.',
+    ].join(' '),
+  },
+  migrations: {
+    up: async () => {},
+    down: async () => {},
+  },
+})
@@ -0,0 +1,42 @@
+import { VersionInfo } from '@start9labs/start-sdk'
+
+// Frontend convenience release: persist auth across page reloads.
+//
+// Background: through 0.1.0:40 the auth token + user object were held only
+// in React state in memory. Any refresh, tab close, or browser restart
+// dropped the token and forced the user back to the login screen. Since
+// the JWT is signed with /data/.crm-secret (which already survives sideloads
+// and container restarts), the underlying token is still valid for its full
+// 24-hour lifetime — we just weren't keeping it anywhere persistent.
+//
+// 0.1.0:41 stores the JWT and user object in localStorage on login (and
+// rehydrates from there on app mount), so refreshes and reopened tabs stay
+// signed in until the token expires. The api() helper now also dispatches
+// a 'crm:unauthorized' event whenever an authenticated request comes back
+// with a 401, and the AuthProvider listens for that event to clear the
+// stored auth — so an expired or rejected token immediately bounces the
+// user back to the login screen instead of leaving the app in a broken
+// "loaded but every request fails" state.
+//
+// Backend is unchanged: the JWT still carries the user's true role and is
+// re-verified on every request, so a tampered localStorage user object
+// cannot escalate privileges (the next admin call would just 401/403).
+//
+// No data migration is required.
+export const v_0_1_0_41 = VersionInfo.of({
+  version: '0.1.0:41',
+  releaseNotes: {
+    en_US: [
+      'Logins now persist across page refreshes and tab closures for',
+      'the full 24-hour token lifetime. Previously every reload bounced',
+      'you to the login screen even though the token was still valid.',
+      'If the server later rejects a stored token (expired, secret key',
+      'changed, etc.) the app automatically clears it and shows the',
+      'login screen instead of leaving requests silently failing.',
+    ].join(' '),
+  },
+  migrations: {
+    up: async () => {},
+    down: async () => {},
+  },
+})
@@ -0,0 +1,58 @@
+import { VersionInfo } from '@start9labs/start-sdk'
+
+// Gmail integration — Phase 1.
+//
+// Background: the CRM previously had no ingestion path for email
+// activity. Contacts were logged manually; correspondence history lived
+// only in our mailboxes. This release adds a one-way capture pipeline
+// that ingests sent and received mail for every Workspace user at
+// ten31.xyz, matches messages against existing investor records, and
+// records metadata (+ bodies and attachments for matched threads) into
+// the CRM database.
+//
+// Auth model: domain-wide delegation via a Google service account. The
+// service-account JSON key is stored on the /data volume at
+// /data/secrets/gmail-service-account.json (chmod 600, operator-dropped).
+// The integration is self-disabling: if the key file is absent, the
+// scheduler doesn't start and /api/email/* routes return 503. No key →
+// no behavior change from 0.1.0:41.
+//
+// When the key IS present, docker_entrypoint.sh auto-enables the
+// integration and sets sensible defaults (3-hour sync interval, domain
+// ten31.xyz, DWD auth). All defaults can still be overridden via env.
+//
+// Database: migration 0001 adds eight new tables under the email_
+// namespace (emails, email_accounts, email_recipients,
+// email_account_messages, email_attachments, email_threads,
+// email_investor_links, email_sync_runs). All CREATE TABLE IF NOT EXISTS,
+// so the migration is safely idempotent — re-applying is a no-op.
+//
+// Backend: wholly isolated under backend/email_integration/. Three tiny,
+// feature-flag-guarded hooks in server.py (migration call, scheduler
+// startup, /api/email/* route dispatch). Removing or disabling the
+// integration leaves server behavior identical to 0.1.0:41.
+//
+// New Python dep: cryptography==42.0.5 (required for RS256 JWT signing
+// in DWD bearer token exchange). Now installed in the image.
+//
+// No data migration code needed — new tables, additive only.
+export const v_0_1_0_42 = VersionInfo.of({
+  version: '0.1.0:42',
+  releaseNotes: {
+    en_US: [
+      'Adds a Gmail capture pipeline. When a Google Workspace',
+      "service-account key is dropped into the server's /data/secrets",
+      'folder, the CRM begins pulling sent and received mail for every',
+      'ten31.xyz user on a 3-hour cycle, matching messages against',
+      'existing investor records and storing metadata (plus bodies and',
+      'attachments for matched threads) in the database. With no key',
+      'present the feature is dormant and this release behaves',
+      'identically to 0.1.0:41. Eight new email_* tables are added',
+      'additively; no existing data is touched.',
+    ].join(' '),
+  },
+  migrations: {
+    up: async () => {},
+    down: async () => {},
+  },
+})
@@ -0,0 +1,44 @@
+import { VersionInfo } from '@start9labs/start-sdk'
+
+// Hotfix for 0.1.0:42.
+//
+// Issue 1 (critical): POST requests to /api/email/* hung indefinitely.
+// server.py's do_POST called get_body() early in the dispatch to support
+// /api/auth/login, which reads bytes off the request stream. My Gmail
+// integration hook then ran route handlers that called get_body() a
+// second time — but the stream was already drained, so the second read
+// blocked waiting for bytes that never came. GET requests (which don't
+// read a body) were unaffected.
+//
+// Fix: get_body() now caches the parsed JSON on the handler instance
+// on first call. Repeat calls return the cached value. Handler
+// instances are per-request in ThreadingHTTPServer, so the cache is
+// naturally request-scoped and thread-safe.
+//
+// Issue 2 (minor): the /api/email/accounts/enroll endpoint required
+// both `email_address` and `user_id` in the body, making it painful to
+// call for the common single-admin-enrolling-themselves case.
+//
+// Fix: the endpoint now also accepts `email` as an alias, and if
+// user_id isn't supplied it auto-resolves by looking up the email in
+// the users table (falling back to the authenticated admin's own id
+// if no match).
+//
+// No schema changes, no data migration.
+export const v_0_1_0_43 = VersionInfo.of({
+  version: '0.1.0:43',
+  releaseNotes: {
+    en_US: [
+      'Hotfix for the Gmail integration in 0.1.0:42. POST requests to',
+      '/api/email/* endpoints were hanging because the request body was',
+      'being read twice from a single-shot stream. This release caches',
+      'the parsed body on the request so subsequent reads are safe, and',
+      'also relaxes the enroll endpoint to accept just an email and',
+      'auto-resolve the CRM user.',
+    ].join(' '),
+  },
+  migrations: {
+    up: async () => {},
+    down: async () => {},
+  },
+})
@@ -0,0 +1,11 @@
+{
+  "include": ["startos/**/*.ts", "node_modules/**/startos"],
+  "compilerOptions": {
+    "target": "ES2018",
+    "module": "CommonJS",
+    "moduleResolution": "node",
+    "esModuleInterop": true,
+    "strict": true,
+    "skipLibCheck": true
+  }
+}
@@ -0,0 +1,115 @@
+# Start9 Packaging Checklist (0.3.5 style)
+
+This checklist is written for the StartOS 0.3.5 packaging flow used in this repo.
+Use it as an indicative template for other projects, not a literal one-size-fits-all script.
+
+## 1) Required packaging scaffold (inside `start9/<version>/`)
+
+- `manifest.yaml`
+- `Makefile`
+- `Dockerfile`
+- `docker_entrypoint.sh`
+- `healthcheck.sh`
+- `instructions.md`
+- `icon.png` (and/or `icon.svg` if desired)
+
+Optional:
+- `scripts/` for package-specific helper scripts
+- prebuilt artifacts (`image.tar`, `.s9pk`) generated by packaging
+
+## 2) Project-specific values to change
+
+In `manifest.yaml`:
+- `id`
+- `title`
+- `version`
+- `description`
+- `upstream-repo`, `support-site`, `marketing-site`
+- `interfaces` (port, protocol, TLS, UI flags)
+- `config` (runtime env/config options)
+- `backup` mounts/commands
+- `actions` (if you expose maintenance actions)
+
+In `Makefile`:
+- package id/version variables
+- image name/tag
+- paths/targets used by `make ... package`
+
+In `Dockerfile`:
+- base image
+- runtime dependencies
+- app copy paths
+- entrypoint/cmd
+
+In scripts:
+- read config/env from StartOS mount/env conventions
+- write data only to mounted persistent directories
+
+## 3) What must exist outside `start9/`
+
+The wrapper is not fully standalone. It builds an image from your app source.
+
+For this CRM package specifically:
+- `backend/server.py`
+- `frontend/` (all static assets/UI)
+
+These are copied in Docker build steps. In other projects, these paths, filenames, and build inputs can be different.
+This document is meant to show the pattern; each project must map to its own app layout.
+
+## 4) Data + persistence checklist
+
+- Persist DB/files under mounted data path (not container ephemeral path).
+- Confirm backup/restore mounts and commands match the manifest volume names exactly.
+- Verify restore can start app cleanly and preserve schema/data.
+
+## 5) Network/interface checklist
+
+- Confirm service listens on the internal container port expected by `manifest.yaml`.
+- Confirm LAN interface protocol settings match actual service behavior (HTTP vs HTTPS/TCP).
+- Confirm UI launches from StartOS Interfaces page without cert/protocol mismatch.
+
+## 6) Build + install flow
+
+1. Bump version in:
+   - `start9/<version>/manifest.yaml`
+   - `start9/<version>/Makefile`
+2. Build package:
+   - `make -C start9/<version> package`
+3. Install resulting `.s9pk` in StartOS.
+4. Start service and check:
+   - health/logs
+   - UI launch
+   - persistence after restart
+   - backup/restore smoke test
+
+## 7) Reusable vs non-reusable parts
+
+Reusable:
+- overall folder structure and file roles in `start9/<version>/`
+- packaging workflow (`manifest` + `Makefile` + `Dockerfile` + scripts)
+
+Non-reusable without edits:
+- app copy paths in Dockerfile
+- app-specific env/config keys
+- ports/interfaces/protocol values
+- backup/restore commands tied to app data layout
+
+## 8) Planned migration path to StartOS 0.4
+
+When 0.4 is ready for your deployment, use this approach:
+
+1. Keep 0.3.5 package stable as the production branch.
+2. Create a parallel package folder for 0.4 (for example `start9/0.4/`).
+3. Port wrapper files (`manifest`, `Makefile`, Docker packaging scripts) to the 0.4 schema/tooling.
+4. Update interface/config/backup definitions to 0.4 expectations.
+5. Build and install 0.4 package in a test server first.
+6. Restore a real backup into 0.4 and validate:
+   - app starts
+   - UI works
+   - data integrity is preserved
+   - backup/restore still works
+7. Only after successful validation, promote 0.4 package for primary use.
+
+Notes:
+- Keep database path and backup format stable where possible to make migration low-risk.
+- If schema changes are required, add explicit migration steps and rollback steps before production cutover.
@@ -1,258 +0,0 @@
-# Venture Fund CRM — Project Context for Claude
-
-You are continuing development on a self-hosted CRM system for a venture fund. Below is everything you need to know about what has been built, how it works, and what comes next.
-
---
-
-## Business Context
-
- **Fund:** ~$200M AUM, currently fundraising for Fund II
- **Users:** Team of 5 people, accessing via browser on local network or remotely via Tailscale VPN
- **Current LPs:** 150 investors
- **Prospects:** 250+ being tracked
- **Migrating from:** Airtable (CSV exports available)
- **Core goals:**
-  1. Eliminate sensitive LP/prospect data from third-party servers (Airtable, CRMs)
-  2. Stop paying monthly subscription costs
-  3. Purpose-built tool for fundraising workflow: managing existing investors, tracking new prospects, raising capital
- **User:** Grant (grant@ten31.xyz)
-
---
-
-## What Has Been Built (Sprint 1 — Complete)
-
-A fully functional prototype with backend API, frontend UI, demo data, and utility scripts. Everything runs locally with zero external dependencies beyond two Python packages.
-
-### Tech Stack (Actual — differs from original plan)
-
-The original plan called for FastAPI + SQLAlchemy + separate React build, but the build environment lacked pip/npm access. The stack was adapted to:
-
- **Backend:** Python 3 stdlib HTTP server + `sqlite3` + `bcrypt` + `PyJWT` — single file, no framework
- **Database:** SQLite with WAL mode (concurrent reads, serialized writes — fine for 5 users)
- **Frontend:** Single self-contained HTML file loading React 18 + Babel from CDN (unpkg)
- **Deployment:** Run `python3 backend/server.py` — serves both API and frontend on port 8080
- **Remote access:** Tailscale mesh VPN (each device gets a private IP, peer-to-peer encrypted)
-
-### Project Structure
-
-```
-venture-crm/
-├── backend/
-│   ├── server.py              # Complete API server (1,873 lines)
-│   └── requirements.txt       # bcrypt, PyJWT (for reference)
-├── frontend/
-│   └── index.html             # Complete React SPA (2,982 lines)
-├── data/
-│   └── crm.db                 # SQLite database (created on first run)
-├── scripts/
-│   ├── create_user.py         # CLI tool to add users
-│   ├── reset_password.py      # CLI tool to reset passwords
-│   └── backup.sh              # Database backup with 30-day retention
-└── start.sh                   # Launch script
-```
-
-### Database Schema
-
-All tables use TEXT primary keys (8-char UUIDs). The database is at `data/crm.db`.
-
-**Tables:**
- `users` — id, username, email, password_hash, full_name, role (admin/manager/member), is_active
- `contacts` — id, first_name, last_name, email, phone, mobile, title, organization_id (FK), contact_type (investor/prospect/advisor/other), status, source, tags (JSON), notes, linkedin_url, preferred_contact, created_by (FK)
- `organizations` — id, name, type, industry, website, phone, email, address, city, state, country, description, tags (JSON), created_by (FK)
- `opportunities` — id, name, contact_id (FK), organization_id (FK), stage (lead/outreach/meeting/due_diligence/committed/funded), commitment_amount, expected_amount, probability, expected_close_date, fund_name, description, next_step, owner_id (FK), priority (low/medium/high), lost_reason
- `communications` — id, contact_id (FK), opportunity_id (FK), type (email/call/meeting/note/text), subject, body, communication_date, duration_minutes, outcome, next_action, next_action_date, attendees (JSON), created_by (FK)
- `lp_profiles` — id, contact_id (FK, unique), commitment_amount, funded_amount, commitment_date, fund_name, investor_type, accredited, legal_docs_signed, signed_date, wire_received, wire_date, k1_sent, preferred_communication, notes
- `custom_fields` — id, name, entity_type, field_type, options (JSON), required, display_order
- `custom_field_values` — id, custom_field_id (FK), entity_id, entity_type, value
- `audit_log` — id, user_id (FK), entity_type, entity_id, action, changes (JSON), created_at
- `tags` — id, name (unique), color
-
-**Key indexes:** contacts(contact_type, status, organization_id), opportunities(stage, owner_id, contact_id), communications(contact_id, communication_date), audit_log(entity_type, entity_id), lp_profiles(contact_id)
-
-### API Endpoints
-
-All endpoints except auth require `Authorization: Bearer <jwt_token>` header. Server runs at `http://0.0.0.0:8080`.
-
-**Auth:**
- `POST /api/auth/login` — body: {username, password} → {token, user}
- `POST /api/auth/register` — body: {username, password, email, full_name} → {token, user}
-
-**Contacts:**
- `GET /api/contacts?type=&status=&search=&sort=&order=&limit=&offset=&organization_id=&tag=` → {data[], total, limit, offset}
- `GET /api/contacts/:id` → {data: {contact + communications[], opportunities[], lp_profile}}
- `POST /api/contacts` — full CRUD
- `PUT /api/contacts/:id`
- `DELETE /api/contacts/:id`
-
-**Organizations:**
- `GET /api/organizations?search=&type=&limit=&offset=` → {data[], total}
- `GET /api/organizations/:id` → {data: {org + contacts[], opportunities[]}}
- `POST /api/organizations` — full CRUD
- `PUT /api/organizations/:id`
- `DELETE /api/organizations/:id`
-
-**Opportunities (Pipeline):**
- `GET /api/opportunities?stage=&owner_id=&search=&priority=&fund_name=&limit=&offset=` → {data[], total}
- `GET /api/opportunities/:id` → {data: {opp + communications[], stage_history[]}}
- `POST /api/opportunities`
- `PUT /api/opportunities/:id`
- `PATCH /api/opportunities/:id/stage` — body: {stage} (logs stage change in audit)
- `DELETE /api/opportunities/:id`
-
-**Communications:**
- `GET /api/communications?contact_id=&type=&search=&limit=&offset=` → {data[], total}
- `GET /api/contacts/:id/communications` → same as above, scoped to contact
- `POST /api/communications`
- `PUT /api/communications/:id`
- `DELETE /api/communications/:id`
-
-**LP Profiles:**
- `GET /api/lp-profiles?fund_name=&search=` → {data[], total}
- `GET /api/lp-profiles/:id` → {data}
- `POST /api/lp-profiles` — also sets contact type to 'investor'
- `PUT /api/lp-profiles/:id`
-
-**Reports:**
- `GET /api/reports/dashboard` → {metrics, pipeline_stages[], recent_communications[], upcoming_actions[], recent_stage_changes[]}
- `GET /api/reports/pipeline` → {by_stage[], by_owner[], by_priority[]}
- `GET /api/reports/lp-breakdown` → {lps[], summary, by_type[]}
- `GET /api/reports/activity?days=30` → {by_user[], by_day[]}
-
-**Import/Export:**
- `POST /api/import/csv` — body: {data: [...objects], entity_type, mapping: {csv_col: crm_field}, dry_run: bool}
- `GET /api/export/contacts` → {data[]}
-
-**Other:**
- `GET /api/tags` / `POST /api/tags`
- `GET /api/users`
- `GET /api/audit-log?entity_type=&entity_id=`
- `GET /api/health`
-
-### Frontend Pages
-
-The frontend is a single HTML file with inline CSS (dark theme) and React via CDN. Pages:
-
-1. **Login** — username/password form, registration option
-2. **Dashboard** — KPI cards (Total LPs, Committed $, Pipeline Value, Active Opportunities, Prospects, Monthly Comms), pipeline stage visualization, recent communications, upcoming actions, recent stage changes
-3. **Contacts** — tabbed (All/Investors/Prospects), searchable sortable table, slide-over detail panel with communications timeline and opportunities, add/edit modal
-4. **Pipeline** — Kanban-style board (Lead → Outreach → Meeting → DD → Committed → Funded), stage summary bar with $ per stage, opportunity cards with stage selector, add/edit modal
-5. **Communications** — chronological list, filter by type/contact, log new communication form
-6. **LP Tracker** — summary cards (Total Committed, Funded, Avg Check, LP Count), table with status indicators (checkmarks) for docs/wire/K1
-7. **Import** — CSV paste/upload, preview table, field mapping interface, dry-run validation, execute import
-8. **Settings** — user profile, tag management
-
-### Demo Data (Seeded Automatically)
-
-On first run, the server seeds:
- 2 users: `admin`/`admin123` (admin role), `grant`/`password` (admin role)
- 8 organizations (Sovereign Wealth Holdings, Pacific Capital Partners, Northeast Pension Fund, Redwood Endowment, Atlas Family Office, Summit Insurance Group, Cascade Wealth Management, Blue Harbor Foundation)
- 12 contacts (6 investors, 6 prospects)
- 6 LP profiles totaling $83M committed (all Fund I, all fully funded)
- 6 pipeline opportunities totaling $40M expected (Fund II prospects at various stages)
- 8 communication records (emails, calls, meetings)
- 6 tags (High Priority, Fund I LP, Fund II Prospect, Family Office, Institutional, Re-up Target)
-
-### How to Run
-
-```bash
-pip3 install bcrypt PyJWT
-cd venture-crm
-python3 backend/server.py
-# Open http://localhost:8080
-# Login: grant / password
-```
-
-### What Has Been Tested
-
-All API endpoints have been verified via curl:
- Auth (login, register)
- Contact CRUD + search
- Organization CRUD
- Opportunity CRUD + stage changes
- Communication CRUD
- LP profile CRUD
- Dashboard, pipeline, LP breakdown reports
- CSV import with dry-run and field mapping
- Frontend serves correctly from the backend
-
---
-
-## What Has NOT Been Built Yet (Remaining Sprints)
-
-### Sprint 2 items still needed:
- Custom fields UI (backend schema exists but not wired to frontend forms)
- Drag-and-drop on pipeline board (currently uses dropdown stage selector)
-
-### Sprint 3: Airtable Migration + Custom Fields
- Custom field definition admin UI
- Display custom fields on contact/opportunity forms
- Actual Airtable data migration (import wizard exists but hasn't been used with real data)
-
-### Sprint 4: Reporting + Polish
- Pipeline analytics (deal velocity, conversion rates between stages)
- User activity report page
- CSV export buttons on all reports
- Bulk actions on contact list (tag multiple, assign, bulk export)
- Automated daily backup via cron
- Team setup documentation
-
-### Future Enhancements (discussed but not planned):
- Email integration (auto-log emails via IMAP)
- Calendar sync
- Task assignments linked to opportunities
- Bulk email with templates
- Two-factor authentication
- Advanced saved search filters
- Audit trail UI page
-
---
-
-## Architecture Decisions & Constraints
-
-1. **Single-file backend:** The Python server is one file (`server.py`) using stdlib `http.server`. No framework. This keeps deployment dead simple but means no middleware pattern, no auto-docs, no async. If the codebase grows significantly, consider migrating to FastAPI.
-
-2. **Single-file frontend:** The React app is one HTML file loading from CDN. No build step. This means no TypeScript, no tree-shaking, no code splitting. Babel compiles JSX in the browser. If the UI grows significantly, consider splitting into a proper Vite/React project.
-
-3. **SQLite WAL mode:** Handles 5 concurrent readers + 1 writer. Fine for this team size. If the team grows past 10-15, migrate to PostgreSQL.
-
-4. **No localStorage:** JWT token stored in React state only (memory). Page refresh = re-login. This is intentional for security.
-
-5. **8-char UUIDs:** Generated via `uuid.uuid4()[:8]`. Collision probability is negligible at this data scale.
-
-6. **Tailscale for remote access:** Server binds to 0.0.0.0. Tailscale gives each device a 100.x.x.x IP. No port forwarding, no public exposure.
-
---
-
-## Key Files to Read
-
-When making changes, these are the files:
-
- **`backend/server.py`** (1,873 lines) — ALL backend logic: database schema, auth, every API endpoint, seed data, server startup. Search for handler method names like `handle_list_contacts`, `handle_create_opportunity`, etc.
-
- **`frontend/index.html`** (2,982 lines) — ALL frontend logic: CSS styles, React components, API client, every page. Search for component names like `Dashboard`, `ContactsPage`, `PipelinePage`, etc.
-
- **`scripts/create_user.py`** — CLI to add team members
- **`scripts/backup.sh`** — Database backup with rotation
- **`start.sh`** — Launch script
-
---
-
-## Common Modification Patterns
-
-**Adding a new field to contacts:**
-1. In `server.py`: add column to CREATE TABLE, add to INSERT/UPDATE in handler methods
-2. In `index.html`: add field to the contact form component and detail view
-3. Delete `data/crm.db` to recreate schema (or use sqlite3 ALTER TABLE)
-
-**Adding a new pipeline stage:**
-1. In `server.py`: add to `PIPELINE_STAGES` list
-2. In `index.html`: add to the stages array in the Pipeline component
-
-**Changing the color scheme:**
-1. In `index.html`: modify the CSS variables in the `<style>` tag (search for hex colors like `#0f172a`, `#1e293b`, `#6366f1`)
-
-**Adding a new API endpoint:**
-1. In `server.py`: add route matching in `do_GET`/`do_POST`/etc., then add handler method
-
-**Adding a new page:**
-1. In `index.html`: create a new component, add it to the navigation sidebar and the page router