Phase 0 foundation: canonical schema, ingest pipeline, CRM MCP server

Workstream A–C substrate for the Ten31 agentic system:
- A1: docs/crm-overview.md; CLAUDE.md conventions + guardrail #9
- A2: additive/reversible core migration (canonical_entities, entity_links,
  interaction_log, relationship_edges, soft-delete) + ledgered runner
- B1/B3: chunking + deterministic entity resolution (backend/ingest)
- B2: dense (bge-m3) + BM25 sparse ingest to Qdrant crm_chunks
- C: CRM MCP server (reads, retrieval modes, logged writes) — no outbound tools
- docs: redaction/re-hydration, Gmail enablement runbook
- synthetic test data; .env.example; housekeeping (.gitignore, untrack crm.db,
  drop legacy files + start9/0.3.5)

Verified end-to-end on synthetic data + live Sparks (hybrid > dense on entity
queries). Real backfill runs on Ten31 infra; index holds synthetic data only.
Branch snapshot also captures pre-existing working-tree changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Keysat
2026-06-05 08:11:28 -05:00
parent 7027efd777
commit c7ce44d963
99 changed files with 10676 additions and 7817 deletions
+826
View File
@@ -0,0 +1,826 @@
# Gmail Integration — Technical Architecture
**Status:** Design / Phase 1 scaffolding
**Target:** Venture CRM (self-hosted, StartOS 0.4 beta)
**Author:** Grant + Claude
**Last updated:** 2026-04-21
---
## 1. Goals and non-goals
### In scope for Phase 1
1. Authorize Gmail access for all 5 team members.
2. Continuous, incremental sync of sent and received mail.
3. Parse sender, recipients, subject, body, timestamp, attachments.
4. Store locally in the CRM SQLite DB.
5. Match each email to an investor record by sender/recipient email.
6. Thread related messages using Gmail's native `threadId` plus RFC 2822 `Message-ID` / `In-Reply-To` / `References`.
7. Store attachments on local disk alongside email records.
### Non-goals for Phase 1 (future phases)
- Sending email from the CRM.
- Full-text search (we'll store bodies; search UI comes later).
- Real-time push via Pub/Sub watch (we'll poll every few hours).
- Contact auto-creation from unknown senders.
- Shared team inbox features.
### Guiding constraints
- **Do not break the running CRM.** All new code lives in an isolated module. Exactly two small additions are made to the existing `server.py`: a startup call and a route-dispatch hook (both behind a feature flag, both safe to roll back).
- **Match the existing codebase's style.** Pure Python stdlib + sqlite3, no new hard dependencies beyond `cryptography` (for encrypting stored OAuth refresh tokens) and the `google-auth` / `google-api-python-client` libs for the Gmail SDK. Everything else is stdlib.
- **Single-process, single-sqlite-writer discipline.** The sync worker writes through the same connection pattern as the rest of the CRM (WAL mode, short transactions) so we don't introduce lock contention.
---
## 2. Decision summary
| # | Decision | Choice | Rationale |
|---|----------|--------|-----------|
| D1 | Auth method | Domain-wide delegation (DWD) primary, per-user OAuth2 as pluggable fallback | You're Workspace super-admin; one-time setup, no per-user consent flow, centralized revocation. OAuth path preserved behind a common `CredentialProvider` interface so the CRM isn't locked in. |
| D2 | Schema | New dedicated `emails_*` tables (not reuse `communications`) | Gmail-specific fields (`gmail_message_id`, `thread_id`, `history_id`, labels, RFC headers) don't belong in the generic `communications` table. `communications` can later project email rows via a read-only view if desired. |
| D3 | Sync scope | Entire mailbox, match-only full storage | Fetch cheap metadata for every message. Persist the full body + attachments only for messages that match a known investor email. Minimizes disk use and privacy surface. |
| D4 | Incremental strategy | Gmail `history.list` driven by per-account `historyId` checkpoint | Canonical Gmail-supported mechanism for "what changed since last sync" — cheap, reliable, dedupe-free. `messages.list` only on initial backfill. |
| D5 | Delivery | Architecture doc + scaffolded modules in `backend/email_integration/`, no edits to `server.py` yet | Lets you review the shape before any wiring happens. Activation is a 3-line patch to `server.py` when you're ready. |
---
## 3. Google Cloud / Workspace setup (one-time, ~20 minutes)
This is the runbook you'll follow once the code is ready. It's in the doc so the setup steps and the code that relies on them are never out of sync.
### 3a. Create a GCP project
1. `console.cloud.google.com` → new project, name it `ten31-crm-gmail` (or similar). Note the **Project ID**.
2. Enable the Gmail API: APIs & Services → Library → "Gmail API" → Enable.
### 3b. Create a service account (for DWD)
3. IAM & Admin → Service Accounts → Create service account.
- Name: `crm-gmail-sync`
- Role: leave blank (it doesn't need any GCP IAM roles; Gmail scope comes from DWD).
4. On the service account's detail page → Keys → Add Key → Create new key → JSON. Download the file.
5. Advanced settings → enable **domain-wide delegation**. Copy the **numeric client ID** (a ~21-digit number).
### 3c. Authorize the service account in Google Workspace
6. `admin.google.com` → Security → Access and data control → API controls → Manage domain-wide delegation → Add new.
7. Client ID: paste the numeric client ID from step 5.
8. OAuth scopes (comma-separated, read-only — no send):
```
https://www.googleapis.com/auth/gmail.readonly,
https://www.googleapis.com/auth/gmail.metadata
```
(Phase 1 uses `gmail.readonly` only. `gmail.metadata` is listed for a possible future "metadata-only" sync mode.)
9. Authorize.
### 3d. Install the service account key on the CRM server
10. Copy the JSON key file to `<CRM_ROOT>/data/secrets/gmail-service-account.json`.
11. `chmod 600` the file. `chown` to the CRM process owner.
12. Set env var in `.env.beta`:
```
CRM_GMAIL_SA_KEY_PATH=/path/to/CRM/data/secrets/gmail-service-account.json
CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz
CRM_GMAIL_INTEGRATION_ENABLED=true
CRM_GMAIL_SYNC_INTERVAL_MIN=180 # 3 hours
CRM_GMAIL_SECRET_KEY=<32+ random bytes base64> # for encrypting stored OAuth refresh tokens (unused in DWD mode but required if fallback ever activates)
```
### 3e. Configure per-user mailboxes
13. In the CRM, Admin → Email Accounts → "Enroll all Workspace users via DWD". This calls `POST /api/email/accounts/enroll-all` which for each existing CRM user with an `@ten31.xyz` email creates an `email_accounts` row bound to their address.
14. Each teammate sees a "Gmail sync: active" chip on their profile.
---
## 4. Database schema
All new tables live under the `emails_` prefix so they're grouped in tooling. Migration script: `backend/email_integration/migrations/0001_email_tables.sql`. It is **idempotent** (uses `CREATE TABLE IF NOT EXISTS` + targeted `ALTER TABLE` guards) so re-running is safe.
### 4a. `email_accounts` — one row per enrolled mailbox
```sql
CREATE TABLE IF NOT EXISTS email_accounts (
id TEXT PRIMARY KEY, -- uuid
user_id TEXT NOT NULL, -- FK users.id
email_address TEXT NOT NULL UNIQUE, -- e.g. grant@ten31.xyz
auth_method TEXT NOT NULL, -- 'dwd' | 'oauth'
oauth_refresh_enc BLOB, -- NULL for DWD; AES-GCM ciphertext for oauth
oauth_token_exp TEXT, -- ISO8601, cached access token expiry
oauth_token_enc BLOB, -- short-lived access token cache (encrypted)
sync_enabled INTEGER NOT NULL DEFAULT 1,
sync_status TEXT NOT NULL DEFAULT 'pending', -- pending|active|paused|error
sync_error TEXT, -- last error message, if any
last_history_id TEXT, -- Gmail historyId checkpoint
last_synced_at TEXT,
backfill_complete INTEGER NOT NULL DEFAULT 0, -- 0 = initial backfill still running
backfill_cursor TEXT, -- nextPageToken during backfill
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now')),
FOREIGN KEY(user_id) REFERENCES users(id)
);
CREATE INDEX IF NOT EXISTS idx_email_accounts_user ON email_accounts(user_id);
CREATE INDEX IF NOT EXISTS idx_email_accounts_sync ON email_accounts(sync_enabled, sync_status);
```
### 4b. `emails` — canonical email record (dedup'd across accounts)
One row per distinct RFC `Message-ID`. If two teammates are both on a thread, we store one `emails` row and link both via `email_account_messages`.
```sql
CREATE TABLE IF NOT EXISTS emails (
id TEXT PRIMARY KEY, -- uuid
rfc_message_id TEXT NOT NULL UNIQUE, -- RFC 2822 Message-ID, dedup key
gmail_thread_id TEXT, -- primary Gmail thread id (first account that saw it)
rfc_thread_root_id TEXT, -- root of References chain (for cross-account threading)
subject TEXT,
from_email TEXT NOT NULL,
from_name TEXT,
to_emails_json TEXT NOT NULL DEFAULT '[]', -- JSON array
cc_emails_json TEXT NOT NULL DEFAULT '[]',
bcc_emails_json TEXT NOT NULL DEFAULT '[]',
reply_to TEXT,
sent_at TEXT NOT NULL, -- from Date: header, ISO8601
body_text TEXT, -- plain/text part
body_html TEXT, -- text/html part
snippet TEXT, -- Gmail-provided, useful for unmatched rows
in_reply_to TEXT, -- RFC Message-ID of parent
references_json TEXT DEFAULT '[]', -- parsed References: header
has_attachments INTEGER NOT NULL DEFAULT 0,
size_estimate INTEGER, -- bytes reported by Gmail
is_matched INTEGER NOT NULL DEFAULT 0, -- 1 if any investor link exists
match_status TEXT NOT NULL DEFAULT 'unmatched', -- unmatched|matched|skipped
raw_headers_json TEXT, -- full header dump for debugging/forensics
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now'))
);
CREATE INDEX IF NOT EXISTS idx_emails_thread ON emails(gmail_thread_id);
CREATE INDEX IF NOT EXISTS idx_emails_rfc_thread ON emails(rfc_thread_root_id);
CREATE INDEX IF NOT EXISTS idx_emails_from ON emails(from_email);
CREATE INDEX IF NOT EXISTS idx_emails_sent_at ON emails(sent_at);
CREATE INDEX IF NOT EXISTS idx_emails_matched ON emails(is_matched, sent_at);
```
> **Match-only storage note:** For unmatched emails we still write the `emails` row, but `body_text` / `body_html` / `raw_headers_json` are left NULL (only headers and snippet are persisted). A nightly job can prune unmatched rows older than 90 days if storage becomes a concern. See §12.
### 4c. `email_recipients` — exploded recipients for fast investor matching
Denormalized so we can index and JOIN directly.
```sql
CREATE TABLE IF NOT EXISTS email_recipients (
id TEXT PRIMARY KEY,
email_id TEXT NOT NULL,
address TEXT NOT NULL, -- lowercased, trimmed
display_name TEXT,
kind TEXT NOT NULL, -- 'from'|'to'|'cc'|'bcc'|'reply_to'
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_email_recipients_addr ON email_recipients(address);
CREATE INDEX IF NOT EXISTS idx_email_recipients_email ON email_recipients(email_id);
```
### 4d. `email_account_messages` — per-account sighting of an email
Same `emails` row may be visible to multiple teammates with different Gmail message IDs (each user's mailbox has its own copy). This join table records each sighting.
```sql
CREATE TABLE IF NOT EXISTS email_account_messages (
id TEXT PRIMARY KEY,
email_id TEXT NOT NULL,
account_id TEXT NOT NULL,
gmail_message_id TEXT NOT NULL, -- per-mailbox unique
gmail_thread_id TEXT NOT NULL,
labels_json TEXT DEFAULT '[]', -- Gmail labels from this account's view
is_sent INTEGER NOT NULL DEFAULT 0, -- 1 if SENT label present
first_seen_at TEXT DEFAULT (datetime('now')),
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE,
FOREIGN KEY(account_id) REFERENCES email_accounts(id) ON DELETE CASCADE,
UNIQUE(account_id, gmail_message_id)
);
CREATE INDEX IF NOT EXISTS idx_eam_email ON email_account_messages(email_id);
CREATE INDEX IF NOT EXISTS idx_eam_account ON email_account_messages(account_id);
CREATE INDEX IF NOT EXISTS idx_eam_gmail_msg ON email_account_messages(gmail_message_id);
```
### 4e. `email_attachments` — attachments stored on disk
Attachment bytes live at `data/email_attachments/<email_id>/<attachment_id>-<sanitized_filename>`. DB stores metadata + content hash.
```sql
CREATE TABLE IF NOT EXISTS email_attachments (
id TEXT PRIMARY KEY,
email_id TEXT NOT NULL,
gmail_attachment_id TEXT NOT NULL, -- opaque Gmail handle
filename TEXT NOT NULL,
sanitized_filename TEXT NOT NULL, -- what's actually on disk
mime_type TEXT,
size_bytes INTEGER,
sha256_hex TEXT, -- integrity + dedup across emails
storage_path TEXT NOT NULL, -- relative to CRM_DATA_DIR
download_status TEXT NOT NULL DEFAULT 'pending', -- pending|downloaded|failed|skipped
download_attempts INTEGER NOT NULL DEFAULT 0,
download_error TEXT,
downloaded_at TEXT,
created_at TEXT DEFAULT (datetime('now')),
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_attach_email ON email_attachments(email_id);
CREATE INDEX IF NOT EXISTS idx_attach_sha ON email_attachments(sha256_hex);
```
### 4f. `email_threads` — thread roll-up (materialized for UI)
```sql
CREATE TABLE IF NOT EXISTS email_threads (
id TEXT PRIMARY KEY, -- uuid
gmail_thread_id TEXT, -- NULL if we only have RFC threading
rfc_thread_root_id TEXT, -- fallback / cross-account key
subject_normalized TEXT, -- lowercased + stripped of Re:/Fwd: prefixes
first_message_at TEXT,
last_message_at TEXT,
message_count INTEGER NOT NULL DEFAULT 0,
participant_count INTEGER NOT NULL DEFAULT 0,
participants_json TEXT DEFAULT '[]',
is_matched INTEGER NOT NULL DEFAULT 0,
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now'))
);
CREATE UNIQUE INDEX IF NOT EXISTS idx_threads_gmail ON email_threads(gmail_thread_id) WHERE gmail_thread_id IS NOT NULL;
CREATE INDEX IF NOT EXISTS idx_threads_rfc_root ON email_threads(rfc_thread_root_id);
CREATE INDEX IF NOT EXISTS idx_threads_last_msg ON email_threads(last_message_at);
-- Back-link emails → threads
ALTER TABLE emails ADD COLUMN thread_id TEXT; -- wrapped in PRAGMA-based guard at migration time
CREATE INDEX IF NOT EXISTS idx_emails_thread_fk ON emails(thread_id);
```
### 4g. `email_investor_links` — matched investors
```sql
CREATE TABLE IF NOT EXISTS email_investor_links (
id TEXT PRIMARY KEY,
email_id TEXT NOT NULL,
-- Match target: exactly one of the following is non-null
fundraising_investor_id TEXT, -- fundraising_investors.id
fundraising_contact_id TEXT, -- fundraising_contacts.id
contact_id TEXT, -- contacts.id
organization_id TEXT, -- organizations.id (if matched by domain only)
matched_address TEXT NOT NULL, -- which email address produced the match
match_kind TEXT NOT NULL, -- 'exact_email'|'domain_match'|'manual'
match_confidence REAL NOT NULL DEFAULT 1.0, -- 0..1
created_at TEXT DEFAULT (datetime('now')),
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_eil_email ON email_investor_links(email_id);
CREATE INDEX IF NOT EXISTS idx_eil_investor ON email_investor_links(fundraising_investor_id);
CREATE INDEX IF NOT EXISTS idx_eil_fr_contact ON email_investor_links(fundraising_contact_id);
CREATE INDEX IF NOT EXISTS idx_eil_contact ON email_investor_links(contact_id);
```
### 4h. `email_sync_runs` — observability
```sql
CREATE TABLE IF NOT EXISTS email_sync_runs (
id TEXT PRIMARY KEY,
account_id TEXT NOT NULL,
kind TEXT NOT NULL, -- 'backfill'|'incremental'
started_at TEXT NOT NULL,
finished_at TEXT,
status TEXT NOT NULL, -- 'running'|'ok'|'error'|'partial'
messages_seen INTEGER NOT NULL DEFAULT 0,
messages_stored INTEGER NOT NULL DEFAULT 0,
attachments_saved INTEGER NOT NULL DEFAULT 0,
api_calls INTEGER NOT NULL DEFAULT 0,
retries INTEGER NOT NULL DEFAULT 0,
error TEXT,
FOREIGN KEY(account_id) REFERENCES email_accounts(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_sync_runs_account ON email_sync_runs(account_id, started_at);
```
---
## 5. Module layout
```
backend/
├── server.py # unchanged in Phase 1 (hooks added at end of Phase 1)
└── email_integration/
├── __init__.py # re-exports public API
├── README.md # setup instructions (copy of §3 in code-local form)
├── config.py # env-var loader + singleton Config()
├── credentials.py # CredentialProvider interface + DWD/OAuth impls
├── db.py # thin data-access layer; all SQL lives here
├── gmail_client.py # Gmail API wrapper w/ retry + rate-limit
├── parser.py # MIME/payload parser
├── matcher.py # investor matching logic
├── threads.py # thread roll-up logic
├── attachments.py # attachment download + disk storage
├── sync.py # orchestrator (backfill + incremental)
├── scheduler.py # background thread running periodic sync
├── routes.py # HTTP handlers (plug into CRMHandler)
├── errors.py # exception types + classification
├── crypto.py # AES-GCM wrapper for encrypting OAuth tokens
└── migrations/
└── 0001_email_tables.sql
```
---
## 6. Auth flow
### 6a. Credential abstraction
```python
class CredentialProvider(Protocol):
def access_token_for(self, email_address: str) -> AccessToken: ...
# AccessToken = namedtuple("AccessToken", ["token", "expires_at"])
class DWDCredentialProvider:
"""Service-account impersonation. No per-user state."""
class OAuthCredentialProvider:
"""Per-user refresh tokens stored encrypted in email_accounts.oauth_refresh_enc."""
```
`sync.py` and `gmail_client.py` only depend on `CredentialProvider`. Switching modes is a config flag.
### 6b. DWD flow (primary)
1. On startup: load service-account JSON from `CRM_GMAIL_SA_KEY_PATH`.
2. For each sync request, build a signed JWT claim set:
```
iss: <service account email>
sub: <target user email> ← impersonation
scope: gmail.readonly
aud: https://oauth2.googleapis.com/token
exp: now + 3600
iat: now
```
3. Sign RS256 with the service account's private key.
4. POST to `https://oauth2.googleapis.com/token` with `grant_type=urn:ietf:params:oauth:grant-type:jwt-bearer&assertion=<jwt>`.
5. Cache the returned access token (typ. 1h lifetime) in `email_accounts.oauth_token_enc` + `oauth_token_exp`. If expired/missing, re-mint.
We implement this ourselves with `cryptography` rather than pulling `google-auth` as a hard dependency — keeps the vendor footprint small. If you want to use `google-auth` instead, swap `_mint_dwd_access_token()` for `google.oauth2.service_account.Credentials.with_subject(...)` — one-line change.
### 6c. OAuth fallback flow
Endpoints scaffolded but inactive by default:
- `GET /api/email/oauth/start?account_email=<email>` → redirects to Google consent screen with `state` CSRF token, `access_type=offline`, `prompt=consent`.
- `GET /api/email/oauth/callback?code=...&state=...` → exchanges code for tokens, encrypts refresh token with AES-GCM using `CRM_GMAIL_SECRET_KEY`, stores in `email_accounts.oauth_refresh_enc`.
- Access tokens are refreshed on-demand just before any Gmail call if `oauth_token_exp` is within 60s of expiring.
Refresh tokens **never** leave the DB in plaintext; decryption happens in memory for the duration of a single API call and is then discarded.
### 6d. Token storage security
- At rest: `oauth_refresh_enc` is AES-256-GCM with a random per-record nonce. Key material is `CRM_GMAIL_SECRET_KEY` (separate from the CRM's JWT `CRM_SECRET_KEY` so you can rotate them independently).
- At rest (DWD): no tokens to store. Only the service account JSON is sensitive, and it lives at `0600` on the host.
- In transit: all Gmail API calls use HTTPS. Our HTTP client pins TLS 1.2+.
- Revocation:
- DWD: remove client ID from Workspace Admin console → instantly revokes all 5 accounts.
- OAuth: per-user `DELETE /api/email/accounts/:id/revoke` calls `https://oauth2.googleapis.com/revoke` then zeroes the DB row.
---
## 7. Sync algorithm
### 7a. Initial backfill (runs once per account)
When a new `email_accounts` row is created:
```
while backfill_complete == 0:
resp = GET /gmail/v1/users/me/messages
?pageToken=<backfill_cursor>&maxResults=500&q=
for msg_id in resp.messages:
fetch metadata (format=metadata, headers=[From,To,Cc,Subject,Date,
Message-ID,In-Reply-To,References])
decide match via matcher.is_match(addresses)
if matched:
fetch full (format=full)
parser.extract(...)
db.upsert_email(...), db.upsert_links(...)
enqueue attachment downloads
else:
db.upsert_email(headers_only=True)
db.upsert_account_sighting(account_id, msg_id, thread_id, labels)
backfill_cursor = resp.nextPageToken
if not backfill_cursor:
backfill_complete = 1
last_history_id = GET /users/me/profile → historyId
```
Batch the metadata fetches using Gmail's batch endpoint (up to 100 per batch) to stay well under quota.
### 7b. Incremental sync (runs every `CRM_GMAIL_SYNC_INTERVAL_MIN` minutes)
```
for account where sync_enabled AND backfill_complete:
try:
resp = GET /users/me/history?startHistoryId=<last_history_id>
for record in resp.history:
for msg_added in record.messagesAdded:
process_message(msg_added.message.id) # same as backfill
for label_change in record.labelsAdded/labelsRemoved:
update labels on email_account_messages (cheap)
for msg_deleted in record.messagesDeleted:
# Gmail delete ≠ CRM delete. We tombstone the sighting, keep email.
mark email_account_messages.deleted_at
last_history_id = resp.historyId
except HistoryExpiredError (HTTP 404):
# Gmail only keeps history for ~7 days. If we fall behind, trigger
# a partial re-backfill using q=after:<last_synced_at>.
reset_to_date_backfill(account)
```
**Idempotency guarantee:** `email_account_messages (account_id, gmail_message_id)` is unique. `emails.rfc_message_id` is unique. A retry can't double-insert. The `INSERT OR IGNORE` patterns in `db.py` make this explicit.
### 7c. Scheduler
A single daemon thread started from `server.py` startup:
```python
# scheduler.py
def run_forever():
while not _stop.is_set():
try:
run_all_accounts()
except Exception as e:
log.exception("sync loop failed")
_stop.wait(CONFIG.sync_interval_sec)
```
Within `run_all_accounts()`, accounts are processed **sequentially** (not in parallel) because:
- Our sqlite setup has one writer.
- It keeps the per-user-quota-per-second math simple (250 units/user/sec per account, no cross-account contention).
- Serial processing at ~30s/account × 5 = 2.5 min per cycle, which is well within a 3-hour interval.
Manual trigger: `POST /api/email/sync/run-now` (admin-only) skips the sleep and runs one cycle immediately.
---
## 8. Parsing
Gmail returns messages as a recursive tree of MIME parts. `parser.py` walks the tree and produces a `ParsedEmail` dataclass.
Headers we extract:
- `From` (single address, parsed with `email.utils.getaddresses`)
- `To`, `Cc`, `Bcc`, `Reply-To` (address lists)
- `Subject`
- `Date` (parsed with `email.utils.parsedate_to_datetime`, stored UTC ISO8601)
- `Message-ID`
- `In-Reply-To`
- `References` (space-separated → JSON array)
Body extraction rules:
1. Prefer `text/plain` part; fall back to HTML-stripped text from `text/html` using stdlib `html.parser.HTMLParser`.
2. Keep `text/html` alongside `text/plain` when both exist (both columns stored).
3. Ignore inline images for body purposes; they get stored as attachments with `content_disposition='inline'`.
4. Decode base64url bodies correctly (Gmail uses `-_` alphabet with no padding).
5. Normalize line endings to `\n`.
Attachments are identified by any MIME part with a non-empty `filename` or `Content-Disposition: attachment`. Attachment bodies are **not** inlined in the message `get` response when it's large — Gmail returns an opaque `attachmentId`, and we fetch them separately via `messages.attachments.get`.
Edge cases explicitly handled:
- Malformed `Date:` headers (fall back to `internalDate` from Gmail response, always present).
- Missing `Message-ID` header (generate synthetic: `synthetic-<gmail_message_id>@ten31.local`).
- Duplicate `Message-ID` across mailboxes (expected — primary dedup path).
- RFC 2047 encoded subjects (`=?UTF-8?B?...?=`) — `email.header.decode_header`.
---
## 9. Matching
### 9a. Source of truth for investor addresses
At matcher startup and periodically (every 15 min), we build an in-memory index:
```python
INVESTOR_EMAIL_INDEX = {
"alice@acme.vc": {
"kind": "fundraising_contact",
"id": "fc_abc",
"investor_id": "fi_xyz",
"investor_name": "Acme Capital"
},
...
}
INVESTOR_DOMAIN_INDEX = {
"acme.vc": [("fi_xyz", "Acme Capital"), ...]
}
```
Pulled from:
1. `fundraising_contacts.email` (primary LP list)
2. `contacts.email` (general CRM contacts)
3. `organizations.email` and domain of `organizations.website` (weak domain match only)
### 9b. Match algorithm
For each parsed email, collect the set of participant addresses (from + to + cc + bcc), **excluding** the owning account's own address (we don't match our own team).
```python
def match(participants: set[str]) -> list[InvestorLink]:
links = []
for addr in participants:
if addr in INVESTOR_EMAIL_INDEX:
entry = INVESTOR_EMAIL_INDEX[addr]
links.append(InvestorLink(
match_kind="exact_email",
confidence=1.0,
matched_address=addr,
**entry.targets
))
if not links: # only fall back to domain if no exact match
for addr in participants:
domain = addr.split("@", 1)[-1].lower()
if domain in INVESTOR_DOMAIN_INDEX and domain not in COMMON_PERSONAL_DOMAINS:
for inv_id, inv_name in INVESTOR_DOMAIN_INDEX[domain]:
links.append(InvestorLink(
match_kind="domain_match",
confidence=0.6,
matched_address=addr,
fundraising_investor_id=inv_id
))
return dedupe(links)
```
`COMMON_PERSONAL_DOMAINS = {"gmail.com", "outlook.com", "yahoo.com", "icloud.com", ...}` — we don't domain-match on these.
Match status flows:
- 0 links → `emails.match_status = 'unmatched'`, body not stored.
- ≥1 exact_email link → `match_status = 'matched'`, full body + attachments stored.
- Only domain_match links → `match_status = 'matched'` with confidence 0.6, full body stored but flagged for review in UI.
### 9c. Re-matching
If a new investor contact is added to the CRM, existing unmatched emails can be retroactively matched. `POST /api/email/rematch?since=<iso>` rebuilds the index and scans emails where `match_status='unmatched'` in the window. Full bodies are fetched from Gmail on demand for the newly matched ones (we still have the `gmail_message_id` via the sightings table).
---
## 10. Threading
Two-level strategy:
1. **Primary: Gmail `threadId`.** Within one mailbox, Gmail's grouping is authoritative. We trust it.
2. **Cross-account reconciliation: RFC `References` / `In-Reply-To`.** If Alice sees thread `thr_A` and Bob sees thread `thr_B` and they share RFC `Message-ID`s, they're really one conversation.
Algorithm:
```python
def resolve_thread(email_row):
# Step 1: normalize subject (strip "Re:", "Fwd:", whitespace)
subj_norm = normalize_subject(email_row.subject)
# Step 2: collect candidate thread keys
rfc_refs = email_row.references + [email_row.in_reply_to]
rfc_refs = [r for r in rfc_refs if r]
# Step 3: find any existing email that shares a References link
parent = db.find_email_by_any_rfc_id(rfc_refs + [email_row.rfc_message_id])
if parent and parent.thread_id:
return parent.thread_id
# Step 4: fall back to gmail_thread_id (first sighting wins)
if email_row.gmail_thread_id:
existing = db.find_thread_by_gmail_id(email_row.gmail_thread_id)
if existing:
return existing.id
# Step 5: create new thread
return db.create_thread(
gmail_thread_id=email_row.gmail_thread_id,
rfc_thread_root_id=rfc_refs[0] if rfc_refs else email_row.rfc_message_id,
subject_normalized=subj_norm,
first_message_at=email_row.sent_at
)
```
Thread roll-up (`email_threads.message_count`, `participants_json`, `last_message_at`) is recomputed opportunistically on each insert using a single `UPDATE ... FROM (SELECT ...)` — cheap at the volumes a 5-person team produces.
---
## 11. Attachments
### 11a. Storage layout
```
data/
├── crm.db
├── email_attachments/
│ └── <email_id[0:2]>/ # shard by first 2 chars of email_id to keep dirs small
│ └── <email_id>/
│ ├── <attachment_id>-pitch-deck.pdf
│ └── <attachment_id>-financials.xlsx
```
### 11b. Download flow
Run asynchronously from message parsing (so a giant attachment doesn't block the sync loop):
1. Parser records attachment stubs in `email_attachments` with `download_status='pending'`.
2. An attachment-download worker pulls up to N at a time (default 4), calls `messages.attachments.get`, streams to a temp file, computes SHA-256, renames to final path.
3. On success: `download_status='downloaded'`, `downloaded_at`, `sha256_hex`, `size_bytes` filled.
4. On failure: `download_status='failed'`, `download_attempts` incremented, `download_error` set. Retry with backoff up to 5 attempts.
5. Sanitize filenames aggressively: strip path separators, null bytes, control chars, truncate to 200 chars. Keep extension.
### 11c. Dedup and size limits
- Content-addressed by `sha256_hex`. If the same file already exists on disk (same hash), we don't re-download — we link the second `email_attachments` row to the same `storage_path`.
- Default per-file cap: 50 MB. Above that, we record metadata only with `download_status='skipped'` and a note. Configurable via `CRM_GMAIL_MAX_ATTACHMENT_MB`.
- Default total storage cap: 10 GB (logged warning at 80%).
### 11d. Retrieval
- `GET /api/email/attachments/:id/download` — authenticated, streams the file. 404 if status != `downloaded`.
- No direct filesystem exposure.
---
## 12. Errors, rate limits, backoff
### 12a. Error taxonomy (`errors.py`)
```python
class GmailError(Exception): ...
class AuthError(GmailError): ... # 401/403 — broken creds, needs operator attention
class RateLimitError(GmailError): ... # 429 or 403 rate_limit — backoff + retry
class TransientError(GmailError): ... # 500, 502, 503, 504, network — backoff + retry
class NotFoundError(GmailError): ... # 404 — message deleted, history expired
class HistoryExpiredError(NotFoundError): ... # specifically for expired startHistoryId
class PermanentError(GmailError): ... # 400 bad request etc — don't retry, log+skip
```
Every HTTP call is routed through `gmail_client._call()` which maps status codes → exception types → retry decision.
### 12b. Retry policy
Exponential backoff with full jitter:
- Initial delay: 1s
- Max delay: 60s
- Max attempts: 5 for transient/rate-limit, 0 for auth/permanent
- On the 6th rate-limit error in a row for an account, mark account `sync_status='paused'` and alert via audit log.
### 12c. Gmail quota budget
- Per-project: 1,000,000,000 units/day (enormous; we won't get near it).
- Per-user: 250 units/second. We pace to ~150 units/sec/account via a token bucket in `gmail_client.py`. With 5 accounts processed serially, that's naturally below the per-project burst limit of 1,200 units/sec.
Call costs:
- `messages.list` = 5
- `messages.get` = 5
- `messages.attachments.get` = 5
- `history.list` = 2
Typical incremental cycle for one active user (~50 new messages in 3h): ~260 units. Negligible.
### 12d. Data validation
All DB inserts go through `db.py` helpers that:
- Enforce required fields (from_email, sent_at, rfc_message_id).
- Cap TEXT field lengths (subject 998 bytes per RFC 5322, body_text 10 MB soft cap with truncation log).
- Coerce timestamps to UTC ISO8601.
- Wrap everything in a single transaction per email.
---
## 13. Security considerations
| Concern | Mitigation |
|---------|-----------|
| Service account key theft | File at 0600, not in git, `.gitignore` covers `data/secrets/`, key usable only for Gmail scope, restricted to Workspace domain by DWD. |
| SQL injection | All SQL uses parameterized queries; strict `?` placeholders. |
| Path traversal in attachment filenames | Filenames are sanitized (`_sanitize_filename`) and written to a hash-based path; original name is stored in DB only. |
| Rate-limit DoS on sync endpoint | `POST /api/email/sync/run-now` is admin-only and subject to existing `WRITE_RATE_LIMIT_PER_MIN`. |
| Exfiltration via rendered email HTML | UI must render in a sandboxed iframe with CSP; no remote image/script loads. (Frontend change, not in scope for this scaffolding.) |
| Secret exposure in logs | Access tokens and refresh tokens never logged. Error messages are redacted by a central `redact_secrets()` filter. |
| Backup includes email bodies | Existing backup system writes JSON of `fundraising_state` only — it does **not** include the new `emails_*` tables. Future decision: include encrypted? See §17. |
---
## 14. Integration plan — zero-breakage activation
The scaffolded code is **inert** until three very small changes are made to `server.py`. Each is individually reversible.
### Patch 1: Table creation (append to `init_db()`)
```python
# server.py, near the end of init_db()
try:
from email_integration.db import apply_migrations
apply_migrations(cursor)
except ImportError:
pass # module not deployed yet, skip silently
```
### Patch 2: Scheduler startup (append to `main()` before `server_forever()`)
```python
# server.py, in main() after start_backup_scheduler()
if os.environ.get("CRM_GMAIL_INTEGRATION_ENABLED", "").lower() in ("1","true","yes"):
from email_integration.scheduler import start_sync_scheduler
start_sync_scheduler()
```
### Patch 3: Route dispatch hook (in `CRMHandler.do_GET` / `do_POST`, before the 404)
```python
# server.py, before the 404 fallthrough in CRMHandler
from email_integration.routes import try_handle
if try_handle(self): # returns True if it claimed the request
return
```
All three are guarded by the env var `CRM_GMAIL_INTEGRATION_ENABLED`. Default is off. Setting it off instantly disables the integration on restart. Migrations are idempotent so there's no "partially-migrated" state.
---
## 15. Testing strategy
### 15a. Unit tests (pure Python, no network)
- `test_parser.py` — fixture Gmail payloads covering: plain-only, html-only, multipart/alternative, multipart/mixed with attachments, RFC 2047 subjects, malformed dates, missing Message-ID.
- `test_matcher.py` — exact match, domain match, common-personal-domain exclusion, team-own-address exclusion, dedup.
- `test_threads.py` — new thread, Gmail-only thread, RFC-only thread, cross-account reconciliation.
- `test_db.py` — migrations run on empty DB, re-run is no-op, all upserts idempotent.
- `test_crypto.py` — encrypt/decrypt roundtrip, wrong key fails, nonce uniqueness.
### 15b. Integration tests (mocked HTTP)
Use `responses` lib or a tiny in-proc mock server to simulate Gmail API responses:
- Full backfill of 200 messages across 5 pages.
- Incremental history with messagesAdded + labelsAdded + messagesDeleted.
- Expired `startHistoryId` triggers date-backfill fallback.
- 429 triggers retry; 6 consecutive 429s pauses account.
- Attachment download happy path + large-file skip.
### 15c. Manual smoke test before production
1. Enable feature flag for a single test account first (`grant@ten31.xyz` only).
2. Run initial backfill, confirm no errors in `email_sync_runs`.
3. Check sample of matched emails in UI against Gmail source.
4. Verify attachments download and hash matches.
5. After 24h of incremental sync, enroll remaining 4 teammates.
---
## 16. Rollout plan
| Week | Milestone |
|------|-----------|
| 1 | Review this doc. Adjust schema / decisions as needed. |
| 2 | Implement scaffolded modules (code is already stubbed — fill in bodies). Unit tests. |
| 3 | Integration tests against a GCP sandbox project. |
| 4 | Apply server.py patches behind flag, enable for Grant's account only. Monitor for 1 week. |
| 5 | Enroll remaining 4 teammates. Document ops runbook. |
| 6+ | Phase 2: UI for threads, search, manual match override. |
---
## 17. Open questions / deferred decisions
1. **Backup inclusion.** Should the existing nightly backup cover `emails_*` tables? Pro: completeness. Con: bodies contain sensitive content; backups currently written unencrypted. Recommend adding a separate encrypted email backup stream rather than merging into `fundraising_backup`.
2. **Retention policy for unmatched emails.** Current plan: keep metadata indefinitely, never store body. Alternative: delete metadata after 90 days to reduce DB size. User preference needed.
3. **Real-time push.** Gmail supports `users.watch` + Pub/Sub to get sub-second notifications. Out of scope for Phase 1 (requires ingress URL, Pub/Sub, harder on a self-hosted box). Revisit if latency becomes an issue.
4. **Contact auto-creation.** Should an email from a new `@foo.vc` address automatically create a `contacts` row? Currently no — matching is passive. Easy to add later.
5. **Sent email enrichment.** When a teammate sends email to a known investor, should it create/update a CRM task (e.g., "awaiting reply")? Phase 3 roadmap item.
6. **Multi-domain Workspace.** If ten31.xyz ever adds a second domain (ten31.io etc), DWD needs updating. Document for future.
---
## Appendix A: Example end-to-end flow
A concrete walkthrough to ground the design.
**Scenario:** Alice from Acme Capital emails Grant with a fundraising question, cc'ing her partner Bob.
1. **T+0.** Alice sends `Subject: "Re: Q2 allocation"` to `grant@ten31.xyz`, cc `bob@acme.vc`.
2. **T+1 min.** Gmail delivers. Historically this is invisible to the CRM.
3. **T+1-180 min.** Sync scheduler wakes up. For Grant's account, `history.list(startHistoryId=<checkpoint>)` returns this new message.
4. **Metadata fetch.** `messages.get(id=..., format=metadata)` returns headers. Participant set = `{alice@acme.vc, grant@ten31.xyz, bob@acme.vc}`. Excluding Grant's own address → `{alice@acme.vc, bob@acme.vc}`.
5. **Match.** `INVESTOR_EMAIL_INDEX[alice@acme.vc]` → Acme Capital (fundraising_investor_id=`fi_xyz`). Exact match, confidence 1.0.
6. **Full fetch.** `messages.get(id=..., format=full)` returns body + attachment stubs.
7. **Parse.** `parser.extract` produces `ParsedEmail(from=alice, to=[grant], cc=[bob], subject="Re: Q2 allocation", body_text="...", attachments=[{name:"term_sheet.pdf", gmail_attachment_id:"..."}])`.
8. **DB writes (one transaction):**
- `INSERT INTO emails` with `rfc_message_id=<CAA...@mail.gmail.com>`, `match_status='matched'`.
- `INSERT INTO email_recipients` × 3.
- `INSERT INTO email_account_messages` with Grant's account + Gmail message id.
- `INSERT INTO email_investor_links` × 1 (to fi_xyz).
- `INSERT INTO email_attachments` with `download_status='pending'`.
- `resolve_thread()` finds existing thread via References, updates `email_threads.last_message_at`.
9. **Attachment download.** Worker picks up the pending attachment, calls `messages.attachments.get`, writes `data/email_attachments/<emailid[:2]>/<emailid>/<attid>-term_sheet.pdf`, updates row with hash + status.
10. **Sightings.** If Bob is also a CRM user with an enrolled mailbox, his sync run also picks up his copy of the same email. `rfc_message_id` already exists → `INSERT INTO email_account_messages` only (no new `emails` row). Emerges as a second sighting.
11. **Result in UI.** Acme Capital's investor page shows a new email in the conversation feed, threaded with prior Q2 allocation messages. Term sheet attachment is one click away, served from local disk.
---
## Appendix B: What gets scaffolded now vs. filled in later
The scaffolded modules (`backend/email_integration/*.py`) include:
- All module-level docstrings explaining purpose.
- Complete type signatures and dataclasses.
- Full SQL for migrations.
- Working implementations for: config loading, credential abstraction, DB migration application, HTTP route dispatch.
- Skeleton implementations with `TODO` markers for: full Gmail API call bodies, MIME tree walking, retry loops.
Lines of code delivered in scaffolding: ~1,500. Remaining to implement: ~1,000 (mostly the Gmail client and parser edge cases). Nothing in `server.py` or `crm.db` is changed.