c7ce44d963
Workstream A–C substrate for the Ten31 agentic system: - A1: docs/crm-overview.md; CLAUDE.md conventions + guardrail #9 - A2: additive/reversible core migration (canonical_entities, entity_links, interaction_log, relationship_edges, soft-delete) + ledgered runner - B1/B3: chunking + deterministic entity resolution (backend/ingest) - B2: dense (bge-m3) + BM25 sparse ingest to Qdrant crm_chunks - C: CRM MCP server (reads, retrieval modes, logged writes) — no outbound tools - docs: redaction/re-hydration, Gmail enablement runbook - synthetic test data; .env.example; housekeeping (.gitignore, untrack crm.db, drop legacy files + start9/0.3.5) Verified end-to-end on synthetic data + live Sparks (hybrid > dense on entity queries). Real backfill runs on Ten31 infra; index holds synthetic data only. Branch snapshot also captures pre-existing working-tree changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
116 lines
7.7 KiB
Markdown
116 lines
7.7 KiB
Markdown
# Gmail Integration — Enablement Runbook
|
|
|
|
*How to turn on the (already-built) Gmail correspondence integration on the live Start9 box, validate it with a small observed backfill, then roll out to the domain. Read-only capture; all mail stays on Ten31 infrastructure.*
|
|
|
|
Code: `backend/email_integration/`. Schema: `migrations/0001_email_tables.sql`. See `docs/crm-overview.md` §2.4 for the data model.
|
|
|
|
---
|
|
|
|
## What this does & the sovereignty posture
|
|
|
|
- Pulls Gmail messages for enrolled `@ten31.xyz` mailboxes into the CRM's own SQLite DB (`emails`, `email_threads`, `email_attachments`, …), **deduped across inboxes**, **threaded**, and **matched** to investors/contacts (`email_investor_links`).
|
|
- **Scope is `https://www.googleapis.com/auth/gmail.readonly`** (`credentials.py:34`) — the integration can *read* mail, never send or modify. Lower risk, and it's all the ingest needs.
|
|
- **Data path is Google → your Start9 box only.** No new third party, and per guardrail #9 Claude never reads the mail — the correspondence becomes ingest input for *local* embeddings (bge-m3 on the Sparks), not API context. (Contrast with Superhuman's MCP — see §7.)
|
|
|
|
## 0. Pick the auth method
|
|
|
|
| Method | When | What you provide |
|
|
|---|---|---|
|
|
| **DWD (domain-wide delegation)** — recommended | You administer the `ten31.xyz` Google Workspace and want to capture team mailboxes without per-user consent | One service-account JSON key + a Workspace admin authorization |
|
|
| **Per-user OAuth** | Capturing a mailbox you don't admin, or avoiding DWD | OAuth client id/secret + each user clicks through `/api/email/oauth/start` |
|
|
|
|
The Start9 0.4 entrypoint is built around **DWD** (auto-detects the key, sets `CRM_GMAIL_AUTH_METHOD=dwd`, `CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz`). The rest of this runbook assumes DWD.
|
|
|
|
## 1. Google-side setup (one time)
|
|
|
|
You need Workspace **super-admin** + a GCP project.
|
|
|
|
1. **GCP project** → enable the **Gmail API** (`APIs & Services → Library → Gmail API → Enable`).
|
|
2. **Create a service account** (`IAM & Admin → Service Accounts`). Note its **client ID** (a long number) and its email.
|
|
3. **Create a JSON key** for it (`Keys → Add key → JSON`). This file is the secret — handle per guardrail #7.
|
|
4. **Authorize domain-wide delegation** in the Workspace **Admin console** (`Security → Access and data control → API controls → Domain-wide delegation → Add new`):
|
|
- **Client ID** = the service account's client ID from step 2.
|
|
- **OAuth scopes** = `https://www.googleapis.com/auth/gmail.readonly`
|
|
- Save. (Without this exact scope authorized, sync returns a non-retryable auth error — see `errors.py:21`.)
|
|
|
|
## 2. Install the key on Start9
|
|
|
|
1. Copy the JSON key to the service's data volume at **`/data/secrets/gmail-service-account.json`**.
|
|
2. Lock it down: `chmod 600 /data/secrets/gmail-service-account.json` (the entrypoint also `chmod 700`s `/data/secrets`).
|
|
3. **Restart the service.** On boot the 0.4 entrypoint detects the key and exports:
|
|
`CRM_GMAIL_INTEGRATION_ENABLED=true`, `CRM_GMAIL_AUTH_METHOD=dwd`, `CRM_GMAIL_SA_KEY_PATH=/data/secrets/gmail-service-account.json`, `CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz`, `CRM_GMAIL_SYNC_INTERVAL_MIN=180`.
|
|
It logs `Gmail integration: ENABLED (key at …)`.
|
|
|
|
## 3. Smoke test — ONE mailbox first (the "don't rush it" gate)
|
|
|
|
Do a single-mailbox run before enrolling the whole team, to shake out auth/matching bugs on a small surface. All calls need an **admin Bearer token**:
|
|
|
|
```bash
|
|
CRM=https://<your-start9-crm-host> # the CRM's address
|
|
TOKEN=$(curl -sk $CRM/api/auth/login -H 'Content-Type: application/json' \
|
|
-d '{"username":"<admin>","password":"<pw>"}' | python3 -c 'import sys,json;print(json.load(sys.stdin)["token"])')
|
|
|
|
# integration alive?
|
|
curl -sk $CRM/api/email/status -H "Authorization: Bearer $TOKEN"
|
|
|
|
# enroll just yourself
|
|
curl -sk $CRM/api/email/accounts/enroll -H "Authorization: Bearer $TOKEN" \
|
|
-H 'Content-Type: application/json' -d '{"email":"you@ten31.xyz"}'
|
|
|
|
# trigger a sync now (otherwise it runs every 180 min)
|
|
curl -sk $CRM/api/email/sync/run-now -X POST -H "Authorization: Bearer $TOKEN"
|
|
```
|
|
|
|
**Tip:** to keep the first backfill small, set `CRM_GMAIL_BACKFILL_PAGE_SIZE` low (e.g. `50`) before the restart, watch one page land, then raise it.
|
|
|
|
## 4. Verify (on the box, read-only SQL)
|
|
|
|
```sql
|
|
-- sync ran cleanly?
|
|
SELECT kind, status, messages_seen, messages_stored, attachments_saved, error
|
|
FROM email_sync_runs ORDER BY started_at DESC LIMIT 3;
|
|
|
|
-- mail captured + how much got matched to investors/contacts
|
|
SELECT COUNT(*) total, SUM(is_matched) matched FROM emails;
|
|
|
|
-- who did it match, and how confidently?
|
|
SELECT match_kind, COUNT(*) FROM email_investor_links GROUP BY match_kind;
|
|
```
|
|
|
|
Or via the API: `GET /api/email/status` (counts) and `GET /api/email/threads?investor_id=<id>` (matched threads for one investor). If matching looks thin, run `POST /api/email/rematch` with `{"since":"<ISO8601>"}` after the investor list is populated.
|
|
|
|
## 5. Roll out to the domain
|
|
|
|
Once the single mailbox looks right:
|
|
|
|
```bash
|
|
curl -sk $CRM/api/email/accounts/enroll-all -X POST -H "Authorization: Bearer $TOKEN"
|
|
curl -sk $CRM/api/email/sync/run-now -X POST -H "Authorization: Bearer $TOKEN"
|
|
```
|
|
|
|
Incremental sync then runs every `CRM_GMAIL_SYNC_INTERVAL_MIN` (default 180) via the scheduler thread.
|
|
|
|
## 6. Tuning knobs (env, `config.py`)
|
|
|
|
`CRM_GMAIL_SYNC_INTERVAL_MIN` (180) · `CRM_GMAIL_BACKFILL_PAGE_SIZE` (500) · `CRM_GMAIL_MAX_ATTACHMENT_MB` (50) · `CRM_GMAIL_ATTACH_CONCURRENCY` (4) · `CRM_GMAIL_RATE_UNITS_SEC` (150) · `CRM_GMAIL_HISTORY_STALE_DAYS` (5, forces a backfill if Gmail pruned history).
|
|
|
|
## 7. Where Superhuman fits (and where it doesn't)
|
|
|
|
You have Superhuman connected to Gmail, and it exposes an MCP server. The two are **complementary, not competing**, and it matters which job each does:
|
|
|
|
- **Canonical correspondence ingest → use this DWD integration, not Superhuman.** It pulls mail straight into your own `crm.db` on Start9 and feeds the *local* embedding pipeline. Routing bulk ingest through Superhuman's MCP would put your email content through Superhuman's servers and — because an agent/Claude would be driving those calls — through Anthropic, which is exactly what guardrail #1 keeps the corpus away from. DWD keeps the data path Google → your box.
|
|
- **Human mail workflow & drafting → Superhuman MCP is great.** Reading/triaging your own inbox, and Closer-style *draft* generation that a human reviews and sends, naturally happen in your real mail client. The `batch-draft-writer` skill already drives the Superhuman MCP for that, and it's usable today — independent of the CRM pipeline.
|
|
|
|
Net: **DWD = system-of-record correspondence (sovereign, for retrieval). Superhuman MCP = the human's working surface (drafting, triage).** Don't make Superhuman the ingest source of truth.
|
|
|
|
## 8. Disable / rollback
|
|
|
|
Remove (or rename) `/data/secrets/gmail-service-account.json` and restart → the entrypoint logs `DISABLED` and routes return 503; captured data remains. To pause one mailbox without disabling the whole integration, set its `email_accounts.sync_enabled = 0`.
|
|
|
|
## 9. Troubleshooting
|
|
|
|
- **401/403 from Google on sync** → DWD scope not authorized, wrong client ID, or Gmail API not enabled (steps 1 & 4). This error is non-retryable by design (`errors.py`).
|
|
- **`status` says disabled / routes 503** → key not found at `CRM_GMAIL_SA_KEY_PATH`, or `CRM_GMAIL_INTEGRATION_ENABLED` not truthy (the entrypoint only sets it when the key file exists).
|
|
- **Mail captured but `matched = 0`** → the investor/contact list was empty or addresses don't match; populate the CRM/grid first, then `POST /api/email/rematch`.
|
|
- **Bodies missing on some emails** → by design, unmatched emails are stored metadata-only (no body) until matched (`sync.py`); re-match to backfill.
|