Phase 0 foundation: canonical schema, ingest pipeline, CRM MCP server

Workstream A–C substrate for the Ten31 agentic system:
- A1: docs/crm-overview.md; CLAUDE.md conventions + guardrail #9
- A2: additive/reversible core migration (canonical_entities, entity_links,
  interaction_log, relationship_edges, soft-delete) + ledgered runner
- B1/B3: chunking + deterministic entity resolution (backend/ingest)
- B2: dense (bge-m3) + BM25 sparse ingest to Qdrant crm_chunks
- C: CRM MCP server (reads, retrieval modes, logged writes) — no outbound tools
- docs: redaction/re-hydration, Gmail enablement runbook
- synthetic test data; .env.example; housekeeping (.gitignore, untrack crm.db,
  drop legacy files + start9/0.3.5)

Verified end-to-end on synthetic data + live Sparks (hybrid > dense on entity
queries). Real backfill runs on Ten31 infra; index holds synthetic data only.
Branch snapshot also captures pre-existing working-tree changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Keysat
2026-06-05 08:11:28 -05:00
parent 7027efd777
commit c7ce44d963
99 changed files with 10676 additions and 7817 deletions
+175
View File
@@ -0,0 +1,175 @@
# `email_integration` — Gmail capture for the Venture CRM
Scaffolded Phase 1 of the Gmail integration described in
`GMAIL_INTEGRATION_ARCHITECTURE.md` (repo root). Everything in this module is
isolated from `server.py` until you wire it in explicitly.
## Contents
| File | Purpose |
|------|---------|
| `config.py` | Env-var loader; exposes `CONFIG` singleton. |
| `errors.py` | Exception taxonomy used by the retry loop. |
| `crypto.py` | AES-GCM wrapper for OAuth refresh-token encryption (only used in OAuth mode). |
| `credentials.py` | `CredentialProvider` protocol + `DWDCredentialProvider` / `OAuthCredentialProvider`. |
| `gmail_client.py` | Gmail API HTTP wrapper (rate limit, retry, pagination). |
| `db.py` | All SQL touching `emails_*` tables. Migrations live under `migrations/`. |
| `parser.py` | Gmail payload → canonical dict (headers, body, attachments). |
| `matcher.py` | Investor address index + match logic. |
| `threads.py` | Thread resolution using Gmail threadId + RFC References. |
| `attachments.py` | Stub rows + on-disk storage + download worker. |
| `sync.py` | Orchestrator for backfill + incremental sync of one account. |
| `scheduler.py` | Background thread that runs `sync.sync_all` on an interval. |
| `routes.py` | HTTP handlers under `/api/email/*` compatible with `CRMHandler`. |
| `migrations/0001_email_tables.sql` | Table DDL. |
## Wiring it in
All changes are in `backend/server.py`, all guarded by an env flag. Each is
independently revertible. None run unless `CRM_GMAIL_INTEGRATION_ENABLED=true`.
**Patch 1 — migrations** (append to `init_db()` after all existing
`cursor.executescript(...)` calls, before `conn.commit()`):
```python
try:
from email_integration.db import apply_migrations
apply_migrations(cursor)
except ImportError:
pass
```
**Patch 2 — scheduler** (in `main()`, after `start_backup_scheduler()`):
```python
if os.environ.get("CRM_GMAIL_INTEGRATION_ENABLED", "").lower() in ("1", "true", "yes", "on"):
from email_integration.scheduler import start_sync_scheduler
start_sync_scheduler()
```
**Patch 3 — routes** (add near the top of `CRMHandler.do_GET` and
`CRMHandler.do_POST`, after auth/rate-limit pre-checks, before API routing):
```python
try:
from email_integration.routes import try_handle
if try_handle(self):
return
except ImportError:
pass
```
## Environment variables
```bash
# Master on/off. Default off; scheduler won't start, routes return 503.
CRM_GMAIL_INTEGRATION_ENABLED=true
# Auth method: "dwd" (default, recommended) or "oauth"
CRM_GMAIL_AUTH_METHOD=dwd
# DWD mode
CRM_GMAIL_SA_KEY_PATH=/path/to/CRM/data/secrets/gmail-service-account.json
CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz
# OAuth mode (fallback; not required for DWD)
CRM_GMAIL_OAUTH_CLIENT_ID=...
CRM_GMAIL_OAUTH_CLIENT_SECRET=...
CRM_GMAIL_OAUTH_REDIRECT_URI=https://crm.ten31.xyz/api/email/oauth/callback
CRM_GMAIL_SECRET_KEY=<base64-32-random-bytes> # for encrypting refresh tokens
# Sync
CRM_GMAIL_SYNC_INTERVAL_MIN=180 # default 3h
CRM_GMAIL_BACKFILL_PAGE_SIZE=500
CRM_GMAIL_MAX_ATTACHMENT_MB=50
CRM_GMAIL_ATTACH_CONCURRENCY=4
CRM_GMAIL_RATE_UNITS_SEC=150 # per account, leaves 40% headroom
CRM_GMAIL_RETRY_MAX=5
CRM_GMAIL_HISTORY_STALE_DAYS=5
```
## Google Cloud / Workspace setup (DWD)
See `GMAIL_INTEGRATION_ARCHITECTURE.md` §3 for the full runbook. Short form:
1. Create GCP project, enable Gmail API.
2. Create service account, download JSON key, enable domain-wide delegation.
3. In Google Admin console → Security → API controls → Manage domain-wide
delegation, authorize the service account's client ID with scope
`https://www.googleapis.com/auth/gmail.readonly`.
4. Copy the JSON key to `data/secrets/gmail-service-account.json`, `chmod 600`.
5. Set env vars in `.env.beta`, restart CRM.
6. As admin, POST `/api/email/accounts/enroll-all` to create `email_accounts`
rows for every active user whose email ends in the Workspace domain.
## Adding the crypto dependency (only for OAuth mode)
If you use OAuth fallback you need `cryptography`:
```
cryptography==42.0.5
```
Append to `backend/requirements.txt`. DWD mode also uses `cryptography` for
the RSA signing of the JWT bearer token — so if you enable the integration in
either mode, add the dep.
## Rollback
To disable instantly: set `CRM_GMAIL_INTEGRATION_ENABLED=false` and restart.
The scheduler won't start, routes return 503, DB tables remain (unused).
To remove completely: drop the env var, delete `data/email_attachments/`,
drop all `emails_*` tables and `email_*` tables (migration is idempotent
create-only; a separate drop script would be required — not provided in
Phase 1 since you said you're not rushing).
## Local development
The module has zero network dependencies when imported without the scheduler
starting. You can:
```python
python3 -c "from email_integration.parser import parse; \
import json; \
print(parse(json.load(open('fixture.json'))))"
```
## Testing checklist (before enabling in production)
- [ ] Enable `CRM_GMAIL_INTEGRATION_ENABLED=true` on a staging copy of the DB only.
- [ ] Verify migrations applied: `emails`, `email_accounts`, etc. present.
- [ ] Enroll one account (yours) via `/api/email/accounts/enroll`.
- [ ] Trigger `POST /api/email/sync/run-now`.
- [ ] Check `email_sync_runs` for `status='ok'`.
- [ ] Spot-check `emails` rows against Gmail.
- [ ] Verify an attachment downloaded correctly (hash and size).
- [ ] Let the scheduler run for 24 hours; monitor `/api/email/status`.
- [ ] Enroll remaining 4 teammates.
## What's scaffolded vs. TODO
**Scaffolded and complete:**
- Schema (migration 0001)
- Config and env parsing
- Error taxonomy + retry classifier
- AES-GCM crypto helpers
- DWD JWT minting + access token caching
- OAuth refresh + consent flow endpoints
- Gmail client (list/get/history/attachments/profile) with rate limit + retry
- Full DB data-access layer
- MIME parser including RFC 2047 subjects and HTML→text fallback
- Investor matcher with exact + domain strategies
- Thread resolution (Gmail threadId + RFC References cross-account)
- Attachment storage with SHA-256 dedup
- Sync orchestrator (backfill + incremental with history-expired fallback)
- Scheduler with manual-trigger hook
- HTTP routes (status, accounts, threads, enroll, run-now, rematch, oauth)
**TODO before production (see architecture doc §15):**
- Multipart batch metadata fetch in `gmail_client.batch_get_metadata`
(currently serial fallback).
- Unit tests (fixtures for parser, matcher, threads; integration tests with
responses-style HTTP mock).
- Frontend UI: a thread list + detail pane in `frontend/index.html`.
- Sandboxed HTML rendering for email bodies (out of scope here).
+15
View File
@@ -0,0 +1,15 @@
"""
Gmail Integration for Venture CRM.
Phase 1 scope: OAuth2/DWD authentication, incremental Gmail sync, MIME parsing,
investor matching, threading, attachment storage. All logic isolated to this
module; server.py integration is a 3-line patch guarded by
CRM_GMAIL_INTEGRATION_ENABLED.
See GMAIL_INTEGRATION_ARCHITECTURE.md at the repo root for full design.
"""
from . import config # noqa: F401
from . import errors # noqa: F401
__all__ = ["config", "errors"]
+234
View File
@@ -0,0 +1,234 @@
"""
Attachment download + on-disk storage.
Two usage patterns:
1. During message parsing we call `register_stubs(conn, email_id, parsed)`
which writes pending rows to email_attachments.
2. A separate worker (kicked off by sync after each account completes)
calls `drain_pending()` which fetches attachment bytes from Gmail and
writes them to disk under CONFIG.attachments_dir.
Files are named: <CRM_DATA_DIR>/email_attachments/<email_id[:2]>/<email_id>/<attachment_id>-<sanitized_filename>
Sanitization prevents path traversal and keeps cross-platform-safe names.
"""
import base64
import hashlib
import os
import re
import sqlite3
from typing import Iterable, Optional
from . import config as _cfg
from . import db as _db
from . import errors as _errors
from . import gmail_client as _gmail
_MAX_FILENAME_LEN = 200
_BAD_FILENAME_CHARS = re.compile(r'[/\\\x00-\x1f\x7f:*?"<>|]+')
def _sanitize_filename(name: str) -> str:
if not name:
return "unnamed.bin"
# strip path components first
name = os.path.basename(name.replace("\\", "/"))
name = _BAD_FILENAME_CHARS.sub("_", name).strip(" .")
if not name:
name = "unnamed.bin"
if len(name) > _MAX_FILENAME_LEN:
stem, dot, ext = name.rpartition(".")
if dot:
name = stem[: _MAX_FILENAME_LEN - len(ext) - 1] + "." + ext
else:
name = name[:_MAX_FILENAME_LEN]
return name
def _storage_path_for(email_id: str, attachment_id: str, sanitized_filename: str) -> str:
root = _cfg.CONFIG.attachments_dir
bucket = email_id[:2] or "_0"
dir_ = os.path.join(root, bucket, email_id)
os.makedirs(dir_, exist_ok=True)
return os.path.join(dir_, f"{attachment_id}-{sanitized_filename}")
# ---------------------------------------------------------------------------- phase 1: register stubs
def register_stubs(conn: sqlite3.Connection, *, email_id: str,
parsed_attachments: Iterable[dict]) -> list[str]:
"""Write pending attachment rows from parsed message data.
Also handles tiny inline attachments whose bytes arrived with the message
body (body.data present, no separate attachmentId) by writing them
directly and marking as downloaded.
Returns list of attachment ids created.
"""
max_bytes = _cfg.CONFIG.max_attachment_mb * 1024 * 1024
ids = []
for att in parsed_attachments:
filename = att.get("filename") or "unnamed.bin"
sanitized = _sanitize_filename(filename)
gmail_att_id = att.get("gmail_attachment_id") or ""
mime = att.get("mime_type")
size = att.get("size")
# Determine storage path (we write the path whether or not the download
# succeeded; missing files surface via download_status).
att_row_id = _db.insert_attachment_stub(
conn,
email_id=email_id,
gmail_attachment_id=gmail_att_id,
filename=filename,
sanitized_filename=sanitized,
mime_type=mime,
size_bytes=size,
storage_path=_storage_path_for(email_id, gmail_att_id or att_row_id_fallback(), sanitized),
)
ids.append(att_row_id)
# Oversize guard.
if isinstance(size, int) and size > max_bytes:
conn.execute(
"UPDATE email_attachments SET download_status = 'skipped', "
"download_error = ? WHERE id = ?",
(f"exceeds max size {_cfg.CONFIG.max_attachment_mb}MB", att_row_id),
)
continue
# Inline data fast-path.
inline_b64 = att.get("inline_data_b64")
if inline_b64:
try:
raw = base64.urlsafe_b64decode(_pad(inline_b64).encode("ascii"))
path = _storage_path_for(email_id, att_row_id, sanitized)
_write_bytes(path, raw)
sha = hashlib.sha256(raw).hexdigest()
conn.execute(
"UPDATE email_attachments SET storage_path = ? WHERE id = ?",
(path, att_row_id),
)
_db.mark_attachment_downloaded(
conn, att_row_id, sha256_hex=sha, size_bytes=len(raw)
)
except Exception as e:
_db.mark_attachment_failed(conn, att_row_id, error=f"inline decode: {e}")
return ids
def att_row_id_fallback() -> str:
# Placeholder so the path template always produces something if gmail_att_id
# was missing at stub time; the real path is rewritten when the worker
# picks it up.
import uuid
return uuid.uuid4().hex
# ---------------------------------------------------------------------------- phase 2: worker
def drain_pending(conn_factory, client: _gmail.GmailClient, account_id: str,
*, limit: int = 50) -> int:
"""Download up to `limit` pending attachments for `account_id`.
Returns count of successfully downloaded attachments. Called after each
account's sync completes so large files don't block the sync loop.
"""
conn = conn_factory()
try:
pending = _db.pending_attachments(conn, limit=limit)
finally:
conn.close()
downloaded = 0
for row in pending:
if row["account_id"] != account_id:
continue
conn = conn_factory()
try:
ok = _download_one(conn, client, row)
if ok:
downloaded += 1
conn.commit()
finally:
conn.close()
return downloaded
def _download_one(conn: sqlite3.Connection, client: _gmail.GmailClient, row) -> bool:
try:
resp = client.get_attachment(row["gmail_message_id"], row["gmail_attachment_id"])
except _errors.RETRYABLE as e:
_db.mark_attachment_failed(conn, row["id"], error=f"transient: {type(e).__name__}")
return False
except _errors.GmailError as e:
_db.mark_attachment_failed(conn, row["id"], error=f"{type(e).__name__}: {e}")
return False
data_b64 = resp.get("data")
if not data_b64:
_db.mark_attachment_failed(conn, row["id"], error="empty data in response")
return False
try:
raw = base64.urlsafe_b64decode(_pad(data_b64).encode("ascii"))
except Exception as e:
_db.mark_attachment_failed(conn, row["id"], error=f"decode: {e}")
return False
sha = hashlib.sha256(raw).hexdigest()
# If an existing attachment has the same SHA, re-point storage_path and skip write.
existing = _find_existing_by_sha(conn, sha, exclude_id=row["id"])
if existing:
conn.execute(
"UPDATE email_attachments SET storage_path = ? WHERE id = ?",
(existing["storage_path"], row["id"]),
)
_db.mark_attachment_downloaded(conn, row["id"], sha256_hex=sha, size_bytes=len(raw))
return True
path = _storage_path_for(row["email_id"], row["id"], row["sanitized_filename"])
try:
_write_bytes(path, raw)
except OSError as e:
_db.mark_attachment_failed(conn, row["id"], error=f"disk: {e}")
return False
conn.execute(
"UPDATE email_attachments SET storage_path = ? WHERE id = ?",
(path, row["id"]),
)
_db.mark_attachment_downloaded(conn, row["id"], sha256_hex=sha, size_bytes=len(raw))
return True
def _find_existing_by_sha(conn: sqlite3.Connection, sha: str, *, exclude_id: str) -> Optional[sqlite3.Row]:
cur = conn.cursor()
cur.execute(
"SELECT * FROM email_attachments WHERE sha256_hex = ? AND id != ? "
"AND download_status = 'downloaded' LIMIT 1",
(sha, exclude_id),
)
return cur.fetchone()
# ---------------------------------------------------------------------------- utils
def _pad(b64: str) -> str:
pad = 4 - (len(b64) % 4)
return b64 + ("=" * pad if pad != 4 else "")
def _write_bytes(path: str, data: bytes) -> None:
os.makedirs(os.path.dirname(path), exist_ok=True)
tmp = path + ".tmp"
with open(tmp, "wb") as f:
f.write(data)
os.chmod(tmp, 0o600)
os.replace(tmp, path)
+112
View File
@@ -0,0 +1,112 @@
"""
Email integration configuration.
Reads from the same env-var surface as the rest of the CRM (server.py style),
no pydantic/dotenv magic — stdlib only.
"""
import os
from dataclasses import dataclass
from typing import Optional
# Reuse the CRM's data dir so backups and email storage live together.
_PROJECT_DIR = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
_DEFAULT_DATA_DIR = os.path.join(_PROJECT_DIR, "data")
def _bool_env(name: str, default: bool = False) -> bool:
v = os.environ.get(name, "").strip().lower()
if v in ("1", "true", "yes", "on"):
return True
if v in ("0", "false", "no", "off"):
return False
return default
def _int_env(name: str, default: int) -> int:
try:
return int(os.environ.get(name, str(default)))
except (TypeError, ValueError):
return default
@dataclass(frozen=True)
class EmailConfig:
# Master kill switch. When False, scheduler doesn't start and routes
# return 503. Migrations are still applied (so schema is ready).
enabled: bool
# Primary auth path. "dwd" means service account / domain-wide delegation.
# "oauth" means per-user refresh tokens. DWD is default; OAuth is the
# pluggable fallback.
primary_auth: str
# DWD specifics
dwd_key_path: Optional[str]
workspace_domain: Optional[str]
# OAuth specifics (used for fallback + admin UI)
oauth_client_id: Optional[str]
oauth_client_secret: Optional[str]
oauth_redirect_uri: Optional[str]
# Encryption key (base64) for OAuth refresh-token-at-rest encryption.
# Required whenever oauth path is in use. DWD path never persists tokens.
secret_key_b64: Optional[str]
# Sync scheduling
sync_interval_sec: int
backfill_page_size: int
max_attachment_mb: int
max_parallel_attachment_downloads: int
# Storage
data_dir: str
attachments_dir: str
secrets_dir: str
# Rate limit / retry
rate_limit_units_per_sec_per_account: int
retry_max_attempts: int
retry_initial_delay_sec: float
retry_max_delay_sec: float
# Gmail history retention — if we fall this far behind, switch to date
# backfill since Gmail may have pruned history records.
history_stale_days: int
def load() -> EmailConfig:
data_dir = os.environ.get("CRM_DATA_DIR", _DEFAULT_DATA_DIR)
return EmailConfig(
enabled=_bool_env("CRM_GMAIL_INTEGRATION_ENABLED", False),
primary_auth=os.environ.get("CRM_GMAIL_AUTH_METHOD", "dwd").lower(),
dwd_key_path=os.environ.get("CRM_GMAIL_SA_KEY_PATH") or None,
workspace_domain=os.environ.get("CRM_GMAIL_WORKSPACE_DOMAIN") or None,
oauth_client_id=os.environ.get("CRM_GMAIL_OAUTH_CLIENT_ID") or None,
oauth_client_secret=os.environ.get("CRM_GMAIL_OAUTH_CLIENT_SECRET") or None,
oauth_redirect_uri=os.environ.get("CRM_GMAIL_OAUTH_REDIRECT_URI") or None,
secret_key_b64=os.environ.get("CRM_GMAIL_SECRET_KEY") or None,
sync_interval_sec=_int_env("CRM_GMAIL_SYNC_INTERVAL_MIN", 180) * 60,
backfill_page_size=_int_env("CRM_GMAIL_BACKFILL_PAGE_SIZE", 500),
max_attachment_mb=_int_env("CRM_GMAIL_MAX_ATTACHMENT_MB", 50),
max_parallel_attachment_downloads=_int_env("CRM_GMAIL_ATTACH_CONCURRENCY", 4),
data_dir=data_dir,
attachments_dir=os.path.join(data_dir, "email_attachments"),
secrets_dir=os.path.join(data_dir, "secrets"),
rate_limit_units_per_sec_per_account=_int_env("CRM_GMAIL_RATE_UNITS_SEC", 150),
retry_max_attempts=_int_env("CRM_GMAIL_RETRY_MAX", 5),
retry_initial_delay_sec=float(os.environ.get("CRM_GMAIL_RETRY_INITIAL_SEC", "1.0")),
retry_max_delay_sec=float(os.environ.get("CRM_GMAIL_RETRY_MAX_SEC", "60.0")),
history_stale_days=_int_env("CRM_GMAIL_HISTORY_STALE_DAYS", 5),
)
# Singleton. Reload with `reload_config()` if env changes (mostly for tests).
CONFIG = load()
def reload_config() -> EmailConfig:
global CONFIG
CONFIG = load()
return CONFIG
+297
View File
@@ -0,0 +1,297 @@
"""
Credential providers for Gmail API access.
Two implementations behind a common protocol:
- DWDCredentialProvider: signs a JWT with the Workspace-authorized service
account, exchanges for a short-lived access token that impersonates a
specific user. No per-user persistent state.
- OAuthCredentialProvider: uses a per-user refresh token (stored encrypted
in email_accounts.oauth_refresh_enc) to mint access tokens. Supports the
'connect Gmail' UI flow.
Both provide the same interface:
provider.access_token_for(email_address: str) -> AccessToken
"""
import base64
import json
import os
import threading
import time
from dataclasses import dataclass
from typing import Optional, Protocol
import urllib.parse
import urllib.request
from . import config as _cfg
from . import crypto
from . import errors
GMAIL_READONLY_SCOPE = "https://www.googleapis.com/auth/gmail.readonly"
GOOGLE_TOKEN_URL = "https://oauth2.googleapis.com/token"
@dataclass
class AccessToken:
token: str
expires_at: float # epoch seconds
class CredentialProvider(Protocol):
def access_token_for(self, email_address: str) -> AccessToken: ...
def revoke(self, email_address: str) -> None: ...
# ============================================================================
# Domain-wide delegation
# ============================================================================
class DWDCredentialProvider:
"""Impersonation via service-account JWT bearer grant."""
def __init__(self, key_path: str):
with open(key_path, "r") as f:
self._key = json.load(f)
self._client_email = self._key["client_email"]
self._private_key_pem = self._key["private_key"].encode("utf-8")
self._cache: dict[str, AccessToken] = {}
self._lock = threading.Lock()
def access_token_for(self, email_address: str) -> AccessToken:
with self._lock:
cached = self._cache.get(email_address)
if cached and cached.expires_at - time.time() > 60:
return cached
token = self._mint(email_address)
self._cache[email_address] = token
return token
def revoke(self, email_address: str) -> None:
# DWD tokens expire naturally in <1h. Revocation is via Admin console.
# We just drop the cache so next call mints fresh.
with self._lock:
self._cache.pop(email_address, None)
# ------------------------------------------------------------------ helpers
def _mint(self, subject_email: str) -> AccessToken:
try:
from cryptography.hazmat.primitives import hashes, serialization # type: ignore
from cryptography.hazmat.primitives.asymmetric import padding # type: ignore
except ImportError as e: # pragma: no cover
raise errors.AuthError(
"DWD requires the `cryptography` package. Add to requirements.txt."
) from e
now = int(time.time())
header = {"alg": "RS256", "typ": "JWT"}
claim = {
"iss": self._client_email,
"sub": subject_email,
"scope": GMAIL_READONLY_SCOPE,
"aud": GOOGLE_TOKEN_URL,
"iat": now,
"exp": now + 3600,
}
signing_input = _b64url(_json(header)) + b"." + _b64url(_json(claim))
private_key = serialization.load_pem_private_key(self._private_key_pem, password=None)
signature = private_key.sign(signing_input, padding.PKCS1v15(), hashes.SHA256())
jwt = signing_input + b"." + _b64url(signature)
body = urllib.parse.urlencode({
"grant_type": "urn:ietf:params:oauth:grant-type:jwt-bearer",
"assertion": jwt.decode("ascii"),
}).encode("ascii")
req = urllib.request.Request(
GOOGLE_TOKEN_URL,
data=body,
headers={"Content-Type": "application/x-www-form-urlencoded"},
)
try:
with urllib.request.urlopen(req, timeout=15) as resp:
payload = json.loads(resp.read())
except urllib.error.HTTPError as e:
body = e.read().decode("utf-8", errors="replace")
try:
payload = json.loads(body)
except Exception:
payload = {"raw": body}
raise errors.classify_http(e.code, payload)
if "access_token" not in payload:
raise errors.AuthError("DWD token exchange returned no access_token", payload=payload)
return AccessToken(
token=payload["access_token"],
expires_at=time.time() + float(payload.get("expires_in", 3600)) - 30,
)
# ============================================================================
# Per-user OAuth (fallback)
# ============================================================================
class OAuthCredentialProvider:
"""Refreshes access tokens using a stored encrypted refresh token.
Refresh tokens are obtained via the consent-flow routes in routes.py and
stored in email_accounts.oauth_refresh_enc (AES-GCM ciphertext).
"""
def __init__(self, db_conn_factory, client_id: str, client_secret: str, secret_key_b64: str):
self._db = db_conn_factory
self._client_id = client_id
self._client_secret = client_secret
self._secret_key_b64 = secret_key_b64
self._lock = threading.Lock()
def access_token_for(self, email_address: str) -> AccessToken:
with self._lock:
row = self._load_account(email_address)
if row is None:
raise errors.AuthError(f"no email_accounts row for {email_address}")
# Cached access token still valid?
if row["oauth_token_enc"] and row["oauth_token_exp"]:
try:
exp = float(row["oauth_token_exp"])
except ValueError:
exp = 0.0
if exp - time.time() > 60:
token = crypto.decrypt(row["oauth_token_enc"], secret_key_b64=self._secret_key_b64).decode("ascii")
return AccessToken(token=token, expires_at=exp)
# Refresh.
return self._refresh(email_address, row)
def revoke(self, email_address: str) -> None:
row = self._load_account(email_address)
if not row or not row["oauth_refresh_enc"]:
return
refresh = crypto.decrypt(row["oauth_refresh_enc"], secret_key_b64=self._secret_key_b64).decode("ascii")
body = urllib.parse.urlencode({"token": refresh}).encode("ascii")
req = urllib.request.Request(
"https://oauth2.googleapis.com/revoke",
data=body,
headers={"Content-Type": "application/x-www-form-urlencoded"},
)
try:
urllib.request.urlopen(req, timeout=10).read()
except Exception:
pass # best effort; we zero locally regardless
self._zero_account(email_address)
# ------------------------------------------------------------------ helpers
def _refresh(self, email_address: str, row) -> AccessToken:
if not row["oauth_refresh_enc"]:
raise errors.AuthError(f"no refresh token stored for {email_address}")
refresh = crypto.decrypt(row["oauth_refresh_enc"], secret_key_b64=self._secret_key_b64).decode("ascii")
body = urllib.parse.urlencode({
"grant_type": "refresh_token",
"refresh_token": refresh,
"client_id": self._client_id,
"client_secret": self._client_secret,
}).encode("ascii")
req = urllib.request.Request(
GOOGLE_TOKEN_URL,
data=body,
headers={"Content-Type": "application/x-www-form-urlencoded"},
)
try:
with urllib.request.urlopen(req, timeout=15) as resp:
payload = json.loads(resp.read())
except urllib.error.HTTPError as e:
body_text = e.read().decode("utf-8", errors="replace")
try:
payload = json.loads(body_text)
except Exception:
payload = {"raw": body_text}
raise errors.classify_http(e.code, payload)
if "access_token" not in payload:
raise errors.AuthError("OAuth refresh returned no access_token", payload=payload)
token_str = payload["access_token"]
exp = time.time() + float(payload.get("expires_in", 3600)) - 30
enc_token = crypto.encrypt(token_str.encode("ascii"), secret_key_b64=self._secret_key_b64)
self._save_token(email_address, enc_token, exp)
return AccessToken(token=token_str, expires_at=exp)
def _load_account(self, email_address: str):
conn = self._db()
try:
cur = conn.cursor()
cur.execute(
"SELECT id, oauth_refresh_enc, oauth_token_enc, oauth_token_exp "
"FROM email_accounts WHERE email_address = ?",
(email_address,),
)
return cur.fetchone()
finally:
conn.close()
def _save_token(self, email_address: str, enc_token: bytes, exp: float):
conn = self._db()
try:
conn.execute(
"UPDATE email_accounts SET oauth_token_enc = ?, oauth_token_exp = ?, "
"updated_at = datetime('now') WHERE email_address = ?",
(enc_token, str(exp), email_address),
)
conn.commit()
finally:
conn.close()
def _zero_account(self, email_address: str):
conn = self._db()
try:
conn.execute(
"UPDATE email_accounts SET oauth_refresh_enc = NULL, oauth_token_enc = NULL, "
"oauth_token_exp = NULL, sync_enabled = 0, sync_status = 'paused', "
"updated_at = datetime('now') WHERE email_address = ?",
(email_address,),
)
conn.commit()
finally:
conn.close()
# ============================================================================
# Factory — resolves CONFIG.primary_auth to a concrete provider
# ============================================================================
def build_provider(db_conn_factory) -> CredentialProvider:
cfg = _cfg.CONFIG
if cfg.primary_auth == "dwd":
if not cfg.dwd_key_path or not os.path.exists(cfg.dwd_key_path):
raise errors.AuthError(
f"CRM_GMAIL_SA_KEY_PATH not found: {cfg.dwd_key_path!r}"
)
return DWDCredentialProvider(cfg.dwd_key_path)
if cfg.primary_auth == "oauth":
if not (cfg.oauth_client_id and cfg.oauth_client_secret and cfg.secret_key_b64):
raise errors.AuthError(
"OAuth mode requires CRM_GMAIL_OAUTH_CLIENT_ID, "
"CRM_GMAIL_OAUTH_CLIENT_SECRET, and CRM_GMAIL_SECRET_KEY."
)
return OAuthCredentialProvider(
db_conn_factory,
cfg.oauth_client_id,
cfg.oauth_client_secret,
cfg.secret_key_b64,
)
raise errors.AuthError(f"unknown primary_auth: {cfg.primary_auth!r}")
# ---------------------------------------------------------------------------- utils
def _b64url(data: bytes) -> bytes:
return base64.urlsafe_b64encode(data).rstrip(b"=")
def _json(obj) -> bytes:
return json.dumps(obj, separators=(",", ":")).encode("utf-8")
+79
View File
@@ -0,0 +1,79 @@
"""
AES-256-GCM encryption for OAuth refresh tokens at rest.
Key material comes from CONFIG.secret_key_b64 (env: CRM_GMAIL_SECRET_KEY).
Must be at least 32 bytes of entropy, base64-encoded.
Storage format (as stored in BLOB columns):
version(1 byte) || nonce(12 bytes) || ciphertext+tag(N bytes)
version = 1 for AES-GCM-256.
Uses the `cryptography` library. If not available (optional at scaffold time),
the OAuth fallback path is disabled with a clear error — DWD path is unaffected.
"""
import base64
import os
import secrets
from typing import Optional
try:
from cryptography.hazmat.primitives.ciphers.aead import AESGCM # type: ignore
_AVAILABLE = True
except ImportError: # pragma: no cover
AESGCM = None # type: ignore
_AVAILABLE = False
VERSION = 1
NONCE_LEN = 12
class CryptoUnavailable(RuntimeError):
pass
def _load_key(secret_key_b64: Optional[str]) -> bytes:
if not secret_key_b64:
raise CryptoUnavailable(
"CRM_GMAIL_SECRET_KEY not set; cannot encrypt/decrypt OAuth tokens. "
"DWD auth does not require this."
)
try:
key = base64.b64decode(secret_key_b64)
except Exception as e:
raise CryptoUnavailable(f"CRM_GMAIL_SECRET_KEY not valid base64: {e}") from e
if len(key) < 32:
raise CryptoUnavailable(
f"CRM_GMAIL_SECRET_KEY decodes to {len(key)} bytes; need >= 32."
)
return key[:32] # AES-256
def encrypt(plaintext: bytes, *, secret_key_b64: Optional[str]) -> bytes:
if not _AVAILABLE:
raise CryptoUnavailable("cryptography library not installed")
key = _load_key(secret_key_b64)
nonce = secrets.token_bytes(NONCE_LEN)
ct = AESGCM(key).encrypt(nonce, plaintext, None)
return bytes([VERSION]) + nonce + ct
def decrypt(blob: bytes, *, secret_key_b64: Optional[str]) -> bytes:
if not _AVAILABLE:
raise CryptoUnavailable("cryptography library not installed")
if not blob or len(blob) < 1 + NONCE_LEN + 16:
raise ValueError("ciphertext too short")
version = blob[0]
if version != VERSION:
raise ValueError(f"unsupported crypto version: {version}")
nonce = blob[1:1 + NONCE_LEN]
ct = blob[1 + NONCE_LEN:]
key = _load_key(secret_key_b64)
return AESGCM(key).decrypt(nonce, ct, None)
def generate_secret_key_b64() -> str:
"""Helper for initial setup: prints a fresh key you can drop into env."""
return base64.b64encode(os.urandom(32)).decode("ascii")
+416
View File
@@ -0,0 +1,416 @@
"""
Data-access layer for the email_integration module.
All SQL touching emails_* tables lives here. Other modules call named
helpers — they never write SQL inline. This keeps schema changes contained.
Connection pattern matches server.py get_db():
- WAL mode, foreign keys on, busy_timeout
- sqlite3.Row row_factory
The caller is responsible for committing / closing.
"""
import json
import os
import sqlite3
import uuid
from datetime import datetime, timezone
from typing import Iterable, Optional
# ------------------------------------------------------------------ migrations
def apply_migrations(cursor: sqlite3.Cursor) -> None:
"""Apply all .sql migration files in migrations/ in lexicographic order.
Called from server.init_db(). Idempotent. Does not log past migrations in
a table yet — each file is guarded by CREATE ... IF NOT EXISTS etc. If
we ever need more complex migrations, add a schema_migrations table.
"""
here = os.path.dirname(os.path.abspath(__file__))
mdir = os.path.join(here, "migrations")
if not os.path.isdir(mdir):
return
for name in sorted(os.listdir(mdir)):
if not name.endswith(".sql"):
continue
path = os.path.join(mdir, name)
with open(path, "r") as f:
sql = f.read()
cursor.executescript(sql)
# ------------------------------------------------------------------ utils
def _uuid() -> str:
return str(uuid.uuid4())
def _now_iso() -> str:
return datetime.now(tz=timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
def _json(v) -> str:
return json.dumps(v, separators=(",", ":"))
# ------------------------------------------------------------------ email_accounts
def list_sync_ready_accounts(conn: sqlite3.Connection) -> list[sqlite3.Row]:
cur = conn.cursor()
cur.execute(
"SELECT * FROM email_accounts "
"WHERE sync_enabled = 1 AND sync_status IN ('pending','active') "
"ORDER BY last_synced_at IS NOT NULL, last_synced_at"
)
return cur.fetchall()
def get_account_by_email(conn: sqlite3.Connection, email_address: str) -> Optional[sqlite3.Row]:
cur = conn.cursor()
cur.execute("SELECT * FROM email_accounts WHERE email_address = ?", (email_address,))
return cur.fetchone()
def upsert_account(conn: sqlite3.Connection, *, user_id: str, email_address: str,
auth_method: str) -> str:
existing = get_account_by_email(conn, email_address)
if existing:
return existing["id"]
account_id = _uuid()
conn.execute(
"INSERT INTO email_accounts (id, user_id, email_address, auth_method) "
"VALUES (?, ?, ?, ?)",
(account_id, user_id, email_address, auth_method),
)
return account_id
def set_account_status(conn: sqlite3.Connection, account_id: str, *,
status: str, error: Optional[str] = None) -> None:
conn.execute(
"UPDATE email_accounts SET sync_status = ?, sync_error = ?, "
"updated_at = datetime('now') WHERE id = ?",
(status, error, account_id),
)
def set_account_checkpoint(conn: sqlite3.Connection, account_id: str, *,
history_id: Optional[str] = None,
backfill_cursor: Optional[str] = None,
backfill_complete: Optional[bool] = None,
last_synced_at: Optional[str] = None) -> None:
sets, params = [], []
if history_id is not None:
sets.append("last_history_id = ?"); params.append(history_id)
if backfill_cursor is not None:
sets.append("backfill_cursor = ?"); params.append(backfill_cursor)
if backfill_complete is not None:
sets.append("backfill_complete = ?"); params.append(1 if backfill_complete else 0)
if last_synced_at is not None:
sets.append("last_synced_at = ?"); params.append(last_synced_at)
if not sets:
return
sets.append("updated_at = datetime('now')")
params.append(account_id)
conn.execute(f"UPDATE email_accounts SET {', '.join(sets)} WHERE id = ?", params)
# ------------------------------------------------------------------ emails
def find_email_by_rfc_id(conn: sqlite3.Connection, rfc_message_id: str) -> Optional[sqlite3.Row]:
cur = conn.cursor()
cur.execute("SELECT * FROM emails WHERE rfc_message_id = ?", (rfc_message_id,))
return cur.fetchone()
def find_email_id_by_any_rfc_id(conn: sqlite3.Connection,
rfc_ids: Iterable[str]) -> Optional[str]:
ids = [r for r in rfc_ids if r]
if not ids:
return None
placeholders = ",".join("?" for _ in ids)
cur = conn.cursor()
cur.execute(
f"SELECT id FROM emails WHERE rfc_message_id IN ({placeholders}) "
"ORDER BY sent_at ASC LIMIT 1",
ids,
)
row = cur.fetchone()
return row["id"] if row else None
def insert_email(conn: sqlite3.Connection, *, parsed: dict, match_status: str) -> str:
"""Insert a fresh emails row. Returns email_id.
Caller must ensure no row exists for parsed['rfc_message_id']; use
find_email_by_rfc_id first.
"""
email_id = _uuid()
conn.execute(
"""INSERT INTO emails
(id, rfc_message_id, gmail_thread_id, rfc_thread_root_id, subject,
from_email, from_name, to_emails_json, cc_emails_json, bcc_emails_json,
reply_to, sent_at, body_text, body_html, snippet, in_reply_to,
references_json, has_attachments, size_estimate, is_matched,
match_status, raw_headers_json)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(
email_id,
parsed["rfc_message_id"],
parsed.get("gmail_thread_id"),
parsed.get("rfc_thread_root_id"),
parsed.get("subject"),
parsed["from_email"],
parsed.get("from_name"),
_json(parsed.get("to", [])),
_json(parsed.get("cc", [])),
_json(parsed.get("bcc", [])),
parsed.get("reply_to"),
parsed["sent_at"],
parsed.get("body_text"),
parsed.get("body_html"),
parsed.get("snippet"),
parsed.get("in_reply_to"),
_json(parsed.get("references", [])),
1 if parsed.get("attachments") else 0,
parsed.get("size_estimate"),
1 if match_status == "matched" else 0,
match_status,
_json(parsed.get("raw_headers", {})) if parsed.get("raw_headers") else None,
),
)
# recipients
for kind in ("from", "to", "cc", "bcc", "reply_to"):
addrs = []
if kind == "from" and parsed.get("from_email"):
addrs = [(parsed["from_email"], parsed.get("from_name"))]
elif kind == "reply_to" and parsed.get("reply_to"):
addrs = [(parsed["reply_to"], None)]
else:
for a in parsed.get(kind, []):
if isinstance(a, dict):
addrs.append((a.get("email"), a.get("name")))
else:
addrs.append((a, None))
for address, name in addrs:
if not address:
continue
conn.execute(
"INSERT INTO email_recipients (id, email_id, address, display_name, kind) "
"VALUES (?, ?, ?, ?, ?)",
(_uuid(), email_id, address.lower().strip(), name, kind),
)
return email_id
def set_email_thread(conn: sqlite3.Connection, email_id: str, thread_id: str) -> None:
conn.execute(
"UPDATE emails SET thread_id = ?, updated_at = datetime('now') WHERE id = ?",
(thread_id, email_id),
)
# ------------------------------------------------------------------ sightings
def upsert_sighting(conn: sqlite3.Connection, *, email_id: str, account_id: str,
gmail_message_id: str, gmail_thread_id: str,
labels: list[str], is_sent: bool) -> None:
conn.execute(
"""INSERT OR IGNORE INTO email_account_messages
(id, email_id, account_id, gmail_message_id, gmail_thread_id,
labels_json, is_sent)
VALUES (?, ?, ?, ?, ?, ?, ?)""",
(_uuid(), email_id, account_id, gmail_message_id, gmail_thread_id,
_json(labels), 1 if is_sent else 0),
)
def update_sighting_labels(conn: sqlite3.Connection, *, account_id: str,
gmail_message_id: str, labels: list[str]) -> None:
conn.execute(
"UPDATE email_account_messages SET labels_json = ? "
"WHERE account_id = ? AND gmail_message_id = ?",
(_json(labels), account_id, gmail_message_id),
)
def tombstone_sighting(conn: sqlite3.Connection, *, account_id: str,
gmail_message_id: str) -> None:
conn.execute(
"UPDATE email_account_messages SET deleted_at = datetime('now') "
"WHERE account_id = ? AND gmail_message_id = ?",
(account_id, gmail_message_id),
)
# ------------------------------------------------------------------ attachments
def insert_attachment_stub(conn: sqlite3.Connection, *, email_id: str,
gmail_attachment_id: str, filename: str,
sanitized_filename: str, mime_type: Optional[str],
size_bytes: Optional[int], storage_path: str) -> str:
att_id = _uuid()
conn.execute(
"""INSERT INTO email_attachments
(id, email_id, gmail_attachment_id, filename, sanitized_filename,
mime_type, size_bytes, storage_path)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
(att_id, email_id, gmail_attachment_id, filename, sanitized_filename,
mime_type, size_bytes, storage_path),
)
return att_id
def mark_attachment_downloaded(conn: sqlite3.Connection, attachment_id: str, *,
sha256_hex: str, size_bytes: int) -> None:
conn.execute(
"UPDATE email_attachments SET download_status = 'downloaded', "
"sha256_hex = ?, size_bytes = ?, downloaded_at = datetime('now') "
"WHERE id = ?",
(sha256_hex, size_bytes, attachment_id),
)
def mark_attachment_failed(conn: sqlite3.Connection, attachment_id: str, *,
error: str) -> None:
conn.execute(
"UPDATE email_attachments SET download_status = 'failed', "
"download_attempts = download_attempts + 1, download_error = ? "
"WHERE id = ?",
(error, attachment_id),
)
def pending_attachments(conn: sqlite3.Connection, limit: int = 50) -> list[sqlite3.Row]:
cur = conn.cursor()
cur.execute(
"SELECT a.*, eam.gmail_message_id, eam.account_id "
"FROM email_attachments a "
"JOIN email_account_messages eam ON eam.email_id = a.email_id "
"WHERE a.download_status = 'pending' AND a.download_attempts < 5 "
"LIMIT ?",
(limit,),
)
return cur.fetchall()
# ------------------------------------------------------------------ threads
def find_thread_by_gmail_id(conn: sqlite3.Connection, gmail_thread_id: str) -> Optional[sqlite3.Row]:
cur = conn.cursor()
cur.execute(
"SELECT * FROM email_threads WHERE gmail_thread_id = ?",
(gmail_thread_id,),
)
return cur.fetchone()
def find_thread_by_rfc_root(conn: sqlite3.Connection, rfc_root: str) -> Optional[sqlite3.Row]:
cur = conn.cursor()
cur.execute(
"SELECT * FROM email_threads WHERE rfc_thread_root_id = ?",
(rfc_root,),
)
return cur.fetchone()
def create_thread(conn: sqlite3.Connection, *, gmail_thread_id: Optional[str],
rfc_thread_root_id: Optional[str], subject_normalized: Optional[str],
first_message_at: Optional[str]) -> str:
thread_id = _uuid()
conn.execute(
"""INSERT INTO email_threads
(id, gmail_thread_id, rfc_thread_root_id, subject_normalized,
first_message_at, last_message_at, message_count)
VALUES (?, ?, ?, ?, ?, ?, 0)""",
(thread_id, gmail_thread_id, rfc_thread_root_id, subject_normalized,
first_message_at, first_message_at),
)
return thread_id
def rollup_thread(conn: sqlite3.Connection, thread_id: str) -> None:
"""Recompute count / last_message_at / participants from member emails.
Cheap at 5-person team volumes. For larger deployments swap to triggers.
"""
cur = conn.cursor()
cur.execute(
"SELECT COUNT(*) AS n, MIN(sent_at) AS first, MAX(sent_at) AS last, "
"MAX(is_matched) AS matched FROM emails WHERE thread_id = ?",
(thread_id,),
)
row = cur.fetchone()
if not row or row["n"] == 0:
return
cur.execute(
"SELECT DISTINCT address FROM email_recipients er "
"JOIN emails e ON e.id = er.email_id WHERE e.thread_id = ?",
(thread_id,),
)
participants = [r["address"] for r in cur.fetchall()]
conn.execute(
"UPDATE email_threads SET message_count = ?, first_message_at = ?, "
"last_message_at = ?, participant_count = ?, participants_json = ?, "
"is_matched = ?, updated_at = datetime('now') WHERE id = ?",
(row["n"], row["first"], row["last"], len(participants),
_json(participants), int(row["matched"] or 0), thread_id),
)
# ------------------------------------------------------------------ investor links
def insert_investor_link(conn: sqlite3.Connection, *, email_id: str,
link: dict) -> None:
conn.execute(
"""INSERT INTO email_investor_links
(id, email_id, fundraising_investor_id, fundraising_contact_id,
contact_id, organization_id, matched_address, match_kind,
match_confidence)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(
_uuid(),
email_id,
link.get("fundraising_investor_id"),
link.get("fundraising_contact_id"),
link.get("contact_id"),
link.get("organization_id"),
link["matched_address"],
link["match_kind"],
float(link.get("match_confidence", 1.0)),
),
)
# ------------------------------------------------------------------ sync runs
def start_sync_run(conn: sqlite3.Connection, *, account_id: str, kind: str) -> str:
run_id = _uuid()
conn.execute(
"INSERT INTO email_sync_runs (id, account_id, kind, started_at, status) "
"VALUES (?, ?, ?, ?, 'running')",
(run_id, account_id, kind, _now_iso()),
)
return run_id
def finish_sync_run(conn: sqlite3.Connection, run_id: str, *, status: str,
stats: Optional[dict] = None, error: Optional[str] = None) -> None:
stats = stats or {}
conn.execute(
"""UPDATE email_sync_runs
SET finished_at = ?, status = ?, messages_seen = ?, messages_stored = ?,
attachments_saved = ?, api_calls = ?, retries = ?, error = ?
WHERE id = ?""",
(
_now_iso(), status,
int(stats.get("messages_seen", 0)),
int(stats.get("messages_stored", 0)),
int(stats.get("attachments_saved", 0)),
int(stats.get("api_calls", 0)),
int(stats.get("retries", 0)),
error,
run_id,
),
)
+79
View File
@@ -0,0 +1,79 @@
"""
Exception taxonomy for Gmail integration.
gmail_client._call() maps HTTP status codes to these exception types. The retry
loop in gmail_client._with_retry() inspects the class hierarchy to decide
whether to back off + retry or fail fast.
"""
class GmailError(Exception):
"""Base class for all Gmail-integration errors."""
def __init__(self, message: str = "", *, status: int = 0, payload: object = None):
super().__init__(message)
self.status = status
self.payload = payload
class AuthError(GmailError):
"""401 / 403 that is not a rate-limit. Requires operator intervention
(bad service account key, revoked OAuth, missing DWD scope). Not retried."""
class RateLimitError(GmailError):
"""429 or 403 with reason in {rateLimitExceeded, userRateLimitExceeded}.
Retried with exponential backoff."""
class TransientError(GmailError):
"""5xx or network error. Retried with exponential backoff."""
class NotFoundError(GmailError):
"""404. For messages this usually means 'deleted in Gmail after we saw it';
for history this is HistoryExpiredError."""
class HistoryExpiredError(NotFoundError):
"""404 on history.list with startHistoryId — Gmail only retains history
for a limited window (~7 days). Triggers date-based backfill fallback."""
class PermanentError(GmailError):
"""400 or other permanent failure. Skip and log; do not retry."""
def classify_http(status: int, payload: object) -> GmailError:
"""Map a Gmail API response to the appropriate exception type.
`payload` is the decoded JSON body if any; used to distinguish rate-limit
403s from pure auth 403s via the `reason` field Google returns.
"""
reason = ""
if isinstance(payload, dict):
try:
errs = payload.get("error", {}).get("errors") or []
if errs:
reason = str(errs[0].get("reason", ""))
except Exception: # pragma: no cover — defensive
pass
if status == 429:
return RateLimitError(f"rate limited: {reason}", status=status, payload=payload)
if status == 403:
if reason in ("rateLimitExceeded", "userRateLimitExceeded", "quotaExceeded"):
return RateLimitError(f"quota: {reason}", status=status, payload=payload)
return AuthError(f"forbidden: {reason}", status=status, payload=payload)
if status == 401:
return AuthError("unauthorized", status=status, payload=payload)
if status == 404:
return NotFoundError("not found", status=status, payload=payload)
if 500 <= status < 600:
return TransientError(f"server error {status}", status=status, payload=payload)
if 400 <= status < 500:
return PermanentError(f"client error {status}: {reason}", status=status, payload=payload)
return GmailError(f"unexpected status {status}", status=status, payload=payload)
RETRYABLE = (RateLimitError, TransientError)
+249
View File
@@ -0,0 +1,249 @@
"""
Thin Gmail API wrapper.
Responsibilities:
- HTTPS calls to https://gmail.googleapis.com/gmail/v1/users/me/*
- Per-account access-token injection via CredentialProvider
- Rate limiting via token bucket
- Retry loop with exponential backoff + jitter for RETRYABLE errors
- Batch requests for metadata fetches (multipart/mixed) — sketch provided
- Call-count accounting for observability (plumbed to email_sync_runs)
We call Gmail over raw urllib instead of the google-api-python-client to keep
the dependency surface small. If you prefer the Google SDK, replace _call()
with client calls; everything else is independent.
"""
import json
import random
import threading
import time
import urllib.error
import urllib.parse
import urllib.request
from dataclasses import dataclass, field
from typing import Any, Iterator, Optional
from . import config as _cfg
from . import errors
BASE = "https://gmail.googleapis.com/gmail/v1/users"
# ---------------------------------------------------------------------------- token bucket
class _TokenBucket:
"""Simple per-account rate limiter. Call wait(cost) before each API call."""
def __init__(self, units_per_sec: int, burst: Optional[int] = None):
self._rate = float(units_per_sec)
self._burst = float(burst if burst is not None else units_per_sec)
self._tokens = self._burst
self._last = time.monotonic()
self._lock = threading.Lock()
def wait(self, cost: float) -> None:
while True:
with self._lock:
now = time.monotonic()
self._tokens = min(self._burst, self._tokens + (now - self._last) * self._rate)
self._last = now
if self._tokens >= cost:
self._tokens -= cost
return
needed = cost - self._tokens
sleep_for = needed / self._rate
time.sleep(sleep_for)
# ---------------------------------------------------------------------------- call stats
@dataclass
class CallStats:
api_calls: int = 0
retries: int = 0
bytes_in: int = 0
last_errors: list[str] = field(default_factory=list)
# ---------------------------------------------------------------------------- client
class GmailClient:
"""Per-account Gmail client. Bind one instance per sync run."""
def __init__(self, credential_provider, email_address: str, stats: Optional[CallStats] = None):
self._creds = credential_provider
self._email = email_address
self._bucket = _TokenBucket(units_per_sec=_cfg.CONFIG.rate_limit_units_per_sec_per_account)
self.stats = stats or CallStats()
# -------------------------------------------------------------- messages.*
def list_messages(self, *, q: str = "", page_token: Optional[str] = None,
max_results: int = 500, label_ids: Optional[list[str]] = None) -> dict:
"""https://developers.google.com/gmail/api/reference/rest/v1/users.messages/list"""
params = {"maxResults": str(max_results)}
if q:
params["q"] = q
if page_token:
params["pageToken"] = page_token
if label_ids:
for lid in label_ids:
params.setdefault("labelIds", [])
params["labelIds"].append(lid) if False else None
return self._get("/messages", params=params, cost=5)
def get_message(self, message_id: str, *, format: str = "metadata",
metadata_headers: Optional[list[str]] = None) -> dict:
params = {"format": format}
if format == "metadata" and metadata_headers:
params["metadataHeaders"] = metadata_headers
return self._get(f"/messages/{message_id}", params=params, cost=5)
def get_attachment(self, message_id: str, attachment_id: str) -> dict:
return self._get(
f"/messages/{message_id}/attachments/{attachment_id}",
params=None,
cost=5,
)
# -------------------------------------------------------------- history.*
def list_history(self, *, start_history_id: str, page_token: Optional[str] = None,
history_types: Optional[list[str]] = None) -> dict:
params = {"startHistoryId": start_history_id, "maxResults": "500"}
if page_token:
params["pageToken"] = page_token
if history_types:
params["historyTypes"] = history_types
try:
return self._get("/history", params=params, cost=2)
except errors.NotFoundError as e:
# Gmail returns 404 when startHistoryId is too old. Wrap for callers.
raise errors.HistoryExpiredError(
"startHistoryId no longer available", status=404, payload=getattr(e, "payload", None)
) from e
# -------------------------------------------------------------- profile
def get_profile(self) -> dict:
return self._get("/profile", params=None, cost=1)
# -------------------------------------------------------------- iteration helpers
def iter_messages(self, *, q: str = "") -> Iterator[dict]:
page_token: Optional[str] = None
while True:
resp = self.list_messages(q=q, page_token=page_token,
max_results=_cfg.CONFIG.backfill_page_size)
for m in resp.get("messages") or []:
yield m
page_token = resp.get("nextPageToken")
if not page_token:
return
def iter_history(self, *, start_history_id: str,
history_types: Optional[list[str]] = None) -> Iterator[dict]:
page_token: Optional[str] = None
while True:
resp = self.list_history(
start_history_id=start_history_id,
page_token=page_token,
history_types=history_types,
)
for h in resp.get("history") or []:
yield h
page_token = resp.get("nextPageToken")
if not page_token:
# Cache final historyId for caller to checkpoint.
self._last_history_id = resp.get("historyId")
return
@property
def last_history_id(self) -> Optional[str]:
return getattr(self, "_last_history_id", None)
# -------------------------------------------------------------- internals
def _get(self, path: str, *, params: Optional[dict], cost: float) -> dict:
return self._with_retry(lambda: self._call("GET", path, params=params, cost=cost))
def _call(self, method: str, path: str, *, params: Optional[dict] = None,
body: Optional[bytes] = None, cost: float = 1.0) -> dict:
self._bucket.wait(cost)
self.stats.api_calls += 1
qs = ""
if params:
# urllib.parse.urlencode with doseq=True handles repeated params
# like metadataHeaders=Foo&metadataHeaders=Bar correctly.
qs = "?" + urllib.parse.urlencode(params, doseq=True)
url = f"{BASE}/me{path}{qs}"
token = self._creds.access_token_for(self._email)
req = urllib.request.Request(url, method=method, data=body)
req.add_header("Authorization", f"Bearer {token.token}")
req.add_header("Accept", "application/json")
if body:
req.add_header("Content-Type", "application/json")
try:
with urllib.request.urlopen(req, timeout=30) as resp:
raw = resp.read()
self.stats.bytes_in += len(raw)
if not raw:
return {}
return json.loads(raw)
except urllib.error.HTTPError as e:
raw = e.read() or b""
self.stats.bytes_in += len(raw)
try:
payload = json.loads(raw) if raw else {}
except Exception:
payload = {"raw": raw.decode("utf-8", errors="replace")}
err = errors.classify_http(e.code, payload)
# short-message logging hook (redacted of tokens by design)
self.stats.last_errors.append(f"{e.code} {type(err).__name__}")
self.stats.last_errors = self.stats.last_errors[-10:]
raise err
except (urllib.error.URLError, TimeoutError) as e:
raise errors.TransientError(f"network error: {e}") from e
def _with_retry(self, fn):
cfg = _cfg.CONFIG
attempts = 0
delay = cfg.retry_initial_delay_sec
while True:
try:
return fn()
except errors.RETRYABLE as e:
attempts += 1
if attempts >= cfg.retry_max_attempts:
raise
self.stats.retries += 1
# Full jitter
sleep_for = random.uniform(0, min(delay, cfg.retry_max_delay_sec))
time.sleep(sleep_for)
delay = min(delay * 2, cfg.retry_max_delay_sec)
# Non-retryable errors propagate immediately.
# ---------------------------------------------------------------------------- batch fetch sketch
def batch_get_metadata(client: GmailClient, message_ids: list[str],
headers: list[str]) -> dict[str, dict]:
"""Fetch metadata for up to ~100 messages.
TODO: implement using Gmail's multipart/mixed batch endpoint at
https://www.googleapis.com/batch/gmail/v1 for efficiency. In the scaffold
we fall back to serial gets so the logic is correct from day 1.
"""
out: dict[str, dict] = {}
for mid in message_ids:
try:
out[mid] = client.get_message(mid, format="metadata", metadata_headers=headers)
except errors.NotFoundError:
# Message deleted between list and get — skip.
continue
return out
+215
View File
@@ -0,0 +1,215 @@
"""
Investor matching.
Builds an in-memory index of investor email addresses from:
- fundraising_contacts.email
- contacts.email
- organizations.email + organizations.website (domain only)
For each synced email, returns a list of investor links. Exact-email matches
beat domain matches; if any exact match exists, domain matches are suppressed.
The index is rebuilt every `REFRESH_INTERVAL_SEC` or on demand via rebuild().
"""
import re
import threading
import time
from dataclasses import dataclass
from typing import Optional
REFRESH_INTERVAL_SEC = 900 # 15 minutes
# Domains we never domain-match against (personal mailboxes).
COMMON_PERSONAL_DOMAINS = {
"gmail.com", "googlemail.com",
"outlook.com", "hotmail.com", "live.com", "msn.com",
"yahoo.com", "yahoo.co.uk", "ymail.com",
"icloud.com", "me.com", "mac.com",
"aol.com", "proton.me", "protonmail.com",
"pm.me", "fastmail.com", "tuta.io", "hey.com",
"duck.com", "zoho.com",
}
# Also skip matching on the team's own domain (they email each other).
# Populated from CONFIG.workspace_domain at rebuild time.
@dataclass
class MatchTarget:
fundraising_investor_id: Optional[str] = None
fundraising_contact_id: Optional[str] = None
contact_id: Optional[str] = None
organization_id: Optional[str] = None
investor_name: Optional[str] = None
@dataclass
class InvestorLink:
matched_address: str
match_kind: str # exact_email | domain_match | manual
match_confidence: float
target: MatchTarget
class InvestorIndex:
def __init__(self, own_domain: Optional[str] = None):
self._email_index: dict[str, MatchTarget] = {}
self._domain_index: dict[str, list[MatchTarget]] = {}
self._own_domain = (own_domain or "").lower() or None
self._last_built = 0.0
self._lock = threading.Lock()
# ------------------------------------------------------------------ build
def rebuild(self, db_conn_factory) -> None:
with self._lock:
email_idx: dict[str, MatchTarget] = {}
domain_idx: dict[str, list[MatchTarget]] = {}
conn = db_conn_factory()
try:
cur = conn.cursor()
# fundraising_contacts
cur.execute(
"SELECT fc.id, fc.email, fc.investor_id, fi.investor_name "
"FROM fundraising_contacts fc "
"LEFT JOIN fundraising_investors fi ON fi.id = fc.investor_id "
"WHERE fc.email IS NOT NULL AND fc.email != ''"
)
for r in cur.fetchall():
addr = (r["email"] or "").lower().strip()
if not _valid_email(addr):
continue
email_idx[addr] = MatchTarget(
fundraising_contact_id=r["id"],
fundraising_investor_id=r["investor_id"],
investor_name=r["investor_name"],
)
# contacts
cur.execute(
"SELECT id, email, organization_id FROM contacts "
"WHERE email IS NOT NULL AND email != ''"
)
for r in cur.fetchall():
addr = (r["email"] or "").lower().strip()
if not _valid_email(addr):
continue
# Don't overwrite a fundraising_contact match; they're higher signal.
email_idx.setdefault(addr, MatchTarget(
contact_id=r["id"],
organization_id=r["organization_id"],
))
# organizations — domain-only match source
cur.execute(
"SELECT id, name, email, website FROM organizations "
"WHERE (email IS NOT NULL AND email != '') OR (website IS NOT NULL AND website != '')"
)
for r in cur.fetchall():
for d in _domains_for_org(r):
if d in COMMON_PERSONAL_DOMAINS:
continue
if self._own_domain and d == self._own_domain:
continue
domain_idx.setdefault(d, []).append(MatchTarget(
organization_id=r["id"],
investor_name=r["name"],
))
finally:
conn.close()
self._email_index = email_idx
self._domain_index = domain_idx
self._last_built = time.time()
def rebuild_if_stale(self, db_conn_factory) -> None:
if time.time() - self._last_built > REFRESH_INTERVAL_SEC:
self.rebuild(db_conn_factory)
# ------------------------------------------------------------------ query
def match(self, addresses: set[str], *,
exclude_addresses: Optional[set[str]] = None) -> list[InvestorLink]:
excl = {a.lower() for a in (exclude_addresses or set())}
candidates = {a.lower().strip() for a in addresses if a} - excl
# Exclude own domain addresses (teammates emailing each other).
if self._own_domain:
candidates = {a for a in candidates
if not a.endswith("@" + self._own_domain)}
links: list[InvestorLink] = []
seen_targets: set[tuple] = set()
# Exact email matches first.
for addr in candidates:
t = self._email_index.get(addr)
if t:
key = (t.fundraising_contact_id, t.contact_id)
if key in seen_targets:
continue
seen_targets.add(key)
links.append(InvestorLink(
matched_address=addr,
match_kind="exact_email",
match_confidence=1.0,
target=t,
))
if links: # exact hits short-circuit domain matching
return links
# Domain fallback.
for addr in candidates:
_, _, domain = addr.partition("@")
if not domain or domain in COMMON_PERSONAL_DOMAINS:
continue
for t in self._domain_index.get(domain, []):
key = ("org", t.organization_id)
if key in seen_targets:
continue
seen_targets.add(key)
links.append(InvestorLink(
matched_address=addr,
match_kind="domain_match",
match_confidence=0.6,
target=t,
))
return links
# ---------------------------------------------------------------------------- helpers
_EMAIL_RE = re.compile(r"^[^@\s]+@[^@\s]+\.[^@\s]+$")
def _valid_email(s: str) -> bool:
return bool(_EMAIL_RE.match(s))
def _domains_for_org(row) -> list[str]:
out: list[str] = []
if row["email"]:
_, _, d = row["email"].lower().partition("@")
if d:
out.append(d)
if row["website"]:
d = _domain_from_url(row["website"])
if d:
out.append(d)
return list({d for d in out if d})
def _domain_from_url(url: str) -> Optional[str]:
if not url:
return None
m = re.match(r"^\s*(?:https?://)?(?:www\.)?([^/:?#\s]+)", url.strip(), re.IGNORECASE)
if not m:
return None
return m.group(1).lower()
@@ -0,0 +1,192 @@
-- Gmail Integration — Phase 1 migration
-- Creates all tables for email capture, matching, threading, attachments.
-- This migration is IDEMPOTENT: safe to re-run.
-- Applied by email_integration.db.apply_migrations() on server startup when
-- CRM_GMAIL_INTEGRATION_ENABLED is truthy.
--
-- DO NOT modify this file in place after it ships. Create 0002_*.sql, etc.
-- ============================================================================
-- email_accounts — one row per enrolled team-member mailbox
-- ============================================================================
CREATE TABLE IF NOT EXISTS email_accounts (
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL,
email_address TEXT NOT NULL UNIQUE,
auth_method TEXT NOT NULL, -- 'dwd' | 'oauth'
oauth_refresh_enc BLOB,
oauth_token_enc BLOB,
oauth_token_exp TEXT,
sync_enabled INTEGER NOT NULL DEFAULT 1,
sync_status TEXT NOT NULL DEFAULT 'pending',
sync_error TEXT,
last_history_id TEXT,
last_synced_at TEXT,
backfill_complete INTEGER NOT NULL DEFAULT 0,
backfill_cursor TEXT,
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now')),
FOREIGN KEY(user_id) REFERENCES users(id)
);
CREATE INDEX IF NOT EXISTS idx_email_accounts_user ON email_accounts(user_id);
CREATE INDEX IF NOT EXISTS idx_email_accounts_sync ON email_accounts(sync_enabled, sync_status);
-- ============================================================================
-- emails — canonical email record, dedup'd across accounts by RFC Message-ID
-- ============================================================================
CREATE TABLE IF NOT EXISTS emails (
id TEXT PRIMARY KEY,
rfc_message_id TEXT NOT NULL UNIQUE,
gmail_thread_id TEXT,
rfc_thread_root_id TEXT,
thread_id TEXT, -- FK email_threads.id (populated by threads.py)
subject TEXT,
from_email TEXT NOT NULL,
from_name TEXT,
to_emails_json TEXT NOT NULL DEFAULT '[]',
cc_emails_json TEXT NOT NULL DEFAULT '[]',
bcc_emails_json TEXT NOT NULL DEFAULT '[]',
reply_to TEXT,
sent_at TEXT NOT NULL,
body_text TEXT,
body_html TEXT,
snippet TEXT,
in_reply_to TEXT,
references_json TEXT DEFAULT '[]',
has_attachments INTEGER NOT NULL DEFAULT 0,
size_estimate INTEGER,
is_matched INTEGER NOT NULL DEFAULT 0,
match_status TEXT NOT NULL DEFAULT 'unmatched', -- unmatched|matched|skipped
raw_headers_json TEXT,
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now'))
);
CREATE INDEX IF NOT EXISTS idx_emails_thread ON emails(gmail_thread_id);
CREATE INDEX IF NOT EXISTS idx_emails_rfc_thread ON emails(rfc_thread_root_id);
CREATE INDEX IF NOT EXISTS idx_emails_thread_fk ON emails(thread_id);
CREATE INDEX IF NOT EXISTS idx_emails_from ON emails(from_email);
CREATE INDEX IF NOT EXISTS idx_emails_sent_at ON emails(sent_at);
CREATE INDEX IF NOT EXISTS idx_emails_matched ON emails(is_matched, sent_at);
CREATE INDEX IF NOT EXISTS idx_emails_in_reply_to ON emails(in_reply_to);
-- ============================================================================
-- email_recipients — denormalized for fast address lookups
-- ============================================================================
CREATE TABLE IF NOT EXISTS email_recipients (
id TEXT PRIMARY KEY,
email_id TEXT NOT NULL,
address TEXT NOT NULL,
display_name TEXT,
kind TEXT NOT NULL, -- from|to|cc|bcc|reply_to
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_email_recipients_addr ON email_recipients(address);
CREATE INDEX IF NOT EXISTS idx_email_recipients_email ON email_recipients(email_id);
-- ============================================================================
-- email_account_messages — per-mailbox sighting of an email
-- ============================================================================
CREATE TABLE IF NOT EXISTS email_account_messages (
id TEXT PRIMARY KEY,
email_id TEXT NOT NULL,
account_id TEXT NOT NULL,
gmail_message_id TEXT NOT NULL,
gmail_thread_id TEXT NOT NULL,
labels_json TEXT DEFAULT '[]',
is_sent INTEGER NOT NULL DEFAULT 0,
first_seen_at TEXT DEFAULT (datetime('now')),
deleted_at TEXT,
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE,
FOREIGN KEY(account_id) REFERENCES email_accounts(id) ON DELETE CASCADE,
UNIQUE(account_id, gmail_message_id)
);
CREATE INDEX IF NOT EXISTS idx_eam_email ON email_account_messages(email_id);
CREATE INDEX IF NOT EXISTS idx_eam_account ON email_account_messages(account_id);
CREATE INDEX IF NOT EXISTS idx_eam_gmail_msg ON email_account_messages(gmail_message_id);
-- ============================================================================
-- email_attachments — metadata; bytes on disk under data/email_attachments/
-- ============================================================================
CREATE TABLE IF NOT EXISTS email_attachments (
id TEXT PRIMARY KEY,
email_id TEXT NOT NULL,
gmail_attachment_id TEXT NOT NULL,
filename TEXT NOT NULL,
sanitized_filename TEXT NOT NULL,
mime_type TEXT,
size_bytes INTEGER,
sha256_hex TEXT,
storage_path TEXT NOT NULL,
download_status TEXT NOT NULL DEFAULT 'pending', -- pending|downloaded|failed|skipped
download_attempts INTEGER NOT NULL DEFAULT 0,
download_error TEXT,
downloaded_at TEXT,
created_at TEXT DEFAULT (datetime('now')),
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_attach_email ON email_attachments(email_id);
CREATE INDEX IF NOT EXISTS idx_attach_sha ON email_attachments(sha256_hex);
CREATE INDEX IF NOT EXISTS idx_attach_status ON email_attachments(download_status);
-- ============================================================================
-- email_threads — thread roll-up for UI
-- ============================================================================
CREATE TABLE IF NOT EXISTS email_threads (
id TEXT PRIMARY KEY,
gmail_thread_id TEXT,
rfc_thread_root_id TEXT,
subject_normalized TEXT,
first_message_at TEXT,
last_message_at TEXT,
message_count INTEGER NOT NULL DEFAULT 0,
participant_count INTEGER NOT NULL DEFAULT 0,
participants_json TEXT DEFAULT '[]',
is_matched INTEGER NOT NULL DEFAULT 0,
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now'))
);
CREATE UNIQUE INDEX IF NOT EXISTS idx_threads_gmail_uniq ON email_threads(gmail_thread_id)
WHERE gmail_thread_id IS NOT NULL;
CREATE INDEX IF NOT EXISTS idx_threads_rfc_root ON email_threads(rfc_thread_root_id);
CREATE INDEX IF NOT EXISTS idx_threads_last_msg ON email_threads(last_message_at);
-- ============================================================================
-- email_investor_links — matched investors
-- ============================================================================
CREATE TABLE IF NOT EXISTS email_investor_links (
id TEXT PRIMARY KEY,
email_id TEXT NOT NULL,
fundraising_investor_id TEXT,
fundraising_contact_id TEXT,
contact_id TEXT,
organization_id TEXT,
matched_address TEXT NOT NULL,
match_kind TEXT NOT NULL, -- exact_email|domain_match|manual
match_confidence REAL NOT NULL DEFAULT 1.0,
created_at TEXT DEFAULT (datetime('now')),
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_eil_email ON email_investor_links(email_id);
CREATE INDEX IF NOT EXISTS idx_eil_investor ON email_investor_links(fundraising_investor_id);
CREATE INDEX IF NOT EXISTS idx_eil_fr_contact ON email_investor_links(fundraising_contact_id);
CREATE INDEX IF NOT EXISTS idx_eil_contact ON email_investor_links(contact_id);
-- ============================================================================
-- email_sync_runs — per-run observability
-- ============================================================================
CREATE TABLE IF NOT EXISTS email_sync_runs (
id TEXT PRIMARY KEY,
account_id TEXT NOT NULL,
kind TEXT NOT NULL, -- backfill|incremental|manual
started_at TEXT NOT NULL,
finished_at TEXT,
status TEXT NOT NULL, -- running|ok|error|partial
messages_seen INTEGER NOT NULL DEFAULT 0,
messages_stored INTEGER NOT NULL DEFAULT 0,
attachments_saved INTEGER NOT NULL DEFAULT 0,
api_calls INTEGER NOT NULL DEFAULT 0,
retries INTEGER NOT NULL DEFAULT 0,
error TEXT,
FOREIGN KEY(account_id) REFERENCES email_accounts(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_sync_runs_account ON email_sync_runs(account_id, started_at);
+283
View File
@@ -0,0 +1,283 @@
"""
Parse a Gmail `users.messages.get` response (format=full) into a flat dict
ready for db.insert_email().
Input shape (abbreviated):
{
"id": "...", # Gmail message id
"threadId": "...",
"labelIds": ["INBOX","IMPORTANT",...],
"snippet": "...",
"historyId": "...",
"internalDate": "1713657600000", # ms epoch, authoritative
"sizeEstimate": 12345,
"payload": {
"headers": [{"name":"Subject","value":"..."}, ...],
"mimeType": "multipart/mixed",
"parts": [...recursive...],
"body": {"data": "<base64url>", "size": ...}
}
}
"""
import base64
import email.utils
import email.header
import re
from datetime import datetime, timezone
from typing import Any, Iterable, Optional
from html.parser import HTMLParser
# ---------------------------------------------------------------------------- public
def parse(message: dict, *, owning_account_address: Optional[str] = None) -> dict:
"""Parse a Gmail message payload into our canonical dict shape."""
headers = _header_map(message.get("payload", {}).get("headers") or [])
from_name, from_email = _split_addr(headers.get("from", ""))
to_list = _parse_address_list(headers.get("to", ""))
cc_list = _parse_address_list(headers.get("cc", ""))
bcc_list = _parse_address_list(headers.get("bcc", ""))
reply_to = _split_addr(headers.get("reply-to", ""))[1] or None
sent_at = _parse_date_header(headers.get("date"), fallback_ms=message.get("internalDate"))
rfc_mid = headers.get("message-id", "").strip() or f"synthetic-{message.get('id')}@ten31.local"
rfc_mid = _strip_angle_brackets(rfc_mid)
in_reply_to = _strip_angle_brackets(headers.get("in-reply-to", "").strip()) or None
references = _split_references(headers.get("references", ""))
rfc_thread_root_id = references[0] if references else (in_reply_to or rfc_mid)
body_text, body_html, attachments = _walk_payload(message.get("payload", {}))
subject = _decode_rfc2047(headers.get("subject") or "")
labels = message.get("labelIds") or []
is_sent = "SENT" in labels
return {
"gmail_message_id": message.get("id"),
"gmail_thread_id": message.get("threadId"),
"rfc_message_id": rfc_mid,
"rfc_thread_root_id": rfc_thread_root_id,
"in_reply_to": in_reply_to,
"references": references,
"subject": subject,
"from_email": (from_email or "").lower(),
"from_name": from_name,
"to": [{"email": e.lower(), "name": n} for n, e in to_list if e],
"cc": [{"email": e.lower(), "name": n} for n, e in cc_list if e],
"bcc": [{"email": e.lower(), "name": n} for n, e in bcc_list if e],
"reply_to": reply_to.lower() if reply_to else None,
"sent_at": sent_at,
"body_text": _cap_text(body_text),
"body_html": _cap_text(body_html),
"snippet": message.get("snippet"),
"attachments": attachments,
"size_estimate": message.get("sizeEstimate"),
"labels": labels,
"is_sent": is_sent,
"raw_headers": headers,
"owning_account": owning_account_address,
}
# ---------------------------------------------------------------------------- headers
def _header_map(header_list: Iterable[dict]) -> dict[str, str]:
"""Case-insensitive keys. Last-write-wins for duplicates (rare)."""
out: dict[str, str] = {}
for h in header_list:
name = (h.get("name") or "").lower()
out[name] = h.get("value") or ""
return out
def _decode_rfc2047(s: str) -> str:
if not s:
return ""
try:
parts = email.header.decode_header(s)
pieces = []
for text, charset in parts:
if isinstance(text, bytes):
try:
pieces.append(text.decode(charset or "utf-8", errors="replace"))
except LookupError:
pieces.append(text.decode("utf-8", errors="replace"))
else:
pieces.append(text)
return "".join(pieces)
except Exception:
return s
def _split_addr(raw: str) -> tuple[Optional[str], Optional[str]]:
if not raw:
return (None, None)
name, addr = email.utils.parseaddr(raw)
return (_decode_rfc2047(name) or None, addr or None)
def _parse_address_list(raw: str) -> list[tuple[Optional[str], Optional[str]]]:
if not raw:
return []
parsed = email.utils.getaddresses([raw])
return [(_decode_rfc2047(n) or None, a or None) for n, a in parsed if a]
def _parse_date_header(raw: Optional[str], *, fallback_ms: Optional[str]) -> str:
# Prefer RFC Date header, fall back to Gmail internalDate (epoch ms).
if raw:
try:
dt = email.utils.parsedate_to_datetime(raw)
if dt is not None:
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
return dt.astimezone(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
except (TypeError, ValueError):
pass
if fallback_ms:
try:
dt = datetime.fromtimestamp(int(fallback_ms) / 1000.0, tz=timezone.utc)
return dt.strftime("%Y-%m-%dT%H:%M:%SZ")
except (TypeError, ValueError):
pass
return datetime.now(tz=timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
def _split_references(raw: str) -> list[str]:
if not raw:
return []
return [_strip_angle_brackets(p) for p in raw.split() if p.strip()]
def _strip_angle_brackets(s: str) -> str:
s = (s or "").strip()
if s.startswith("<") and s.endswith(">"):
return s[1:-1]
return s
# ---------------------------------------------------------------------------- MIME walk
def _walk_payload(payload: dict) -> tuple[Optional[str], Optional[str], list[dict]]:
"""Returns (body_text, body_html, attachments).
Depth-first walk. First plain/text wins for body_text; first text/html
wins for body_html. Anything with a filename or attachment disposition
becomes an attachment entry.
"""
text: Optional[str] = None
html_body: Optional[str] = None
attachments: list[dict] = []
def visit(part: dict):
nonlocal text, html_body
mime = (part.get("mimeType") or "").lower()
filename = part.get("filename") or ""
body = part.get("body") or {}
parts = part.get("parts") or []
headers = _header_map(part.get("headers") or [])
disposition = (headers.get("content-disposition") or "").lower()
is_attachment = bool(filename) or disposition.startswith("attachment")
if is_attachment:
attachments.append({
"filename": filename or f"unnamed.{_ext_for(mime)}",
"mime_type": mime or "application/octet-stream",
"size": body.get("size"),
"gmail_attachment_id": body.get("attachmentId"),
# Some tiny attachments come inlined as base64; attachmentId is
# then missing and data is in body.data. sync.py handles both.
"inline_data_b64": body.get("data"),
"content_disposition": "inline" if disposition.startswith("inline") else "attachment",
})
else:
if mime == "text/plain" and text is None:
text = _decode_body(body)
elif mime == "text/html" and html_body is None:
html_body = _decode_body(body)
for child in parts:
visit(child)
visit(payload)
# Derive a plain-text body from HTML if only HTML exists.
if text is None and html_body:
text = _strip_html(html_body)
return text, html_body, attachments
def _decode_body(body: dict) -> Optional[str]:
data = body.get("data")
if not data:
return None
try:
padding = 4 - (len(data) % 4)
if padding != 4:
data = data + ("=" * padding)
raw = base64.urlsafe_b64decode(data.encode("ascii"))
return raw.decode("utf-8", errors="replace").replace("\r\n", "\n")
except Exception:
return None
# ---------------------------------------------------------------------------- HTML stripping
class _HTMLToText(HTMLParser):
def __init__(self):
super().__init__()
self._parts: list[str] = []
self._skip_depth = 0
def handle_starttag(self, tag, attrs):
if tag in ("script", "style"):
self._skip_depth += 1
if tag in ("br", "p", "div", "tr", "li"):
self._parts.append("\n")
def handle_endtag(self, tag):
if tag in ("script", "style"):
self._skip_depth = max(0, self._skip_depth - 1)
if tag in ("p", "div", "tr"):
self._parts.append("\n")
def handle_data(self, data):
if self._skip_depth == 0:
self._parts.append(data)
def text(self) -> str:
raw = "".join(self._parts)
return re.sub(r"\n{3,}", "\n\n", raw).strip()
def _strip_html(html: str) -> str:
p = _HTMLToText()
try:
p.feed(html)
return p.text()
except Exception:
return re.sub(r"<[^>]+>", " ", html)
def _ext_for(mime: str) -> str:
return mime.split("/")[-1] if "/" in mime else "bin"
# ---------------------------------------------------------------------------- caps
# Keep bodies bounded to avoid a pathological 500MB message exploding the DB.
_BODY_CAP_BYTES = 10 * 1024 * 1024 # 10MB
def _cap_text(s: Optional[str]) -> Optional[str]:
if s is None:
return None
if len(s.encode("utf-8", errors="ignore")) <= _BODY_CAP_BYTES:
return s
return s[: _BODY_CAP_BYTES // 2] + "\n\n[TRUNCATED BY CRM — body exceeded 10MB]"
+462
View File
@@ -0,0 +1,462 @@
"""
HTTP route handlers for the Gmail integration.
Designed to plug into server.py's CRMHandler (BaseHTTPRequestHandler) pattern.
The hook is a single function call near the top of do_GET / do_POST that
lets this module claim any /api/email/* request:
# in CRMHandler.do_GET and CRMHandler.do_POST, before the 404 fallthrough:
from email_integration.routes import try_handle
if try_handle(self):
return
`try_handle(handler)` inspects `handler.command` and `handler.get_path()` and
returns True if it handled the request (sent a response).
Every handler respects the same auth / rate-limit model as the rest of server.py
by calling handler.get_user() and handler.rate_limited(...).
"""
import json
import sqlite3
from typing import Optional
from . import config as _cfg
from . import credentials as _creds
from . import crypto as _crypto
from . import db as _db
from . import scheduler as _sched
# ---------------------------------------------------------------------------- dispatch
_GET_ROUTES = {
"/api/email/status": "status",
"/api/email/accounts": "list_accounts",
"/api/email/threads": "list_threads",
"/api/email/oauth/start": "oauth_start",
"/api/email/oauth/callback": "oauth_callback",
}
_POST_ROUTES = {
"/api/email/accounts/enroll-all": "enroll_all",
"/api/email/accounts/enroll": "enroll_one",
"/api/email/sync/run-now": "run_now",
"/api/email/rematch": "rematch",
}
def try_handle(handler) -> bool:
path = handler.get_path()
method = handler.command
table = _GET_ROUTES if method == "GET" else _POST_ROUTES if method == "POST" else {}
name = table.get(path)
if not path.startswith("/api/email/"):
return False
if not name:
# Route is owned by this module but unknown — return a proper 404
# instead of letting the main dispatcher's 404 abuse counter fire.
handler.send_error_json("Not found", 404)
return True
if not _cfg.CONFIG.enabled:
handler.send_error_json("Email integration disabled", 503)
return True
# Also enforce attachment streaming under a different prefix
# (handled above via prefix check).
impl = globals().get(f"_h_{name}")
if impl is None:
handler.send_error_json("Not implemented", 500)
return True
try:
impl(handler)
except Exception as e:
handler.send_error_json(f"Internal error: {e}", 500)
return True
# ---------------------------------------------------------------------------- helpers
def _conn() -> sqlite3.Connection:
import os
db_path = os.environ.get(
"CRM_DB_PATH",
os.path.join(_cfg.CONFIG.data_dir, "crm.db"),
)
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA foreign_keys=ON")
conn.execute("PRAGMA busy_timeout=5000")
conn.row_factory = sqlite3.Row
return conn
def _require_auth(handler) -> Optional[dict]:
user = handler.get_user()
if not user:
handler.send_error_json("Unauthorized", 401)
return None
return user
def _require_admin(handler) -> Optional[dict]:
user = _require_auth(handler)
if user is None:
return None
if user.get("role") != "admin":
handler.send_error_json("Admin required", 403)
return None
return user
# ---------------------------------------------------------------------------- GET handlers
def _h_status(handler):
user = _require_auth(handler)
if not user:
return
snap = _sched.status_snapshot()
conn = _conn()
try:
cur = conn.cursor()
cur.execute(
"SELECT COUNT(*) AS n_accounts, "
"SUM(CASE WHEN sync_status='active' THEN 1 ELSE 0 END) AS n_active, "
"SUM(CASE WHEN sync_status='error' THEN 1 ELSE 0 END) AS n_error "
"FROM email_accounts"
)
counts = dict(cur.fetchone() or {})
cur.execute("SELECT COUNT(*) AS n FROM emails WHERE match_status = 'matched'")
snap["matched_emails"] = cur.fetchone()["n"]
finally:
conn.close()
snap["accounts_summary"] = counts
handler.send_json(snap)
def _h_list_accounts(handler):
user = _require_auth(handler)
if not user:
return
conn = _conn()
try:
cur = conn.cursor()
cur.execute(
"SELECT id, user_id, email_address, auth_method, sync_enabled, "
"sync_status, sync_error, last_synced_at, backfill_complete "
"FROM email_accounts ORDER BY email_address"
)
rows = [dict(r) for r in cur.fetchall()]
finally:
conn.close()
# Non-admins only see their own row
if user.get("role") != "admin":
rows = [r for r in rows if r["user_id"] == user["user_id"]]
handler.send_json({"accounts": rows})
def _h_list_threads(handler):
user = _require_auth(handler)
if not user:
return
q = handler.get_query_params()
investor_id = q.get("investor_id")
limit = min(int(q.get("limit", 50)), 500)
conn = _conn()
try:
cur = conn.cursor()
if investor_id:
cur.execute(
"""SELECT t.*
FROM email_threads t
JOIN emails e ON e.thread_id = t.id
JOIN email_investor_links l ON l.email_id = e.id
WHERE l.fundraising_investor_id = ?
OR l.fundraising_contact_id IN (
SELECT id FROM fundraising_contacts WHERE investor_id = ?
)
GROUP BY t.id
ORDER BY t.last_message_at DESC
LIMIT ?""",
(investor_id, investor_id, limit),
)
else:
cur.execute(
"SELECT * FROM email_threads WHERE is_matched = 1 "
"ORDER BY last_message_at DESC LIMIT ?",
(limit,),
)
threads = [dict(r) for r in cur.fetchall()]
finally:
conn.close()
handler.send_json({"threads": threads})
def _h_oauth_start(handler):
"""Begin per-user OAuth consent flow (fallback path)."""
user = _require_auth(handler)
if not user:
return
if _cfg.CONFIG.primary_auth != "oauth":
return handler.send_error_json(
"Per-user OAuth disabled (set CRM_GMAIL_AUTH_METHOD=oauth to enable)", 400
)
q = handler.get_query_params()
account_email = q.get("account_email") or ""
if not account_email:
return handler.send_error_json("account_email required", 400)
import secrets
import urllib.parse
state = secrets.token_urlsafe(32)
_oauth_state_store(state, user["user_id"], account_email)
params = {
"client_id": _cfg.CONFIG.oauth_client_id,
"redirect_uri": _cfg.CONFIG.oauth_redirect_uri,
"response_type": "code",
"scope": _creds.GMAIL_READONLY_SCOPE,
"access_type": "offline",
"prompt": "consent",
"state": state,
"login_hint": account_email,
}
url = "https://accounts.google.com/o/oauth2/v2/auth?" + urllib.parse.urlencode(params)
handler.send_json({"redirect_url": url})
def _h_oauth_callback(handler):
"""Exchange code for tokens, encrypt refresh token, store."""
q = handler.get_query_params()
code = q.get("code")
state = q.get("state")
if not code or not state:
return handler.send_error_json("code and state required", 400)
state_row = _oauth_state_consume(state)
if not state_row:
return handler.send_error_json("Invalid state", 400)
import urllib.parse
import urllib.request
body = urllib.parse.urlencode({
"code": code,
"client_id": _cfg.CONFIG.oauth_client_id,
"client_secret": _cfg.CONFIG.oauth_client_secret,
"redirect_uri": _cfg.CONFIG.oauth_redirect_uri,
"grant_type": "authorization_code",
}).encode("ascii")
req = urllib.request.Request(
"https://oauth2.googleapis.com/token",
data=body,
headers={"Content-Type": "application/x-www-form-urlencoded"},
)
try:
with urllib.request.urlopen(req, timeout=15) as resp:
payload = json.loads(resp.read())
except Exception as e:
return handler.send_error_json(f"Token exchange failed: {e}", 500)
refresh = payload.get("refresh_token")
if not refresh:
return handler.send_error_json("No refresh_token returned (user may have previously consented; prompt=consent required)", 400)
enc = _crypto.encrypt(refresh.encode("ascii"), secret_key_b64=_cfg.CONFIG.secret_key_b64)
conn = _conn()
try:
_db.upsert_account(conn, user_id=state_row["user_id"],
email_address=state_row["account_email"],
auth_method="oauth")
conn.execute(
"UPDATE email_accounts SET oauth_refresh_enc = ?, sync_status = 'pending', "
"updated_at = datetime('now') WHERE email_address = ?",
(enc, state_row["account_email"]),
)
conn.commit()
finally:
conn.close()
handler.send_json({"ok": True, "account_email": state_row["account_email"]})
# ---------------------------------------------------------------------------- POST handlers
def _h_enroll_all(handler):
"""Admin: enroll every CRM user whose email is @workspace_domain via DWD."""
user = _require_admin(handler)
if not user:
return
if _cfg.CONFIG.primary_auth != "dwd":
return handler.send_error_json("enroll-all only valid in DWD mode", 400)
domain = _cfg.CONFIG.workspace_domain
if not domain:
return handler.send_error_json("CRM_GMAIL_WORKSPACE_DOMAIN not set", 400)
conn = _conn()
try:
cur = conn.cursor()
cur.execute(
"SELECT id, email FROM users WHERE is_active = 1 AND email LIKE ?",
(f"%@{domain}",),
)
users = cur.fetchall()
created = []
for u in users:
aid = _db.upsert_account(conn, user_id=u["id"],
email_address=u["email"].lower(),
auth_method="dwd")
created.append({"account_id": aid, "email": u["email"]})
conn.commit()
finally:
conn.close()
handler.send_json({"enrolled": created, "count": len(created)})
def _h_enroll_one(handler):
user = _require_admin(handler)
if not user:
return
body = handler.get_body() or {}
# Accept either `email` or `email_address` for ergonomics.
email_address = (body.get("email_address") or body.get("email") or "").lower().strip()
user_id = body.get("user_id")
auth_method = body.get("auth_method") or _cfg.CONFIG.primary_auth
if not email_address:
return handler.send_error_json("email (or email_address) required", 400)
# If the caller didn't specify a CRM user_id, resolve it from the
# users table by matching email. Falls back to the authenticated
# admin's own id (handles the common case of a single admin
# enrolling themselves without having to paste their UUID).
if not user_id:
conn = _conn()
try:
cur = conn.cursor()
cur.execute("SELECT id FROM users WHERE LOWER(email) = ?",
(email_address,))
row = cur.fetchone()
user_id = row["id"] if row else user.get("id")
finally:
conn.close()
if not user_id:
return handler.send_error_json("could not resolve user_id for that email", 400)
conn = _conn()
try:
aid = _db.upsert_account(conn, user_id=user_id,
email_address=email_address,
auth_method=auth_method)
conn.commit()
finally:
conn.close()
handler.send_json({"account_id": aid, "email": email_address, "user_id": user_id})
def _h_run_now(handler):
user = _require_admin(handler)
if not user:
return
# Reuse existing rate limit so admins can't hammer this.
if handler.rate_limited("email-sync-now", 6):
return handler.send_error_json("Too many requests", 429)
result = _sched.trigger_run_now()
handler.send_json(result)
def _h_rematch(handler):
"""Re-evaluate unmatched emails against the current investor index."""
user = _require_admin(handler)
if not user:
return
body = handler.get_body() or {}
since = body.get("since") # optional ISO8601
conn = _conn()
scanned = 0
matched = 0
try:
from .matcher import InvestorIndex
index = InvestorIndex(own_domain=_cfg.CONFIG.workspace_domain)
index.rebuild(_conn)
cur = conn.cursor()
sql = ("SELECT id, from_email, to_emails_json, cc_emails_json "
"FROM emails WHERE match_status = 'unmatched'")
params: list = []
if since:
sql += " AND sent_at >= ?"
params.append(since)
sql += " ORDER BY sent_at DESC LIMIT 10000"
cur.execute(sql, params)
for row in cur.fetchall():
scanned += 1
participants = set()
if row["from_email"]:
participants.add(row["from_email"].lower())
for col in ("to_emails_json", "cc_emails_json"):
try:
arr = json.loads(row[col] or "[]")
except Exception:
arr = []
for a in arr:
e = a.get("email") if isinstance(a, dict) else a
if e:
participants.add(e.lower())
links = index.match(participants)
if not links:
continue
matched += 1
conn.execute(
"UPDATE emails SET match_status='matched', is_matched=1, "
"updated_at=datetime('now') WHERE id=?",
(row["id"],),
)
for link in links:
_db.insert_investor_link(conn, email_id=row["id"], link={
"matched_address": link.matched_address,
"match_kind": link.match_kind,
"match_confidence": link.match_confidence,
"fundraising_investor_id": link.target.fundraising_investor_id,
"fundraising_contact_id": link.target.fundraising_contact_id,
"contact_id": link.target.contact_id,
"organization_id": link.target.organization_id,
})
# NOTE: body is still missing — we only have headers. A follow-up
# job can re-fetch the full message from Gmail using the sighting's
# gmail_message_id. Not done inline to keep this endpoint fast.
conn.commit()
finally:
conn.close()
handler.send_json({"scanned": scanned, "newly_matched": matched})
# ---------------------------------------------------------------------------- OAuth state store (in-memory)
# For a 5-person CRM the state store doesn't need to be durable — a server
# restart between start and callback is rare and just requires a retry.
_oauth_states: dict[str, dict] = {}
_oauth_state_lock = __import__("threading").Lock()
def _oauth_state_store(state: str, user_id: str, account_email: str) -> None:
import time
with _oauth_state_lock:
# Prune stale entries (>10 min).
cutoff = time.time() - 600
for k, v in list(_oauth_states.items()):
if v["created"] < cutoff:
_oauth_states.pop(k, None)
_oauth_states[state] = {
"user_id": user_id,
"account_email": account_email.lower().strip(),
"created": time.time(),
}
def _oauth_state_consume(state: str) -> Optional[dict]:
with _oauth_state_lock:
return _oauth_states.pop(state, None)
+143
View File
@@ -0,0 +1,143 @@
"""
Background sync scheduler.
Runs as a daemon thread started from server.py main(). One thread; it wakes
every `sync_interval_sec`, processes all accounts serially, sleeps again.
Singleton: start_sync_scheduler() is idempotent — calling twice won't spawn
a second thread. stop_sync_scheduler() gracefully signals shutdown (not
strictly needed since it's daemon, but useful for tests).
"""
import logging
import sqlite3
import threading
import time
from typing import Callable, Optional
from . import config as _cfg
from . import credentials as _creds
from . import sync as _sync
from .matcher import InvestorIndex
log = logging.getLogger("email_integration.scheduler")
_state: dict[str, object] = {
"thread": None,
"stop": threading.Event(),
"last_run": 0.0,
"last_result": None,
"running_now": False,
}
def _conn_factory_from_env() -> Callable[[], sqlite3.Connection]:
"""Build a get_db() compatible with server.py's pattern.
We don't import server.py (avoid circular / startup ordering). Instead
we re-implement the same settings. If server.py's DB path differs from
the default, CRM_DB_PATH env var should be set — same mechanism.
"""
import os
db_path = os.environ.get(
"CRM_DB_PATH",
os.path.join(_cfg.CONFIG.data_dir, "crm.db"),
)
def get_db() -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA foreign_keys=ON")
conn.execute("PRAGMA busy_timeout=5000")
conn.row_factory = sqlite3.Row
return conn
return get_db
def start_sync_scheduler(conn_factory: Optional[Callable] = None) -> None:
if _state["thread"] is not None:
return # already running
if not _cfg.CONFIG.enabled:
log.info("email_integration not enabled; scheduler will not start")
return
factory = conn_factory or _conn_factory_from_env()
try:
provider = _creds.build_provider(factory)
except Exception as e:
log.exception("cannot build credential provider: %s", e)
return
index = InvestorIndex(own_domain=_cfg.CONFIG.workspace_domain)
try:
index.rebuild(factory)
except Exception:
log.exception("initial investor-index build failed; scheduler continues")
stop = threading.Event()
_state["stop"] = stop
def _loop():
log.info("email sync scheduler started; interval=%ss", _cfg.CONFIG.sync_interval_sec)
# First cycle: short delay to let server finish startup.
if stop.wait(10):
return
while not stop.is_set():
_state["running_now"] = True
t0 = time.time()
try:
result = _sync.sync_all(factory, provider, index)
_state["last_result"] = result
except Exception:
log.exception("sync loop crashed; will retry next cycle")
finally:
_state["running_now"] = False
_state["last_run"] = t0
if stop.wait(_cfg.CONFIG.sync_interval_sec):
return
t = threading.Thread(target=_loop, name="email-sync", daemon=True)
t.start()
_state["thread"] = t
_state["provider"] = provider
_state["index"] = index
_state["factory"] = factory
def stop_sync_scheduler() -> None:
ev: threading.Event = _state["stop"] # type: ignore
ev.set()
t = _state.get("thread")
if t:
try:
t.join(timeout=5)
except Exception:
pass
_state["thread"] = None
def trigger_run_now() -> dict:
"""Force a single sync pass synchronously (admin 'sync now' endpoint)."""
if _state.get("running_now"):
return {"status": "already_running"}
factory = _state.get("factory")
provider = _state.get("provider")
index = _state.get("index")
if not (factory and provider and index):
return {"status": "not_initialized"}
return _sync.sync_all(factory, provider, index) # type: ignore
def status_snapshot() -> dict:
return {
"enabled": _cfg.CONFIG.enabled,
"running": _state["running_now"],
"last_run_unix": _state.get("last_run"),
"last_result": _state.get("last_result"),
"interval_sec": _cfg.CONFIG.sync_interval_sec,
}
+390
View File
@@ -0,0 +1,390 @@
"""
Sync orchestrator.
Top-level entry points:
sync_account(conn_factory, credential_provider, account_row, matcher)
Full sync pass for one mailbox. Decides backfill vs. incremental based
on email_accounts.backfill_complete. Writes a sync_runs row.
sync_all(conn_factory, credential_provider, matcher)
Iterates every sync-enabled account sequentially. Called from
scheduler.py every CRM_GMAIL_SYNC_INTERVAL_MIN minutes.
Design: match-only storage (see architecture doc §7). For each message:
1. Fetch metadata (cheap, 5 units).
2. Run matcher against participant addresses.
3. If matched → fetch full message, parse, persist body + register attachments.
4. If unmatched → persist header-only row.
5. In both cases, record the per-account sighting.
"""
import logging
import sqlite3
import traceback
from typing import Optional
from . import attachments as _attach
from . import config as _cfg
from . import db as _db
from . import errors as _errors
from . import gmail_client as _gmail
from . import parser as _parser
from . import threads as _threads
from .matcher import InvestorIndex, InvestorLink
log = logging.getLogger("email_integration.sync")
METADATA_HEADERS = [
"From", "To", "Cc", "Bcc", "Subject", "Date",
"Message-ID", "In-Reply-To", "References", "Reply-To",
]
# ---------------------------------------------------------------------------- public
def sync_all(conn_factory, credential_provider, index: InvestorIndex) -> dict:
"""Run one pass across all enabled accounts. Returns summary stats."""
index.rebuild_if_stale(conn_factory)
conn = conn_factory()
try:
accounts = _db.list_sync_ready_accounts(conn)
finally:
conn.close()
totals = {"accounts": 0, "messages_stored": 0, "errors": 0}
for acc in accounts:
totals["accounts"] += 1
try:
stats = sync_account(conn_factory, credential_provider, acc, index)
totals["messages_stored"] += stats.get("messages_stored", 0)
except Exception:
totals["errors"] += 1
log.exception("sync failed for account %s", acc["email_address"])
return totals
def sync_account(conn_factory, credential_provider, account,
index: InvestorIndex) -> dict:
"""Sync a single mailbox. Returns stats dict."""
email_addr = account["email_address"]
stats = _gmail.CallStats()
client = _gmail.GmailClient(credential_provider, email_addr, stats=stats)
# Mark running
conn = conn_factory()
try:
run_id = _db.start_sync_run(conn,
account_id=account["id"],
kind="backfill" if not account["backfill_complete"] else "incremental")
_db.set_account_status(conn, account["id"], status="active", error=None)
conn.commit()
finally:
conn.close()
run_stats = {"messages_seen": 0, "messages_stored": 0, "attachments_saved": 0}
error_str: Optional[str] = None
status = "ok"
try:
if not account["backfill_complete"]:
_run_backfill(conn_factory, client, account, index, run_stats)
else:
_run_incremental(conn_factory, client, account, index, run_stats)
# Drain attachments for this account.
conn = conn_factory()
try:
# Limit to a few cycles' worth of attachments per pass.
batched = _attach.drain_pending(conn_factory, client, account["id"], limit=100)
run_stats["attachments_saved"] = batched
finally:
conn.close()
except _errors.AuthError as e:
error_str = f"auth: {e}"
status = "error"
except _errors.HistoryExpiredError:
# Recover: reset to date-based backfill from last_synced_at.
error_str = "history expired; fallback to date backfill"
status = "partial"
_fallback_date_backfill(conn_factory, client, account, index, run_stats)
except Exception as e:
error_str = f"unexpected: {type(e).__name__}: {e}"
status = "error"
log.exception("unexpected during sync of %s", email_addr)
finally:
run_stats["api_calls"] = stats.api_calls
run_stats["retries"] = stats.retries
conn = conn_factory()
try:
_db.finish_sync_run(conn, run_id, status=status, stats=run_stats, error=error_str)
_db.set_account_status(conn, account["id"],
status="active" if status == "ok" else status,
error=error_str)
_db.set_account_checkpoint(conn, account["id"],
last_synced_at=_db._now_iso())
conn.commit()
finally:
conn.close()
return run_stats
# ---------------------------------------------------------------------------- backfill
def _run_backfill(conn_factory, client, account, index: InvestorIndex,
run_stats: dict) -> None:
"""Initial full-mailbox backfill, resumable via backfill_cursor."""
page_token = account["backfill_cursor"]
while True:
resp = client.list_messages(page_token=page_token,
max_results=_cfg.CONFIG.backfill_page_size)
messages = resp.get("messages") or []
for m in messages:
run_stats["messages_seen"] += 1
try:
_process_one_message(conn_factory, client, account, index,
gmail_message_id=m["id"], run_stats=run_stats)
except _errors.GmailError as e:
log.warning("skip msg %s on %s: %s", m["id"], account["email_address"], e)
continue
page_token = resp.get("nextPageToken")
conn = conn_factory()
try:
_db.set_account_checkpoint(conn, account["id"],
backfill_cursor=page_token,
backfill_complete=(not page_token))
conn.commit()
finally:
conn.close()
if not page_token:
# Capture current historyId as checkpoint for future incrementals.
prof = client.get_profile()
hid = prof.get("historyId")
if hid:
conn = conn_factory()
try:
_db.set_account_checkpoint(conn, account["id"], history_id=str(hid))
conn.commit()
finally:
conn.close()
return
# ---------------------------------------------------------------------------- incremental
def _run_incremental(conn_factory, client, account, index: InvestorIndex,
run_stats: dict) -> None:
start_hid = account["last_history_id"]
if not start_hid:
# Safety: if checkpoint is missing, re-enter backfill.
_run_backfill(conn_factory, client, account, index, run_stats)
return
# history_types filter limits bandwidth to what we care about.
new_hid: Optional[str] = None
try:
for h in client.iter_history(
start_history_id=start_hid,
history_types=["messageAdded", "messageDeleted", "labelAdded", "labelRemoved"],
):
for ma in h.get("messagesAdded") or []:
msg = ma.get("message") or {}
run_stats["messages_seen"] += 1
try:
_process_one_message(conn_factory, client, account, index,
gmail_message_id=msg.get("id"),
run_stats=run_stats)
except _errors.GmailError as e:
log.warning("skip msg %s on %s: %s", msg.get("id"), account["email_address"], e)
for md in h.get("messagesDeleted") or []:
msg = md.get("message") or {}
conn = conn_factory()
try:
_db.tombstone_sighting(
conn,
account_id=account["id"],
gmail_message_id=msg.get("id"),
)
conn.commit()
finally:
conn.close()
for la in (h.get("labelsAdded") or []) + (h.get("labelsRemoved") or []):
msg = la.get("message") or {}
# labels are the resulting label set in Gmail's payload after
# the change. We refresh them wholesale.
labels = msg.get("labelIds") or []
conn = conn_factory()
try:
_db.update_sighting_labels(
conn,
account_id=account["id"],
gmail_message_id=msg.get("id"),
labels=labels,
)
conn.commit()
finally:
conn.close()
new_hid = client.last_history_id
except _errors.HistoryExpiredError:
raise
if new_hid:
conn = conn_factory()
try:
_db.set_account_checkpoint(conn, account["id"], history_id=str(new_hid))
conn.commit()
finally:
conn.close()
def _fallback_date_backfill(conn_factory, client, account, index, run_stats):
"""Used when startHistoryId has been pruned by Gmail.
Pulls everything since last_synced_at (or 14d if unknown), which will
hit a large overlap with existing data but upserts are idempotent.
"""
from datetime import datetime, timedelta, timezone
since = account["last_synced_at"] or (
datetime.now(tz=timezone.utc) - timedelta(days=14)
).strftime("%Y-%m-%dT%H:%M:%SZ")
q = f"after:{since.replace('-', '/').split('T')[0]}"
for m in client.iter_messages(q=q):
run_stats["messages_seen"] += 1
try:
_process_one_message(conn_factory, client, account, index,
gmail_message_id=m["id"], run_stats=run_stats)
except _errors.GmailError as e:
log.warning("skip during date-backfill msg %s: %s", m["id"], e)
prof = client.get_profile()
hid = prof.get("historyId")
if hid:
conn = conn_factory()
try:
_db.set_account_checkpoint(conn, account["id"], history_id=str(hid))
conn.commit()
finally:
conn.close()
# ---------------------------------------------------------------------------- per-message
def _process_one_message(conn_factory, client, account, index: InvestorIndex,
*, gmail_message_id: str, run_stats: dict) -> None:
"""Fetch, match, persist one message. Idempotent."""
if not gmail_message_id:
return
# Skip if we've already sighted this message for this account.
conn = conn_factory()
try:
cur = conn.cursor()
cur.execute(
"SELECT email_id FROM email_account_messages "
"WHERE account_id = ? AND gmail_message_id = ?",
(account["id"], gmail_message_id),
)
if cur.fetchone():
return
finally:
conn.close()
# 1. Metadata fetch (cheap).
meta = client.get_message(gmail_message_id, format="metadata",
metadata_headers=METADATA_HEADERS)
meta_parsed = _parser.parse(meta, owning_account_address=account["email_address"])
participants = set()
if meta_parsed.get("from_email"):
participants.add(meta_parsed["from_email"])
for kind in ("to", "cc", "bcc"):
for a in meta_parsed.get(kind, []):
if isinstance(a, dict) and a.get("email"):
participants.add(a["email"])
# Exclude owning account's own address so we don't try to "match" ourselves.
own = {account["email_address"].lower()}
links = index.match(participants, exclude_addresses=own)
is_matched = bool(links)
# 2. If matched, fetch full and parse for body + attachments.
if is_matched:
full = client.get_message(gmail_message_id, format="full")
parsed = _parser.parse(full, owning_account_address=account["email_address"])
else:
parsed = meta_parsed
# Strip any body fields (metadata fetch shouldn't have them but be safe).
parsed["body_text"] = None
parsed["body_html"] = None
parsed["attachments"] = []
# 3. Persist (idempotent on rfc_message_id).
conn = conn_factory()
try:
existing = _db.find_email_by_rfc_id(conn, parsed["rfc_message_id"])
if existing:
email_id = existing["id"]
# If the email was previously unmatched but now matches (e.g. user
# added the investor after first sight), upgrade the row.
if is_matched and existing["match_status"] == "unmatched":
conn.execute(
"UPDATE emails SET match_status = 'matched', is_matched = 1, "
"body_text = ?, body_html = ?, updated_at = datetime('now') "
"WHERE id = ?",
(parsed.get("body_text"), parsed.get("body_html"), email_id),
)
_attach.register_stubs(conn,
email_id=email_id,
parsed_attachments=parsed.get("attachments") or [])
for link in links:
_db.insert_investor_link(conn, email_id=email_id, link=_flatten_link(link))
else:
match_status = "matched" if is_matched else "unmatched"
email_id = _db.insert_email(conn, parsed=parsed, match_status=match_status)
thread_id = _threads.resolve_thread_id(conn, parsed)
_db.set_email_thread(conn, email_id, thread_id)
if is_matched:
_attach.register_stubs(conn,
email_id=email_id,
parsed_attachments=parsed.get("attachments") or [])
for link in links:
_db.insert_investor_link(conn, email_id=email_id, link=_flatten_link(link))
_db.rollup_thread(conn, thread_id)
run_stats["messages_stored"] += 1
# Record sighting (always, even if email row was pre-existing).
_db.upsert_sighting(
conn,
email_id=email_id,
account_id=account["id"],
gmail_message_id=gmail_message_id,
gmail_thread_id=parsed.get("gmail_thread_id") or "",
labels=parsed.get("labels", []),
is_sent=parsed.get("is_sent", False),
)
conn.commit()
except sqlite3.IntegrityError:
# Concurrent insert race — re-read and proceed.
pass
finally:
conn.close()
def _flatten_link(link: InvestorLink) -> dict:
return {
"matched_address": link.matched_address,
"match_kind": link.match_kind,
"match_confidence": link.match_confidence,
"fundraising_investor_id": link.target.fundraising_investor_id,
"fundraising_contact_id": link.target.fundraising_contact_id,
"contact_id": link.target.contact_id,
"organization_id": link.target.organization_id,
}
+75
View File
@@ -0,0 +1,75 @@
"""
Threading resolution.
Given a freshly-inserted emails row (or its about-to-be-inserted parsed dict),
figure out which email_threads row it belongs to. If none exists, create one.
Priority order (see architecture doc §10):
1. Existing email in our DB that shares any RFC Message-ID with this one's
References/In-Reply-To chain — inherit its thread.
2. Existing thread with the same gmail_thread_id.
3. Existing thread with the same rfc_thread_root_id.
4. Create a new thread.
"""
import re
import sqlite3
from typing import Optional
from . import db as _db
SUBJECT_PREFIX_RE = re.compile(r"^\s*(re|fwd?|aw|sv|antw|回复|fw)\s*:\s*", re.IGNORECASE)
def normalize_subject(s: Optional[str]) -> Optional[str]:
if not s:
return None
out = s
# Strip up to 5 nested Re:/Fwd: prefixes.
for _ in range(5):
new = SUBJECT_PREFIX_RE.sub("", out, count=1)
if new == out:
break
out = new
return out.strip().lower()
def resolve_thread_id(conn: sqlite3.Connection, parsed: dict) -> str:
"""Returns a thread_id — either an existing one or a newly created one."""
# Step 1: RFC cross-link.
candidates = list(parsed.get("references") or [])
if parsed.get("in_reply_to"):
candidates.append(parsed["in_reply_to"])
if candidates:
existing_email_id = _db.find_email_id_by_any_rfc_id(conn, candidates)
if existing_email_id:
cur = conn.cursor()
cur.execute("SELECT thread_id FROM emails WHERE id = ?", (existing_email_id,))
row = cur.fetchone()
if row and row["thread_id"]:
return row["thread_id"]
# Step 2: gmail_thread_id match.
gt = parsed.get("gmail_thread_id")
if gt:
existing = _db.find_thread_by_gmail_id(conn, gt)
if existing:
return existing["id"]
# Step 3: RFC thread-root match.
rfc_root = parsed.get("rfc_thread_root_id")
if rfc_root:
existing = _db.find_thread_by_rfc_root(conn, rfc_root)
if existing:
return existing["id"]
# Step 4: create.
return _db.create_thread(
conn,
gmail_thread_id=gt,
rfc_thread_root_id=rfc_root,
subject_normalized=normalize_subject(parsed.get("subject")),
first_message_at=parsed.get("sent_at"),
)