Phase 0 foundation: canonical schema, ingest pipeline, CRM MCP server
Workstream A–C substrate for the Ten31 agentic system: - A1: docs/crm-overview.md; CLAUDE.md conventions + guardrail #9 - A2: additive/reversible core migration (canonical_entities, entity_links, interaction_log, relationship_edges, soft-delete) + ledgered runner - B1/B3: chunking + deterministic entity resolution (backend/ingest) - B2: dense (bge-m3) + BM25 sparse ingest to Qdrant crm_chunks - C: CRM MCP server (reads, retrieval modes, logged writes) — no outbound tools - docs: redaction/re-hydration, Gmail enablement runbook - synthetic test data; .env.example; housekeeping (.gitignore, untrack crm.db, drop legacy files + start9/0.3.5) Verified end-to-end on synthetic data + live Sparks (hybrid > dense on entity queries). Real backfill runs on Ten31 infra; index holds synthetic data only. Branch snapshot also captures pre-existing working-tree changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,175 @@
|
||||
# `email_integration` — Gmail capture for the Venture CRM
|
||||
|
||||
Scaffolded Phase 1 of the Gmail integration described in
|
||||
`GMAIL_INTEGRATION_ARCHITECTURE.md` (repo root). Everything in this module is
|
||||
isolated from `server.py` until you wire it in explicitly.
|
||||
|
||||
## Contents
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `config.py` | Env-var loader; exposes `CONFIG` singleton. |
|
||||
| `errors.py` | Exception taxonomy used by the retry loop. |
|
||||
| `crypto.py` | AES-GCM wrapper for OAuth refresh-token encryption (only used in OAuth mode). |
|
||||
| `credentials.py` | `CredentialProvider` protocol + `DWDCredentialProvider` / `OAuthCredentialProvider`. |
|
||||
| `gmail_client.py` | Gmail API HTTP wrapper (rate limit, retry, pagination). |
|
||||
| `db.py` | All SQL touching `emails_*` tables. Migrations live under `migrations/`. |
|
||||
| `parser.py` | Gmail payload → canonical dict (headers, body, attachments). |
|
||||
| `matcher.py` | Investor address index + match logic. |
|
||||
| `threads.py` | Thread resolution using Gmail threadId + RFC References. |
|
||||
| `attachments.py` | Stub rows + on-disk storage + download worker. |
|
||||
| `sync.py` | Orchestrator for backfill + incremental sync of one account. |
|
||||
| `scheduler.py` | Background thread that runs `sync.sync_all` on an interval. |
|
||||
| `routes.py` | HTTP handlers under `/api/email/*` compatible with `CRMHandler`. |
|
||||
| `migrations/0001_email_tables.sql` | Table DDL. |
|
||||
|
||||
## Wiring it in
|
||||
|
||||
All changes are in `backend/server.py`, all guarded by an env flag. Each is
|
||||
independently revertible. None run unless `CRM_GMAIL_INTEGRATION_ENABLED=true`.
|
||||
|
||||
**Patch 1 — migrations** (append to `init_db()` after all existing
|
||||
`cursor.executescript(...)` calls, before `conn.commit()`):
|
||||
|
||||
```python
|
||||
try:
|
||||
from email_integration.db import apply_migrations
|
||||
apply_migrations(cursor)
|
||||
except ImportError:
|
||||
pass
|
||||
```
|
||||
|
||||
**Patch 2 — scheduler** (in `main()`, after `start_backup_scheduler()`):
|
||||
|
||||
```python
|
||||
if os.environ.get("CRM_GMAIL_INTEGRATION_ENABLED", "").lower() in ("1", "true", "yes", "on"):
|
||||
from email_integration.scheduler import start_sync_scheduler
|
||||
start_sync_scheduler()
|
||||
```
|
||||
|
||||
**Patch 3 — routes** (add near the top of `CRMHandler.do_GET` and
|
||||
`CRMHandler.do_POST`, after auth/rate-limit pre-checks, before API routing):
|
||||
|
||||
```python
|
||||
try:
|
||||
from email_integration.routes import try_handle
|
||||
if try_handle(self):
|
||||
return
|
||||
except ImportError:
|
||||
pass
|
||||
```
|
||||
|
||||
## Environment variables
|
||||
|
||||
```bash
|
||||
# Master on/off. Default off; scheduler won't start, routes return 503.
|
||||
CRM_GMAIL_INTEGRATION_ENABLED=true
|
||||
|
||||
# Auth method: "dwd" (default, recommended) or "oauth"
|
||||
CRM_GMAIL_AUTH_METHOD=dwd
|
||||
|
||||
# DWD mode
|
||||
CRM_GMAIL_SA_KEY_PATH=/path/to/CRM/data/secrets/gmail-service-account.json
|
||||
CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz
|
||||
|
||||
# OAuth mode (fallback; not required for DWD)
|
||||
CRM_GMAIL_OAUTH_CLIENT_ID=...
|
||||
CRM_GMAIL_OAUTH_CLIENT_SECRET=...
|
||||
CRM_GMAIL_OAUTH_REDIRECT_URI=https://crm.ten31.xyz/api/email/oauth/callback
|
||||
CRM_GMAIL_SECRET_KEY=<base64-32-random-bytes> # for encrypting refresh tokens
|
||||
|
||||
# Sync
|
||||
CRM_GMAIL_SYNC_INTERVAL_MIN=180 # default 3h
|
||||
CRM_GMAIL_BACKFILL_PAGE_SIZE=500
|
||||
CRM_GMAIL_MAX_ATTACHMENT_MB=50
|
||||
CRM_GMAIL_ATTACH_CONCURRENCY=4
|
||||
CRM_GMAIL_RATE_UNITS_SEC=150 # per account, leaves 40% headroom
|
||||
CRM_GMAIL_RETRY_MAX=5
|
||||
CRM_GMAIL_HISTORY_STALE_DAYS=5
|
||||
```
|
||||
|
||||
## Google Cloud / Workspace setup (DWD)
|
||||
|
||||
See `GMAIL_INTEGRATION_ARCHITECTURE.md` §3 for the full runbook. Short form:
|
||||
|
||||
1. Create GCP project, enable Gmail API.
|
||||
2. Create service account, download JSON key, enable domain-wide delegation.
|
||||
3. In Google Admin console → Security → API controls → Manage domain-wide
|
||||
delegation, authorize the service account's client ID with scope
|
||||
`https://www.googleapis.com/auth/gmail.readonly`.
|
||||
4. Copy the JSON key to `data/secrets/gmail-service-account.json`, `chmod 600`.
|
||||
5. Set env vars in `.env.beta`, restart CRM.
|
||||
6. As admin, POST `/api/email/accounts/enroll-all` to create `email_accounts`
|
||||
rows for every active user whose email ends in the Workspace domain.
|
||||
|
||||
## Adding the crypto dependency (only for OAuth mode)
|
||||
|
||||
If you use OAuth fallback you need `cryptography`:
|
||||
|
||||
```
|
||||
cryptography==42.0.5
|
||||
```
|
||||
|
||||
Append to `backend/requirements.txt`. DWD mode also uses `cryptography` for
|
||||
the RSA signing of the JWT bearer token — so if you enable the integration in
|
||||
either mode, add the dep.
|
||||
|
||||
## Rollback
|
||||
|
||||
To disable instantly: set `CRM_GMAIL_INTEGRATION_ENABLED=false` and restart.
|
||||
The scheduler won't start, routes return 503, DB tables remain (unused).
|
||||
|
||||
To remove completely: drop the env var, delete `data/email_attachments/`,
|
||||
drop all `emails_*` tables and `email_*` tables (migration is idempotent
|
||||
create-only; a separate drop script would be required — not provided in
|
||||
Phase 1 since you said you're not rushing).
|
||||
|
||||
## Local development
|
||||
|
||||
The module has zero network dependencies when imported without the scheduler
|
||||
starting. You can:
|
||||
|
||||
```python
|
||||
python3 -c "from email_integration.parser import parse; \
|
||||
import json; \
|
||||
print(parse(json.load(open('fixture.json'))))"
|
||||
```
|
||||
|
||||
## Testing checklist (before enabling in production)
|
||||
|
||||
- [ ] Enable `CRM_GMAIL_INTEGRATION_ENABLED=true` on a staging copy of the DB only.
|
||||
- [ ] Verify migrations applied: `emails`, `email_accounts`, etc. present.
|
||||
- [ ] Enroll one account (yours) via `/api/email/accounts/enroll`.
|
||||
- [ ] Trigger `POST /api/email/sync/run-now`.
|
||||
- [ ] Check `email_sync_runs` for `status='ok'`.
|
||||
- [ ] Spot-check `emails` rows against Gmail.
|
||||
- [ ] Verify an attachment downloaded correctly (hash and size).
|
||||
- [ ] Let the scheduler run for 24 hours; monitor `/api/email/status`.
|
||||
- [ ] Enroll remaining 4 teammates.
|
||||
|
||||
## What's scaffolded vs. TODO
|
||||
|
||||
**Scaffolded and complete:**
|
||||
- Schema (migration 0001)
|
||||
- Config and env parsing
|
||||
- Error taxonomy + retry classifier
|
||||
- AES-GCM crypto helpers
|
||||
- DWD JWT minting + access token caching
|
||||
- OAuth refresh + consent flow endpoints
|
||||
- Gmail client (list/get/history/attachments/profile) with rate limit + retry
|
||||
- Full DB data-access layer
|
||||
- MIME parser including RFC 2047 subjects and HTML→text fallback
|
||||
- Investor matcher with exact + domain strategies
|
||||
- Thread resolution (Gmail threadId + RFC References cross-account)
|
||||
- Attachment storage with SHA-256 dedup
|
||||
- Sync orchestrator (backfill + incremental with history-expired fallback)
|
||||
- Scheduler with manual-trigger hook
|
||||
- HTTP routes (status, accounts, threads, enroll, run-now, rematch, oauth)
|
||||
|
||||
**TODO before production (see architecture doc §15):**
|
||||
- Multipart batch metadata fetch in `gmail_client.batch_get_metadata`
|
||||
(currently serial fallback).
|
||||
- Unit tests (fixtures for parser, matcher, threads; integration tests with
|
||||
responses-style HTTP mock).
|
||||
- Frontend UI: a thread list + detail pane in `frontend/index.html`.
|
||||
- Sandboxed HTML rendering for email bodies (out of scope here).
|
||||
@@ -0,0 +1,15 @@
|
||||
"""
|
||||
Gmail Integration for Venture CRM.
|
||||
|
||||
Phase 1 scope: OAuth2/DWD authentication, incremental Gmail sync, MIME parsing,
|
||||
investor matching, threading, attachment storage. All logic isolated to this
|
||||
module; server.py integration is a 3-line patch guarded by
|
||||
CRM_GMAIL_INTEGRATION_ENABLED.
|
||||
|
||||
See GMAIL_INTEGRATION_ARCHITECTURE.md at the repo root for full design.
|
||||
"""
|
||||
|
||||
from . import config # noqa: F401
|
||||
from . import errors # noqa: F401
|
||||
|
||||
__all__ = ["config", "errors"]
|
||||
@@ -0,0 +1,234 @@
|
||||
"""
|
||||
Attachment download + on-disk storage.
|
||||
|
||||
Two usage patterns:
|
||||
|
||||
1. During message parsing we call `register_stubs(conn, email_id, parsed)`
|
||||
which writes pending rows to email_attachments.
|
||||
|
||||
2. A separate worker (kicked off by sync after each account completes)
|
||||
calls `drain_pending()` which fetches attachment bytes from Gmail and
|
||||
writes them to disk under CONFIG.attachments_dir.
|
||||
|
||||
Files are named: <CRM_DATA_DIR>/email_attachments/<email_id[:2]>/<email_id>/<attachment_id>-<sanitized_filename>
|
||||
|
||||
Sanitization prevents path traversal and keeps cross-platform-safe names.
|
||||
"""
|
||||
|
||||
import base64
|
||||
import hashlib
|
||||
import os
|
||||
import re
|
||||
import sqlite3
|
||||
from typing import Iterable, Optional
|
||||
|
||||
from . import config as _cfg
|
||||
from . import db as _db
|
||||
from . import errors as _errors
|
||||
from . import gmail_client as _gmail
|
||||
|
||||
|
||||
_MAX_FILENAME_LEN = 200
|
||||
_BAD_FILENAME_CHARS = re.compile(r'[/\\\x00-\x1f\x7f:*?"<>|]+')
|
||||
|
||||
|
||||
def _sanitize_filename(name: str) -> str:
|
||||
if not name:
|
||||
return "unnamed.bin"
|
||||
# strip path components first
|
||||
name = os.path.basename(name.replace("\\", "/"))
|
||||
name = _BAD_FILENAME_CHARS.sub("_", name).strip(" .")
|
||||
if not name:
|
||||
name = "unnamed.bin"
|
||||
if len(name) > _MAX_FILENAME_LEN:
|
||||
stem, dot, ext = name.rpartition(".")
|
||||
if dot:
|
||||
name = stem[: _MAX_FILENAME_LEN - len(ext) - 1] + "." + ext
|
||||
else:
|
||||
name = name[:_MAX_FILENAME_LEN]
|
||||
return name
|
||||
|
||||
|
||||
def _storage_path_for(email_id: str, attachment_id: str, sanitized_filename: str) -> str:
|
||||
root = _cfg.CONFIG.attachments_dir
|
||||
bucket = email_id[:2] or "_0"
|
||||
dir_ = os.path.join(root, bucket, email_id)
|
||||
os.makedirs(dir_, exist_ok=True)
|
||||
return os.path.join(dir_, f"{attachment_id}-{sanitized_filename}")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- phase 1: register stubs
|
||||
|
||||
def register_stubs(conn: sqlite3.Connection, *, email_id: str,
|
||||
parsed_attachments: Iterable[dict]) -> list[str]:
|
||||
"""Write pending attachment rows from parsed message data.
|
||||
|
||||
Also handles tiny inline attachments whose bytes arrived with the message
|
||||
body (body.data present, no separate attachmentId) by writing them
|
||||
directly and marking as downloaded.
|
||||
|
||||
Returns list of attachment ids created.
|
||||
"""
|
||||
max_bytes = _cfg.CONFIG.max_attachment_mb * 1024 * 1024
|
||||
ids = []
|
||||
|
||||
for att in parsed_attachments:
|
||||
filename = att.get("filename") or "unnamed.bin"
|
||||
sanitized = _sanitize_filename(filename)
|
||||
gmail_att_id = att.get("gmail_attachment_id") or ""
|
||||
mime = att.get("mime_type")
|
||||
size = att.get("size")
|
||||
|
||||
# Determine storage path (we write the path whether or not the download
|
||||
# succeeded; missing files surface via download_status).
|
||||
att_row_id = _db.insert_attachment_stub(
|
||||
conn,
|
||||
email_id=email_id,
|
||||
gmail_attachment_id=gmail_att_id,
|
||||
filename=filename,
|
||||
sanitized_filename=sanitized,
|
||||
mime_type=mime,
|
||||
size_bytes=size,
|
||||
storage_path=_storage_path_for(email_id, gmail_att_id or att_row_id_fallback(), sanitized),
|
||||
)
|
||||
ids.append(att_row_id)
|
||||
|
||||
# Oversize guard.
|
||||
if isinstance(size, int) and size > max_bytes:
|
||||
conn.execute(
|
||||
"UPDATE email_attachments SET download_status = 'skipped', "
|
||||
"download_error = ? WHERE id = ?",
|
||||
(f"exceeds max size {_cfg.CONFIG.max_attachment_mb}MB", att_row_id),
|
||||
)
|
||||
continue
|
||||
|
||||
# Inline data fast-path.
|
||||
inline_b64 = att.get("inline_data_b64")
|
||||
if inline_b64:
|
||||
try:
|
||||
raw = base64.urlsafe_b64decode(_pad(inline_b64).encode("ascii"))
|
||||
path = _storage_path_for(email_id, att_row_id, sanitized)
|
||||
_write_bytes(path, raw)
|
||||
sha = hashlib.sha256(raw).hexdigest()
|
||||
conn.execute(
|
||||
"UPDATE email_attachments SET storage_path = ? WHERE id = ?",
|
||||
(path, att_row_id),
|
||||
)
|
||||
_db.mark_attachment_downloaded(
|
||||
conn, att_row_id, sha256_hex=sha, size_bytes=len(raw)
|
||||
)
|
||||
except Exception as e:
|
||||
_db.mark_attachment_failed(conn, att_row_id, error=f"inline decode: {e}")
|
||||
|
||||
return ids
|
||||
|
||||
|
||||
def att_row_id_fallback() -> str:
|
||||
# Placeholder so the path template always produces something if gmail_att_id
|
||||
# was missing at stub time; the real path is rewritten when the worker
|
||||
# picks it up.
|
||||
import uuid
|
||||
return uuid.uuid4().hex
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- phase 2: worker
|
||||
|
||||
def drain_pending(conn_factory, client: _gmail.GmailClient, account_id: str,
|
||||
*, limit: int = 50) -> int:
|
||||
"""Download up to `limit` pending attachments for `account_id`.
|
||||
|
||||
Returns count of successfully downloaded attachments. Called after each
|
||||
account's sync completes so large files don't block the sync loop.
|
||||
"""
|
||||
conn = conn_factory()
|
||||
try:
|
||||
pending = _db.pending_attachments(conn, limit=limit)
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
downloaded = 0
|
||||
for row in pending:
|
||||
if row["account_id"] != account_id:
|
||||
continue
|
||||
conn = conn_factory()
|
||||
try:
|
||||
ok = _download_one(conn, client, row)
|
||||
if ok:
|
||||
downloaded += 1
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
return downloaded
|
||||
|
||||
|
||||
def _download_one(conn: sqlite3.Connection, client: _gmail.GmailClient, row) -> bool:
|
||||
try:
|
||||
resp = client.get_attachment(row["gmail_message_id"], row["gmail_attachment_id"])
|
||||
except _errors.RETRYABLE as e:
|
||||
_db.mark_attachment_failed(conn, row["id"], error=f"transient: {type(e).__name__}")
|
||||
return False
|
||||
except _errors.GmailError as e:
|
||||
_db.mark_attachment_failed(conn, row["id"], error=f"{type(e).__name__}: {e}")
|
||||
return False
|
||||
|
||||
data_b64 = resp.get("data")
|
||||
if not data_b64:
|
||||
_db.mark_attachment_failed(conn, row["id"], error="empty data in response")
|
||||
return False
|
||||
|
||||
try:
|
||||
raw = base64.urlsafe_b64decode(_pad(data_b64).encode("ascii"))
|
||||
except Exception as e:
|
||||
_db.mark_attachment_failed(conn, row["id"], error=f"decode: {e}")
|
||||
return False
|
||||
|
||||
sha = hashlib.sha256(raw).hexdigest()
|
||||
# If an existing attachment has the same SHA, re-point storage_path and skip write.
|
||||
existing = _find_existing_by_sha(conn, sha, exclude_id=row["id"])
|
||||
if existing:
|
||||
conn.execute(
|
||||
"UPDATE email_attachments SET storage_path = ? WHERE id = ?",
|
||||
(existing["storage_path"], row["id"]),
|
||||
)
|
||||
_db.mark_attachment_downloaded(conn, row["id"], sha256_hex=sha, size_bytes=len(raw))
|
||||
return True
|
||||
|
||||
path = _storage_path_for(row["email_id"], row["id"], row["sanitized_filename"])
|
||||
try:
|
||||
_write_bytes(path, raw)
|
||||
except OSError as e:
|
||||
_db.mark_attachment_failed(conn, row["id"], error=f"disk: {e}")
|
||||
return False
|
||||
|
||||
conn.execute(
|
||||
"UPDATE email_attachments SET storage_path = ? WHERE id = ?",
|
||||
(path, row["id"]),
|
||||
)
|
||||
_db.mark_attachment_downloaded(conn, row["id"], sha256_hex=sha, size_bytes=len(raw))
|
||||
return True
|
||||
|
||||
|
||||
def _find_existing_by_sha(conn: sqlite3.Connection, sha: str, *, exclude_id: str) -> Optional[sqlite3.Row]:
|
||||
cur = conn.cursor()
|
||||
cur.execute(
|
||||
"SELECT * FROM email_attachments WHERE sha256_hex = ? AND id != ? "
|
||||
"AND download_status = 'downloaded' LIMIT 1",
|
||||
(sha, exclude_id),
|
||||
)
|
||||
return cur.fetchone()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- utils
|
||||
|
||||
def _pad(b64: str) -> str:
|
||||
pad = 4 - (len(b64) % 4)
|
||||
return b64 + ("=" * pad if pad != 4 else "")
|
||||
|
||||
|
||||
def _write_bytes(path: str, data: bytes) -> None:
|
||||
os.makedirs(os.path.dirname(path), exist_ok=True)
|
||||
tmp = path + ".tmp"
|
||||
with open(tmp, "wb") as f:
|
||||
f.write(data)
|
||||
os.chmod(tmp, 0o600)
|
||||
os.replace(tmp, path)
|
||||
@@ -0,0 +1,112 @@
|
||||
"""
|
||||
Email integration configuration.
|
||||
|
||||
Reads from the same env-var surface as the rest of the CRM (server.py style),
|
||||
no pydantic/dotenv magic — stdlib only.
|
||||
"""
|
||||
|
||||
import os
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
# Reuse the CRM's data dir so backups and email storage live together.
|
||||
_PROJECT_DIR = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
_DEFAULT_DATA_DIR = os.path.join(_PROJECT_DIR, "data")
|
||||
|
||||
|
||||
def _bool_env(name: str, default: bool = False) -> bool:
|
||||
v = os.environ.get(name, "").strip().lower()
|
||||
if v in ("1", "true", "yes", "on"):
|
||||
return True
|
||||
if v in ("0", "false", "no", "off"):
|
||||
return False
|
||||
return default
|
||||
|
||||
|
||||
def _int_env(name: str, default: int) -> int:
|
||||
try:
|
||||
return int(os.environ.get(name, str(default)))
|
||||
except (TypeError, ValueError):
|
||||
return default
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class EmailConfig:
|
||||
# Master kill switch. When False, scheduler doesn't start and routes
|
||||
# return 503. Migrations are still applied (so schema is ready).
|
||||
enabled: bool
|
||||
|
||||
# Primary auth path. "dwd" means service account / domain-wide delegation.
|
||||
# "oauth" means per-user refresh tokens. DWD is default; OAuth is the
|
||||
# pluggable fallback.
|
||||
primary_auth: str
|
||||
|
||||
# DWD specifics
|
||||
dwd_key_path: Optional[str]
|
||||
workspace_domain: Optional[str]
|
||||
|
||||
# OAuth specifics (used for fallback + admin UI)
|
||||
oauth_client_id: Optional[str]
|
||||
oauth_client_secret: Optional[str]
|
||||
oauth_redirect_uri: Optional[str]
|
||||
|
||||
# Encryption key (base64) for OAuth refresh-token-at-rest encryption.
|
||||
# Required whenever oauth path is in use. DWD path never persists tokens.
|
||||
secret_key_b64: Optional[str]
|
||||
|
||||
# Sync scheduling
|
||||
sync_interval_sec: int
|
||||
backfill_page_size: int
|
||||
max_attachment_mb: int
|
||||
max_parallel_attachment_downloads: int
|
||||
|
||||
# Storage
|
||||
data_dir: str
|
||||
attachments_dir: str
|
||||
secrets_dir: str
|
||||
|
||||
# Rate limit / retry
|
||||
rate_limit_units_per_sec_per_account: int
|
||||
retry_max_attempts: int
|
||||
retry_initial_delay_sec: float
|
||||
retry_max_delay_sec: float
|
||||
|
||||
# Gmail history retention — if we fall this far behind, switch to date
|
||||
# backfill since Gmail may have pruned history records.
|
||||
history_stale_days: int
|
||||
|
||||
|
||||
def load() -> EmailConfig:
|
||||
data_dir = os.environ.get("CRM_DATA_DIR", _DEFAULT_DATA_DIR)
|
||||
return EmailConfig(
|
||||
enabled=_bool_env("CRM_GMAIL_INTEGRATION_ENABLED", False),
|
||||
primary_auth=os.environ.get("CRM_GMAIL_AUTH_METHOD", "dwd").lower(),
|
||||
dwd_key_path=os.environ.get("CRM_GMAIL_SA_KEY_PATH") or None,
|
||||
workspace_domain=os.environ.get("CRM_GMAIL_WORKSPACE_DOMAIN") or None,
|
||||
oauth_client_id=os.environ.get("CRM_GMAIL_OAUTH_CLIENT_ID") or None,
|
||||
oauth_client_secret=os.environ.get("CRM_GMAIL_OAUTH_CLIENT_SECRET") or None,
|
||||
oauth_redirect_uri=os.environ.get("CRM_GMAIL_OAUTH_REDIRECT_URI") or None,
|
||||
secret_key_b64=os.environ.get("CRM_GMAIL_SECRET_KEY") or None,
|
||||
sync_interval_sec=_int_env("CRM_GMAIL_SYNC_INTERVAL_MIN", 180) * 60,
|
||||
backfill_page_size=_int_env("CRM_GMAIL_BACKFILL_PAGE_SIZE", 500),
|
||||
max_attachment_mb=_int_env("CRM_GMAIL_MAX_ATTACHMENT_MB", 50),
|
||||
max_parallel_attachment_downloads=_int_env("CRM_GMAIL_ATTACH_CONCURRENCY", 4),
|
||||
data_dir=data_dir,
|
||||
attachments_dir=os.path.join(data_dir, "email_attachments"),
|
||||
secrets_dir=os.path.join(data_dir, "secrets"),
|
||||
rate_limit_units_per_sec_per_account=_int_env("CRM_GMAIL_RATE_UNITS_SEC", 150),
|
||||
retry_max_attempts=_int_env("CRM_GMAIL_RETRY_MAX", 5),
|
||||
retry_initial_delay_sec=float(os.environ.get("CRM_GMAIL_RETRY_INITIAL_SEC", "1.0")),
|
||||
retry_max_delay_sec=float(os.environ.get("CRM_GMAIL_RETRY_MAX_SEC", "60.0")),
|
||||
history_stale_days=_int_env("CRM_GMAIL_HISTORY_STALE_DAYS", 5),
|
||||
)
|
||||
|
||||
|
||||
# Singleton. Reload with `reload_config()` if env changes (mostly for tests).
|
||||
CONFIG = load()
|
||||
|
||||
|
||||
def reload_config() -> EmailConfig:
|
||||
global CONFIG
|
||||
CONFIG = load()
|
||||
return CONFIG
|
||||
@@ -0,0 +1,297 @@
|
||||
"""
|
||||
Credential providers for Gmail API access.
|
||||
|
||||
Two implementations behind a common protocol:
|
||||
|
||||
- DWDCredentialProvider: signs a JWT with the Workspace-authorized service
|
||||
account, exchanges for a short-lived access token that impersonates a
|
||||
specific user. No per-user persistent state.
|
||||
|
||||
- OAuthCredentialProvider: uses a per-user refresh token (stored encrypted
|
||||
in email_accounts.oauth_refresh_enc) to mint access tokens. Supports the
|
||||
'connect Gmail' UI flow.
|
||||
|
||||
Both provide the same interface:
|
||||
|
||||
provider.access_token_for(email_address: str) -> AccessToken
|
||||
"""
|
||||
|
||||
import base64
|
||||
import json
|
||||
import os
|
||||
import threading
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional, Protocol
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
|
||||
from . import config as _cfg
|
||||
from . import crypto
|
||||
from . import errors
|
||||
|
||||
|
||||
GMAIL_READONLY_SCOPE = "https://www.googleapis.com/auth/gmail.readonly"
|
||||
GOOGLE_TOKEN_URL = "https://oauth2.googleapis.com/token"
|
||||
|
||||
|
||||
@dataclass
|
||||
class AccessToken:
|
||||
token: str
|
||||
expires_at: float # epoch seconds
|
||||
|
||||
|
||||
class CredentialProvider(Protocol):
|
||||
def access_token_for(self, email_address: str) -> AccessToken: ...
|
||||
def revoke(self, email_address: str) -> None: ...
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Domain-wide delegation
|
||||
# ============================================================================
|
||||
|
||||
class DWDCredentialProvider:
|
||||
"""Impersonation via service-account JWT bearer grant."""
|
||||
|
||||
def __init__(self, key_path: str):
|
||||
with open(key_path, "r") as f:
|
||||
self._key = json.load(f)
|
||||
self._client_email = self._key["client_email"]
|
||||
self._private_key_pem = self._key["private_key"].encode("utf-8")
|
||||
self._cache: dict[str, AccessToken] = {}
|
||||
self._lock = threading.Lock()
|
||||
|
||||
def access_token_for(self, email_address: str) -> AccessToken:
|
||||
with self._lock:
|
||||
cached = self._cache.get(email_address)
|
||||
if cached and cached.expires_at - time.time() > 60:
|
||||
return cached
|
||||
token = self._mint(email_address)
|
||||
self._cache[email_address] = token
|
||||
return token
|
||||
|
||||
def revoke(self, email_address: str) -> None:
|
||||
# DWD tokens expire naturally in <1h. Revocation is via Admin console.
|
||||
# We just drop the cache so next call mints fresh.
|
||||
with self._lock:
|
||||
self._cache.pop(email_address, None)
|
||||
|
||||
# ------------------------------------------------------------------ helpers
|
||||
|
||||
def _mint(self, subject_email: str) -> AccessToken:
|
||||
try:
|
||||
from cryptography.hazmat.primitives import hashes, serialization # type: ignore
|
||||
from cryptography.hazmat.primitives.asymmetric import padding # type: ignore
|
||||
except ImportError as e: # pragma: no cover
|
||||
raise errors.AuthError(
|
||||
"DWD requires the `cryptography` package. Add to requirements.txt."
|
||||
) from e
|
||||
|
||||
now = int(time.time())
|
||||
header = {"alg": "RS256", "typ": "JWT"}
|
||||
claim = {
|
||||
"iss": self._client_email,
|
||||
"sub": subject_email,
|
||||
"scope": GMAIL_READONLY_SCOPE,
|
||||
"aud": GOOGLE_TOKEN_URL,
|
||||
"iat": now,
|
||||
"exp": now + 3600,
|
||||
}
|
||||
signing_input = _b64url(_json(header)) + b"." + _b64url(_json(claim))
|
||||
|
||||
private_key = serialization.load_pem_private_key(self._private_key_pem, password=None)
|
||||
signature = private_key.sign(signing_input, padding.PKCS1v15(), hashes.SHA256())
|
||||
jwt = signing_input + b"." + _b64url(signature)
|
||||
|
||||
body = urllib.parse.urlencode({
|
||||
"grant_type": "urn:ietf:params:oauth:grant-type:jwt-bearer",
|
||||
"assertion": jwt.decode("ascii"),
|
||||
}).encode("ascii")
|
||||
|
||||
req = urllib.request.Request(
|
||||
GOOGLE_TOKEN_URL,
|
||||
data=body,
|
||||
headers={"Content-Type": "application/x-www-form-urlencoded"},
|
||||
)
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=15) as resp:
|
||||
payload = json.loads(resp.read())
|
||||
except urllib.error.HTTPError as e:
|
||||
body = e.read().decode("utf-8", errors="replace")
|
||||
try:
|
||||
payload = json.loads(body)
|
||||
except Exception:
|
||||
payload = {"raw": body}
|
||||
raise errors.classify_http(e.code, payload)
|
||||
|
||||
if "access_token" not in payload:
|
||||
raise errors.AuthError("DWD token exchange returned no access_token", payload=payload)
|
||||
return AccessToken(
|
||||
token=payload["access_token"],
|
||||
expires_at=time.time() + float(payload.get("expires_in", 3600)) - 30,
|
||||
)
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Per-user OAuth (fallback)
|
||||
# ============================================================================
|
||||
|
||||
class OAuthCredentialProvider:
|
||||
"""Refreshes access tokens using a stored encrypted refresh token.
|
||||
|
||||
Refresh tokens are obtained via the consent-flow routes in routes.py and
|
||||
stored in email_accounts.oauth_refresh_enc (AES-GCM ciphertext).
|
||||
"""
|
||||
|
||||
def __init__(self, db_conn_factory, client_id: str, client_secret: str, secret_key_b64: str):
|
||||
self._db = db_conn_factory
|
||||
self._client_id = client_id
|
||||
self._client_secret = client_secret
|
||||
self._secret_key_b64 = secret_key_b64
|
||||
self._lock = threading.Lock()
|
||||
|
||||
def access_token_for(self, email_address: str) -> AccessToken:
|
||||
with self._lock:
|
||||
row = self._load_account(email_address)
|
||||
if row is None:
|
||||
raise errors.AuthError(f"no email_accounts row for {email_address}")
|
||||
# Cached access token still valid?
|
||||
if row["oauth_token_enc"] and row["oauth_token_exp"]:
|
||||
try:
|
||||
exp = float(row["oauth_token_exp"])
|
||||
except ValueError:
|
||||
exp = 0.0
|
||||
if exp - time.time() > 60:
|
||||
token = crypto.decrypt(row["oauth_token_enc"], secret_key_b64=self._secret_key_b64).decode("ascii")
|
||||
return AccessToken(token=token, expires_at=exp)
|
||||
# Refresh.
|
||||
return self._refresh(email_address, row)
|
||||
|
||||
def revoke(self, email_address: str) -> None:
|
||||
row = self._load_account(email_address)
|
||||
if not row or not row["oauth_refresh_enc"]:
|
||||
return
|
||||
refresh = crypto.decrypt(row["oauth_refresh_enc"], secret_key_b64=self._secret_key_b64).decode("ascii")
|
||||
body = urllib.parse.urlencode({"token": refresh}).encode("ascii")
|
||||
req = urllib.request.Request(
|
||||
"https://oauth2.googleapis.com/revoke",
|
||||
data=body,
|
||||
headers={"Content-Type": "application/x-www-form-urlencoded"},
|
||||
)
|
||||
try:
|
||||
urllib.request.urlopen(req, timeout=10).read()
|
||||
except Exception:
|
||||
pass # best effort; we zero locally regardless
|
||||
self._zero_account(email_address)
|
||||
|
||||
# ------------------------------------------------------------------ helpers
|
||||
|
||||
def _refresh(self, email_address: str, row) -> AccessToken:
|
||||
if not row["oauth_refresh_enc"]:
|
||||
raise errors.AuthError(f"no refresh token stored for {email_address}")
|
||||
refresh = crypto.decrypt(row["oauth_refresh_enc"], secret_key_b64=self._secret_key_b64).decode("ascii")
|
||||
body = urllib.parse.urlencode({
|
||||
"grant_type": "refresh_token",
|
||||
"refresh_token": refresh,
|
||||
"client_id": self._client_id,
|
||||
"client_secret": self._client_secret,
|
||||
}).encode("ascii")
|
||||
req = urllib.request.Request(
|
||||
GOOGLE_TOKEN_URL,
|
||||
data=body,
|
||||
headers={"Content-Type": "application/x-www-form-urlencoded"},
|
||||
)
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=15) as resp:
|
||||
payload = json.loads(resp.read())
|
||||
except urllib.error.HTTPError as e:
|
||||
body_text = e.read().decode("utf-8", errors="replace")
|
||||
try:
|
||||
payload = json.loads(body_text)
|
||||
except Exception:
|
||||
payload = {"raw": body_text}
|
||||
raise errors.classify_http(e.code, payload)
|
||||
|
||||
if "access_token" not in payload:
|
||||
raise errors.AuthError("OAuth refresh returned no access_token", payload=payload)
|
||||
|
||||
token_str = payload["access_token"]
|
||||
exp = time.time() + float(payload.get("expires_in", 3600)) - 30
|
||||
enc_token = crypto.encrypt(token_str.encode("ascii"), secret_key_b64=self._secret_key_b64)
|
||||
self._save_token(email_address, enc_token, exp)
|
||||
return AccessToken(token=token_str, expires_at=exp)
|
||||
|
||||
def _load_account(self, email_address: str):
|
||||
conn = self._db()
|
||||
try:
|
||||
cur = conn.cursor()
|
||||
cur.execute(
|
||||
"SELECT id, oauth_refresh_enc, oauth_token_enc, oauth_token_exp "
|
||||
"FROM email_accounts WHERE email_address = ?",
|
||||
(email_address,),
|
||||
)
|
||||
return cur.fetchone()
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
def _save_token(self, email_address: str, enc_token: bytes, exp: float):
|
||||
conn = self._db()
|
||||
try:
|
||||
conn.execute(
|
||||
"UPDATE email_accounts SET oauth_token_enc = ?, oauth_token_exp = ?, "
|
||||
"updated_at = datetime('now') WHERE email_address = ?",
|
||||
(enc_token, str(exp), email_address),
|
||||
)
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
def _zero_account(self, email_address: str):
|
||||
conn = self._db()
|
||||
try:
|
||||
conn.execute(
|
||||
"UPDATE email_accounts SET oauth_refresh_enc = NULL, oauth_token_enc = NULL, "
|
||||
"oauth_token_exp = NULL, sync_enabled = 0, sync_status = 'paused', "
|
||||
"updated_at = datetime('now') WHERE email_address = ?",
|
||||
(email_address,),
|
||||
)
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Factory — resolves CONFIG.primary_auth to a concrete provider
|
||||
# ============================================================================
|
||||
|
||||
def build_provider(db_conn_factory) -> CredentialProvider:
|
||||
cfg = _cfg.CONFIG
|
||||
if cfg.primary_auth == "dwd":
|
||||
if not cfg.dwd_key_path or not os.path.exists(cfg.dwd_key_path):
|
||||
raise errors.AuthError(
|
||||
f"CRM_GMAIL_SA_KEY_PATH not found: {cfg.dwd_key_path!r}"
|
||||
)
|
||||
return DWDCredentialProvider(cfg.dwd_key_path)
|
||||
if cfg.primary_auth == "oauth":
|
||||
if not (cfg.oauth_client_id and cfg.oauth_client_secret and cfg.secret_key_b64):
|
||||
raise errors.AuthError(
|
||||
"OAuth mode requires CRM_GMAIL_OAUTH_CLIENT_ID, "
|
||||
"CRM_GMAIL_OAUTH_CLIENT_SECRET, and CRM_GMAIL_SECRET_KEY."
|
||||
)
|
||||
return OAuthCredentialProvider(
|
||||
db_conn_factory,
|
||||
cfg.oauth_client_id,
|
||||
cfg.oauth_client_secret,
|
||||
cfg.secret_key_b64,
|
||||
)
|
||||
raise errors.AuthError(f"unknown primary_auth: {cfg.primary_auth!r}")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- utils
|
||||
|
||||
def _b64url(data: bytes) -> bytes:
|
||||
return base64.urlsafe_b64encode(data).rstrip(b"=")
|
||||
|
||||
|
||||
def _json(obj) -> bytes:
|
||||
return json.dumps(obj, separators=(",", ":")).encode("utf-8")
|
||||
@@ -0,0 +1,79 @@
|
||||
"""
|
||||
AES-256-GCM encryption for OAuth refresh tokens at rest.
|
||||
|
||||
Key material comes from CONFIG.secret_key_b64 (env: CRM_GMAIL_SECRET_KEY).
|
||||
Must be at least 32 bytes of entropy, base64-encoded.
|
||||
|
||||
Storage format (as stored in BLOB columns):
|
||||
version(1 byte) || nonce(12 bytes) || ciphertext+tag(N bytes)
|
||||
|
||||
version = 1 for AES-GCM-256.
|
||||
|
||||
Uses the `cryptography` library. If not available (optional at scaffold time),
|
||||
the OAuth fallback path is disabled with a clear error — DWD path is unaffected.
|
||||
"""
|
||||
|
||||
import base64
|
||||
import os
|
||||
import secrets
|
||||
from typing import Optional
|
||||
|
||||
try:
|
||||
from cryptography.hazmat.primitives.ciphers.aead import AESGCM # type: ignore
|
||||
_AVAILABLE = True
|
||||
except ImportError: # pragma: no cover
|
||||
AESGCM = None # type: ignore
|
||||
_AVAILABLE = False
|
||||
|
||||
|
||||
VERSION = 1
|
||||
NONCE_LEN = 12
|
||||
|
||||
|
||||
class CryptoUnavailable(RuntimeError):
|
||||
pass
|
||||
|
||||
|
||||
def _load_key(secret_key_b64: Optional[str]) -> bytes:
|
||||
if not secret_key_b64:
|
||||
raise CryptoUnavailable(
|
||||
"CRM_GMAIL_SECRET_KEY not set; cannot encrypt/decrypt OAuth tokens. "
|
||||
"DWD auth does not require this."
|
||||
)
|
||||
try:
|
||||
key = base64.b64decode(secret_key_b64)
|
||||
except Exception as e:
|
||||
raise CryptoUnavailable(f"CRM_GMAIL_SECRET_KEY not valid base64: {e}") from e
|
||||
if len(key) < 32:
|
||||
raise CryptoUnavailable(
|
||||
f"CRM_GMAIL_SECRET_KEY decodes to {len(key)} bytes; need >= 32."
|
||||
)
|
||||
return key[:32] # AES-256
|
||||
|
||||
|
||||
def encrypt(plaintext: bytes, *, secret_key_b64: Optional[str]) -> bytes:
|
||||
if not _AVAILABLE:
|
||||
raise CryptoUnavailable("cryptography library not installed")
|
||||
key = _load_key(secret_key_b64)
|
||||
nonce = secrets.token_bytes(NONCE_LEN)
|
||||
ct = AESGCM(key).encrypt(nonce, plaintext, None)
|
||||
return bytes([VERSION]) + nonce + ct
|
||||
|
||||
|
||||
def decrypt(blob: bytes, *, secret_key_b64: Optional[str]) -> bytes:
|
||||
if not _AVAILABLE:
|
||||
raise CryptoUnavailable("cryptography library not installed")
|
||||
if not blob or len(blob) < 1 + NONCE_LEN + 16:
|
||||
raise ValueError("ciphertext too short")
|
||||
version = blob[0]
|
||||
if version != VERSION:
|
||||
raise ValueError(f"unsupported crypto version: {version}")
|
||||
nonce = blob[1:1 + NONCE_LEN]
|
||||
ct = blob[1 + NONCE_LEN:]
|
||||
key = _load_key(secret_key_b64)
|
||||
return AESGCM(key).decrypt(nonce, ct, None)
|
||||
|
||||
|
||||
def generate_secret_key_b64() -> str:
|
||||
"""Helper for initial setup: prints a fresh key you can drop into env."""
|
||||
return base64.b64encode(os.urandom(32)).decode("ascii")
|
||||
@@ -0,0 +1,416 @@
|
||||
"""
|
||||
Data-access layer for the email_integration module.
|
||||
|
||||
All SQL touching emails_* tables lives here. Other modules call named
|
||||
helpers — they never write SQL inline. This keeps schema changes contained.
|
||||
|
||||
Connection pattern matches server.py get_db():
|
||||
- WAL mode, foreign keys on, busy_timeout
|
||||
- sqlite3.Row row_factory
|
||||
The caller is responsible for committing / closing.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import sqlite3
|
||||
import uuid
|
||||
from datetime import datetime, timezone
|
||||
from typing import Iterable, Optional
|
||||
|
||||
|
||||
# ------------------------------------------------------------------ migrations
|
||||
|
||||
def apply_migrations(cursor: sqlite3.Cursor) -> None:
|
||||
"""Apply all .sql migration files in migrations/ in lexicographic order.
|
||||
|
||||
Called from server.init_db(). Idempotent. Does not log past migrations in
|
||||
a table yet — each file is guarded by CREATE ... IF NOT EXISTS etc. If
|
||||
we ever need more complex migrations, add a schema_migrations table.
|
||||
"""
|
||||
here = os.path.dirname(os.path.abspath(__file__))
|
||||
mdir = os.path.join(here, "migrations")
|
||||
if not os.path.isdir(mdir):
|
||||
return
|
||||
for name in sorted(os.listdir(mdir)):
|
||||
if not name.endswith(".sql"):
|
||||
continue
|
||||
path = os.path.join(mdir, name)
|
||||
with open(path, "r") as f:
|
||||
sql = f.read()
|
||||
cursor.executescript(sql)
|
||||
|
||||
|
||||
# ------------------------------------------------------------------ utils
|
||||
|
||||
def _uuid() -> str:
|
||||
return str(uuid.uuid4())
|
||||
|
||||
|
||||
def _now_iso() -> str:
|
||||
return datetime.now(tz=timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
|
||||
|
||||
|
||||
def _json(v) -> str:
|
||||
return json.dumps(v, separators=(",", ":"))
|
||||
|
||||
|
||||
# ------------------------------------------------------------------ email_accounts
|
||||
|
||||
def list_sync_ready_accounts(conn: sqlite3.Connection) -> list[sqlite3.Row]:
|
||||
cur = conn.cursor()
|
||||
cur.execute(
|
||||
"SELECT * FROM email_accounts "
|
||||
"WHERE sync_enabled = 1 AND sync_status IN ('pending','active') "
|
||||
"ORDER BY last_synced_at IS NOT NULL, last_synced_at"
|
||||
)
|
||||
return cur.fetchall()
|
||||
|
||||
|
||||
def get_account_by_email(conn: sqlite3.Connection, email_address: str) -> Optional[sqlite3.Row]:
|
||||
cur = conn.cursor()
|
||||
cur.execute("SELECT * FROM email_accounts WHERE email_address = ?", (email_address,))
|
||||
return cur.fetchone()
|
||||
|
||||
|
||||
def upsert_account(conn: sqlite3.Connection, *, user_id: str, email_address: str,
|
||||
auth_method: str) -> str:
|
||||
existing = get_account_by_email(conn, email_address)
|
||||
if existing:
|
||||
return existing["id"]
|
||||
account_id = _uuid()
|
||||
conn.execute(
|
||||
"INSERT INTO email_accounts (id, user_id, email_address, auth_method) "
|
||||
"VALUES (?, ?, ?, ?)",
|
||||
(account_id, user_id, email_address, auth_method),
|
||||
)
|
||||
return account_id
|
||||
|
||||
|
||||
def set_account_status(conn: sqlite3.Connection, account_id: str, *,
|
||||
status: str, error: Optional[str] = None) -> None:
|
||||
conn.execute(
|
||||
"UPDATE email_accounts SET sync_status = ?, sync_error = ?, "
|
||||
"updated_at = datetime('now') WHERE id = ?",
|
||||
(status, error, account_id),
|
||||
)
|
||||
|
||||
|
||||
def set_account_checkpoint(conn: sqlite3.Connection, account_id: str, *,
|
||||
history_id: Optional[str] = None,
|
||||
backfill_cursor: Optional[str] = None,
|
||||
backfill_complete: Optional[bool] = None,
|
||||
last_synced_at: Optional[str] = None) -> None:
|
||||
sets, params = [], []
|
||||
if history_id is not None:
|
||||
sets.append("last_history_id = ?"); params.append(history_id)
|
||||
if backfill_cursor is not None:
|
||||
sets.append("backfill_cursor = ?"); params.append(backfill_cursor)
|
||||
if backfill_complete is not None:
|
||||
sets.append("backfill_complete = ?"); params.append(1 if backfill_complete else 0)
|
||||
if last_synced_at is not None:
|
||||
sets.append("last_synced_at = ?"); params.append(last_synced_at)
|
||||
if not sets:
|
||||
return
|
||||
sets.append("updated_at = datetime('now')")
|
||||
params.append(account_id)
|
||||
conn.execute(f"UPDATE email_accounts SET {', '.join(sets)} WHERE id = ?", params)
|
||||
|
||||
|
||||
# ------------------------------------------------------------------ emails
|
||||
|
||||
def find_email_by_rfc_id(conn: sqlite3.Connection, rfc_message_id: str) -> Optional[sqlite3.Row]:
|
||||
cur = conn.cursor()
|
||||
cur.execute("SELECT * FROM emails WHERE rfc_message_id = ?", (rfc_message_id,))
|
||||
return cur.fetchone()
|
||||
|
||||
|
||||
def find_email_id_by_any_rfc_id(conn: sqlite3.Connection,
|
||||
rfc_ids: Iterable[str]) -> Optional[str]:
|
||||
ids = [r for r in rfc_ids if r]
|
||||
if not ids:
|
||||
return None
|
||||
placeholders = ",".join("?" for _ in ids)
|
||||
cur = conn.cursor()
|
||||
cur.execute(
|
||||
f"SELECT id FROM emails WHERE rfc_message_id IN ({placeholders}) "
|
||||
"ORDER BY sent_at ASC LIMIT 1",
|
||||
ids,
|
||||
)
|
||||
row = cur.fetchone()
|
||||
return row["id"] if row else None
|
||||
|
||||
|
||||
def insert_email(conn: sqlite3.Connection, *, parsed: dict, match_status: str) -> str:
|
||||
"""Insert a fresh emails row. Returns email_id.
|
||||
|
||||
Caller must ensure no row exists for parsed['rfc_message_id']; use
|
||||
find_email_by_rfc_id first.
|
||||
"""
|
||||
email_id = _uuid()
|
||||
conn.execute(
|
||||
"""INSERT INTO emails
|
||||
(id, rfc_message_id, gmail_thread_id, rfc_thread_root_id, subject,
|
||||
from_email, from_name, to_emails_json, cc_emails_json, bcc_emails_json,
|
||||
reply_to, sent_at, body_text, body_html, snippet, in_reply_to,
|
||||
references_json, has_attachments, size_estimate, is_matched,
|
||||
match_status, raw_headers_json)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
|
||||
(
|
||||
email_id,
|
||||
parsed["rfc_message_id"],
|
||||
parsed.get("gmail_thread_id"),
|
||||
parsed.get("rfc_thread_root_id"),
|
||||
parsed.get("subject"),
|
||||
parsed["from_email"],
|
||||
parsed.get("from_name"),
|
||||
_json(parsed.get("to", [])),
|
||||
_json(parsed.get("cc", [])),
|
||||
_json(parsed.get("bcc", [])),
|
||||
parsed.get("reply_to"),
|
||||
parsed["sent_at"],
|
||||
parsed.get("body_text"),
|
||||
parsed.get("body_html"),
|
||||
parsed.get("snippet"),
|
||||
parsed.get("in_reply_to"),
|
||||
_json(parsed.get("references", [])),
|
||||
1 if parsed.get("attachments") else 0,
|
||||
parsed.get("size_estimate"),
|
||||
1 if match_status == "matched" else 0,
|
||||
match_status,
|
||||
_json(parsed.get("raw_headers", {})) if parsed.get("raw_headers") else None,
|
||||
),
|
||||
)
|
||||
# recipients
|
||||
for kind in ("from", "to", "cc", "bcc", "reply_to"):
|
||||
addrs = []
|
||||
if kind == "from" and parsed.get("from_email"):
|
||||
addrs = [(parsed["from_email"], parsed.get("from_name"))]
|
||||
elif kind == "reply_to" and parsed.get("reply_to"):
|
||||
addrs = [(parsed["reply_to"], None)]
|
||||
else:
|
||||
for a in parsed.get(kind, []):
|
||||
if isinstance(a, dict):
|
||||
addrs.append((a.get("email"), a.get("name")))
|
||||
else:
|
||||
addrs.append((a, None))
|
||||
for address, name in addrs:
|
||||
if not address:
|
||||
continue
|
||||
conn.execute(
|
||||
"INSERT INTO email_recipients (id, email_id, address, display_name, kind) "
|
||||
"VALUES (?, ?, ?, ?, ?)",
|
||||
(_uuid(), email_id, address.lower().strip(), name, kind),
|
||||
)
|
||||
return email_id
|
||||
|
||||
|
||||
def set_email_thread(conn: sqlite3.Connection, email_id: str, thread_id: str) -> None:
|
||||
conn.execute(
|
||||
"UPDATE emails SET thread_id = ?, updated_at = datetime('now') WHERE id = ?",
|
||||
(thread_id, email_id),
|
||||
)
|
||||
|
||||
|
||||
# ------------------------------------------------------------------ sightings
|
||||
|
||||
def upsert_sighting(conn: sqlite3.Connection, *, email_id: str, account_id: str,
|
||||
gmail_message_id: str, gmail_thread_id: str,
|
||||
labels: list[str], is_sent: bool) -> None:
|
||||
conn.execute(
|
||||
"""INSERT OR IGNORE INTO email_account_messages
|
||||
(id, email_id, account_id, gmail_message_id, gmail_thread_id,
|
||||
labels_json, is_sent)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?)""",
|
||||
(_uuid(), email_id, account_id, gmail_message_id, gmail_thread_id,
|
||||
_json(labels), 1 if is_sent else 0),
|
||||
)
|
||||
|
||||
|
||||
def update_sighting_labels(conn: sqlite3.Connection, *, account_id: str,
|
||||
gmail_message_id: str, labels: list[str]) -> None:
|
||||
conn.execute(
|
||||
"UPDATE email_account_messages SET labels_json = ? "
|
||||
"WHERE account_id = ? AND gmail_message_id = ?",
|
||||
(_json(labels), account_id, gmail_message_id),
|
||||
)
|
||||
|
||||
|
||||
def tombstone_sighting(conn: sqlite3.Connection, *, account_id: str,
|
||||
gmail_message_id: str) -> None:
|
||||
conn.execute(
|
||||
"UPDATE email_account_messages SET deleted_at = datetime('now') "
|
||||
"WHERE account_id = ? AND gmail_message_id = ?",
|
||||
(account_id, gmail_message_id),
|
||||
)
|
||||
|
||||
|
||||
# ------------------------------------------------------------------ attachments
|
||||
|
||||
def insert_attachment_stub(conn: sqlite3.Connection, *, email_id: str,
|
||||
gmail_attachment_id: str, filename: str,
|
||||
sanitized_filename: str, mime_type: Optional[str],
|
||||
size_bytes: Optional[int], storage_path: str) -> str:
|
||||
att_id = _uuid()
|
||||
conn.execute(
|
||||
"""INSERT INTO email_attachments
|
||||
(id, email_id, gmail_attachment_id, filename, sanitized_filename,
|
||||
mime_type, size_bytes, storage_path)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
|
||||
(att_id, email_id, gmail_attachment_id, filename, sanitized_filename,
|
||||
mime_type, size_bytes, storage_path),
|
||||
)
|
||||
return att_id
|
||||
|
||||
|
||||
def mark_attachment_downloaded(conn: sqlite3.Connection, attachment_id: str, *,
|
||||
sha256_hex: str, size_bytes: int) -> None:
|
||||
conn.execute(
|
||||
"UPDATE email_attachments SET download_status = 'downloaded', "
|
||||
"sha256_hex = ?, size_bytes = ?, downloaded_at = datetime('now') "
|
||||
"WHERE id = ?",
|
||||
(sha256_hex, size_bytes, attachment_id),
|
||||
)
|
||||
|
||||
|
||||
def mark_attachment_failed(conn: sqlite3.Connection, attachment_id: str, *,
|
||||
error: str) -> None:
|
||||
conn.execute(
|
||||
"UPDATE email_attachments SET download_status = 'failed', "
|
||||
"download_attempts = download_attempts + 1, download_error = ? "
|
||||
"WHERE id = ?",
|
||||
(error, attachment_id),
|
||||
)
|
||||
|
||||
|
||||
def pending_attachments(conn: sqlite3.Connection, limit: int = 50) -> list[sqlite3.Row]:
|
||||
cur = conn.cursor()
|
||||
cur.execute(
|
||||
"SELECT a.*, eam.gmail_message_id, eam.account_id "
|
||||
"FROM email_attachments a "
|
||||
"JOIN email_account_messages eam ON eam.email_id = a.email_id "
|
||||
"WHERE a.download_status = 'pending' AND a.download_attempts < 5 "
|
||||
"LIMIT ?",
|
||||
(limit,),
|
||||
)
|
||||
return cur.fetchall()
|
||||
|
||||
|
||||
# ------------------------------------------------------------------ threads
|
||||
|
||||
def find_thread_by_gmail_id(conn: sqlite3.Connection, gmail_thread_id: str) -> Optional[sqlite3.Row]:
|
||||
cur = conn.cursor()
|
||||
cur.execute(
|
||||
"SELECT * FROM email_threads WHERE gmail_thread_id = ?",
|
||||
(gmail_thread_id,),
|
||||
)
|
||||
return cur.fetchone()
|
||||
|
||||
|
||||
def find_thread_by_rfc_root(conn: sqlite3.Connection, rfc_root: str) -> Optional[sqlite3.Row]:
|
||||
cur = conn.cursor()
|
||||
cur.execute(
|
||||
"SELECT * FROM email_threads WHERE rfc_thread_root_id = ?",
|
||||
(rfc_root,),
|
||||
)
|
||||
return cur.fetchone()
|
||||
|
||||
|
||||
def create_thread(conn: sqlite3.Connection, *, gmail_thread_id: Optional[str],
|
||||
rfc_thread_root_id: Optional[str], subject_normalized: Optional[str],
|
||||
first_message_at: Optional[str]) -> str:
|
||||
thread_id = _uuid()
|
||||
conn.execute(
|
||||
"""INSERT INTO email_threads
|
||||
(id, gmail_thread_id, rfc_thread_root_id, subject_normalized,
|
||||
first_message_at, last_message_at, message_count)
|
||||
VALUES (?, ?, ?, ?, ?, ?, 0)""",
|
||||
(thread_id, gmail_thread_id, rfc_thread_root_id, subject_normalized,
|
||||
first_message_at, first_message_at),
|
||||
)
|
||||
return thread_id
|
||||
|
||||
|
||||
def rollup_thread(conn: sqlite3.Connection, thread_id: str) -> None:
|
||||
"""Recompute count / last_message_at / participants from member emails.
|
||||
|
||||
Cheap at 5-person team volumes. For larger deployments swap to triggers.
|
||||
"""
|
||||
cur = conn.cursor()
|
||||
cur.execute(
|
||||
"SELECT COUNT(*) AS n, MIN(sent_at) AS first, MAX(sent_at) AS last, "
|
||||
"MAX(is_matched) AS matched FROM emails WHERE thread_id = ?",
|
||||
(thread_id,),
|
||||
)
|
||||
row = cur.fetchone()
|
||||
if not row or row["n"] == 0:
|
||||
return
|
||||
cur.execute(
|
||||
"SELECT DISTINCT address FROM email_recipients er "
|
||||
"JOIN emails e ON e.id = er.email_id WHERE e.thread_id = ?",
|
||||
(thread_id,),
|
||||
)
|
||||
participants = [r["address"] for r in cur.fetchall()]
|
||||
conn.execute(
|
||||
"UPDATE email_threads SET message_count = ?, first_message_at = ?, "
|
||||
"last_message_at = ?, participant_count = ?, participants_json = ?, "
|
||||
"is_matched = ?, updated_at = datetime('now') WHERE id = ?",
|
||||
(row["n"], row["first"], row["last"], len(participants),
|
||||
_json(participants), int(row["matched"] or 0), thread_id),
|
||||
)
|
||||
|
||||
|
||||
# ------------------------------------------------------------------ investor links
|
||||
|
||||
def insert_investor_link(conn: sqlite3.Connection, *, email_id: str,
|
||||
link: dict) -> None:
|
||||
conn.execute(
|
||||
"""INSERT INTO email_investor_links
|
||||
(id, email_id, fundraising_investor_id, fundraising_contact_id,
|
||||
contact_id, organization_id, matched_address, match_kind,
|
||||
match_confidence)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)""",
|
||||
(
|
||||
_uuid(),
|
||||
email_id,
|
||||
link.get("fundraising_investor_id"),
|
||||
link.get("fundraising_contact_id"),
|
||||
link.get("contact_id"),
|
||||
link.get("organization_id"),
|
||||
link["matched_address"],
|
||||
link["match_kind"],
|
||||
float(link.get("match_confidence", 1.0)),
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
# ------------------------------------------------------------------ sync runs
|
||||
|
||||
def start_sync_run(conn: sqlite3.Connection, *, account_id: str, kind: str) -> str:
|
||||
run_id = _uuid()
|
||||
conn.execute(
|
||||
"INSERT INTO email_sync_runs (id, account_id, kind, started_at, status) "
|
||||
"VALUES (?, ?, ?, ?, 'running')",
|
||||
(run_id, account_id, kind, _now_iso()),
|
||||
)
|
||||
return run_id
|
||||
|
||||
|
||||
def finish_sync_run(conn: sqlite3.Connection, run_id: str, *, status: str,
|
||||
stats: Optional[dict] = None, error: Optional[str] = None) -> None:
|
||||
stats = stats or {}
|
||||
conn.execute(
|
||||
"""UPDATE email_sync_runs
|
||||
SET finished_at = ?, status = ?, messages_seen = ?, messages_stored = ?,
|
||||
attachments_saved = ?, api_calls = ?, retries = ?, error = ?
|
||||
WHERE id = ?""",
|
||||
(
|
||||
_now_iso(), status,
|
||||
int(stats.get("messages_seen", 0)),
|
||||
int(stats.get("messages_stored", 0)),
|
||||
int(stats.get("attachments_saved", 0)),
|
||||
int(stats.get("api_calls", 0)),
|
||||
int(stats.get("retries", 0)),
|
||||
error,
|
||||
run_id,
|
||||
),
|
||||
)
|
||||
@@ -0,0 +1,79 @@
|
||||
"""
|
||||
Exception taxonomy for Gmail integration.
|
||||
|
||||
gmail_client._call() maps HTTP status codes to these exception types. The retry
|
||||
loop in gmail_client._with_retry() inspects the class hierarchy to decide
|
||||
whether to back off + retry or fail fast.
|
||||
"""
|
||||
|
||||
|
||||
class GmailError(Exception):
|
||||
"""Base class for all Gmail-integration errors."""
|
||||
|
||||
def __init__(self, message: str = "", *, status: int = 0, payload: object = None):
|
||||
super().__init__(message)
|
||||
self.status = status
|
||||
self.payload = payload
|
||||
|
||||
|
||||
class AuthError(GmailError):
|
||||
"""401 / 403 that is not a rate-limit. Requires operator intervention
|
||||
(bad service account key, revoked OAuth, missing DWD scope). Not retried."""
|
||||
|
||||
|
||||
class RateLimitError(GmailError):
|
||||
"""429 or 403 with reason in {rateLimitExceeded, userRateLimitExceeded}.
|
||||
Retried with exponential backoff."""
|
||||
|
||||
|
||||
class TransientError(GmailError):
|
||||
"""5xx or network error. Retried with exponential backoff."""
|
||||
|
||||
|
||||
class NotFoundError(GmailError):
|
||||
"""404. For messages this usually means 'deleted in Gmail after we saw it';
|
||||
for history this is HistoryExpiredError."""
|
||||
|
||||
|
||||
class HistoryExpiredError(NotFoundError):
|
||||
"""404 on history.list with startHistoryId — Gmail only retains history
|
||||
for a limited window (~7 days). Triggers date-based backfill fallback."""
|
||||
|
||||
|
||||
class PermanentError(GmailError):
|
||||
"""400 or other permanent failure. Skip and log; do not retry."""
|
||||
|
||||
|
||||
def classify_http(status: int, payload: object) -> GmailError:
|
||||
"""Map a Gmail API response to the appropriate exception type.
|
||||
|
||||
`payload` is the decoded JSON body if any; used to distinguish rate-limit
|
||||
403s from pure auth 403s via the `reason` field Google returns.
|
||||
"""
|
||||
reason = ""
|
||||
if isinstance(payload, dict):
|
||||
try:
|
||||
errs = payload.get("error", {}).get("errors") or []
|
||||
if errs:
|
||||
reason = str(errs[0].get("reason", ""))
|
||||
except Exception: # pragma: no cover — defensive
|
||||
pass
|
||||
|
||||
if status == 429:
|
||||
return RateLimitError(f"rate limited: {reason}", status=status, payload=payload)
|
||||
if status == 403:
|
||||
if reason in ("rateLimitExceeded", "userRateLimitExceeded", "quotaExceeded"):
|
||||
return RateLimitError(f"quota: {reason}", status=status, payload=payload)
|
||||
return AuthError(f"forbidden: {reason}", status=status, payload=payload)
|
||||
if status == 401:
|
||||
return AuthError("unauthorized", status=status, payload=payload)
|
||||
if status == 404:
|
||||
return NotFoundError("not found", status=status, payload=payload)
|
||||
if 500 <= status < 600:
|
||||
return TransientError(f"server error {status}", status=status, payload=payload)
|
||||
if 400 <= status < 500:
|
||||
return PermanentError(f"client error {status}: {reason}", status=status, payload=payload)
|
||||
return GmailError(f"unexpected status {status}", status=status, payload=payload)
|
||||
|
||||
|
||||
RETRYABLE = (RateLimitError, TransientError)
|
||||
@@ -0,0 +1,249 @@
|
||||
"""
|
||||
Thin Gmail API wrapper.
|
||||
|
||||
Responsibilities:
|
||||
- HTTPS calls to https://gmail.googleapis.com/gmail/v1/users/me/*
|
||||
- Per-account access-token injection via CredentialProvider
|
||||
- Rate limiting via token bucket
|
||||
- Retry loop with exponential backoff + jitter for RETRYABLE errors
|
||||
- Batch requests for metadata fetches (multipart/mixed) — sketch provided
|
||||
- Call-count accounting for observability (plumbed to email_sync_runs)
|
||||
|
||||
We call Gmail over raw urllib instead of the google-api-python-client to keep
|
||||
the dependency surface small. If you prefer the Google SDK, replace _call()
|
||||
with client calls; everything else is independent.
|
||||
"""
|
||||
|
||||
import json
|
||||
import random
|
||||
import threading
|
||||
import time
|
||||
import urllib.error
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Any, Iterator, Optional
|
||||
|
||||
from . import config as _cfg
|
||||
from . import errors
|
||||
|
||||
|
||||
BASE = "https://gmail.googleapis.com/gmail/v1/users"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- token bucket
|
||||
|
||||
class _TokenBucket:
|
||||
"""Simple per-account rate limiter. Call wait(cost) before each API call."""
|
||||
|
||||
def __init__(self, units_per_sec: int, burst: Optional[int] = None):
|
||||
self._rate = float(units_per_sec)
|
||||
self._burst = float(burst if burst is not None else units_per_sec)
|
||||
self._tokens = self._burst
|
||||
self._last = time.monotonic()
|
||||
self._lock = threading.Lock()
|
||||
|
||||
def wait(self, cost: float) -> None:
|
||||
while True:
|
||||
with self._lock:
|
||||
now = time.monotonic()
|
||||
self._tokens = min(self._burst, self._tokens + (now - self._last) * self._rate)
|
||||
self._last = now
|
||||
if self._tokens >= cost:
|
||||
self._tokens -= cost
|
||||
return
|
||||
needed = cost - self._tokens
|
||||
sleep_for = needed / self._rate
|
||||
time.sleep(sleep_for)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- call stats
|
||||
|
||||
@dataclass
|
||||
class CallStats:
|
||||
api_calls: int = 0
|
||||
retries: int = 0
|
||||
bytes_in: int = 0
|
||||
last_errors: list[str] = field(default_factory=list)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- client
|
||||
|
||||
class GmailClient:
|
||||
"""Per-account Gmail client. Bind one instance per sync run."""
|
||||
|
||||
def __init__(self, credential_provider, email_address: str, stats: Optional[CallStats] = None):
|
||||
self._creds = credential_provider
|
||||
self._email = email_address
|
||||
self._bucket = _TokenBucket(units_per_sec=_cfg.CONFIG.rate_limit_units_per_sec_per_account)
|
||||
self.stats = stats or CallStats()
|
||||
|
||||
# -------------------------------------------------------------- messages.*
|
||||
|
||||
def list_messages(self, *, q: str = "", page_token: Optional[str] = None,
|
||||
max_results: int = 500, label_ids: Optional[list[str]] = None) -> dict:
|
||||
"""https://developers.google.com/gmail/api/reference/rest/v1/users.messages/list"""
|
||||
params = {"maxResults": str(max_results)}
|
||||
if q:
|
||||
params["q"] = q
|
||||
if page_token:
|
||||
params["pageToken"] = page_token
|
||||
if label_ids:
|
||||
for lid in label_ids:
|
||||
params.setdefault("labelIds", [])
|
||||
params["labelIds"].append(lid) if False else None
|
||||
return self._get("/messages", params=params, cost=5)
|
||||
|
||||
def get_message(self, message_id: str, *, format: str = "metadata",
|
||||
metadata_headers: Optional[list[str]] = None) -> dict:
|
||||
params = {"format": format}
|
||||
if format == "metadata" and metadata_headers:
|
||||
params["metadataHeaders"] = metadata_headers
|
||||
return self._get(f"/messages/{message_id}", params=params, cost=5)
|
||||
|
||||
def get_attachment(self, message_id: str, attachment_id: str) -> dict:
|
||||
return self._get(
|
||||
f"/messages/{message_id}/attachments/{attachment_id}",
|
||||
params=None,
|
||||
cost=5,
|
||||
)
|
||||
|
||||
# -------------------------------------------------------------- history.*
|
||||
|
||||
def list_history(self, *, start_history_id: str, page_token: Optional[str] = None,
|
||||
history_types: Optional[list[str]] = None) -> dict:
|
||||
params = {"startHistoryId": start_history_id, "maxResults": "500"}
|
||||
if page_token:
|
||||
params["pageToken"] = page_token
|
||||
if history_types:
|
||||
params["historyTypes"] = history_types
|
||||
try:
|
||||
return self._get("/history", params=params, cost=2)
|
||||
except errors.NotFoundError as e:
|
||||
# Gmail returns 404 when startHistoryId is too old. Wrap for callers.
|
||||
raise errors.HistoryExpiredError(
|
||||
"startHistoryId no longer available", status=404, payload=getattr(e, "payload", None)
|
||||
) from e
|
||||
|
||||
# -------------------------------------------------------------- profile
|
||||
|
||||
def get_profile(self) -> dict:
|
||||
return self._get("/profile", params=None, cost=1)
|
||||
|
||||
# -------------------------------------------------------------- iteration helpers
|
||||
|
||||
def iter_messages(self, *, q: str = "") -> Iterator[dict]:
|
||||
page_token: Optional[str] = None
|
||||
while True:
|
||||
resp = self.list_messages(q=q, page_token=page_token,
|
||||
max_results=_cfg.CONFIG.backfill_page_size)
|
||||
for m in resp.get("messages") or []:
|
||||
yield m
|
||||
page_token = resp.get("nextPageToken")
|
||||
if not page_token:
|
||||
return
|
||||
|
||||
def iter_history(self, *, start_history_id: str,
|
||||
history_types: Optional[list[str]] = None) -> Iterator[dict]:
|
||||
page_token: Optional[str] = None
|
||||
while True:
|
||||
resp = self.list_history(
|
||||
start_history_id=start_history_id,
|
||||
page_token=page_token,
|
||||
history_types=history_types,
|
||||
)
|
||||
for h in resp.get("history") or []:
|
||||
yield h
|
||||
page_token = resp.get("nextPageToken")
|
||||
if not page_token:
|
||||
# Cache final historyId for caller to checkpoint.
|
||||
self._last_history_id = resp.get("historyId")
|
||||
return
|
||||
|
||||
@property
|
||||
def last_history_id(self) -> Optional[str]:
|
||||
return getattr(self, "_last_history_id", None)
|
||||
|
||||
# -------------------------------------------------------------- internals
|
||||
|
||||
def _get(self, path: str, *, params: Optional[dict], cost: float) -> dict:
|
||||
return self._with_retry(lambda: self._call("GET", path, params=params, cost=cost))
|
||||
|
||||
def _call(self, method: str, path: str, *, params: Optional[dict] = None,
|
||||
body: Optional[bytes] = None, cost: float = 1.0) -> dict:
|
||||
self._bucket.wait(cost)
|
||||
self.stats.api_calls += 1
|
||||
|
||||
qs = ""
|
||||
if params:
|
||||
# urllib.parse.urlencode with doseq=True handles repeated params
|
||||
# like metadataHeaders=Foo&metadataHeaders=Bar correctly.
|
||||
qs = "?" + urllib.parse.urlencode(params, doseq=True)
|
||||
url = f"{BASE}/me{path}{qs}"
|
||||
|
||||
token = self._creds.access_token_for(self._email)
|
||||
req = urllib.request.Request(url, method=method, data=body)
|
||||
req.add_header("Authorization", f"Bearer {token.token}")
|
||||
req.add_header("Accept", "application/json")
|
||||
if body:
|
||||
req.add_header("Content-Type", "application/json")
|
||||
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=30) as resp:
|
||||
raw = resp.read()
|
||||
self.stats.bytes_in += len(raw)
|
||||
if not raw:
|
||||
return {}
|
||||
return json.loads(raw)
|
||||
except urllib.error.HTTPError as e:
|
||||
raw = e.read() or b""
|
||||
self.stats.bytes_in += len(raw)
|
||||
try:
|
||||
payload = json.loads(raw) if raw else {}
|
||||
except Exception:
|
||||
payload = {"raw": raw.decode("utf-8", errors="replace")}
|
||||
err = errors.classify_http(e.code, payload)
|
||||
# short-message logging hook (redacted of tokens by design)
|
||||
self.stats.last_errors.append(f"{e.code} {type(err).__name__}")
|
||||
self.stats.last_errors = self.stats.last_errors[-10:]
|
||||
raise err
|
||||
except (urllib.error.URLError, TimeoutError) as e:
|
||||
raise errors.TransientError(f"network error: {e}") from e
|
||||
|
||||
def _with_retry(self, fn):
|
||||
cfg = _cfg.CONFIG
|
||||
attempts = 0
|
||||
delay = cfg.retry_initial_delay_sec
|
||||
while True:
|
||||
try:
|
||||
return fn()
|
||||
except errors.RETRYABLE as e:
|
||||
attempts += 1
|
||||
if attempts >= cfg.retry_max_attempts:
|
||||
raise
|
||||
self.stats.retries += 1
|
||||
# Full jitter
|
||||
sleep_for = random.uniform(0, min(delay, cfg.retry_max_delay_sec))
|
||||
time.sleep(sleep_for)
|
||||
delay = min(delay * 2, cfg.retry_max_delay_sec)
|
||||
# Non-retryable errors propagate immediately.
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- batch fetch sketch
|
||||
|
||||
def batch_get_metadata(client: GmailClient, message_ids: list[str],
|
||||
headers: list[str]) -> dict[str, dict]:
|
||||
"""Fetch metadata for up to ~100 messages.
|
||||
|
||||
TODO: implement using Gmail's multipart/mixed batch endpoint at
|
||||
https://www.googleapis.com/batch/gmail/v1 for efficiency. In the scaffold
|
||||
we fall back to serial gets so the logic is correct from day 1.
|
||||
"""
|
||||
out: dict[str, dict] = {}
|
||||
for mid in message_ids:
|
||||
try:
|
||||
out[mid] = client.get_message(mid, format="metadata", metadata_headers=headers)
|
||||
except errors.NotFoundError:
|
||||
# Message deleted between list and get — skip.
|
||||
continue
|
||||
return out
|
||||
@@ -0,0 +1,215 @@
|
||||
"""
|
||||
Investor matching.
|
||||
|
||||
Builds an in-memory index of investor email addresses from:
|
||||
- fundraising_contacts.email
|
||||
- contacts.email
|
||||
- organizations.email + organizations.website (domain only)
|
||||
|
||||
For each synced email, returns a list of investor links. Exact-email matches
|
||||
beat domain matches; if any exact match exists, domain matches are suppressed.
|
||||
|
||||
The index is rebuilt every `REFRESH_INTERVAL_SEC` or on demand via rebuild().
|
||||
"""
|
||||
|
||||
import re
|
||||
import threading
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
|
||||
REFRESH_INTERVAL_SEC = 900 # 15 minutes
|
||||
|
||||
# Domains we never domain-match against (personal mailboxes).
|
||||
COMMON_PERSONAL_DOMAINS = {
|
||||
"gmail.com", "googlemail.com",
|
||||
"outlook.com", "hotmail.com", "live.com", "msn.com",
|
||||
"yahoo.com", "yahoo.co.uk", "ymail.com",
|
||||
"icloud.com", "me.com", "mac.com",
|
||||
"aol.com", "proton.me", "protonmail.com",
|
||||
"pm.me", "fastmail.com", "tuta.io", "hey.com",
|
||||
"duck.com", "zoho.com",
|
||||
}
|
||||
|
||||
|
||||
# Also skip matching on the team's own domain (they email each other).
|
||||
# Populated from CONFIG.workspace_domain at rebuild time.
|
||||
|
||||
|
||||
@dataclass
|
||||
class MatchTarget:
|
||||
fundraising_investor_id: Optional[str] = None
|
||||
fundraising_contact_id: Optional[str] = None
|
||||
contact_id: Optional[str] = None
|
||||
organization_id: Optional[str] = None
|
||||
investor_name: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class InvestorLink:
|
||||
matched_address: str
|
||||
match_kind: str # exact_email | domain_match | manual
|
||||
match_confidence: float
|
||||
target: MatchTarget
|
||||
|
||||
|
||||
class InvestorIndex:
|
||||
|
||||
def __init__(self, own_domain: Optional[str] = None):
|
||||
self._email_index: dict[str, MatchTarget] = {}
|
||||
self._domain_index: dict[str, list[MatchTarget]] = {}
|
||||
self._own_domain = (own_domain or "").lower() or None
|
||||
self._last_built = 0.0
|
||||
self._lock = threading.Lock()
|
||||
|
||||
# ------------------------------------------------------------------ build
|
||||
|
||||
def rebuild(self, db_conn_factory) -> None:
|
||||
with self._lock:
|
||||
email_idx: dict[str, MatchTarget] = {}
|
||||
domain_idx: dict[str, list[MatchTarget]] = {}
|
||||
|
||||
conn = db_conn_factory()
|
||||
try:
|
||||
cur = conn.cursor()
|
||||
|
||||
# fundraising_contacts
|
||||
cur.execute(
|
||||
"SELECT fc.id, fc.email, fc.investor_id, fi.investor_name "
|
||||
"FROM fundraising_contacts fc "
|
||||
"LEFT JOIN fundraising_investors fi ON fi.id = fc.investor_id "
|
||||
"WHERE fc.email IS NOT NULL AND fc.email != ''"
|
||||
)
|
||||
for r in cur.fetchall():
|
||||
addr = (r["email"] or "").lower().strip()
|
||||
if not _valid_email(addr):
|
||||
continue
|
||||
email_idx[addr] = MatchTarget(
|
||||
fundraising_contact_id=r["id"],
|
||||
fundraising_investor_id=r["investor_id"],
|
||||
investor_name=r["investor_name"],
|
||||
)
|
||||
|
||||
# contacts
|
||||
cur.execute(
|
||||
"SELECT id, email, organization_id FROM contacts "
|
||||
"WHERE email IS NOT NULL AND email != ''"
|
||||
)
|
||||
for r in cur.fetchall():
|
||||
addr = (r["email"] or "").lower().strip()
|
||||
if not _valid_email(addr):
|
||||
continue
|
||||
# Don't overwrite a fundraising_contact match; they're higher signal.
|
||||
email_idx.setdefault(addr, MatchTarget(
|
||||
contact_id=r["id"],
|
||||
organization_id=r["organization_id"],
|
||||
))
|
||||
|
||||
# organizations — domain-only match source
|
||||
cur.execute(
|
||||
"SELECT id, name, email, website FROM organizations "
|
||||
"WHERE (email IS NOT NULL AND email != '') OR (website IS NOT NULL AND website != '')"
|
||||
)
|
||||
for r in cur.fetchall():
|
||||
for d in _domains_for_org(r):
|
||||
if d in COMMON_PERSONAL_DOMAINS:
|
||||
continue
|
||||
if self._own_domain and d == self._own_domain:
|
||||
continue
|
||||
domain_idx.setdefault(d, []).append(MatchTarget(
|
||||
organization_id=r["id"],
|
||||
investor_name=r["name"],
|
||||
))
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
self._email_index = email_idx
|
||||
self._domain_index = domain_idx
|
||||
self._last_built = time.time()
|
||||
|
||||
def rebuild_if_stale(self, db_conn_factory) -> None:
|
||||
if time.time() - self._last_built > REFRESH_INTERVAL_SEC:
|
||||
self.rebuild(db_conn_factory)
|
||||
|
||||
# ------------------------------------------------------------------ query
|
||||
|
||||
def match(self, addresses: set[str], *,
|
||||
exclude_addresses: Optional[set[str]] = None) -> list[InvestorLink]:
|
||||
excl = {a.lower() for a in (exclude_addresses or set())}
|
||||
candidates = {a.lower().strip() for a in addresses if a} - excl
|
||||
|
||||
# Exclude own domain addresses (teammates emailing each other).
|
||||
if self._own_domain:
|
||||
candidates = {a for a in candidates
|
||||
if not a.endswith("@" + self._own_domain)}
|
||||
|
||||
links: list[InvestorLink] = []
|
||||
seen_targets: set[tuple] = set()
|
||||
|
||||
# Exact email matches first.
|
||||
for addr in candidates:
|
||||
t = self._email_index.get(addr)
|
||||
if t:
|
||||
key = (t.fundraising_contact_id, t.contact_id)
|
||||
if key in seen_targets:
|
||||
continue
|
||||
seen_targets.add(key)
|
||||
links.append(InvestorLink(
|
||||
matched_address=addr,
|
||||
match_kind="exact_email",
|
||||
match_confidence=1.0,
|
||||
target=t,
|
||||
))
|
||||
|
||||
if links: # exact hits short-circuit domain matching
|
||||
return links
|
||||
|
||||
# Domain fallback.
|
||||
for addr in candidates:
|
||||
_, _, domain = addr.partition("@")
|
||||
if not domain or domain in COMMON_PERSONAL_DOMAINS:
|
||||
continue
|
||||
for t in self._domain_index.get(domain, []):
|
||||
key = ("org", t.organization_id)
|
||||
if key in seen_targets:
|
||||
continue
|
||||
seen_targets.add(key)
|
||||
links.append(InvestorLink(
|
||||
matched_address=addr,
|
||||
match_kind="domain_match",
|
||||
match_confidence=0.6,
|
||||
target=t,
|
||||
))
|
||||
return links
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- helpers
|
||||
|
||||
_EMAIL_RE = re.compile(r"^[^@\s]+@[^@\s]+\.[^@\s]+$")
|
||||
|
||||
|
||||
def _valid_email(s: str) -> bool:
|
||||
return bool(_EMAIL_RE.match(s))
|
||||
|
||||
|
||||
def _domains_for_org(row) -> list[str]:
|
||||
out: list[str] = []
|
||||
if row["email"]:
|
||||
_, _, d = row["email"].lower().partition("@")
|
||||
if d:
|
||||
out.append(d)
|
||||
if row["website"]:
|
||||
d = _domain_from_url(row["website"])
|
||||
if d:
|
||||
out.append(d)
|
||||
return list({d for d in out if d})
|
||||
|
||||
|
||||
def _domain_from_url(url: str) -> Optional[str]:
|
||||
if not url:
|
||||
return None
|
||||
m = re.match(r"^\s*(?:https?://)?(?:www\.)?([^/:?#\s]+)", url.strip(), re.IGNORECASE)
|
||||
if not m:
|
||||
return None
|
||||
return m.group(1).lower()
|
||||
@@ -0,0 +1,192 @@
|
||||
-- Gmail Integration — Phase 1 migration
|
||||
-- Creates all tables for email capture, matching, threading, attachments.
|
||||
-- This migration is IDEMPOTENT: safe to re-run.
|
||||
-- Applied by email_integration.db.apply_migrations() on server startup when
|
||||
-- CRM_GMAIL_INTEGRATION_ENABLED is truthy.
|
||||
--
|
||||
-- DO NOT modify this file in place after it ships. Create 0002_*.sql, etc.
|
||||
|
||||
-- ============================================================================
|
||||
-- email_accounts — one row per enrolled team-member mailbox
|
||||
-- ============================================================================
|
||||
CREATE TABLE IF NOT EXISTS email_accounts (
|
||||
id TEXT PRIMARY KEY,
|
||||
user_id TEXT NOT NULL,
|
||||
email_address TEXT NOT NULL UNIQUE,
|
||||
auth_method TEXT NOT NULL, -- 'dwd' | 'oauth'
|
||||
oauth_refresh_enc BLOB,
|
||||
oauth_token_enc BLOB,
|
||||
oauth_token_exp TEXT,
|
||||
sync_enabled INTEGER NOT NULL DEFAULT 1,
|
||||
sync_status TEXT NOT NULL DEFAULT 'pending',
|
||||
sync_error TEXT,
|
||||
last_history_id TEXT,
|
||||
last_synced_at TEXT,
|
||||
backfill_complete INTEGER NOT NULL DEFAULT 0,
|
||||
backfill_cursor TEXT,
|
||||
created_at TEXT DEFAULT (datetime('now')),
|
||||
updated_at TEXT DEFAULT (datetime('now')),
|
||||
FOREIGN KEY(user_id) REFERENCES users(id)
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_email_accounts_user ON email_accounts(user_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_email_accounts_sync ON email_accounts(sync_enabled, sync_status);
|
||||
|
||||
-- ============================================================================
|
||||
-- emails — canonical email record, dedup'd across accounts by RFC Message-ID
|
||||
-- ============================================================================
|
||||
CREATE TABLE IF NOT EXISTS emails (
|
||||
id TEXT PRIMARY KEY,
|
||||
rfc_message_id TEXT NOT NULL UNIQUE,
|
||||
gmail_thread_id TEXT,
|
||||
rfc_thread_root_id TEXT,
|
||||
thread_id TEXT, -- FK email_threads.id (populated by threads.py)
|
||||
subject TEXT,
|
||||
from_email TEXT NOT NULL,
|
||||
from_name TEXT,
|
||||
to_emails_json TEXT NOT NULL DEFAULT '[]',
|
||||
cc_emails_json TEXT NOT NULL DEFAULT '[]',
|
||||
bcc_emails_json TEXT NOT NULL DEFAULT '[]',
|
||||
reply_to TEXT,
|
||||
sent_at TEXT NOT NULL,
|
||||
body_text TEXT,
|
||||
body_html TEXT,
|
||||
snippet TEXT,
|
||||
in_reply_to TEXT,
|
||||
references_json TEXT DEFAULT '[]',
|
||||
has_attachments INTEGER NOT NULL DEFAULT 0,
|
||||
size_estimate INTEGER,
|
||||
is_matched INTEGER NOT NULL DEFAULT 0,
|
||||
match_status TEXT NOT NULL DEFAULT 'unmatched', -- unmatched|matched|skipped
|
||||
raw_headers_json TEXT,
|
||||
created_at TEXT DEFAULT (datetime('now')),
|
||||
updated_at TEXT DEFAULT (datetime('now'))
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_emails_thread ON emails(gmail_thread_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_emails_rfc_thread ON emails(rfc_thread_root_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_emails_thread_fk ON emails(thread_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_emails_from ON emails(from_email);
|
||||
CREATE INDEX IF NOT EXISTS idx_emails_sent_at ON emails(sent_at);
|
||||
CREATE INDEX IF NOT EXISTS idx_emails_matched ON emails(is_matched, sent_at);
|
||||
CREATE INDEX IF NOT EXISTS idx_emails_in_reply_to ON emails(in_reply_to);
|
||||
|
||||
-- ============================================================================
|
||||
-- email_recipients — denormalized for fast address lookups
|
||||
-- ============================================================================
|
||||
CREATE TABLE IF NOT EXISTS email_recipients (
|
||||
id TEXT PRIMARY KEY,
|
||||
email_id TEXT NOT NULL,
|
||||
address TEXT NOT NULL,
|
||||
display_name TEXT,
|
||||
kind TEXT NOT NULL, -- from|to|cc|bcc|reply_to
|
||||
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_email_recipients_addr ON email_recipients(address);
|
||||
CREATE INDEX IF NOT EXISTS idx_email_recipients_email ON email_recipients(email_id);
|
||||
|
||||
-- ============================================================================
|
||||
-- email_account_messages — per-mailbox sighting of an email
|
||||
-- ============================================================================
|
||||
CREATE TABLE IF NOT EXISTS email_account_messages (
|
||||
id TEXT PRIMARY KEY,
|
||||
email_id TEXT NOT NULL,
|
||||
account_id TEXT NOT NULL,
|
||||
gmail_message_id TEXT NOT NULL,
|
||||
gmail_thread_id TEXT NOT NULL,
|
||||
labels_json TEXT DEFAULT '[]',
|
||||
is_sent INTEGER NOT NULL DEFAULT 0,
|
||||
first_seen_at TEXT DEFAULT (datetime('now')),
|
||||
deleted_at TEXT,
|
||||
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE,
|
||||
FOREIGN KEY(account_id) REFERENCES email_accounts(id) ON DELETE CASCADE,
|
||||
UNIQUE(account_id, gmail_message_id)
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_eam_email ON email_account_messages(email_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_eam_account ON email_account_messages(account_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_eam_gmail_msg ON email_account_messages(gmail_message_id);
|
||||
|
||||
-- ============================================================================
|
||||
-- email_attachments — metadata; bytes on disk under data/email_attachments/
|
||||
-- ============================================================================
|
||||
CREATE TABLE IF NOT EXISTS email_attachments (
|
||||
id TEXT PRIMARY KEY,
|
||||
email_id TEXT NOT NULL,
|
||||
gmail_attachment_id TEXT NOT NULL,
|
||||
filename TEXT NOT NULL,
|
||||
sanitized_filename TEXT NOT NULL,
|
||||
mime_type TEXT,
|
||||
size_bytes INTEGER,
|
||||
sha256_hex TEXT,
|
||||
storage_path TEXT NOT NULL,
|
||||
download_status TEXT NOT NULL DEFAULT 'pending', -- pending|downloaded|failed|skipped
|
||||
download_attempts INTEGER NOT NULL DEFAULT 0,
|
||||
download_error TEXT,
|
||||
downloaded_at TEXT,
|
||||
created_at TEXT DEFAULT (datetime('now')),
|
||||
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_attach_email ON email_attachments(email_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_attach_sha ON email_attachments(sha256_hex);
|
||||
CREATE INDEX IF NOT EXISTS idx_attach_status ON email_attachments(download_status);
|
||||
|
||||
-- ============================================================================
|
||||
-- email_threads — thread roll-up for UI
|
||||
-- ============================================================================
|
||||
CREATE TABLE IF NOT EXISTS email_threads (
|
||||
id TEXT PRIMARY KEY,
|
||||
gmail_thread_id TEXT,
|
||||
rfc_thread_root_id TEXT,
|
||||
subject_normalized TEXT,
|
||||
first_message_at TEXT,
|
||||
last_message_at TEXT,
|
||||
message_count INTEGER NOT NULL DEFAULT 0,
|
||||
participant_count INTEGER NOT NULL DEFAULT 0,
|
||||
participants_json TEXT DEFAULT '[]',
|
||||
is_matched INTEGER NOT NULL DEFAULT 0,
|
||||
created_at TEXT DEFAULT (datetime('now')),
|
||||
updated_at TEXT DEFAULT (datetime('now'))
|
||||
);
|
||||
CREATE UNIQUE INDEX IF NOT EXISTS idx_threads_gmail_uniq ON email_threads(gmail_thread_id)
|
||||
WHERE gmail_thread_id IS NOT NULL;
|
||||
CREATE INDEX IF NOT EXISTS idx_threads_rfc_root ON email_threads(rfc_thread_root_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_threads_last_msg ON email_threads(last_message_at);
|
||||
|
||||
-- ============================================================================
|
||||
-- email_investor_links — matched investors
|
||||
-- ============================================================================
|
||||
CREATE TABLE IF NOT EXISTS email_investor_links (
|
||||
id TEXT PRIMARY KEY,
|
||||
email_id TEXT NOT NULL,
|
||||
fundraising_investor_id TEXT,
|
||||
fundraising_contact_id TEXT,
|
||||
contact_id TEXT,
|
||||
organization_id TEXT,
|
||||
matched_address TEXT NOT NULL,
|
||||
match_kind TEXT NOT NULL, -- exact_email|domain_match|manual
|
||||
match_confidence REAL NOT NULL DEFAULT 1.0,
|
||||
created_at TEXT DEFAULT (datetime('now')),
|
||||
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_eil_email ON email_investor_links(email_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_eil_investor ON email_investor_links(fundraising_investor_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_eil_fr_contact ON email_investor_links(fundraising_contact_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_eil_contact ON email_investor_links(contact_id);
|
||||
|
||||
-- ============================================================================
|
||||
-- email_sync_runs — per-run observability
|
||||
-- ============================================================================
|
||||
CREATE TABLE IF NOT EXISTS email_sync_runs (
|
||||
id TEXT PRIMARY KEY,
|
||||
account_id TEXT NOT NULL,
|
||||
kind TEXT NOT NULL, -- backfill|incremental|manual
|
||||
started_at TEXT NOT NULL,
|
||||
finished_at TEXT,
|
||||
status TEXT NOT NULL, -- running|ok|error|partial
|
||||
messages_seen INTEGER NOT NULL DEFAULT 0,
|
||||
messages_stored INTEGER NOT NULL DEFAULT 0,
|
||||
attachments_saved INTEGER NOT NULL DEFAULT 0,
|
||||
api_calls INTEGER NOT NULL DEFAULT 0,
|
||||
retries INTEGER NOT NULL DEFAULT 0,
|
||||
error TEXT,
|
||||
FOREIGN KEY(account_id) REFERENCES email_accounts(id) ON DELETE CASCADE
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_sync_runs_account ON email_sync_runs(account_id, started_at);
|
||||
@@ -0,0 +1,283 @@
|
||||
"""
|
||||
Parse a Gmail `users.messages.get` response (format=full) into a flat dict
|
||||
ready for db.insert_email().
|
||||
|
||||
Input shape (abbreviated):
|
||||
{
|
||||
"id": "...", # Gmail message id
|
||||
"threadId": "...",
|
||||
"labelIds": ["INBOX","IMPORTANT",...],
|
||||
"snippet": "...",
|
||||
"historyId": "...",
|
||||
"internalDate": "1713657600000", # ms epoch, authoritative
|
||||
"sizeEstimate": 12345,
|
||||
"payload": {
|
||||
"headers": [{"name":"Subject","value":"..."}, ...],
|
||||
"mimeType": "multipart/mixed",
|
||||
"parts": [...recursive...],
|
||||
"body": {"data": "<base64url>", "size": ...}
|
||||
}
|
||||
}
|
||||
"""
|
||||
|
||||
import base64
|
||||
import email.utils
|
||||
import email.header
|
||||
import re
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any, Iterable, Optional
|
||||
from html.parser import HTMLParser
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- public
|
||||
|
||||
def parse(message: dict, *, owning_account_address: Optional[str] = None) -> dict:
|
||||
"""Parse a Gmail message payload into our canonical dict shape."""
|
||||
headers = _header_map(message.get("payload", {}).get("headers") or [])
|
||||
|
||||
from_name, from_email = _split_addr(headers.get("from", ""))
|
||||
to_list = _parse_address_list(headers.get("to", ""))
|
||||
cc_list = _parse_address_list(headers.get("cc", ""))
|
||||
bcc_list = _parse_address_list(headers.get("bcc", ""))
|
||||
reply_to = _split_addr(headers.get("reply-to", ""))[1] or None
|
||||
|
||||
sent_at = _parse_date_header(headers.get("date"), fallback_ms=message.get("internalDate"))
|
||||
|
||||
rfc_mid = headers.get("message-id", "").strip() or f"synthetic-{message.get('id')}@ten31.local"
|
||||
rfc_mid = _strip_angle_brackets(rfc_mid)
|
||||
in_reply_to = _strip_angle_brackets(headers.get("in-reply-to", "").strip()) or None
|
||||
references = _split_references(headers.get("references", ""))
|
||||
rfc_thread_root_id = references[0] if references else (in_reply_to or rfc_mid)
|
||||
|
||||
body_text, body_html, attachments = _walk_payload(message.get("payload", {}))
|
||||
|
||||
subject = _decode_rfc2047(headers.get("subject") or "")
|
||||
|
||||
labels = message.get("labelIds") or []
|
||||
is_sent = "SENT" in labels
|
||||
|
||||
return {
|
||||
"gmail_message_id": message.get("id"),
|
||||
"gmail_thread_id": message.get("threadId"),
|
||||
"rfc_message_id": rfc_mid,
|
||||
"rfc_thread_root_id": rfc_thread_root_id,
|
||||
"in_reply_to": in_reply_to,
|
||||
"references": references,
|
||||
"subject": subject,
|
||||
"from_email": (from_email or "").lower(),
|
||||
"from_name": from_name,
|
||||
"to": [{"email": e.lower(), "name": n} for n, e in to_list if e],
|
||||
"cc": [{"email": e.lower(), "name": n} for n, e in cc_list if e],
|
||||
"bcc": [{"email": e.lower(), "name": n} for n, e in bcc_list if e],
|
||||
"reply_to": reply_to.lower() if reply_to else None,
|
||||
"sent_at": sent_at,
|
||||
"body_text": _cap_text(body_text),
|
||||
"body_html": _cap_text(body_html),
|
||||
"snippet": message.get("snippet"),
|
||||
"attachments": attachments,
|
||||
"size_estimate": message.get("sizeEstimate"),
|
||||
"labels": labels,
|
||||
"is_sent": is_sent,
|
||||
"raw_headers": headers,
|
||||
"owning_account": owning_account_address,
|
||||
}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- headers
|
||||
|
||||
def _header_map(header_list: Iterable[dict]) -> dict[str, str]:
|
||||
"""Case-insensitive keys. Last-write-wins for duplicates (rare)."""
|
||||
out: dict[str, str] = {}
|
||||
for h in header_list:
|
||||
name = (h.get("name") or "").lower()
|
||||
out[name] = h.get("value") or ""
|
||||
return out
|
||||
|
||||
|
||||
def _decode_rfc2047(s: str) -> str:
|
||||
if not s:
|
||||
return ""
|
||||
try:
|
||||
parts = email.header.decode_header(s)
|
||||
pieces = []
|
||||
for text, charset in parts:
|
||||
if isinstance(text, bytes):
|
||||
try:
|
||||
pieces.append(text.decode(charset or "utf-8", errors="replace"))
|
||||
except LookupError:
|
||||
pieces.append(text.decode("utf-8", errors="replace"))
|
||||
else:
|
||||
pieces.append(text)
|
||||
return "".join(pieces)
|
||||
except Exception:
|
||||
return s
|
||||
|
||||
|
||||
def _split_addr(raw: str) -> tuple[Optional[str], Optional[str]]:
|
||||
if not raw:
|
||||
return (None, None)
|
||||
name, addr = email.utils.parseaddr(raw)
|
||||
return (_decode_rfc2047(name) or None, addr or None)
|
||||
|
||||
|
||||
def _parse_address_list(raw: str) -> list[tuple[Optional[str], Optional[str]]]:
|
||||
if not raw:
|
||||
return []
|
||||
parsed = email.utils.getaddresses([raw])
|
||||
return [(_decode_rfc2047(n) or None, a or None) for n, a in parsed if a]
|
||||
|
||||
|
||||
def _parse_date_header(raw: Optional[str], *, fallback_ms: Optional[str]) -> str:
|
||||
# Prefer RFC Date header, fall back to Gmail internalDate (epoch ms).
|
||||
if raw:
|
||||
try:
|
||||
dt = email.utils.parsedate_to_datetime(raw)
|
||||
if dt is not None:
|
||||
if dt.tzinfo is None:
|
||||
dt = dt.replace(tzinfo=timezone.utc)
|
||||
return dt.astimezone(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
|
||||
except (TypeError, ValueError):
|
||||
pass
|
||||
if fallback_ms:
|
||||
try:
|
||||
dt = datetime.fromtimestamp(int(fallback_ms) / 1000.0, tz=timezone.utc)
|
||||
return dt.strftime("%Y-%m-%dT%H:%M:%SZ")
|
||||
except (TypeError, ValueError):
|
||||
pass
|
||||
return datetime.now(tz=timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
|
||||
|
||||
|
||||
def _split_references(raw: str) -> list[str]:
|
||||
if not raw:
|
||||
return []
|
||||
return [_strip_angle_brackets(p) for p in raw.split() if p.strip()]
|
||||
|
||||
|
||||
def _strip_angle_brackets(s: str) -> str:
|
||||
s = (s or "").strip()
|
||||
if s.startswith("<") and s.endswith(">"):
|
||||
return s[1:-1]
|
||||
return s
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- MIME walk
|
||||
|
||||
def _walk_payload(payload: dict) -> tuple[Optional[str], Optional[str], list[dict]]:
|
||||
"""Returns (body_text, body_html, attachments).
|
||||
|
||||
Depth-first walk. First plain/text wins for body_text; first text/html
|
||||
wins for body_html. Anything with a filename or attachment disposition
|
||||
becomes an attachment entry.
|
||||
"""
|
||||
text: Optional[str] = None
|
||||
html_body: Optional[str] = None
|
||||
attachments: list[dict] = []
|
||||
|
||||
def visit(part: dict):
|
||||
nonlocal text, html_body
|
||||
mime = (part.get("mimeType") or "").lower()
|
||||
filename = part.get("filename") or ""
|
||||
body = part.get("body") or {}
|
||||
parts = part.get("parts") or []
|
||||
|
||||
headers = _header_map(part.get("headers") or [])
|
||||
disposition = (headers.get("content-disposition") or "").lower()
|
||||
is_attachment = bool(filename) or disposition.startswith("attachment")
|
||||
|
||||
if is_attachment:
|
||||
attachments.append({
|
||||
"filename": filename or f"unnamed.{_ext_for(mime)}",
|
||||
"mime_type": mime or "application/octet-stream",
|
||||
"size": body.get("size"),
|
||||
"gmail_attachment_id": body.get("attachmentId"),
|
||||
# Some tiny attachments come inlined as base64; attachmentId is
|
||||
# then missing and data is in body.data. sync.py handles both.
|
||||
"inline_data_b64": body.get("data"),
|
||||
"content_disposition": "inline" if disposition.startswith("inline") else "attachment",
|
||||
})
|
||||
else:
|
||||
if mime == "text/plain" and text is None:
|
||||
text = _decode_body(body)
|
||||
elif mime == "text/html" and html_body is None:
|
||||
html_body = _decode_body(body)
|
||||
|
||||
for child in parts:
|
||||
visit(child)
|
||||
|
||||
visit(payload)
|
||||
|
||||
# Derive a plain-text body from HTML if only HTML exists.
|
||||
if text is None and html_body:
|
||||
text = _strip_html(html_body)
|
||||
|
||||
return text, html_body, attachments
|
||||
|
||||
|
||||
def _decode_body(body: dict) -> Optional[str]:
|
||||
data = body.get("data")
|
||||
if not data:
|
||||
return None
|
||||
try:
|
||||
padding = 4 - (len(data) % 4)
|
||||
if padding != 4:
|
||||
data = data + ("=" * padding)
|
||||
raw = base64.urlsafe_b64decode(data.encode("ascii"))
|
||||
return raw.decode("utf-8", errors="replace").replace("\r\n", "\n")
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- HTML stripping
|
||||
|
||||
class _HTMLToText(HTMLParser):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self._parts: list[str] = []
|
||||
self._skip_depth = 0
|
||||
|
||||
def handle_starttag(self, tag, attrs):
|
||||
if tag in ("script", "style"):
|
||||
self._skip_depth += 1
|
||||
if tag in ("br", "p", "div", "tr", "li"):
|
||||
self._parts.append("\n")
|
||||
|
||||
def handle_endtag(self, tag):
|
||||
if tag in ("script", "style"):
|
||||
self._skip_depth = max(0, self._skip_depth - 1)
|
||||
if tag in ("p", "div", "tr"):
|
||||
self._parts.append("\n")
|
||||
|
||||
def handle_data(self, data):
|
||||
if self._skip_depth == 0:
|
||||
self._parts.append(data)
|
||||
|
||||
def text(self) -> str:
|
||||
raw = "".join(self._parts)
|
||||
return re.sub(r"\n{3,}", "\n\n", raw).strip()
|
||||
|
||||
|
||||
def _strip_html(html: str) -> str:
|
||||
p = _HTMLToText()
|
||||
try:
|
||||
p.feed(html)
|
||||
return p.text()
|
||||
except Exception:
|
||||
return re.sub(r"<[^>]+>", " ", html)
|
||||
|
||||
|
||||
def _ext_for(mime: str) -> str:
|
||||
return mime.split("/")[-1] if "/" in mime else "bin"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- caps
|
||||
|
||||
# Keep bodies bounded to avoid a pathological 500MB message exploding the DB.
|
||||
_BODY_CAP_BYTES = 10 * 1024 * 1024 # 10MB
|
||||
|
||||
|
||||
def _cap_text(s: Optional[str]) -> Optional[str]:
|
||||
if s is None:
|
||||
return None
|
||||
if len(s.encode("utf-8", errors="ignore")) <= _BODY_CAP_BYTES:
|
||||
return s
|
||||
return s[: _BODY_CAP_BYTES // 2] + "\n\n[TRUNCATED BY CRM — body exceeded 10MB]"
|
||||
@@ -0,0 +1,462 @@
|
||||
"""
|
||||
HTTP route handlers for the Gmail integration.
|
||||
|
||||
Designed to plug into server.py's CRMHandler (BaseHTTPRequestHandler) pattern.
|
||||
The hook is a single function call near the top of do_GET / do_POST that
|
||||
lets this module claim any /api/email/* request:
|
||||
|
||||
# in CRMHandler.do_GET and CRMHandler.do_POST, before the 404 fallthrough:
|
||||
from email_integration.routes import try_handle
|
||||
if try_handle(self):
|
||||
return
|
||||
|
||||
`try_handle(handler)` inspects `handler.command` and `handler.get_path()` and
|
||||
returns True if it handled the request (sent a response).
|
||||
|
||||
Every handler respects the same auth / rate-limit model as the rest of server.py
|
||||
by calling handler.get_user() and handler.rate_limited(...).
|
||||
"""
|
||||
|
||||
import json
|
||||
import sqlite3
|
||||
from typing import Optional
|
||||
|
||||
from . import config as _cfg
|
||||
from . import credentials as _creds
|
||||
from . import crypto as _crypto
|
||||
from . import db as _db
|
||||
from . import scheduler as _sched
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- dispatch
|
||||
|
||||
_GET_ROUTES = {
|
||||
"/api/email/status": "status",
|
||||
"/api/email/accounts": "list_accounts",
|
||||
"/api/email/threads": "list_threads",
|
||||
"/api/email/oauth/start": "oauth_start",
|
||||
"/api/email/oauth/callback": "oauth_callback",
|
||||
}
|
||||
|
||||
_POST_ROUTES = {
|
||||
"/api/email/accounts/enroll-all": "enroll_all",
|
||||
"/api/email/accounts/enroll": "enroll_one",
|
||||
"/api/email/sync/run-now": "run_now",
|
||||
"/api/email/rematch": "rematch",
|
||||
}
|
||||
|
||||
|
||||
def try_handle(handler) -> bool:
|
||||
path = handler.get_path()
|
||||
method = handler.command
|
||||
table = _GET_ROUTES if method == "GET" else _POST_ROUTES if method == "POST" else {}
|
||||
name = table.get(path)
|
||||
if not path.startswith("/api/email/"):
|
||||
return False
|
||||
if not name:
|
||||
# Route is owned by this module but unknown — return a proper 404
|
||||
# instead of letting the main dispatcher's 404 abuse counter fire.
|
||||
handler.send_error_json("Not found", 404)
|
||||
return True
|
||||
|
||||
if not _cfg.CONFIG.enabled:
|
||||
handler.send_error_json("Email integration disabled", 503)
|
||||
return True
|
||||
|
||||
# Also enforce attachment streaming under a different prefix
|
||||
# (handled above via prefix check).
|
||||
|
||||
impl = globals().get(f"_h_{name}")
|
||||
if impl is None:
|
||||
handler.send_error_json("Not implemented", 500)
|
||||
return True
|
||||
|
||||
try:
|
||||
impl(handler)
|
||||
except Exception as e:
|
||||
handler.send_error_json(f"Internal error: {e}", 500)
|
||||
return True
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- helpers
|
||||
|
||||
def _conn() -> sqlite3.Connection:
|
||||
import os
|
||||
db_path = os.environ.get(
|
||||
"CRM_DB_PATH",
|
||||
os.path.join(_cfg.CONFIG.data_dir, "crm.db"),
|
||||
)
|
||||
conn = sqlite3.connect(db_path)
|
||||
conn.execute("PRAGMA journal_mode=WAL")
|
||||
conn.execute("PRAGMA foreign_keys=ON")
|
||||
conn.execute("PRAGMA busy_timeout=5000")
|
||||
conn.row_factory = sqlite3.Row
|
||||
return conn
|
||||
|
||||
|
||||
def _require_auth(handler) -> Optional[dict]:
|
||||
user = handler.get_user()
|
||||
if not user:
|
||||
handler.send_error_json("Unauthorized", 401)
|
||||
return None
|
||||
return user
|
||||
|
||||
|
||||
def _require_admin(handler) -> Optional[dict]:
|
||||
user = _require_auth(handler)
|
||||
if user is None:
|
||||
return None
|
||||
if user.get("role") != "admin":
|
||||
handler.send_error_json("Admin required", 403)
|
||||
return None
|
||||
return user
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- GET handlers
|
||||
|
||||
def _h_status(handler):
|
||||
user = _require_auth(handler)
|
||||
if not user:
|
||||
return
|
||||
snap = _sched.status_snapshot()
|
||||
conn = _conn()
|
||||
try:
|
||||
cur = conn.cursor()
|
||||
cur.execute(
|
||||
"SELECT COUNT(*) AS n_accounts, "
|
||||
"SUM(CASE WHEN sync_status='active' THEN 1 ELSE 0 END) AS n_active, "
|
||||
"SUM(CASE WHEN sync_status='error' THEN 1 ELSE 0 END) AS n_error "
|
||||
"FROM email_accounts"
|
||||
)
|
||||
counts = dict(cur.fetchone() or {})
|
||||
cur.execute("SELECT COUNT(*) AS n FROM emails WHERE match_status = 'matched'")
|
||||
snap["matched_emails"] = cur.fetchone()["n"]
|
||||
finally:
|
||||
conn.close()
|
||||
snap["accounts_summary"] = counts
|
||||
handler.send_json(snap)
|
||||
|
||||
|
||||
def _h_list_accounts(handler):
|
||||
user = _require_auth(handler)
|
||||
if not user:
|
||||
return
|
||||
conn = _conn()
|
||||
try:
|
||||
cur = conn.cursor()
|
||||
cur.execute(
|
||||
"SELECT id, user_id, email_address, auth_method, sync_enabled, "
|
||||
"sync_status, sync_error, last_synced_at, backfill_complete "
|
||||
"FROM email_accounts ORDER BY email_address"
|
||||
)
|
||||
rows = [dict(r) for r in cur.fetchall()]
|
||||
finally:
|
||||
conn.close()
|
||||
# Non-admins only see their own row
|
||||
if user.get("role") != "admin":
|
||||
rows = [r for r in rows if r["user_id"] == user["user_id"]]
|
||||
handler.send_json({"accounts": rows})
|
||||
|
||||
|
||||
def _h_list_threads(handler):
|
||||
user = _require_auth(handler)
|
||||
if not user:
|
||||
return
|
||||
q = handler.get_query_params()
|
||||
investor_id = q.get("investor_id")
|
||||
limit = min(int(q.get("limit", 50)), 500)
|
||||
conn = _conn()
|
||||
try:
|
||||
cur = conn.cursor()
|
||||
if investor_id:
|
||||
cur.execute(
|
||||
"""SELECT t.*
|
||||
FROM email_threads t
|
||||
JOIN emails e ON e.thread_id = t.id
|
||||
JOIN email_investor_links l ON l.email_id = e.id
|
||||
WHERE l.fundraising_investor_id = ?
|
||||
OR l.fundraising_contact_id IN (
|
||||
SELECT id FROM fundraising_contacts WHERE investor_id = ?
|
||||
)
|
||||
GROUP BY t.id
|
||||
ORDER BY t.last_message_at DESC
|
||||
LIMIT ?""",
|
||||
(investor_id, investor_id, limit),
|
||||
)
|
||||
else:
|
||||
cur.execute(
|
||||
"SELECT * FROM email_threads WHERE is_matched = 1 "
|
||||
"ORDER BY last_message_at DESC LIMIT ?",
|
||||
(limit,),
|
||||
)
|
||||
threads = [dict(r) for r in cur.fetchall()]
|
||||
finally:
|
||||
conn.close()
|
||||
handler.send_json({"threads": threads})
|
||||
|
||||
|
||||
def _h_oauth_start(handler):
|
||||
"""Begin per-user OAuth consent flow (fallback path)."""
|
||||
user = _require_auth(handler)
|
||||
if not user:
|
||||
return
|
||||
if _cfg.CONFIG.primary_auth != "oauth":
|
||||
return handler.send_error_json(
|
||||
"Per-user OAuth disabled (set CRM_GMAIL_AUTH_METHOD=oauth to enable)", 400
|
||||
)
|
||||
q = handler.get_query_params()
|
||||
account_email = q.get("account_email") or ""
|
||||
if not account_email:
|
||||
return handler.send_error_json("account_email required", 400)
|
||||
|
||||
import secrets
|
||||
import urllib.parse
|
||||
state = secrets.token_urlsafe(32)
|
||||
_oauth_state_store(state, user["user_id"], account_email)
|
||||
|
||||
params = {
|
||||
"client_id": _cfg.CONFIG.oauth_client_id,
|
||||
"redirect_uri": _cfg.CONFIG.oauth_redirect_uri,
|
||||
"response_type": "code",
|
||||
"scope": _creds.GMAIL_READONLY_SCOPE,
|
||||
"access_type": "offline",
|
||||
"prompt": "consent",
|
||||
"state": state,
|
||||
"login_hint": account_email,
|
||||
}
|
||||
url = "https://accounts.google.com/o/oauth2/v2/auth?" + urllib.parse.urlencode(params)
|
||||
handler.send_json({"redirect_url": url})
|
||||
|
||||
|
||||
def _h_oauth_callback(handler):
|
||||
"""Exchange code for tokens, encrypt refresh token, store."""
|
||||
q = handler.get_query_params()
|
||||
code = q.get("code")
|
||||
state = q.get("state")
|
||||
if not code or not state:
|
||||
return handler.send_error_json("code and state required", 400)
|
||||
|
||||
state_row = _oauth_state_consume(state)
|
||||
if not state_row:
|
||||
return handler.send_error_json("Invalid state", 400)
|
||||
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
body = urllib.parse.urlencode({
|
||||
"code": code,
|
||||
"client_id": _cfg.CONFIG.oauth_client_id,
|
||||
"client_secret": _cfg.CONFIG.oauth_client_secret,
|
||||
"redirect_uri": _cfg.CONFIG.oauth_redirect_uri,
|
||||
"grant_type": "authorization_code",
|
||||
}).encode("ascii")
|
||||
req = urllib.request.Request(
|
||||
"https://oauth2.googleapis.com/token",
|
||||
data=body,
|
||||
headers={"Content-Type": "application/x-www-form-urlencoded"},
|
||||
)
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=15) as resp:
|
||||
payload = json.loads(resp.read())
|
||||
except Exception as e:
|
||||
return handler.send_error_json(f"Token exchange failed: {e}", 500)
|
||||
|
||||
refresh = payload.get("refresh_token")
|
||||
if not refresh:
|
||||
return handler.send_error_json("No refresh_token returned (user may have previously consented; prompt=consent required)", 400)
|
||||
|
||||
enc = _crypto.encrypt(refresh.encode("ascii"), secret_key_b64=_cfg.CONFIG.secret_key_b64)
|
||||
|
||||
conn = _conn()
|
||||
try:
|
||||
_db.upsert_account(conn, user_id=state_row["user_id"],
|
||||
email_address=state_row["account_email"],
|
||||
auth_method="oauth")
|
||||
conn.execute(
|
||||
"UPDATE email_accounts SET oauth_refresh_enc = ?, sync_status = 'pending', "
|
||||
"updated_at = datetime('now') WHERE email_address = ?",
|
||||
(enc, state_row["account_email"]),
|
||||
)
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
handler.send_json({"ok": True, "account_email": state_row["account_email"]})
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- POST handlers
|
||||
|
||||
def _h_enroll_all(handler):
|
||||
"""Admin: enroll every CRM user whose email is @workspace_domain via DWD."""
|
||||
user = _require_admin(handler)
|
||||
if not user:
|
||||
return
|
||||
if _cfg.CONFIG.primary_auth != "dwd":
|
||||
return handler.send_error_json("enroll-all only valid in DWD mode", 400)
|
||||
domain = _cfg.CONFIG.workspace_domain
|
||||
if not domain:
|
||||
return handler.send_error_json("CRM_GMAIL_WORKSPACE_DOMAIN not set", 400)
|
||||
|
||||
conn = _conn()
|
||||
try:
|
||||
cur = conn.cursor()
|
||||
cur.execute(
|
||||
"SELECT id, email FROM users WHERE is_active = 1 AND email LIKE ?",
|
||||
(f"%@{domain}",),
|
||||
)
|
||||
users = cur.fetchall()
|
||||
created = []
|
||||
for u in users:
|
||||
aid = _db.upsert_account(conn, user_id=u["id"],
|
||||
email_address=u["email"].lower(),
|
||||
auth_method="dwd")
|
||||
created.append({"account_id": aid, "email": u["email"]})
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
handler.send_json({"enrolled": created, "count": len(created)})
|
||||
|
||||
|
||||
def _h_enroll_one(handler):
|
||||
user = _require_admin(handler)
|
||||
if not user:
|
||||
return
|
||||
body = handler.get_body() or {}
|
||||
# Accept either `email` or `email_address` for ergonomics.
|
||||
email_address = (body.get("email_address") or body.get("email") or "").lower().strip()
|
||||
user_id = body.get("user_id")
|
||||
auth_method = body.get("auth_method") or _cfg.CONFIG.primary_auth
|
||||
|
||||
if not email_address:
|
||||
return handler.send_error_json("email (or email_address) required", 400)
|
||||
|
||||
# If the caller didn't specify a CRM user_id, resolve it from the
|
||||
# users table by matching email. Falls back to the authenticated
|
||||
# admin's own id (handles the common case of a single admin
|
||||
# enrolling themselves without having to paste their UUID).
|
||||
if not user_id:
|
||||
conn = _conn()
|
||||
try:
|
||||
cur = conn.cursor()
|
||||
cur.execute("SELECT id FROM users WHERE LOWER(email) = ?",
|
||||
(email_address,))
|
||||
row = cur.fetchone()
|
||||
user_id = row["id"] if row else user.get("id")
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
if not user_id:
|
||||
return handler.send_error_json("could not resolve user_id for that email", 400)
|
||||
|
||||
conn = _conn()
|
||||
try:
|
||||
aid = _db.upsert_account(conn, user_id=user_id,
|
||||
email_address=email_address,
|
||||
auth_method=auth_method)
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
handler.send_json({"account_id": aid, "email": email_address, "user_id": user_id})
|
||||
|
||||
|
||||
def _h_run_now(handler):
|
||||
user = _require_admin(handler)
|
||||
if not user:
|
||||
return
|
||||
# Reuse existing rate limit so admins can't hammer this.
|
||||
if handler.rate_limited("email-sync-now", 6):
|
||||
return handler.send_error_json("Too many requests", 429)
|
||||
result = _sched.trigger_run_now()
|
||||
handler.send_json(result)
|
||||
|
||||
|
||||
def _h_rematch(handler):
|
||||
"""Re-evaluate unmatched emails against the current investor index."""
|
||||
user = _require_admin(handler)
|
||||
if not user:
|
||||
return
|
||||
body = handler.get_body() or {}
|
||||
since = body.get("since") # optional ISO8601
|
||||
conn = _conn()
|
||||
scanned = 0
|
||||
matched = 0
|
||||
try:
|
||||
from .matcher import InvestorIndex
|
||||
index = InvestorIndex(own_domain=_cfg.CONFIG.workspace_domain)
|
||||
index.rebuild(_conn)
|
||||
cur = conn.cursor()
|
||||
sql = ("SELECT id, from_email, to_emails_json, cc_emails_json "
|
||||
"FROM emails WHERE match_status = 'unmatched'")
|
||||
params: list = []
|
||||
if since:
|
||||
sql += " AND sent_at >= ?"
|
||||
params.append(since)
|
||||
sql += " ORDER BY sent_at DESC LIMIT 10000"
|
||||
cur.execute(sql, params)
|
||||
for row in cur.fetchall():
|
||||
scanned += 1
|
||||
participants = set()
|
||||
if row["from_email"]:
|
||||
participants.add(row["from_email"].lower())
|
||||
for col in ("to_emails_json", "cc_emails_json"):
|
||||
try:
|
||||
arr = json.loads(row[col] or "[]")
|
||||
except Exception:
|
||||
arr = []
|
||||
for a in arr:
|
||||
e = a.get("email") if isinstance(a, dict) else a
|
||||
if e:
|
||||
participants.add(e.lower())
|
||||
links = index.match(participants)
|
||||
if not links:
|
||||
continue
|
||||
matched += 1
|
||||
conn.execute(
|
||||
"UPDATE emails SET match_status='matched', is_matched=1, "
|
||||
"updated_at=datetime('now') WHERE id=?",
|
||||
(row["id"],),
|
||||
)
|
||||
for link in links:
|
||||
_db.insert_investor_link(conn, email_id=row["id"], link={
|
||||
"matched_address": link.matched_address,
|
||||
"match_kind": link.match_kind,
|
||||
"match_confidence": link.match_confidence,
|
||||
"fundraising_investor_id": link.target.fundraising_investor_id,
|
||||
"fundraising_contact_id": link.target.fundraising_contact_id,
|
||||
"contact_id": link.target.contact_id,
|
||||
"organization_id": link.target.organization_id,
|
||||
})
|
||||
# NOTE: body is still missing — we only have headers. A follow-up
|
||||
# job can re-fetch the full message from Gmail using the sighting's
|
||||
# gmail_message_id. Not done inline to keep this endpoint fast.
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
handler.send_json({"scanned": scanned, "newly_matched": matched})
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- OAuth state store (in-memory)
|
||||
# For a 5-person CRM the state store doesn't need to be durable — a server
|
||||
# restart between start and callback is rare and just requires a retry.
|
||||
|
||||
_oauth_states: dict[str, dict] = {}
|
||||
_oauth_state_lock = __import__("threading").Lock()
|
||||
|
||||
|
||||
def _oauth_state_store(state: str, user_id: str, account_email: str) -> None:
|
||||
import time
|
||||
with _oauth_state_lock:
|
||||
# Prune stale entries (>10 min).
|
||||
cutoff = time.time() - 600
|
||||
for k, v in list(_oauth_states.items()):
|
||||
if v["created"] < cutoff:
|
||||
_oauth_states.pop(k, None)
|
||||
_oauth_states[state] = {
|
||||
"user_id": user_id,
|
||||
"account_email": account_email.lower().strip(),
|
||||
"created": time.time(),
|
||||
}
|
||||
|
||||
|
||||
def _oauth_state_consume(state: str) -> Optional[dict]:
|
||||
with _oauth_state_lock:
|
||||
return _oauth_states.pop(state, None)
|
||||
@@ -0,0 +1,143 @@
|
||||
"""
|
||||
Background sync scheduler.
|
||||
|
||||
Runs as a daemon thread started from server.py main(). One thread; it wakes
|
||||
every `sync_interval_sec`, processes all accounts serially, sleeps again.
|
||||
|
||||
Singleton: start_sync_scheduler() is idempotent — calling twice won't spawn
|
||||
a second thread. stop_sync_scheduler() gracefully signals shutdown (not
|
||||
strictly needed since it's daemon, but useful for tests).
|
||||
"""
|
||||
|
||||
import logging
|
||||
import sqlite3
|
||||
import threading
|
||||
import time
|
||||
from typing import Callable, Optional
|
||||
|
||||
from . import config as _cfg
|
||||
from . import credentials as _creds
|
||||
from . import sync as _sync
|
||||
from .matcher import InvestorIndex
|
||||
|
||||
|
||||
log = logging.getLogger("email_integration.scheduler")
|
||||
|
||||
|
||||
_state: dict[str, object] = {
|
||||
"thread": None,
|
||||
"stop": threading.Event(),
|
||||
"last_run": 0.0,
|
||||
"last_result": None,
|
||||
"running_now": False,
|
||||
}
|
||||
|
||||
|
||||
def _conn_factory_from_env() -> Callable[[], sqlite3.Connection]:
|
||||
"""Build a get_db() compatible with server.py's pattern.
|
||||
|
||||
We don't import server.py (avoid circular / startup ordering). Instead
|
||||
we re-implement the same settings. If server.py's DB path differs from
|
||||
the default, CRM_DB_PATH env var should be set — same mechanism.
|
||||
"""
|
||||
import os
|
||||
db_path = os.environ.get(
|
||||
"CRM_DB_PATH",
|
||||
os.path.join(_cfg.CONFIG.data_dir, "crm.db"),
|
||||
)
|
||||
|
||||
def get_db() -> sqlite3.Connection:
|
||||
conn = sqlite3.connect(db_path)
|
||||
conn.execute("PRAGMA journal_mode=WAL")
|
||||
conn.execute("PRAGMA foreign_keys=ON")
|
||||
conn.execute("PRAGMA busy_timeout=5000")
|
||||
conn.row_factory = sqlite3.Row
|
||||
return conn
|
||||
|
||||
return get_db
|
||||
|
||||
|
||||
def start_sync_scheduler(conn_factory: Optional[Callable] = None) -> None:
|
||||
if _state["thread"] is not None:
|
||||
return # already running
|
||||
|
||||
if not _cfg.CONFIG.enabled:
|
||||
log.info("email_integration not enabled; scheduler will not start")
|
||||
return
|
||||
|
||||
factory = conn_factory or _conn_factory_from_env()
|
||||
|
||||
try:
|
||||
provider = _creds.build_provider(factory)
|
||||
except Exception as e:
|
||||
log.exception("cannot build credential provider: %s", e)
|
||||
return
|
||||
|
||||
index = InvestorIndex(own_domain=_cfg.CONFIG.workspace_domain)
|
||||
try:
|
||||
index.rebuild(factory)
|
||||
except Exception:
|
||||
log.exception("initial investor-index build failed; scheduler continues")
|
||||
|
||||
stop = threading.Event()
|
||||
_state["stop"] = stop
|
||||
|
||||
def _loop():
|
||||
log.info("email sync scheduler started; interval=%ss", _cfg.CONFIG.sync_interval_sec)
|
||||
# First cycle: short delay to let server finish startup.
|
||||
if stop.wait(10):
|
||||
return
|
||||
while not stop.is_set():
|
||||
_state["running_now"] = True
|
||||
t0 = time.time()
|
||||
try:
|
||||
result = _sync.sync_all(factory, provider, index)
|
||||
_state["last_result"] = result
|
||||
except Exception:
|
||||
log.exception("sync loop crashed; will retry next cycle")
|
||||
finally:
|
||||
_state["running_now"] = False
|
||||
_state["last_run"] = t0
|
||||
if stop.wait(_cfg.CONFIG.sync_interval_sec):
|
||||
return
|
||||
|
||||
t = threading.Thread(target=_loop, name="email-sync", daemon=True)
|
||||
t.start()
|
||||
_state["thread"] = t
|
||||
_state["provider"] = provider
|
||||
_state["index"] = index
|
||||
_state["factory"] = factory
|
||||
|
||||
|
||||
def stop_sync_scheduler() -> None:
|
||||
ev: threading.Event = _state["stop"] # type: ignore
|
||||
ev.set()
|
||||
t = _state.get("thread")
|
||||
if t:
|
||||
try:
|
||||
t.join(timeout=5)
|
||||
except Exception:
|
||||
pass
|
||||
_state["thread"] = None
|
||||
|
||||
|
||||
def trigger_run_now() -> dict:
|
||||
"""Force a single sync pass synchronously (admin 'sync now' endpoint)."""
|
||||
if _state.get("running_now"):
|
||||
return {"status": "already_running"}
|
||||
factory = _state.get("factory")
|
||||
provider = _state.get("provider")
|
||||
index = _state.get("index")
|
||||
if not (factory and provider and index):
|
||||
return {"status": "not_initialized"}
|
||||
return _sync.sync_all(factory, provider, index) # type: ignore
|
||||
|
||||
|
||||
def status_snapshot() -> dict:
|
||||
return {
|
||||
"enabled": _cfg.CONFIG.enabled,
|
||||
"running": _state["running_now"],
|
||||
"last_run_unix": _state.get("last_run"),
|
||||
"last_result": _state.get("last_result"),
|
||||
"interval_sec": _cfg.CONFIG.sync_interval_sec,
|
||||
}
|
||||
@@ -0,0 +1,390 @@
|
||||
"""
|
||||
Sync orchestrator.
|
||||
|
||||
Top-level entry points:
|
||||
|
||||
sync_account(conn_factory, credential_provider, account_row, matcher)
|
||||
Full sync pass for one mailbox. Decides backfill vs. incremental based
|
||||
on email_accounts.backfill_complete. Writes a sync_runs row.
|
||||
|
||||
sync_all(conn_factory, credential_provider, matcher)
|
||||
Iterates every sync-enabled account sequentially. Called from
|
||||
scheduler.py every CRM_GMAIL_SYNC_INTERVAL_MIN minutes.
|
||||
|
||||
Design: match-only storage (see architecture doc §7). For each message:
|
||||
1. Fetch metadata (cheap, 5 units).
|
||||
2. Run matcher against participant addresses.
|
||||
3. If matched → fetch full message, parse, persist body + register attachments.
|
||||
4. If unmatched → persist header-only row.
|
||||
5. In both cases, record the per-account sighting.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import sqlite3
|
||||
import traceback
|
||||
from typing import Optional
|
||||
|
||||
from . import attachments as _attach
|
||||
from . import config as _cfg
|
||||
from . import db as _db
|
||||
from . import errors as _errors
|
||||
from . import gmail_client as _gmail
|
||||
from . import parser as _parser
|
||||
from . import threads as _threads
|
||||
from .matcher import InvestorIndex, InvestorLink
|
||||
|
||||
|
||||
log = logging.getLogger("email_integration.sync")
|
||||
|
||||
|
||||
METADATA_HEADERS = [
|
||||
"From", "To", "Cc", "Bcc", "Subject", "Date",
|
||||
"Message-ID", "In-Reply-To", "References", "Reply-To",
|
||||
]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- public
|
||||
|
||||
def sync_all(conn_factory, credential_provider, index: InvestorIndex) -> dict:
|
||||
"""Run one pass across all enabled accounts. Returns summary stats."""
|
||||
index.rebuild_if_stale(conn_factory)
|
||||
|
||||
conn = conn_factory()
|
||||
try:
|
||||
accounts = _db.list_sync_ready_accounts(conn)
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
totals = {"accounts": 0, "messages_stored": 0, "errors": 0}
|
||||
for acc in accounts:
|
||||
totals["accounts"] += 1
|
||||
try:
|
||||
stats = sync_account(conn_factory, credential_provider, acc, index)
|
||||
totals["messages_stored"] += stats.get("messages_stored", 0)
|
||||
except Exception:
|
||||
totals["errors"] += 1
|
||||
log.exception("sync failed for account %s", acc["email_address"])
|
||||
return totals
|
||||
|
||||
|
||||
def sync_account(conn_factory, credential_provider, account,
|
||||
index: InvestorIndex) -> dict:
|
||||
"""Sync a single mailbox. Returns stats dict."""
|
||||
email_addr = account["email_address"]
|
||||
stats = _gmail.CallStats()
|
||||
client = _gmail.GmailClient(credential_provider, email_addr, stats=stats)
|
||||
|
||||
# Mark running
|
||||
conn = conn_factory()
|
||||
try:
|
||||
run_id = _db.start_sync_run(conn,
|
||||
account_id=account["id"],
|
||||
kind="backfill" if not account["backfill_complete"] else "incremental")
|
||||
_db.set_account_status(conn, account["id"], status="active", error=None)
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
run_stats = {"messages_seen": 0, "messages_stored": 0, "attachments_saved": 0}
|
||||
error_str: Optional[str] = None
|
||||
status = "ok"
|
||||
|
||||
try:
|
||||
if not account["backfill_complete"]:
|
||||
_run_backfill(conn_factory, client, account, index, run_stats)
|
||||
else:
|
||||
_run_incremental(conn_factory, client, account, index, run_stats)
|
||||
|
||||
# Drain attachments for this account.
|
||||
conn = conn_factory()
|
||||
try:
|
||||
# Limit to a few cycles' worth of attachments per pass.
|
||||
batched = _attach.drain_pending(conn_factory, client, account["id"], limit=100)
|
||||
run_stats["attachments_saved"] = batched
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
except _errors.AuthError as e:
|
||||
error_str = f"auth: {e}"
|
||||
status = "error"
|
||||
except _errors.HistoryExpiredError:
|
||||
# Recover: reset to date-based backfill from last_synced_at.
|
||||
error_str = "history expired; fallback to date backfill"
|
||||
status = "partial"
|
||||
_fallback_date_backfill(conn_factory, client, account, index, run_stats)
|
||||
except Exception as e:
|
||||
error_str = f"unexpected: {type(e).__name__}: {e}"
|
||||
status = "error"
|
||||
log.exception("unexpected during sync of %s", email_addr)
|
||||
finally:
|
||||
run_stats["api_calls"] = stats.api_calls
|
||||
run_stats["retries"] = stats.retries
|
||||
conn = conn_factory()
|
||||
try:
|
||||
_db.finish_sync_run(conn, run_id, status=status, stats=run_stats, error=error_str)
|
||||
_db.set_account_status(conn, account["id"],
|
||||
status="active" if status == "ok" else status,
|
||||
error=error_str)
|
||||
_db.set_account_checkpoint(conn, account["id"],
|
||||
last_synced_at=_db._now_iso())
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
return run_stats
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- backfill
|
||||
|
||||
def _run_backfill(conn_factory, client, account, index: InvestorIndex,
|
||||
run_stats: dict) -> None:
|
||||
"""Initial full-mailbox backfill, resumable via backfill_cursor."""
|
||||
page_token = account["backfill_cursor"]
|
||||
while True:
|
||||
resp = client.list_messages(page_token=page_token,
|
||||
max_results=_cfg.CONFIG.backfill_page_size)
|
||||
messages = resp.get("messages") or []
|
||||
for m in messages:
|
||||
run_stats["messages_seen"] += 1
|
||||
try:
|
||||
_process_one_message(conn_factory, client, account, index,
|
||||
gmail_message_id=m["id"], run_stats=run_stats)
|
||||
except _errors.GmailError as e:
|
||||
log.warning("skip msg %s on %s: %s", m["id"], account["email_address"], e)
|
||||
continue
|
||||
|
||||
page_token = resp.get("nextPageToken")
|
||||
conn = conn_factory()
|
||||
try:
|
||||
_db.set_account_checkpoint(conn, account["id"],
|
||||
backfill_cursor=page_token,
|
||||
backfill_complete=(not page_token))
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
if not page_token:
|
||||
# Capture current historyId as checkpoint for future incrementals.
|
||||
prof = client.get_profile()
|
||||
hid = prof.get("historyId")
|
||||
if hid:
|
||||
conn = conn_factory()
|
||||
try:
|
||||
_db.set_account_checkpoint(conn, account["id"], history_id=str(hid))
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
return
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- incremental
|
||||
|
||||
def _run_incremental(conn_factory, client, account, index: InvestorIndex,
|
||||
run_stats: dict) -> None:
|
||||
start_hid = account["last_history_id"]
|
||||
if not start_hid:
|
||||
# Safety: if checkpoint is missing, re-enter backfill.
|
||||
_run_backfill(conn_factory, client, account, index, run_stats)
|
||||
return
|
||||
|
||||
# history_types filter limits bandwidth to what we care about.
|
||||
new_hid: Optional[str] = None
|
||||
try:
|
||||
for h in client.iter_history(
|
||||
start_history_id=start_hid,
|
||||
history_types=["messageAdded", "messageDeleted", "labelAdded", "labelRemoved"],
|
||||
):
|
||||
for ma in h.get("messagesAdded") or []:
|
||||
msg = ma.get("message") or {}
|
||||
run_stats["messages_seen"] += 1
|
||||
try:
|
||||
_process_one_message(conn_factory, client, account, index,
|
||||
gmail_message_id=msg.get("id"),
|
||||
run_stats=run_stats)
|
||||
except _errors.GmailError as e:
|
||||
log.warning("skip msg %s on %s: %s", msg.get("id"), account["email_address"], e)
|
||||
|
||||
for md in h.get("messagesDeleted") or []:
|
||||
msg = md.get("message") or {}
|
||||
conn = conn_factory()
|
||||
try:
|
||||
_db.tombstone_sighting(
|
||||
conn,
|
||||
account_id=account["id"],
|
||||
gmail_message_id=msg.get("id"),
|
||||
)
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
for la in (h.get("labelsAdded") or []) + (h.get("labelsRemoved") or []):
|
||||
msg = la.get("message") or {}
|
||||
# labels are the resulting label set in Gmail's payload after
|
||||
# the change. We refresh them wholesale.
|
||||
labels = msg.get("labelIds") or []
|
||||
conn = conn_factory()
|
||||
try:
|
||||
_db.update_sighting_labels(
|
||||
conn,
|
||||
account_id=account["id"],
|
||||
gmail_message_id=msg.get("id"),
|
||||
labels=labels,
|
||||
)
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
new_hid = client.last_history_id
|
||||
except _errors.HistoryExpiredError:
|
||||
raise
|
||||
|
||||
if new_hid:
|
||||
conn = conn_factory()
|
||||
try:
|
||||
_db.set_account_checkpoint(conn, account["id"], history_id=str(new_hid))
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def _fallback_date_backfill(conn_factory, client, account, index, run_stats):
|
||||
"""Used when startHistoryId has been pruned by Gmail.
|
||||
|
||||
Pulls everything since last_synced_at (or 14d if unknown), which will
|
||||
hit a large overlap with existing data but upserts are idempotent.
|
||||
"""
|
||||
from datetime import datetime, timedelta, timezone
|
||||
since = account["last_synced_at"] or (
|
||||
datetime.now(tz=timezone.utc) - timedelta(days=14)
|
||||
).strftime("%Y-%m-%dT%H:%M:%SZ")
|
||||
q = f"after:{since.replace('-', '/').split('T')[0]}"
|
||||
for m in client.iter_messages(q=q):
|
||||
run_stats["messages_seen"] += 1
|
||||
try:
|
||||
_process_one_message(conn_factory, client, account, index,
|
||||
gmail_message_id=m["id"], run_stats=run_stats)
|
||||
except _errors.GmailError as e:
|
||||
log.warning("skip during date-backfill msg %s: %s", m["id"], e)
|
||||
prof = client.get_profile()
|
||||
hid = prof.get("historyId")
|
||||
if hid:
|
||||
conn = conn_factory()
|
||||
try:
|
||||
_db.set_account_checkpoint(conn, account["id"], history_id=str(hid))
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------- per-message
|
||||
|
||||
def _process_one_message(conn_factory, client, account, index: InvestorIndex,
|
||||
*, gmail_message_id: str, run_stats: dict) -> None:
|
||||
"""Fetch, match, persist one message. Idempotent."""
|
||||
if not gmail_message_id:
|
||||
return
|
||||
|
||||
# Skip if we've already sighted this message for this account.
|
||||
conn = conn_factory()
|
||||
try:
|
||||
cur = conn.cursor()
|
||||
cur.execute(
|
||||
"SELECT email_id FROM email_account_messages "
|
||||
"WHERE account_id = ? AND gmail_message_id = ?",
|
||||
(account["id"], gmail_message_id),
|
||||
)
|
||||
if cur.fetchone():
|
||||
return
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
# 1. Metadata fetch (cheap).
|
||||
meta = client.get_message(gmail_message_id, format="metadata",
|
||||
metadata_headers=METADATA_HEADERS)
|
||||
meta_parsed = _parser.parse(meta, owning_account_address=account["email_address"])
|
||||
|
||||
participants = set()
|
||||
if meta_parsed.get("from_email"):
|
||||
participants.add(meta_parsed["from_email"])
|
||||
for kind in ("to", "cc", "bcc"):
|
||||
for a in meta_parsed.get(kind, []):
|
||||
if isinstance(a, dict) and a.get("email"):
|
||||
participants.add(a["email"])
|
||||
|
||||
# Exclude owning account's own address so we don't try to "match" ourselves.
|
||||
own = {account["email_address"].lower()}
|
||||
links = index.match(participants, exclude_addresses=own)
|
||||
is_matched = bool(links)
|
||||
|
||||
# 2. If matched, fetch full and parse for body + attachments.
|
||||
if is_matched:
|
||||
full = client.get_message(gmail_message_id, format="full")
|
||||
parsed = _parser.parse(full, owning_account_address=account["email_address"])
|
||||
else:
|
||||
parsed = meta_parsed
|
||||
# Strip any body fields (metadata fetch shouldn't have them but be safe).
|
||||
parsed["body_text"] = None
|
||||
parsed["body_html"] = None
|
||||
parsed["attachments"] = []
|
||||
|
||||
# 3. Persist (idempotent on rfc_message_id).
|
||||
conn = conn_factory()
|
||||
try:
|
||||
existing = _db.find_email_by_rfc_id(conn, parsed["rfc_message_id"])
|
||||
if existing:
|
||||
email_id = existing["id"]
|
||||
# If the email was previously unmatched but now matches (e.g. user
|
||||
# added the investor after first sight), upgrade the row.
|
||||
if is_matched and existing["match_status"] == "unmatched":
|
||||
conn.execute(
|
||||
"UPDATE emails SET match_status = 'matched', is_matched = 1, "
|
||||
"body_text = ?, body_html = ?, updated_at = datetime('now') "
|
||||
"WHERE id = ?",
|
||||
(parsed.get("body_text"), parsed.get("body_html"), email_id),
|
||||
)
|
||||
_attach.register_stubs(conn,
|
||||
email_id=email_id,
|
||||
parsed_attachments=parsed.get("attachments") or [])
|
||||
for link in links:
|
||||
_db.insert_investor_link(conn, email_id=email_id, link=_flatten_link(link))
|
||||
else:
|
||||
match_status = "matched" if is_matched else "unmatched"
|
||||
email_id = _db.insert_email(conn, parsed=parsed, match_status=match_status)
|
||||
thread_id = _threads.resolve_thread_id(conn, parsed)
|
||||
_db.set_email_thread(conn, email_id, thread_id)
|
||||
if is_matched:
|
||||
_attach.register_stubs(conn,
|
||||
email_id=email_id,
|
||||
parsed_attachments=parsed.get("attachments") or [])
|
||||
for link in links:
|
||||
_db.insert_investor_link(conn, email_id=email_id, link=_flatten_link(link))
|
||||
_db.rollup_thread(conn, thread_id)
|
||||
run_stats["messages_stored"] += 1
|
||||
|
||||
# Record sighting (always, even if email row was pre-existing).
|
||||
_db.upsert_sighting(
|
||||
conn,
|
||||
email_id=email_id,
|
||||
account_id=account["id"],
|
||||
gmail_message_id=gmail_message_id,
|
||||
gmail_thread_id=parsed.get("gmail_thread_id") or "",
|
||||
labels=parsed.get("labels", []),
|
||||
is_sent=parsed.get("is_sent", False),
|
||||
)
|
||||
conn.commit()
|
||||
except sqlite3.IntegrityError:
|
||||
# Concurrent insert race — re-read and proceed.
|
||||
pass
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def _flatten_link(link: InvestorLink) -> dict:
|
||||
return {
|
||||
"matched_address": link.matched_address,
|
||||
"match_kind": link.match_kind,
|
||||
"match_confidence": link.match_confidence,
|
||||
"fundraising_investor_id": link.target.fundraising_investor_id,
|
||||
"fundraising_contact_id": link.target.fundraising_contact_id,
|
||||
"contact_id": link.target.contact_id,
|
||||
"organization_id": link.target.organization_id,
|
||||
}
|
||||
@@ -0,0 +1,75 @@
|
||||
"""
|
||||
Threading resolution.
|
||||
|
||||
Given a freshly-inserted emails row (or its about-to-be-inserted parsed dict),
|
||||
figure out which email_threads row it belongs to. If none exists, create one.
|
||||
|
||||
Priority order (see architecture doc §10):
|
||||
1. Existing email in our DB that shares any RFC Message-ID with this one's
|
||||
References/In-Reply-To chain — inherit its thread.
|
||||
2. Existing thread with the same gmail_thread_id.
|
||||
3. Existing thread with the same rfc_thread_root_id.
|
||||
4. Create a new thread.
|
||||
"""
|
||||
|
||||
import re
|
||||
import sqlite3
|
||||
from typing import Optional
|
||||
|
||||
from . import db as _db
|
||||
|
||||
|
||||
SUBJECT_PREFIX_RE = re.compile(r"^\s*(re|fwd?|aw|sv|antw|回复|fw)\s*:\s*", re.IGNORECASE)
|
||||
|
||||
|
||||
def normalize_subject(s: Optional[str]) -> Optional[str]:
|
||||
if not s:
|
||||
return None
|
||||
out = s
|
||||
# Strip up to 5 nested Re:/Fwd: prefixes.
|
||||
for _ in range(5):
|
||||
new = SUBJECT_PREFIX_RE.sub("", out, count=1)
|
||||
if new == out:
|
||||
break
|
||||
out = new
|
||||
return out.strip().lower()
|
||||
|
||||
|
||||
def resolve_thread_id(conn: sqlite3.Connection, parsed: dict) -> str:
|
||||
"""Returns a thread_id — either an existing one or a newly created one."""
|
||||
# Step 1: RFC cross-link.
|
||||
candidates = list(parsed.get("references") or [])
|
||||
if parsed.get("in_reply_to"):
|
||||
candidates.append(parsed["in_reply_to"])
|
||||
|
||||
if candidates:
|
||||
existing_email_id = _db.find_email_id_by_any_rfc_id(conn, candidates)
|
||||
if existing_email_id:
|
||||
cur = conn.cursor()
|
||||
cur.execute("SELECT thread_id FROM emails WHERE id = ?", (existing_email_id,))
|
||||
row = cur.fetchone()
|
||||
if row and row["thread_id"]:
|
||||
return row["thread_id"]
|
||||
|
||||
# Step 2: gmail_thread_id match.
|
||||
gt = parsed.get("gmail_thread_id")
|
||||
if gt:
|
||||
existing = _db.find_thread_by_gmail_id(conn, gt)
|
||||
if existing:
|
||||
return existing["id"]
|
||||
|
||||
# Step 3: RFC thread-root match.
|
||||
rfc_root = parsed.get("rfc_thread_root_id")
|
||||
if rfc_root:
|
||||
existing = _db.find_thread_by_rfc_root(conn, rfc_root)
|
||||
if existing:
|
||||
return existing["id"]
|
||||
|
||||
# Step 4: create.
|
||||
return _db.create_thread(
|
||||
conn,
|
||||
gmail_thread_id=gt,
|
||||
rfc_thread_root_id=rfc_root,
|
||||
subject_normalized=normalize_subject(parsed.get("subject")),
|
||||
first_message_at=parsed.get("sent_at"),
|
||||
)
|
||||
Reference in New Issue
Block a user