Phase 0 foundation: canonical schema, ingest pipeline, CRM MCP server

Workstream A–C substrate for the Ten31 agentic system:
- A1: docs/crm-overview.md; CLAUDE.md conventions + guardrail #9
- A2: additive/reversible core migration (canonical_entities, entity_links,
  interaction_log, relationship_edges, soft-delete) + ledgered runner
- B1/B3: chunking + deterministic entity resolution (backend/ingest)
- B2: dense (bge-m3) + BM25 sparse ingest to Qdrant crm_chunks
- C: CRM MCP server (reads, retrieval modes, logged writes) — no outbound tools
- docs: redaction/re-hydration, Gmail enablement runbook
- synthetic test data; .env.example; housekeeping (.gitignore, untrack crm.db,
  drop legacy files + start9/0.3.5)

Verified end-to-end on synthetic data + live Sparks (hybrid > dense on entity
queries). Real backfill runs on Ten31 infra; index holds synthetic data only.
Branch snapshot also captures pre-existing working-tree changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Keysat
2026-06-05 08:11:28 -05:00
parent 7027efd777
commit c7ce44d963
99 changed files with 10676 additions and 7817 deletions
+30
View File
@@ -0,0 +1,30 @@
# Keep Docker build context lean and reproducible.
# Used by start9/0.4/Dockerfile; workdir is the repo root.
# Python artifacts — never copy stale bytecode into the image.
**/__pycache__/
**/*.pyc
**/*.pyo
# Local dev / editor noise
.venv/
venv/
.env
.env.beta
.env.beta.example
.idea/
.vscode/
*.swp
.DS_Store
# Start9 package build artifacts (huge; re-generated by `make`).
start9/**/javascript/
start9/**/javascript.old/
start9/**/javascript.tmp.*/
start9/**/node_modules/
start9/**/*.s9pk
start9/**/image.tar
start9/**/seed/
# Runtime CRM data — never leak dev state into a prod image.
data/
+21
View File
@@ -0,0 +1,21 @@
# Ten31 agentic system — environment template.
# Copy to .env (gitignored) and fill in. Secret values NEVER go in .env.example.
# ── Claude (frontier reasoning; Agent SDK uses an API key, not claude.ai login) ──
ANTHROPIC_API_KEY=
# ── Spark Control gateway (local model services; reads + dense embeds) ──
# HTTPS with the Start9 self-signed cert -> clients must skip TLS verification.
SPARK_CONTROL_URL=https://<spark-control-host>:<port>
SPARK_CONTROL_VERIFY_TLS=false
# ── Qdrant (direct, for ingest: create collection + upsert points) ──
# Plain HTTP on the trusted LAN, no auth currently.
QDRANT_URL=http://<spark2-host>:6333
# ── X (Twitter) API for Scout/Analyst enrichment (NOT a CRM key) ──
X_API_KEY=
# ── CRM (ingest opens the SQLite file directly, read-only) ──
CRM_DB_PATH=./data/crm.db
CRM_DEV_DB_PATH=./data/crm_dev.db
+34
View File
@@ -0,0 +1,34 @@
# ── macOS ──
.DS_Store
**/.DS_Store
# ── Python ──
__pycache__/
*.py[cod]
# ── Secrets & env (CLAUDE.md guardrail #7 — never commit secrets) ──
.env
.env.*
!.env.example
!.env.beta.example
data/secrets/
# ── Local databases & SQLite sidecars ──
# CRM is canonical and lives on the Start9 /data volume; never commit live LP
# data into git (guardrail #1). Dev/synthetic DBs are local-only.
data/*.db
data/*.db-wal
data/*.db-shm
data/crm_dev.db*
# ── FUSE / build temp ──
*.fuse_hidden*
start9/**/javascript.tmp.*
node_modules/
# Real CRM data bundled in packaging seed dirs — never commit (guardrail #1)
**/seed/data/
**/crm.db
**/crm.db-wal
**/crm.db-shm
**/fundraising_state_auto_*.json
+89
View File
@@ -0,0 +1,89 @@
# Ten31 Agentic System — Project Memory
This file is the project constitution. Read it first; it states settled decisions and non-negotiable guardrails. Where anything here conflicts with a one-off prompt, this file wins.
## What we're building
Ten31 is an investment platform (bitcoin ecosystem, energy, AI infrastructure, freedom tech) that raises from LPs and deploys into private companies. We are building an in-house system of AI agents to widen the fundraising funnel, sharpen and propagate our investment thesis, and automate marketing/branding. Build agents on the **Claude Agent SDK**, connected to our systems via **MCP**. Frontier reasoning runs on **Claude**; privacy-sensitive and high-volume work runs on **local models on our DGX Sparks**, fronted by **Spark Control**.
Full architecture and rationale: see `@./docs/Ten31_Agentic_Build_Plan.md`.
Current phase tasks and acceptance criteria: see `@./docs/PHASE_0.md`.
Embedding/retrieval API contract + ingest recipe (authoritative): see `@./docs/EMBEDDINGS.md`.
**We are in Phase 0.** Phase 0 builds the data + retrieval substrate. There are NO live, outward-facing agents in Phase 0.
## Settled architecture
- **Reasoning model:** Claude via the Agent SDK / API (API-key auth, not claude.ai login).
- **Local models (Sparks, via Spark Control gateway):**
- Chat/triage: Qwen3.6 35B-A3B on Spark 1.
- Embeddings: `BAAI/bge-m3` (dense, 1024-dim, L2-normalized) → `/v1/embeddings` (OpenAI shape).
- Reranker: `BAAI/bge-reranker-v2-m3` (cross-encoder) → `/v1/rerank` (Cohere shape).
- Served by **spark-embed**, a small FastAPI server on Spark 2 (NGC PyTorch image — *not* HF TEI, which ships no arm64 CUDA image). Shipped in Spark Control v0.15.0.
- Audio: transcription + diarization + TTS on Spark 2.
- **Canonical data store:** the self-built CRM on the Start9 server. This is the single source of truth for LP/prospect data.
- **Vector index:** Qdrant v1.16.0 on Spark 2 (ports 6333/6334). Derived and rebuildable from the CRM (~815 min full re-embed) — NOT a second source of truth. But it holds the only *live* copy of the index, so it is never auto-restarted; the ingest pipeline must be idempotent so a rebuild is always safe.
- **Retrieval:** one orchestrated call, `POST /api/search` (embed query → Qdrant dense+sparse RRF with payload pre-filter → cross-encoder rerank → top_k). The sparse/BM25 leg is generated **client-side** with FastEmbed (`Qdrant/bm25`) at both ingest and query time, with Qdrant applying IDF over our own corpus — so exact entity/name matching is weighted by our term statistics, not bge-m3's pretrained sparse. Authoritative contract + ingest recipe: `@./docs/EMBEDDINGS.md`.
- **Gateway:** Spark Control (on Start9) fronts all local model services behind one trusted URL with shared TLS, access control, and observability.
## Environment & services
- All local model calls go through **Spark Control**, never directly to a Spark.
- Endpoints: `/v1/chat/completions`, `/v1/embeddings`, `/v1/rerank`, `/api/search` (orchestrated hybrid retrieval), `/v1/audio/transcriptions`, `/v1/audio/speech`.
- **Secrets live in `.env` (gitignored). Never commit secret values.** Required variables (names only):
- `ANTHROPIC_API_KEY`
- `SPARK_CONTROL_URL` — gateway for `/v1/embeddings`, `/v1/rerank`, `/api/search` (reads + dense embeds)
- `QDRANT_URL` — direct Qdrant on Spark 2 (`http://<spark2>:6333`) for collection admin + ingest upserts
- `X_API_KEY` — the X (Twitter) API key for Scout/Analyst enrichment. **Note:** this is *not* a CRM auth key; the CRM has no service-key/API-key path today (see below).
- CRM connection vars:
- `CRM_DB_PATH` — absolute path to the SQLite file (default `<CRM_DATA_DIR>/crm.db`). The CRM has **no network DB protocol** — ingest "connects" by opening this file directly (read-only, `mode=ro`), co-located with the Start9 `/data` volume.
- `CRM_DATA_DIR` — the `/data` volume root (holds `crm.db`, `backups/`, `secrets/`, `email_attachments/`).
- `CRM_BASE_URL``http://<host>:8080` (env `CRM_HOST`/`CRM_PORT`), for any HTTP access to the running CRM.
- `CRM_SECRET_KEY` — the CRM's own JWT signing secret (set on the Start9 deployment, persisted at `/data/.crm-secret`); only needed if the MCP server authenticates over HTTP rather than reading SQLite directly.
- A `.env.example` lists the variable names with empty values.
## The agents (target roster — built in later phases)
- **Scout** — monitors public sources (X via API, filings, etc.); flags trigger events. (Phase 2)
- **Analyst** — builds LP dossiers, enriches records, maps warm-intro paths. (Phase 2)
- **Architect** — owns/refines the canonical thesis; collaborative copilot. (Phase 1)
- **Scribe** — distributes the thesis as content across channels. (Phase 1)
- **Closer** — drafts outreach, nurture, meeting prep. Humans approve/send everything. (Phase 3)
- **Orchestrator** — schedules and routes work; picks per-agent retrieval modes. (Phase 3)
## Division of labor
- **Spark developer (separate):** TEI serving (BGE-M3 + reranker) and Qdrant on Spark 2, exposed via Spark Control `/v1/embeddings` + `/v1/rerank`.
- **This repo (Claude Code + the partners):** CRM schema extensions, ingest/sync pipeline, CRM MCP server, retrieval-mode library, and (later phases) the agents.
## Guardrails — NON-NEGOTIABLE
1. **Sovereignty.** Sensitive LP and relationship data stays on our infrastructure (Start9 + Sparks). Send only the minimum necessary, non-sensitive context to the Claude API. Never bulk-export the LP list to any third party.
2. **CRM is canonical.** Qdrant and any other store are derived. Never treat a derived index as the source of truth; never let them silently diverge.
3. **No destructive data ops.** Never hard-delete CRM records or history. Soft-delete/archive only. Migrations must be reversible and reviewed before running.
4. **Human-in-the-loop on anything outbound.** No agent sends email, posts publicly, or contacts an LP/prospect autonomously. Agents draft; a partner approves and sends. (Especially Closer and Scribe.)
5. **Log every agent action** to the interaction log, for compliance and debugging.
6. **Compliance gate before Phase 3.** No cold/outbound capability ships until counsel has defined solicitation posture (e.g. 506(b) vs 506(c)), accreditation/QP verification, and recordkeeping rules.
7. **Secrets never committed.** Use `.env` / a secrets store. No keys, tokens, or credentials in code, configs, or docs.
8. **Enrichment is one-way and public.** Per-prospect public lookups that write INTO the CRM; never push our data outward.
9. **Development data handling — keep real LP data out of Claude during the build.** Claude Code (the engineering partner) runs on the Anthropic API, so anything it reads is sent to a third party. Therefore Claude Code works only on **code, the schema, and synthetic or properly-redacted data** — never the real LP list, live records, or raw note/email prose. The real backfill and ingest **run on Ten31 infrastructure** (Start9 + Sparks) via **local models**; sensitive rows are never pasted into a Claude Code session or sent to the Claude API during development. To produce a realistic test corpus, redact/pseudonymize a copy **on the Sparks** (local) — do not hand-feed real records to Claude to "clean up." This is the same sovereignty boundary as guardrail #1, applied to the engineering workflow itself.
## Conventions
*Filled in from the CRM code (2026-06). Full detail: `@./docs/crm-overview.md`.*
- **Language / runtime:** Python 3.11, standard library only at runtime. The CRM is one file, `backend/server.py` (~4.5k lines): a stdlib `http.server.ThreadingHTTPServer` + hand-written `CRMHandler` with manual path dispatch. **Not** FastAPI — `backend/requirements.txt` lists FastAPI/SQLAlchemy/Alembic/Pydantic but **none are imported** (vestigial). The only non-stdlib runtime deps are optional `bcrypt`/`jwt` and (for the Gmail module) `cryptography`.
- **Storage:** a single SQLite DB (`data/crm.db`), WAL mode, `foreign_keys=ON`, opened per-request via `get_db()`. Two parallel investor models coexist (classic `contacts`/`lp_profiles` + the `fundraising_*` grid) — see `docs/crm-overview.md` §2.3; reconciling them to canonical IDs is the core Phase-0 entity-resolution task.
- **Migrations:** **additive and reversible only.** Core schema uses ordered `backend/migrations/NNNN_*.sql` files applied once at startup by `backend/core_migrations.py`, tracked in a `schema_migrations` ledger; ship a paired `NNNN_*.down.sql` for rollback. (The Gmail module has its own runner under `backend/email_integration/migrations/`.) SQLite ALTER is add-column/rename only — which enforces the additive guardrail.
- **Run locally:** `./start.sh` (dev defaults, port 8080). `./start_beta.sh` for a Tailscale/production-mode launch (requires `CRM_SECRET_KEY`). No build step.
- **Tests / lint:** none in-repo. Sanity-check edits with `python3 -m py_compile backend/server.py`. Verify migrations against a *copy* of `crm.db`, never production.
- **Production:** Start9 package `ten-database`. **`start9/0.4/` is the live target** (TypeScript SDK manifest under `start9/0.4/startos/`); `start9/0.3.5/` (YAML manifest) is the superseded prior generation. All state on the persistent `/data` volume.
- **Auth:** username/password → HS256 JWT (Bearer header), two roles (`admin`/`member`), no row-level authorization. `X_API_KEY` (in this file's env list) is the *X/Twitter* key — there is **no CRM service-key path in code**; an MCP/ingest client must read SQLite directly or authenticate as a real CRM user.
- Prefer clear, reviewable changes over cleverness. Keep the ingest pipeline and MCP server modular so retrieval modes and sources can be added without rewrites.
## First actions for a new session
1. Read `@./docs/PHASE_0.md` and `@./docs/EMBEDDINGS.md` (the latter is the authoritative embedding/retrieval contract and ingest recipe).
2. Read the CRM source in the repo; produce a short written summary of the storage engine, schema, and API surface, and fill in the Conventions section above and the CRM env vars.
3. Confirm Spark Control is reachable and `/v1/embeddings`, `/v1/rerank`, and `/api/search` respond (these shipped in v0.15.0; check `GET /api/endpoints`).
4. Proceed through the Phase 0 workstreams in order. Do not build any outward-facing agent behavior in Phase 0.
+826
View File
@@ -0,0 +1,826 @@
# Gmail Integration — Technical Architecture
**Status:** Design / Phase 1 scaffolding
**Target:** Venture CRM (self-hosted, StartOS 0.4 beta)
**Author:** Grant + Claude
**Last updated:** 2026-04-21
---
## 1. Goals and non-goals
### In scope for Phase 1
1. Authorize Gmail access for all 5 team members.
2. Continuous, incremental sync of sent and received mail.
3. Parse sender, recipients, subject, body, timestamp, attachments.
4. Store locally in the CRM SQLite DB.
5. Match each email to an investor record by sender/recipient email.
6. Thread related messages using Gmail's native `threadId` plus RFC 2822 `Message-ID` / `In-Reply-To` / `References`.
7. Store attachments on local disk alongside email records.
### Non-goals for Phase 1 (future phases)
- Sending email from the CRM.
- Full-text search (we'll store bodies; search UI comes later).
- Real-time push via Pub/Sub watch (we'll poll every few hours).
- Contact auto-creation from unknown senders.
- Shared team inbox features.
### Guiding constraints
- **Do not break the running CRM.** All new code lives in an isolated module. Exactly two small additions are made to the existing `server.py`: a startup call and a route-dispatch hook (both behind a feature flag, both safe to roll back).
- **Match the existing codebase's style.** Pure Python stdlib + sqlite3, no new hard dependencies beyond `cryptography` (for encrypting stored OAuth refresh tokens) and the `google-auth` / `google-api-python-client` libs for the Gmail SDK. Everything else is stdlib.
- **Single-process, single-sqlite-writer discipline.** The sync worker writes through the same connection pattern as the rest of the CRM (WAL mode, short transactions) so we don't introduce lock contention.
---
## 2. Decision summary
| # | Decision | Choice | Rationale |
|---|----------|--------|-----------|
| D1 | Auth method | Domain-wide delegation (DWD) primary, per-user OAuth2 as pluggable fallback | You're Workspace super-admin; one-time setup, no per-user consent flow, centralized revocation. OAuth path preserved behind a common `CredentialProvider` interface so the CRM isn't locked in. |
| D2 | Schema | New dedicated `emails_*` tables (not reuse `communications`) | Gmail-specific fields (`gmail_message_id`, `thread_id`, `history_id`, labels, RFC headers) don't belong in the generic `communications` table. `communications` can later project email rows via a read-only view if desired. |
| D3 | Sync scope | Entire mailbox, match-only full storage | Fetch cheap metadata for every message. Persist the full body + attachments only for messages that match a known investor email. Minimizes disk use and privacy surface. |
| D4 | Incremental strategy | Gmail `history.list` driven by per-account `historyId` checkpoint | Canonical Gmail-supported mechanism for "what changed since last sync" — cheap, reliable, dedupe-free. `messages.list` only on initial backfill. |
| D5 | Delivery | Architecture doc + scaffolded modules in `backend/email_integration/`, no edits to `server.py` yet | Lets you review the shape before any wiring happens. Activation is a 3-line patch to `server.py` when you're ready. |
---
## 3. Google Cloud / Workspace setup (one-time, ~20 minutes)
This is the runbook you'll follow once the code is ready. It's in the doc so the setup steps and the code that relies on them are never out of sync.
### 3a. Create a GCP project
1. `console.cloud.google.com` → new project, name it `ten31-crm-gmail` (or similar). Note the **Project ID**.
2. Enable the Gmail API: APIs & Services → Library → "Gmail API" → Enable.
### 3b. Create a service account (for DWD)
3. IAM & Admin → Service Accounts → Create service account.
- Name: `crm-gmail-sync`
- Role: leave blank (it doesn't need any GCP IAM roles; Gmail scope comes from DWD).
4. On the service account's detail page → Keys → Add Key → Create new key → JSON. Download the file.
5. Advanced settings → enable **domain-wide delegation**. Copy the **numeric client ID** (a ~21-digit number).
### 3c. Authorize the service account in Google Workspace
6. `admin.google.com` → Security → Access and data control → API controls → Manage domain-wide delegation → Add new.
7. Client ID: paste the numeric client ID from step 5.
8. OAuth scopes (comma-separated, read-only — no send):
```
https://www.googleapis.com/auth/gmail.readonly,
https://www.googleapis.com/auth/gmail.metadata
```
(Phase 1 uses `gmail.readonly` only. `gmail.metadata` is listed for a possible future "metadata-only" sync mode.)
9. Authorize.
### 3d. Install the service account key on the CRM server
10. Copy the JSON key file to `<CRM_ROOT>/data/secrets/gmail-service-account.json`.
11. `chmod 600` the file. `chown` to the CRM process owner.
12. Set env var in `.env.beta`:
```
CRM_GMAIL_SA_KEY_PATH=/path/to/CRM/data/secrets/gmail-service-account.json
CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz
CRM_GMAIL_INTEGRATION_ENABLED=true
CRM_GMAIL_SYNC_INTERVAL_MIN=180 # 3 hours
CRM_GMAIL_SECRET_KEY=<32+ random bytes base64> # for encrypting stored OAuth refresh tokens (unused in DWD mode but required if fallback ever activates)
```
### 3e. Configure per-user mailboxes
13. In the CRM, Admin → Email Accounts → "Enroll all Workspace users via DWD". This calls `POST /api/email/accounts/enroll-all` which for each existing CRM user with an `@ten31.xyz` email creates an `email_accounts` row bound to their address.
14. Each teammate sees a "Gmail sync: active" chip on their profile.
---
## 4. Database schema
All new tables live under the `emails_` prefix so they're grouped in tooling. Migration script: `backend/email_integration/migrations/0001_email_tables.sql`. It is **idempotent** (uses `CREATE TABLE IF NOT EXISTS` + targeted `ALTER TABLE` guards) so re-running is safe.
### 4a. `email_accounts` — one row per enrolled mailbox
```sql
CREATE TABLE IF NOT EXISTS email_accounts (
id TEXT PRIMARY KEY, -- uuid
user_id TEXT NOT NULL, -- FK users.id
email_address TEXT NOT NULL UNIQUE, -- e.g. grant@ten31.xyz
auth_method TEXT NOT NULL, -- 'dwd' | 'oauth'
oauth_refresh_enc BLOB, -- NULL for DWD; AES-GCM ciphertext for oauth
oauth_token_exp TEXT, -- ISO8601, cached access token expiry
oauth_token_enc BLOB, -- short-lived access token cache (encrypted)
sync_enabled INTEGER NOT NULL DEFAULT 1,
sync_status TEXT NOT NULL DEFAULT 'pending', -- pending|active|paused|error
sync_error TEXT, -- last error message, if any
last_history_id TEXT, -- Gmail historyId checkpoint
last_synced_at TEXT,
backfill_complete INTEGER NOT NULL DEFAULT 0, -- 0 = initial backfill still running
backfill_cursor TEXT, -- nextPageToken during backfill
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now')),
FOREIGN KEY(user_id) REFERENCES users(id)
);
CREATE INDEX IF NOT EXISTS idx_email_accounts_user ON email_accounts(user_id);
CREATE INDEX IF NOT EXISTS idx_email_accounts_sync ON email_accounts(sync_enabled, sync_status);
```
### 4b. `emails` — canonical email record (dedup'd across accounts)
One row per distinct RFC `Message-ID`. If two teammates are both on a thread, we store one `emails` row and link both via `email_account_messages`.
```sql
CREATE TABLE IF NOT EXISTS emails (
id TEXT PRIMARY KEY, -- uuid
rfc_message_id TEXT NOT NULL UNIQUE, -- RFC 2822 Message-ID, dedup key
gmail_thread_id TEXT, -- primary Gmail thread id (first account that saw it)
rfc_thread_root_id TEXT, -- root of References chain (for cross-account threading)
subject TEXT,
from_email TEXT NOT NULL,
from_name TEXT,
to_emails_json TEXT NOT NULL DEFAULT '[]', -- JSON array
cc_emails_json TEXT NOT NULL DEFAULT '[]',
bcc_emails_json TEXT NOT NULL DEFAULT '[]',
reply_to TEXT,
sent_at TEXT NOT NULL, -- from Date: header, ISO8601
body_text TEXT, -- plain/text part
body_html TEXT, -- text/html part
snippet TEXT, -- Gmail-provided, useful for unmatched rows
in_reply_to TEXT, -- RFC Message-ID of parent
references_json TEXT DEFAULT '[]', -- parsed References: header
has_attachments INTEGER NOT NULL DEFAULT 0,
size_estimate INTEGER, -- bytes reported by Gmail
is_matched INTEGER NOT NULL DEFAULT 0, -- 1 if any investor link exists
match_status TEXT NOT NULL DEFAULT 'unmatched', -- unmatched|matched|skipped
raw_headers_json TEXT, -- full header dump for debugging/forensics
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now'))
);
CREATE INDEX IF NOT EXISTS idx_emails_thread ON emails(gmail_thread_id);
CREATE INDEX IF NOT EXISTS idx_emails_rfc_thread ON emails(rfc_thread_root_id);
CREATE INDEX IF NOT EXISTS idx_emails_from ON emails(from_email);
CREATE INDEX IF NOT EXISTS idx_emails_sent_at ON emails(sent_at);
CREATE INDEX IF NOT EXISTS idx_emails_matched ON emails(is_matched, sent_at);
```
> **Match-only storage note:** For unmatched emails we still write the `emails` row, but `body_text` / `body_html` / `raw_headers_json` are left NULL (only headers and snippet are persisted). A nightly job can prune unmatched rows older than 90 days if storage becomes a concern. See §12.
### 4c. `email_recipients` — exploded recipients for fast investor matching
Denormalized so we can index and JOIN directly.
```sql
CREATE TABLE IF NOT EXISTS email_recipients (
id TEXT PRIMARY KEY,
email_id TEXT NOT NULL,
address TEXT NOT NULL, -- lowercased, trimmed
display_name TEXT,
kind TEXT NOT NULL, -- 'from'|'to'|'cc'|'bcc'|'reply_to'
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_email_recipients_addr ON email_recipients(address);
CREATE INDEX IF NOT EXISTS idx_email_recipients_email ON email_recipients(email_id);
```
### 4d. `email_account_messages` — per-account sighting of an email
Same `emails` row may be visible to multiple teammates with different Gmail message IDs (each user's mailbox has its own copy). This join table records each sighting.
```sql
CREATE TABLE IF NOT EXISTS email_account_messages (
id TEXT PRIMARY KEY,
email_id TEXT NOT NULL,
account_id TEXT NOT NULL,
gmail_message_id TEXT NOT NULL, -- per-mailbox unique
gmail_thread_id TEXT NOT NULL,
labels_json TEXT DEFAULT '[]', -- Gmail labels from this account's view
is_sent INTEGER NOT NULL DEFAULT 0, -- 1 if SENT label present
first_seen_at TEXT DEFAULT (datetime('now')),
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE,
FOREIGN KEY(account_id) REFERENCES email_accounts(id) ON DELETE CASCADE,
UNIQUE(account_id, gmail_message_id)
);
CREATE INDEX IF NOT EXISTS idx_eam_email ON email_account_messages(email_id);
CREATE INDEX IF NOT EXISTS idx_eam_account ON email_account_messages(account_id);
CREATE INDEX IF NOT EXISTS idx_eam_gmail_msg ON email_account_messages(gmail_message_id);
```
### 4e. `email_attachments` — attachments stored on disk
Attachment bytes live at `data/email_attachments/<email_id>/<attachment_id>-<sanitized_filename>`. DB stores metadata + content hash.
```sql
CREATE TABLE IF NOT EXISTS email_attachments (
id TEXT PRIMARY KEY,
email_id TEXT NOT NULL,
gmail_attachment_id TEXT NOT NULL, -- opaque Gmail handle
filename TEXT NOT NULL,
sanitized_filename TEXT NOT NULL, -- what's actually on disk
mime_type TEXT,
size_bytes INTEGER,
sha256_hex TEXT, -- integrity + dedup across emails
storage_path TEXT NOT NULL, -- relative to CRM_DATA_DIR
download_status TEXT NOT NULL DEFAULT 'pending', -- pending|downloaded|failed|skipped
download_attempts INTEGER NOT NULL DEFAULT 0,
download_error TEXT,
downloaded_at TEXT,
created_at TEXT DEFAULT (datetime('now')),
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_attach_email ON email_attachments(email_id);
CREATE INDEX IF NOT EXISTS idx_attach_sha ON email_attachments(sha256_hex);
```
### 4f. `email_threads` — thread roll-up (materialized for UI)
```sql
CREATE TABLE IF NOT EXISTS email_threads (
id TEXT PRIMARY KEY, -- uuid
gmail_thread_id TEXT, -- NULL if we only have RFC threading
rfc_thread_root_id TEXT, -- fallback / cross-account key
subject_normalized TEXT, -- lowercased + stripped of Re:/Fwd: prefixes
first_message_at TEXT,
last_message_at TEXT,
message_count INTEGER NOT NULL DEFAULT 0,
participant_count INTEGER NOT NULL DEFAULT 0,
participants_json TEXT DEFAULT '[]',
is_matched INTEGER NOT NULL DEFAULT 0,
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now'))
);
CREATE UNIQUE INDEX IF NOT EXISTS idx_threads_gmail ON email_threads(gmail_thread_id) WHERE gmail_thread_id IS NOT NULL;
CREATE INDEX IF NOT EXISTS idx_threads_rfc_root ON email_threads(rfc_thread_root_id);
CREATE INDEX IF NOT EXISTS idx_threads_last_msg ON email_threads(last_message_at);
-- Back-link emails → threads
ALTER TABLE emails ADD COLUMN thread_id TEXT; -- wrapped in PRAGMA-based guard at migration time
CREATE INDEX IF NOT EXISTS idx_emails_thread_fk ON emails(thread_id);
```
### 4g. `email_investor_links` — matched investors
```sql
CREATE TABLE IF NOT EXISTS email_investor_links (
id TEXT PRIMARY KEY,
email_id TEXT NOT NULL,
-- Match target: exactly one of the following is non-null
fundraising_investor_id TEXT, -- fundraising_investors.id
fundraising_contact_id TEXT, -- fundraising_contacts.id
contact_id TEXT, -- contacts.id
organization_id TEXT, -- organizations.id (if matched by domain only)
matched_address TEXT NOT NULL, -- which email address produced the match
match_kind TEXT NOT NULL, -- 'exact_email'|'domain_match'|'manual'
match_confidence REAL NOT NULL DEFAULT 1.0, -- 0..1
created_at TEXT DEFAULT (datetime('now')),
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_eil_email ON email_investor_links(email_id);
CREATE INDEX IF NOT EXISTS idx_eil_investor ON email_investor_links(fundraising_investor_id);
CREATE INDEX IF NOT EXISTS idx_eil_fr_contact ON email_investor_links(fundraising_contact_id);
CREATE INDEX IF NOT EXISTS idx_eil_contact ON email_investor_links(contact_id);
```
### 4h. `email_sync_runs` — observability
```sql
CREATE TABLE IF NOT EXISTS email_sync_runs (
id TEXT PRIMARY KEY,
account_id TEXT NOT NULL,
kind TEXT NOT NULL, -- 'backfill'|'incremental'
started_at TEXT NOT NULL,
finished_at TEXT,
status TEXT NOT NULL, -- 'running'|'ok'|'error'|'partial'
messages_seen INTEGER NOT NULL DEFAULT 0,
messages_stored INTEGER NOT NULL DEFAULT 0,
attachments_saved INTEGER NOT NULL DEFAULT 0,
api_calls INTEGER NOT NULL DEFAULT 0,
retries INTEGER NOT NULL DEFAULT 0,
error TEXT,
FOREIGN KEY(account_id) REFERENCES email_accounts(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_sync_runs_account ON email_sync_runs(account_id, started_at);
```
---
## 5. Module layout
```
backend/
├── server.py # unchanged in Phase 1 (hooks added at end of Phase 1)
└── email_integration/
├── __init__.py # re-exports public API
├── README.md # setup instructions (copy of §3 in code-local form)
├── config.py # env-var loader + singleton Config()
├── credentials.py # CredentialProvider interface + DWD/OAuth impls
├── db.py # thin data-access layer; all SQL lives here
├── gmail_client.py # Gmail API wrapper w/ retry + rate-limit
├── parser.py # MIME/payload parser
├── matcher.py # investor matching logic
├── threads.py # thread roll-up logic
├── attachments.py # attachment download + disk storage
├── sync.py # orchestrator (backfill + incremental)
├── scheduler.py # background thread running periodic sync
├── routes.py # HTTP handlers (plug into CRMHandler)
├── errors.py # exception types + classification
├── crypto.py # AES-GCM wrapper for encrypting OAuth tokens
└── migrations/
└── 0001_email_tables.sql
```
---
## 6. Auth flow
### 6a. Credential abstraction
```python
class CredentialProvider(Protocol):
def access_token_for(self, email_address: str) -> AccessToken: ...
# AccessToken = namedtuple("AccessToken", ["token", "expires_at"])
class DWDCredentialProvider:
"""Service-account impersonation. No per-user state."""
class OAuthCredentialProvider:
"""Per-user refresh tokens stored encrypted in email_accounts.oauth_refresh_enc."""
```
`sync.py` and `gmail_client.py` only depend on `CredentialProvider`. Switching modes is a config flag.
### 6b. DWD flow (primary)
1. On startup: load service-account JSON from `CRM_GMAIL_SA_KEY_PATH`.
2. For each sync request, build a signed JWT claim set:
```
iss: <service account email>
sub: <target user email> ← impersonation
scope: gmail.readonly
aud: https://oauth2.googleapis.com/token
exp: now + 3600
iat: now
```
3. Sign RS256 with the service account's private key.
4. POST to `https://oauth2.googleapis.com/token` with `grant_type=urn:ietf:params:oauth:grant-type:jwt-bearer&assertion=<jwt>`.
5. Cache the returned access token (typ. 1h lifetime) in `email_accounts.oauth_token_enc` + `oauth_token_exp`. If expired/missing, re-mint.
We implement this ourselves with `cryptography` rather than pulling `google-auth` as a hard dependency — keeps the vendor footprint small. If you want to use `google-auth` instead, swap `_mint_dwd_access_token()` for `google.oauth2.service_account.Credentials.with_subject(...)` — one-line change.
### 6c. OAuth fallback flow
Endpoints scaffolded but inactive by default:
- `GET /api/email/oauth/start?account_email=<email>` → redirects to Google consent screen with `state` CSRF token, `access_type=offline`, `prompt=consent`.
- `GET /api/email/oauth/callback?code=...&state=...` → exchanges code for tokens, encrypts refresh token with AES-GCM using `CRM_GMAIL_SECRET_KEY`, stores in `email_accounts.oauth_refresh_enc`.
- Access tokens are refreshed on-demand just before any Gmail call if `oauth_token_exp` is within 60s of expiring.
Refresh tokens **never** leave the DB in plaintext; decryption happens in memory for the duration of a single API call and is then discarded.
### 6d. Token storage security
- At rest: `oauth_refresh_enc` is AES-256-GCM with a random per-record nonce. Key material is `CRM_GMAIL_SECRET_KEY` (separate from the CRM's JWT `CRM_SECRET_KEY` so you can rotate them independently).
- At rest (DWD): no tokens to store. Only the service account JSON is sensitive, and it lives at `0600` on the host.
- In transit: all Gmail API calls use HTTPS. Our HTTP client pins TLS 1.2+.
- Revocation:
- DWD: remove client ID from Workspace Admin console → instantly revokes all 5 accounts.
- OAuth: per-user `DELETE /api/email/accounts/:id/revoke` calls `https://oauth2.googleapis.com/revoke` then zeroes the DB row.
---
## 7. Sync algorithm
### 7a. Initial backfill (runs once per account)
When a new `email_accounts` row is created:
```
while backfill_complete == 0:
resp = GET /gmail/v1/users/me/messages
?pageToken=<backfill_cursor>&maxResults=500&q=
for msg_id in resp.messages:
fetch metadata (format=metadata, headers=[From,To,Cc,Subject,Date,
Message-ID,In-Reply-To,References])
decide match via matcher.is_match(addresses)
if matched:
fetch full (format=full)
parser.extract(...)
db.upsert_email(...), db.upsert_links(...)
enqueue attachment downloads
else:
db.upsert_email(headers_only=True)
db.upsert_account_sighting(account_id, msg_id, thread_id, labels)
backfill_cursor = resp.nextPageToken
if not backfill_cursor:
backfill_complete = 1
last_history_id = GET /users/me/profile → historyId
```
Batch the metadata fetches using Gmail's batch endpoint (up to 100 per batch) to stay well under quota.
### 7b. Incremental sync (runs every `CRM_GMAIL_SYNC_INTERVAL_MIN` minutes)
```
for account where sync_enabled AND backfill_complete:
try:
resp = GET /users/me/history?startHistoryId=<last_history_id>
for record in resp.history:
for msg_added in record.messagesAdded:
process_message(msg_added.message.id) # same as backfill
for label_change in record.labelsAdded/labelsRemoved:
update labels on email_account_messages (cheap)
for msg_deleted in record.messagesDeleted:
# Gmail delete ≠ CRM delete. We tombstone the sighting, keep email.
mark email_account_messages.deleted_at
last_history_id = resp.historyId
except HistoryExpiredError (HTTP 404):
# Gmail only keeps history for ~7 days. If we fall behind, trigger
# a partial re-backfill using q=after:<last_synced_at>.
reset_to_date_backfill(account)
```
**Idempotency guarantee:** `email_account_messages (account_id, gmail_message_id)` is unique. `emails.rfc_message_id` is unique. A retry can't double-insert. The `INSERT OR IGNORE` patterns in `db.py` make this explicit.
### 7c. Scheduler
A single daemon thread started from `server.py` startup:
```python
# scheduler.py
def run_forever():
while not _stop.is_set():
try:
run_all_accounts()
except Exception as e:
log.exception("sync loop failed")
_stop.wait(CONFIG.sync_interval_sec)
```
Within `run_all_accounts()`, accounts are processed **sequentially** (not in parallel) because:
- Our sqlite setup has one writer.
- It keeps the per-user-quota-per-second math simple (250 units/user/sec per account, no cross-account contention).
- Serial processing at ~30s/account × 5 = 2.5 min per cycle, which is well within a 3-hour interval.
Manual trigger: `POST /api/email/sync/run-now` (admin-only) skips the sleep and runs one cycle immediately.
---
## 8. Parsing
Gmail returns messages as a recursive tree of MIME parts. `parser.py` walks the tree and produces a `ParsedEmail` dataclass.
Headers we extract:
- `From` (single address, parsed with `email.utils.getaddresses`)
- `To`, `Cc`, `Bcc`, `Reply-To` (address lists)
- `Subject`
- `Date` (parsed with `email.utils.parsedate_to_datetime`, stored UTC ISO8601)
- `Message-ID`
- `In-Reply-To`
- `References` (space-separated → JSON array)
Body extraction rules:
1. Prefer `text/plain` part; fall back to HTML-stripped text from `text/html` using stdlib `html.parser.HTMLParser`.
2. Keep `text/html` alongside `text/plain` when both exist (both columns stored).
3. Ignore inline images for body purposes; they get stored as attachments with `content_disposition='inline'`.
4. Decode base64url bodies correctly (Gmail uses `-_` alphabet with no padding).
5. Normalize line endings to `\n`.
Attachments are identified by any MIME part with a non-empty `filename` or `Content-Disposition: attachment`. Attachment bodies are **not** inlined in the message `get` response when it's large — Gmail returns an opaque `attachmentId`, and we fetch them separately via `messages.attachments.get`.
Edge cases explicitly handled:
- Malformed `Date:` headers (fall back to `internalDate` from Gmail response, always present).
- Missing `Message-ID` header (generate synthetic: `synthetic-<gmail_message_id>@ten31.local`).
- Duplicate `Message-ID` across mailboxes (expected — primary dedup path).
- RFC 2047 encoded subjects (`=?UTF-8?B?...?=`) — `email.header.decode_header`.
---
## 9. Matching
### 9a. Source of truth for investor addresses
At matcher startup and periodically (every 15 min), we build an in-memory index:
```python
INVESTOR_EMAIL_INDEX = {
"alice@acme.vc": {
"kind": "fundraising_contact",
"id": "fc_abc",
"investor_id": "fi_xyz",
"investor_name": "Acme Capital"
},
...
}
INVESTOR_DOMAIN_INDEX = {
"acme.vc": [("fi_xyz", "Acme Capital"), ...]
}
```
Pulled from:
1. `fundraising_contacts.email` (primary LP list)
2. `contacts.email` (general CRM contacts)
3. `organizations.email` and domain of `organizations.website` (weak domain match only)
### 9b. Match algorithm
For each parsed email, collect the set of participant addresses (from + to + cc + bcc), **excluding** the owning account's own address (we don't match our own team).
```python
def match(participants: set[str]) -> list[InvestorLink]:
links = []
for addr in participants:
if addr in INVESTOR_EMAIL_INDEX:
entry = INVESTOR_EMAIL_INDEX[addr]
links.append(InvestorLink(
match_kind="exact_email",
confidence=1.0,
matched_address=addr,
**entry.targets
))
if not links: # only fall back to domain if no exact match
for addr in participants:
domain = addr.split("@", 1)[-1].lower()
if domain in INVESTOR_DOMAIN_INDEX and domain not in COMMON_PERSONAL_DOMAINS:
for inv_id, inv_name in INVESTOR_DOMAIN_INDEX[domain]:
links.append(InvestorLink(
match_kind="domain_match",
confidence=0.6,
matched_address=addr,
fundraising_investor_id=inv_id
))
return dedupe(links)
```
`COMMON_PERSONAL_DOMAINS = {"gmail.com", "outlook.com", "yahoo.com", "icloud.com", ...}` — we don't domain-match on these.
Match status flows:
- 0 links → `emails.match_status = 'unmatched'`, body not stored.
- ≥1 exact_email link → `match_status = 'matched'`, full body + attachments stored.
- Only domain_match links → `match_status = 'matched'` with confidence 0.6, full body stored but flagged for review in UI.
### 9c. Re-matching
If a new investor contact is added to the CRM, existing unmatched emails can be retroactively matched. `POST /api/email/rematch?since=<iso>` rebuilds the index and scans emails where `match_status='unmatched'` in the window. Full bodies are fetched from Gmail on demand for the newly matched ones (we still have the `gmail_message_id` via the sightings table).
---
## 10. Threading
Two-level strategy:
1. **Primary: Gmail `threadId`.** Within one mailbox, Gmail's grouping is authoritative. We trust it.
2. **Cross-account reconciliation: RFC `References` / `In-Reply-To`.** If Alice sees thread `thr_A` and Bob sees thread `thr_B` and they share RFC `Message-ID`s, they're really one conversation.
Algorithm:
```python
def resolve_thread(email_row):
# Step 1: normalize subject (strip "Re:", "Fwd:", whitespace)
subj_norm = normalize_subject(email_row.subject)
# Step 2: collect candidate thread keys
rfc_refs = email_row.references + [email_row.in_reply_to]
rfc_refs = [r for r in rfc_refs if r]
# Step 3: find any existing email that shares a References link
parent = db.find_email_by_any_rfc_id(rfc_refs + [email_row.rfc_message_id])
if parent and parent.thread_id:
return parent.thread_id
# Step 4: fall back to gmail_thread_id (first sighting wins)
if email_row.gmail_thread_id:
existing = db.find_thread_by_gmail_id(email_row.gmail_thread_id)
if existing:
return existing.id
# Step 5: create new thread
return db.create_thread(
gmail_thread_id=email_row.gmail_thread_id,
rfc_thread_root_id=rfc_refs[0] if rfc_refs else email_row.rfc_message_id,
subject_normalized=subj_norm,
first_message_at=email_row.sent_at
)
```
Thread roll-up (`email_threads.message_count`, `participants_json`, `last_message_at`) is recomputed opportunistically on each insert using a single `UPDATE ... FROM (SELECT ...)` — cheap at the volumes a 5-person team produces.
---
## 11. Attachments
### 11a. Storage layout
```
data/
├── crm.db
├── email_attachments/
│ └── <email_id[0:2]>/ # shard by first 2 chars of email_id to keep dirs small
│ └── <email_id>/
│ ├── <attachment_id>-pitch-deck.pdf
│ └── <attachment_id>-financials.xlsx
```
### 11b. Download flow
Run asynchronously from message parsing (so a giant attachment doesn't block the sync loop):
1. Parser records attachment stubs in `email_attachments` with `download_status='pending'`.
2. An attachment-download worker pulls up to N at a time (default 4), calls `messages.attachments.get`, streams to a temp file, computes SHA-256, renames to final path.
3. On success: `download_status='downloaded'`, `downloaded_at`, `sha256_hex`, `size_bytes` filled.
4. On failure: `download_status='failed'`, `download_attempts` incremented, `download_error` set. Retry with backoff up to 5 attempts.
5. Sanitize filenames aggressively: strip path separators, null bytes, control chars, truncate to 200 chars. Keep extension.
### 11c. Dedup and size limits
- Content-addressed by `sha256_hex`. If the same file already exists on disk (same hash), we don't re-download — we link the second `email_attachments` row to the same `storage_path`.
- Default per-file cap: 50 MB. Above that, we record metadata only with `download_status='skipped'` and a note. Configurable via `CRM_GMAIL_MAX_ATTACHMENT_MB`.
- Default total storage cap: 10 GB (logged warning at 80%).
### 11d. Retrieval
- `GET /api/email/attachments/:id/download` — authenticated, streams the file. 404 if status != `downloaded`.
- No direct filesystem exposure.
---
## 12. Errors, rate limits, backoff
### 12a. Error taxonomy (`errors.py`)
```python
class GmailError(Exception): ...
class AuthError(GmailError): ... # 401/403 — broken creds, needs operator attention
class RateLimitError(GmailError): ... # 429 or 403 rate_limit — backoff + retry
class TransientError(GmailError): ... # 500, 502, 503, 504, network — backoff + retry
class NotFoundError(GmailError): ... # 404 — message deleted, history expired
class HistoryExpiredError(NotFoundError): ... # specifically for expired startHistoryId
class PermanentError(GmailError): ... # 400 bad request etc — don't retry, log+skip
```
Every HTTP call is routed through `gmail_client._call()` which maps status codes → exception types → retry decision.
### 12b. Retry policy
Exponential backoff with full jitter:
- Initial delay: 1s
- Max delay: 60s
- Max attempts: 5 for transient/rate-limit, 0 for auth/permanent
- On the 6th rate-limit error in a row for an account, mark account `sync_status='paused'` and alert via audit log.
### 12c. Gmail quota budget
- Per-project: 1,000,000,000 units/day (enormous; we won't get near it).
- Per-user: 250 units/second. We pace to ~150 units/sec/account via a token bucket in `gmail_client.py`. With 5 accounts processed serially, that's naturally below the per-project burst limit of 1,200 units/sec.
Call costs:
- `messages.list` = 5
- `messages.get` = 5
- `messages.attachments.get` = 5
- `history.list` = 2
Typical incremental cycle for one active user (~50 new messages in 3h): ~260 units. Negligible.
### 12d. Data validation
All DB inserts go through `db.py` helpers that:
- Enforce required fields (from_email, sent_at, rfc_message_id).
- Cap TEXT field lengths (subject 998 bytes per RFC 5322, body_text 10 MB soft cap with truncation log).
- Coerce timestamps to UTC ISO8601.
- Wrap everything in a single transaction per email.
---
## 13. Security considerations
| Concern | Mitigation |
|---------|-----------|
| Service account key theft | File at 0600, not in git, `.gitignore` covers `data/secrets/`, key usable only for Gmail scope, restricted to Workspace domain by DWD. |
| SQL injection | All SQL uses parameterized queries; strict `?` placeholders. |
| Path traversal in attachment filenames | Filenames are sanitized (`_sanitize_filename`) and written to a hash-based path; original name is stored in DB only. |
| Rate-limit DoS on sync endpoint | `POST /api/email/sync/run-now` is admin-only and subject to existing `WRITE_RATE_LIMIT_PER_MIN`. |
| Exfiltration via rendered email HTML | UI must render in a sandboxed iframe with CSP; no remote image/script loads. (Frontend change, not in scope for this scaffolding.) |
| Secret exposure in logs | Access tokens and refresh tokens never logged. Error messages are redacted by a central `redact_secrets()` filter. |
| Backup includes email bodies | Existing backup system writes JSON of `fundraising_state` only — it does **not** include the new `emails_*` tables. Future decision: include encrypted? See §17. |
---
## 14. Integration plan — zero-breakage activation
The scaffolded code is **inert** until three very small changes are made to `server.py`. Each is individually reversible.
### Patch 1: Table creation (append to `init_db()`)
```python
# server.py, near the end of init_db()
try:
from email_integration.db import apply_migrations
apply_migrations(cursor)
except ImportError:
pass # module not deployed yet, skip silently
```
### Patch 2: Scheduler startup (append to `main()` before `server_forever()`)
```python
# server.py, in main() after start_backup_scheduler()
if os.environ.get("CRM_GMAIL_INTEGRATION_ENABLED", "").lower() in ("1","true","yes"):
from email_integration.scheduler import start_sync_scheduler
start_sync_scheduler()
```
### Patch 3: Route dispatch hook (in `CRMHandler.do_GET` / `do_POST`, before the 404)
```python
# server.py, before the 404 fallthrough in CRMHandler
from email_integration.routes import try_handle
if try_handle(self): # returns True if it claimed the request
return
```
All three are guarded by the env var `CRM_GMAIL_INTEGRATION_ENABLED`. Default is off. Setting it off instantly disables the integration on restart. Migrations are idempotent so there's no "partially-migrated" state.
---
## 15. Testing strategy
### 15a. Unit tests (pure Python, no network)
- `test_parser.py` — fixture Gmail payloads covering: plain-only, html-only, multipart/alternative, multipart/mixed with attachments, RFC 2047 subjects, malformed dates, missing Message-ID.
- `test_matcher.py` — exact match, domain match, common-personal-domain exclusion, team-own-address exclusion, dedup.
- `test_threads.py` — new thread, Gmail-only thread, RFC-only thread, cross-account reconciliation.
- `test_db.py` — migrations run on empty DB, re-run is no-op, all upserts idempotent.
- `test_crypto.py` — encrypt/decrypt roundtrip, wrong key fails, nonce uniqueness.
### 15b. Integration tests (mocked HTTP)
Use `responses` lib or a tiny in-proc mock server to simulate Gmail API responses:
- Full backfill of 200 messages across 5 pages.
- Incremental history with messagesAdded + labelsAdded + messagesDeleted.
- Expired `startHistoryId` triggers date-backfill fallback.
- 429 triggers retry; 6 consecutive 429s pauses account.
- Attachment download happy path + large-file skip.
### 15c. Manual smoke test before production
1. Enable feature flag for a single test account first (`grant@ten31.xyz` only).
2. Run initial backfill, confirm no errors in `email_sync_runs`.
3. Check sample of matched emails in UI against Gmail source.
4. Verify attachments download and hash matches.
5. After 24h of incremental sync, enroll remaining 4 teammates.
---
## 16. Rollout plan
| Week | Milestone |
|------|-----------|
| 1 | Review this doc. Adjust schema / decisions as needed. |
| 2 | Implement scaffolded modules (code is already stubbed — fill in bodies). Unit tests. |
| 3 | Integration tests against a GCP sandbox project. |
| 4 | Apply server.py patches behind flag, enable for Grant's account only. Monitor for 1 week. |
| 5 | Enroll remaining 4 teammates. Document ops runbook. |
| 6+ | Phase 2: UI for threads, search, manual match override. |
---
## 17. Open questions / deferred decisions
1. **Backup inclusion.** Should the existing nightly backup cover `emails_*` tables? Pro: completeness. Con: bodies contain sensitive content; backups currently written unencrypted. Recommend adding a separate encrypted email backup stream rather than merging into `fundraising_backup`.
2. **Retention policy for unmatched emails.** Current plan: keep metadata indefinitely, never store body. Alternative: delete metadata after 90 days to reduce DB size. User preference needed.
3. **Real-time push.** Gmail supports `users.watch` + Pub/Sub to get sub-second notifications. Out of scope for Phase 1 (requires ingress URL, Pub/Sub, harder on a self-hosted box). Revisit if latency becomes an issue.
4. **Contact auto-creation.** Should an email from a new `@foo.vc` address automatically create a `contacts` row? Currently no — matching is passive. Easy to add later.
5. **Sent email enrichment.** When a teammate sends email to a known investor, should it create/update a CRM task (e.g., "awaiting reply")? Phase 3 roadmap item.
6. **Multi-domain Workspace.** If ten31.xyz ever adds a second domain (ten31.io etc), DWD needs updating. Document for future.
---
## Appendix A: Example end-to-end flow
A concrete walkthrough to ground the design.
**Scenario:** Alice from Acme Capital emails Grant with a fundraising question, cc'ing her partner Bob.
1. **T+0.** Alice sends `Subject: "Re: Q2 allocation"` to `grant@ten31.xyz`, cc `bob@acme.vc`.
2. **T+1 min.** Gmail delivers. Historically this is invisible to the CRM.
3. **T+1-180 min.** Sync scheduler wakes up. For Grant's account, `history.list(startHistoryId=<checkpoint>)` returns this new message.
4. **Metadata fetch.** `messages.get(id=..., format=metadata)` returns headers. Participant set = `{alice@acme.vc, grant@ten31.xyz, bob@acme.vc}`. Excluding Grant's own address → `{alice@acme.vc, bob@acme.vc}`.
5. **Match.** `INVESTOR_EMAIL_INDEX[alice@acme.vc]` → Acme Capital (fundraising_investor_id=`fi_xyz`). Exact match, confidence 1.0.
6. **Full fetch.** `messages.get(id=..., format=full)` returns body + attachment stubs.
7. **Parse.** `parser.extract` produces `ParsedEmail(from=alice, to=[grant], cc=[bob], subject="Re: Q2 allocation", body_text="...", attachments=[{name:"term_sheet.pdf", gmail_attachment_id:"..."}])`.
8. **DB writes (one transaction):**
- `INSERT INTO emails` with `rfc_message_id=<CAA...@mail.gmail.com>`, `match_status='matched'`.
- `INSERT INTO email_recipients` × 3.
- `INSERT INTO email_account_messages` with Grant's account + Gmail message id.
- `INSERT INTO email_investor_links` × 1 (to fi_xyz).
- `INSERT INTO email_attachments` with `download_status='pending'`.
- `resolve_thread()` finds existing thread via References, updates `email_threads.last_message_at`.
9. **Attachment download.** Worker picks up the pending attachment, calls `messages.attachments.get`, writes `data/email_attachments/<emailid[:2]>/<emailid>/<attid>-term_sheet.pdf`, updates row with hash + status.
10. **Sightings.** If Bob is also a CRM user with an enrolled mailbox, his sync run also picks up his copy of the same email. `rfc_message_id` already exists → `INSERT INTO email_account_messages` only (no new `emails` row). Emerges as a second sighting.
11. **Result in UI.** Acme Capital's investor page shows a new email in the conversation feed, threaded with prior Q2 allocation messages. Term sheet attachment is one click away, served from local disk.
---
## Appendix B: What gets scaffolded now vs. filled in later
The scaffolded modules (`backend/email_integration/*.py`) include:
- All module-level docstrings explaining purpose.
- Complete type signatures and dataclasses.
- Full SQL for migrations.
- Working implementations for: config loading, credential abstraction, DB migration application, HTTP route dispatch.
- Skeleton implementations with `TODO` markers for: full Gmail API call bodies, MIME tree walking, retry loops.
Lines of code delivered in scaffolding: ~1,500. Remaining to implement: ~1,000 (mostly the Gmail client and parser edge cases). Nothing in `server.py` or `crm.db` is changed.
Binary file not shown.
+90
View File
@@ -0,0 +1,90 @@
"""Core-schema migration runner for the Ten31 CRM.
Phase 0 (Workstream A2) introduces ordered, reviewable SQL migrations for the
*core* schema, generalizing the pattern the Gmail subsystem already uses
(email_integration/db.py). Unlike the Gmail runner, this one keeps a
`schema_migrations` ledger so each file runs exactly once and the applied set is
auditable.
Design rules (CLAUDE.md guardrails):
* Additive and reversible only — migrations add tables / nullable columns;
they never drop or rewrite existing data. Each NNNN_*.sql may ship a paired
NNNN_*.down.sql for manual rollback (the .down files are never auto-applied).
* Idempotent — files use `CREATE TABLE/INDEX IF NOT EXISTS`. For the few
non-idempotent `ALTER TABLE ... ADD COLUMN` statements (SQLite has no
IF NOT EXISTS for columns), a partial-apply is tolerated by skipping
"duplicate column name" errors, matching the existing defensive pattern in
server.py:init_db().
Call `apply_core_migrations(conn)` from init_db() after the base tables exist.
"""
import os
import sqlite3
MIGRATIONS_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "migrations")
def _split_statements(sql: str):
"""Split a SQL script into individual statements on ';' boundaries.
Sufficient for our migrations, which contain no procedural blocks or
semicolons inside string literals.
"""
return [s.strip() for s in sql.split(";") if s.strip()]
def _apply_statementwise(cursor, sql: str) -> None:
"""Execute a migration one statement at a time, tolerating an already-applied
`ALTER TABLE ... ADD COLUMN` (duplicate column). Used only as a fallback when
executescript() trips over a partially-applied migration."""
for stmt in _split_statements(sql):
try:
cursor.execute(stmt)
except sqlite3.OperationalError as exc:
if "duplicate column name" in str(exc).lower():
continue
raise
def apply_core_migrations(conn) -> None:
"""Apply any pending backend/migrations/NNNN_*.sql files once, in order.
Records each applied file in the `schema_migrations` ledger. `*.down.sql`
files are ignored (manual rollback only).
"""
cursor = conn.cursor()
cursor.execute(
"""
CREATE TABLE IF NOT EXISTS schema_migrations (
filename TEXT PRIMARY KEY,
applied_at TEXT DEFAULT (datetime('now'))
)
"""
)
conn.commit()
if not os.path.isdir(MIGRATIONS_DIR):
return
applied = {row[0] for row in cursor.execute("SELECT filename FROM schema_migrations")}
pending = sorted(
fn for fn in os.listdir(MIGRATIONS_DIR)
if fn.endswith(".sql") and not fn.endswith(".down.sql") and fn not in applied
)
for filename in pending:
path = os.path.join(MIGRATIONS_DIR, filename)
with open(path, "r", encoding="utf-8") as handle:
sql = handle.read()
try:
cursor.executescript(sql)
except sqlite3.OperationalError as exc:
if "duplicate column name" in str(exc).lower():
_apply_statementwise(cursor, sql)
else:
raise
cursor.execute("INSERT INTO schema_migrations (filename) VALUES (?)", (filename,))
conn.commit()
print(f"[migrations] applied {filename}")
+175
View File
@@ -0,0 +1,175 @@
# `email_integration` — Gmail capture for the Venture CRM
Scaffolded Phase 1 of the Gmail integration described in
`GMAIL_INTEGRATION_ARCHITECTURE.md` (repo root). Everything in this module is
isolated from `server.py` until you wire it in explicitly.
## Contents
| File | Purpose |
|------|---------|
| `config.py` | Env-var loader; exposes `CONFIG` singleton. |
| `errors.py` | Exception taxonomy used by the retry loop. |
| `crypto.py` | AES-GCM wrapper for OAuth refresh-token encryption (only used in OAuth mode). |
| `credentials.py` | `CredentialProvider` protocol + `DWDCredentialProvider` / `OAuthCredentialProvider`. |
| `gmail_client.py` | Gmail API HTTP wrapper (rate limit, retry, pagination). |
| `db.py` | All SQL touching `emails_*` tables. Migrations live under `migrations/`. |
| `parser.py` | Gmail payload → canonical dict (headers, body, attachments). |
| `matcher.py` | Investor address index + match logic. |
| `threads.py` | Thread resolution using Gmail threadId + RFC References. |
| `attachments.py` | Stub rows + on-disk storage + download worker. |
| `sync.py` | Orchestrator for backfill + incremental sync of one account. |
| `scheduler.py` | Background thread that runs `sync.sync_all` on an interval. |
| `routes.py` | HTTP handlers under `/api/email/*` compatible with `CRMHandler`. |
| `migrations/0001_email_tables.sql` | Table DDL. |
## Wiring it in
All changes are in `backend/server.py`, all guarded by an env flag. Each is
independently revertible. None run unless `CRM_GMAIL_INTEGRATION_ENABLED=true`.
**Patch 1 — migrations** (append to `init_db()` after all existing
`cursor.executescript(...)` calls, before `conn.commit()`):
```python
try:
from email_integration.db import apply_migrations
apply_migrations(cursor)
except ImportError:
pass
```
**Patch 2 — scheduler** (in `main()`, after `start_backup_scheduler()`):
```python
if os.environ.get("CRM_GMAIL_INTEGRATION_ENABLED", "").lower() in ("1", "true", "yes", "on"):
from email_integration.scheduler import start_sync_scheduler
start_sync_scheduler()
```
**Patch 3 — routes** (add near the top of `CRMHandler.do_GET` and
`CRMHandler.do_POST`, after auth/rate-limit pre-checks, before API routing):
```python
try:
from email_integration.routes import try_handle
if try_handle(self):
return
except ImportError:
pass
```
## Environment variables
```bash
# Master on/off. Default off; scheduler won't start, routes return 503.
CRM_GMAIL_INTEGRATION_ENABLED=true
# Auth method: "dwd" (default, recommended) or "oauth"
CRM_GMAIL_AUTH_METHOD=dwd
# DWD mode
CRM_GMAIL_SA_KEY_PATH=/path/to/CRM/data/secrets/gmail-service-account.json
CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz
# OAuth mode (fallback; not required for DWD)
CRM_GMAIL_OAUTH_CLIENT_ID=...
CRM_GMAIL_OAUTH_CLIENT_SECRET=...
CRM_GMAIL_OAUTH_REDIRECT_URI=https://crm.ten31.xyz/api/email/oauth/callback
CRM_GMAIL_SECRET_KEY=<base64-32-random-bytes> # for encrypting refresh tokens
# Sync
CRM_GMAIL_SYNC_INTERVAL_MIN=180 # default 3h
CRM_GMAIL_BACKFILL_PAGE_SIZE=500
CRM_GMAIL_MAX_ATTACHMENT_MB=50
CRM_GMAIL_ATTACH_CONCURRENCY=4
CRM_GMAIL_RATE_UNITS_SEC=150 # per account, leaves 40% headroom
CRM_GMAIL_RETRY_MAX=5
CRM_GMAIL_HISTORY_STALE_DAYS=5
```
## Google Cloud / Workspace setup (DWD)
See `GMAIL_INTEGRATION_ARCHITECTURE.md` §3 for the full runbook. Short form:
1. Create GCP project, enable Gmail API.
2. Create service account, download JSON key, enable domain-wide delegation.
3. In Google Admin console → Security → API controls → Manage domain-wide
delegation, authorize the service account's client ID with scope
`https://www.googleapis.com/auth/gmail.readonly`.
4. Copy the JSON key to `data/secrets/gmail-service-account.json`, `chmod 600`.
5. Set env vars in `.env.beta`, restart CRM.
6. As admin, POST `/api/email/accounts/enroll-all` to create `email_accounts`
rows for every active user whose email ends in the Workspace domain.
## Adding the crypto dependency (only for OAuth mode)
If you use OAuth fallback you need `cryptography`:
```
cryptography==42.0.5
```
Append to `backend/requirements.txt`. DWD mode also uses `cryptography` for
the RSA signing of the JWT bearer token — so if you enable the integration in
either mode, add the dep.
## Rollback
To disable instantly: set `CRM_GMAIL_INTEGRATION_ENABLED=false` and restart.
The scheduler won't start, routes return 503, DB tables remain (unused).
To remove completely: drop the env var, delete `data/email_attachments/`,
drop all `emails_*` tables and `email_*` tables (migration is idempotent
create-only; a separate drop script would be required — not provided in
Phase 1 since you said you're not rushing).
## Local development
The module has zero network dependencies when imported without the scheduler
starting. You can:
```python
python3 -c "from email_integration.parser import parse; \
import json; \
print(parse(json.load(open('fixture.json'))))"
```
## Testing checklist (before enabling in production)
- [ ] Enable `CRM_GMAIL_INTEGRATION_ENABLED=true` on a staging copy of the DB only.
- [ ] Verify migrations applied: `emails`, `email_accounts`, etc. present.
- [ ] Enroll one account (yours) via `/api/email/accounts/enroll`.
- [ ] Trigger `POST /api/email/sync/run-now`.
- [ ] Check `email_sync_runs` for `status='ok'`.
- [ ] Spot-check `emails` rows against Gmail.
- [ ] Verify an attachment downloaded correctly (hash and size).
- [ ] Let the scheduler run for 24 hours; monitor `/api/email/status`.
- [ ] Enroll remaining 4 teammates.
## What's scaffolded vs. TODO
**Scaffolded and complete:**
- Schema (migration 0001)
- Config and env parsing
- Error taxonomy + retry classifier
- AES-GCM crypto helpers
- DWD JWT minting + access token caching
- OAuth refresh + consent flow endpoints
- Gmail client (list/get/history/attachments/profile) with rate limit + retry
- Full DB data-access layer
- MIME parser including RFC 2047 subjects and HTML→text fallback
- Investor matcher with exact + domain strategies
- Thread resolution (Gmail threadId + RFC References cross-account)
- Attachment storage with SHA-256 dedup
- Sync orchestrator (backfill + incremental with history-expired fallback)
- Scheduler with manual-trigger hook
- HTTP routes (status, accounts, threads, enroll, run-now, rematch, oauth)
**TODO before production (see architecture doc §15):**
- Multipart batch metadata fetch in `gmail_client.batch_get_metadata`
(currently serial fallback).
- Unit tests (fixtures for parser, matcher, threads; integration tests with
responses-style HTTP mock).
- Frontend UI: a thread list + detail pane in `frontend/index.html`.
- Sandboxed HTML rendering for email bodies (out of scope here).
+15
View File
@@ -0,0 +1,15 @@
"""
Gmail Integration for Venture CRM.
Phase 1 scope: OAuth2/DWD authentication, incremental Gmail sync, MIME parsing,
investor matching, threading, attachment storage. All logic isolated to this
module; server.py integration is a 3-line patch guarded by
CRM_GMAIL_INTEGRATION_ENABLED.
See GMAIL_INTEGRATION_ARCHITECTURE.md at the repo root for full design.
"""
from . import config # noqa: F401
from . import errors # noqa: F401
__all__ = ["config", "errors"]
+234
View File
@@ -0,0 +1,234 @@
"""
Attachment download + on-disk storage.
Two usage patterns:
1. During message parsing we call `register_stubs(conn, email_id, parsed)`
which writes pending rows to email_attachments.
2. A separate worker (kicked off by sync after each account completes)
calls `drain_pending()` which fetches attachment bytes from Gmail and
writes them to disk under CONFIG.attachments_dir.
Files are named: <CRM_DATA_DIR>/email_attachments/<email_id[:2]>/<email_id>/<attachment_id>-<sanitized_filename>
Sanitization prevents path traversal and keeps cross-platform-safe names.
"""
import base64
import hashlib
import os
import re
import sqlite3
from typing import Iterable, Optional
from . import config as _cfg
from . import db as _db
from . import errors as _errors
from . import gmail_client as _gmail
_MAX_FILENAME_LEN = 200
_BAD_FILENAME_CHARS = re.compile(r'[/\\\x00-\x1f\x7f:*?"<>|]+')
def _sanitize_filename(name: str) -> str:
if not name:
return "unnamed.bin"
# strip path components first
name = os.path.basename(name.replace("\\", "/"))
name = _BAD_FILENAME_CHARS.sub("_", name).strip(" .")
if not name:
name = "unnamed.bin"
if len(name) > _MAX_FILENAME_LEN:
stem, dot, ext = name.rpartition(".")
if dot:
name = stem[: _MAX_FILENAME_LEN - len(ext) - 1] + "." + ext
else:
name = name[:_MAX_FILENAME_LEN]
return name
def _storage_path_for(email_id: str, attachment_id: str, sanitized_filename: str) -> str:
root = _cfg.CONFIG.attachments_dir
bucket = email_id[:2] or "_0"
dir_ = os.path.join(root, bucket, email_id)
os.makedirs(dir_, exist_ok=True)
return os.path.join(dir_, f"{attachment_id}-{sanitized_filename}")
# ---------------------------------------------------------------------------- phase 1: register stubs
def register_stubs(conn: sqlite3.Connection, *, email_id: str,
parsed_attachments: Iterable[dict]) -> list[str]:
"""Write pending attachment rows from parsed message data.
Also handles tiny inline attachments whose bytes arrived with the message
body (body.data present, no separate attachmentId) by writing them
directly and marking as downloaded.
Returns list of attachment ids created.
"""
max_bytes = _cfg.CONFIG.max_attachment_mb * 1024 * 1024
ids = []
for att in parsed_attachments:
filename = att.get("filename") or "unnamed.bin"
sanitized = _sanitize_filename(filename)
gmail_att_id = att.get("gmail_attachment_id") or ""
mime = att.get("mime_type")
size = att.get("size")
# Determine storage path (we write the path whether or not the download
# succeeded; missing files surface via download_status).
att_row_id = _db.insert_attachment_stub(
conn,
email_id=email_id,
gmail_attachment_id=gmail_att_id,
filename=filename,
sanitized_filename=sanitized,
mime_type=mime,
size_bytes=size,
storage_path=_storage_path_for(email_id, gmail_att_id or att_row_id_fallback(), sanitized),
)
ids.append(att_row_id)
# Oversize guard.
if isinstance(size, int) and size > max_bytes:
conn.execute(
"UPDATE email_attachments SET download_status = 'skipped', "
"download_error = ? WHERE id = ?",
(f"exceeds max size {_cfg.CONFIG.max_attachment_mb}MB", att_row_id),
)
continue
# Inline data fast-path.
inline_b64 = att.get("inline_data_b64")
if inline_b64:
try:
raw = base64.urlsafe_b64decode(_pad(inline_b64).encode("ascii"))
path = _storage_path_for(email_id, att_row_id, sanitized)
_write_bytes(path, raw)
sha = hashlib.sha256(raw).hexdigest()
conn.execute(
"UPDATE email_attachments SET storage_path = ? WHERE id = ?",
(path, att_row_id),
)
_db.mark_attachment_downloaded(
conn, att_row_id, sha256_hex=sha, size_bytes=len(raw)
)
except Exception as e:
_db.mark_attachment_failed(conn, att_row_id, error=f"inline decode: {e}")
return ids
def att_row_id_fallback() -> str:
# Placeholder so the path template always produces something if gmail_att_id
# was missing at stub time; the real path is rewritten when the worker
# picks it up.
import uuid
return uuid.uuid4().hex
# ---------------------------------------------------------------------------- phase 2: worker
def drain_pending(conn_factory, client: _gmail.GmailClient, account_id: str,
*, limit: int = 50) -> int:
"""Download up to `limit` pending attachments for `account_id`.
Returns count of successfully downloaded attachments. Called after each
account's sync completes so large files don't block the sync loop.
"""
conn = conn_factory()
try:
pending = _db.pending_attachments(conn, limit=limit)
finally:
conn.close()
downloaded = 0
for row in pending:
if row["account_id"] != account_id:
continue
conn = conn_factory()
try:
ok = _download_one(conn, client, row)
if ok:
downloaded += 1
conn.commit()
finally:
conn.close()
return downloaded
def _download_one(conn: sqlite3.Connection, client: _gmail.GmailClient, row) -> bool:
try:
resp = client.get_attachment(row["gmail_message_id"], row["gmail_attachment_id"])
except _errors.RETRYABLE as e:
_db.mark_attachment_failed(conn, row["id"], error=f"transient: {type(e).__name__}")
return False
except _errors.GmailError as e:
_db.mark_attachment_failed(conn, row["id"], error=f"{type(e).__name__}: {e}")
return False
data_b64 = resp.get("data")
if not data_b64:
_db.mark_attachment_failed(conn, row["id"], error="empty data in response")
return False
try:
raw = base64.urlsafe_b64decode(_pad(data_b64).encode("ascii"))
except Exception as e:
_db.mark_attachment_failed(conn, row["id"], error=f"decode: {e}")
return False
sha = hashlib.sha256(raw).hexdigest()
# If an existing attachment has the same SHA, re-point storage_path and skip write.
existing = _find_existing_by_sha(conn, sha, exclude_id=row["id"])
if existing:
conn.execute(
"UPDATE email_attachments SET storage_path = ? WHERE id = ?",
(existing["storage_path"], row["id"]),
)
_db.mark_attachment_downloaded(conn, row["id"], sha256_hex=sha, size_bytes=len(raw))
return True
path = _storage_path_for(row["email_id"], row["id"], row["sanitized_filename"])
try:
_write_bytes(path, raw)
except OSError as e:
_db.mark_attachment_failed(conn, row["id"], error=f"disk: {e}")
return False
conn.execute(
"UPDATE email_attachments SET storage_path = ? WHERE id = ?",
(path, row["id"]),
)
_db.mark_attachment_downloaded(conn, row["id"], sha256_hex=sha, size_bytes=len(raw))
return True
def _find_existing_by_sha(conn: sqlite3.Connection, sha: str, *, exclude_id: str) -> Optional[sqlite3.Row]:
cur = conn.cursor()
cur.execute(
"SELECT * FROM email_attachments WHERE sha256_hex = ? AND id != ? "
"AND download_status = 'downloaded' LIMIT 1",
(sha, exclude_id),
)
return cur.fetchone()
# ---------------------------------------------------------------------------- utils
def _pad(b64: str) -> str:
pad = 4 - (len(b64) % 4)
return b64 + ("=" * pad if pad != 4 else "")
def _write_bytes(path: str, data: bytes) -> None:
os.makedirs(os.path.dirname(path), exist_ok=True)
tmp = path + ".tmp"
with open(tmp, "wb") as f:
f.write(data)
os.chmod(tmp, 0o600)
os.replace(tmp, path)
+112
View File
@@ -0,0 +1,112 @@
"""
Email integration configuration.
Reads from the same env-var surface as the rest of the CRM (server.py style),
no pydantic/dotenv magic — stdlib only.
"""
import os
from dataclasses import dataclass
from typing import Optional
# Reuse the CRM's data dir so backups and email storage live together.
_PROJECT_DIR = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
_DEFAULT_DATA_DIR = os.path.join(_PROJECT_DIR, "data")
def _bool_env(name: str, default: bool = False) -> bool:
v = os.environ.get(name, "").strip().lower()
if v in ("1", "true", "yes", "on"):
return True
if v in ("0", "false", "no", "off"):
return False
return default
def _int_env(name: str, default: int) -> int:
try:
return int(os.environ.get(name, str(default)))
except (TypeError, ValueError):
return default
@dataclass(frozen=True)
class EmailConfig:
# Master kill switch. When False, scheduler doesn't start and routes
# return 503. Migrations are still applied (so schema is ready).
enabled: bool
# Primary auth path. "dwd" means service account / domain-wide delegation.
# "oauth" means per-user refresh tokens. DWD is default; OAuth is the
# pluggable fallback.
primary_auth: str
# DWD specifics
dwd_key_path: Optional[str]
workspace_domain: Optional[str]
# OAuth specifics (used for fallback + admin UI)
oauth_client_id: Optional[str]
oauth_client_secret: Optional[str]
oauth_redirect_uri: Optional[str]
# Encryption key (base64) for OAuth refresh-token-at-rest encryption.
# Required whenever oauth path is in use. DWD path never persists tokens.
secret_key_b64: Optional[str]
# Sync scheduling
sync_interval_sec: int
backfill_page_size: int
max_attachment_mb: int
max_parallel_attachment_downloads: int
# Storage
data_dir: str
attachments_dir: str
secrets_dir: str
# Rate limit / retry
rate_limit_units_per_sec_per_account: int
retry_max_attempts: int
retry_initial_delay_sec: float
retry_max_delay_sec: float
# Gmail history retention — if we fall this far behind, switch to date
# backfill since Gmail may have pruned history records.
history_stale_days: int
def load() -> EmailConfig:
data_dir = os.environ.get("CRM_DATA_DIR", _DEFAULT_DATA_DIR)
return EmailConfig(
enabled=_bool_env("CRM_GMAIL_INTEGRATION_ENABLED", False),
primary_auth=os.environ.get("CRM_GMAIL_AUTH_METHOD", "dwd").lower(),
dwd_key_path=os.environ.get("CRM_GMAIL_SA_KEY_PATH") or None,
workspace_domain=os.environ.get("CRM_GMAIL_WORKSPACE_DOMAIN") or None,
oauth_client_id=os.environ.get("CRM_GMAIL_OAUTH_CLIENT_ID") or None,
oauth_client_secret=os.environ.get("CRM_GMAIL_OAUTH_CLIENT_SECRET") or None,
oauth_redirect_uri=os.environ.get("CRM_GMAIL_OAUTH_REDIRECT_URI") or None,
secret_key_b64=os.environ.get("CRM_GMAIL_SECRET_KEY") or None,
sync_interval_sec=_int_env("CRM_GMAIL_SYNC_INTERVAL_MIN", 180) * 60,
backfill_page_size=_int_env("CRM_GMAIL_BACKFILL_PAGE_SIZE", 500),
max_attachment_mb=_int_env("CRM_GMAIL_MAX_ATTACHMENT_MB", 50),
max_parallel_attachment_downloads=_int_env("CRM_GMAIL_ATTACH_CONCURRENCY", 4),
data_dir=data_dir,
attachments_dir=os.path.join(data_dir, "email_attachments"),
secrets_dir=os.path.join(data_dir, "secrets"),
rate_limit_units_per_sec_per_account=_int_env("CRM_GMAIL_RATE_UNITS_SEC", 150),
retry_max_attempts=_int_env("CRM_GMAIL_RETRY_MAX", 5),
retry_initial_delay_sec=float(os.environ.get("CRM_GMAIL_RETRY_INITIAL_SEC", "1.0")),
retry_max_delay_sec=float(os.environ.get("CRM_GMAIL_RETRY_MAX_SEC", "60.0")),
history_stale_days=_int_env("CRM_GMAIL_HISTORY_STALE_DAYS", 5),
)
# Singleton. Reload with `reload_config()` if env changes (mostly for tests).
CONFIG = load()
def reload_config() -> EmailConfig:
global CONFIG
CONFIG = load()
return CONFIG
+297
View File
@@ -0,0 +1,297 @@
"""
Credential providers for Gmail API access.
Two implementations behind a common protocol:
- DWDCredentialProvider: signs a JWT with the Workspace-authorized service
account, exchanges for a short-lived access token that impersonates a
specific user. No per-user persistent state.
- OAuthCredentialProvider: uses a per-user refresh token (stored encrypted
in email_accounts.oauth_refresh_enc) to mint access tokens. Supports the
'connect Gmail' UI flow.
Both provide the same interface:
provider.access_token_for(email_address: str) -> AccessToken
"""
import base64
import json
import os
import threading
import time
from dataclasses import dataclass
from typing import Optional, Protocol
import urllib.parse
import urllib.request
from . import config as _cfg
from . import crypto
from . import errors
GMAIL_READONLY_SCOPE = "https://www.googleapis.com/auth/gmail.readonly"
GOOGLE_TOKEN_URL = "https://oauth2.googleapis.com/token"
@dataclass
class AccessToken:
token: str
expires_at: float # epoch seconds
class CredentialProvider(Protocol):
def access_token_for(self, email_address: str) -> AccessToken: ...
def revoke(self, email_address: str) -> None: ...
# ============================================================================
# Domain-wide delegation
# ============================================================================
class DWDCredentialProvider:
"""Impersonation via service-account JWT bearer grant."""
def __init__(self, key_path: str):
with open(key_path, "r") as f:
self._key = json.load(f)
self._client_email = self._key["client_email"]
self._private_key_pem = self._key["private_key"].encode("utf-8")
self._cache: dict[str, AccessToken] = {}
self._lock = threading.Lock()
def access_token_for(self, email_address: str) -> AccessToken:
with self._lock:
cached = self._cache.get(email_address)
if cached and cached.expires_at - time.time() > 60:
return cached
token = self._mint(email_address)
self._cache[email_address] = token
return token
def revoke(self, email_address: str) -> None:
# DWD tokens expire naturally in <1h. Revocation is via Admin console.
# We just drop the cache so next call mints fresh.
with self._lock:
self._cache.pop(email_address, None)
# ------------------------------------------------------------------ helpers
def _mint(self, subject_email: str) -> AccessToken:
try:
from cryptography.hazmat.primitives import hashes, serialization # type: ignore
from cryptography.hazmat.primitives.asymmetric import padding # type: ignore
except ImportError as e: # pragma: no cover
raise errors.AuthError(
"DWD requires the `cryptography` package. Add to requirements.txt."
) from e
now = int(time.time())
header = {"alg": "RS256", "typ": "JWT"}
claim = {
"iss": self._client_email,
"sub": subject_email,
"scope": GMAIL_READONLY_SCOPE,
"aud": GOOGLE_TOKEN_URL,
"iat": now,
"exp": now + 3600,
}
signing_input = _b64url(_json(header)) + b"." + _b64url(_json(claim))
private_key = serialization.load_pem_private_key(self._private_key_pem, password=None)
signature = private_key.sign(signing_input, padding.PKCS1v15(), hashes.SHA256())
jwt = signing_input + b"." + _b64url(signature)
body = urllib.parse.urlencode({
"grant_type": "urn:ietf:params:oauth:grant-type:jwt-bearer",
"assertion": jwt.decode("ascii"),
}).encode("ascii")
req = urllib.request.Request(
GOOGLE_TOKEN_URL,
data=body,
headers={"Content-Type": "application/x-www-form-urlencoded"},
)
try:
with urllib.request.urlopen(req, timeout=15) as resp:
payload = json.loads(resp.read())
except urllib.error.HTTPError as e:
body = e.read().decode("utf-8", errors="replace")
try:
payload = json.loads(body)
except Exception:
payload = {"raw": body}
raise errors.classify_http(e.code, payload)
if "access_token" not in payload:
raise errors.AuthError("DWD token exchange returned no access_token", payload=payload)
return AccessToken(
token=payload["access_token"],
expires_at=time.time() + float(payload.get("expires_in", 3600)) - 30,
)
# ============================================================================
# Per-user OAuth (fallback)
# ============================================================================
class OAuthCredentialProvider:
"""Refreshes access tokens using a stored encrypted refresh token.
Refresh tokens are obtained via the consent-flow routes in routes.py and
stored in email_accounts.oauth_refresh_enc (AES-GCM ciphertext).
"""
def __init__(self, db_conn_factory, client_id: str, client_secret: str, secret_key_b64: str):
self._db = db_conn_factory
self._client_id = client_id
self._client_secret = client_secret
self._secret_key_b64 = secret_key_b64
self._lock = threading.Lock()
def access_token_for(self, email_address: str) -> AccessToken:
with self._lock:
row = self._load_account(email_address)
if row is None:
raise errors.AuthError(f"no email_accounts row for {email_address}")
# Cached access token still valid?
if row["oauth_token_enc"] and row["oauth_token_exp"]:
try:
exp = float(row["oauth_token_exp"])
except ValueError:
exp = 0.0
if exp - time.time() > 60:
token = crypto.decrypt(row["oauth_token_enc"], secret_key_b64=self._secret_key_b64).decode("ascii")
return AccessToken(token=token, expires_at=exp)
# Refresh.
return self._refresh(email_address, row)
def revoke(self, email_address: str) -> None:
row = self._load_account(email_address)
if not row or not row["oauth_refresh_enc"]:
return
refresh = crypto.decrypt(row["oauth_refresh_enc"], secret_key_b64=self._secret_key_b64).decode("ascii")
body = urllib.parse.urlencode({"token": refresh}).encode("ascii")
req = urllib.request.Request(
"https://oauth2.googleapis.com/revoke",
data=body,
headers={"Content-Type": "application/x-www-form-urlencoded"},
)
try:
urllib.request.urlopen(req, timeout=10).read()
except Exception:
pass # best effort; we zero locally regardless
self._zero_account(email_address)
# ------------------------------------------------------------------ helpers
def _refresh(self, email_address: str, row) -> AccessToken:
if not row["oauth_refresh_enc"]:
raise errors.AuthError(f"no refresh token stored for {email_address}")
refresh = crypto.decrypt(row["oauth_refresh_enc"], secret_key_b64=self._secret_key_b64).decode("ascii")
body = urllib.parse.urlencode({
"grant_type": "refresh_token",
"refresh_token": refresh,
"client_id": self._client_id,
"client_secret": self._client_secret,
}).encode("ascii")
req = urllib.request.Request(
GOOGLE_TOKEN_URL,
data=body,
headers={"Content-Type": "application/x-www-form-urlencoded"},
)
try:
with urllib.request.urlopen(req, timeout=15) as resp:
payload = json.loads(resp.read())
except urllib.error.HTTPError as e:
body_text = e.read().decode("utf-8", errors="replace")
try:
payload = json.loads(body_text)
except Exception:
payload = {"raw": body_text}
raise errors.classify_http(e.code, payload)
if "access_token" not in payload:
raise errors.AuthError("OAuth refresh returned no access_token", payload=payload)
token_str = payload["access_token"]
exp = time.time() + float(payload.get("expires_in", 3600)) - 30
enc_token = crypto.encrypt(token_str.encode("ascii"), secret_key_b64=self._secret_key_b64)
self._save_token(email_address, enc_token, exp)
return AccessToken(token=token_str, expires_at=exp)
def _load_account(self, email_address: str):
conn = self._db()
try:
cur = conn.cursor()
cur.execute(
"SELECT id, oauth_refresh_enc, oauth_token_enc, oauth_token_exp "
"FROM email_accounts WHERE email_address = ?",
(email_address,),
)
return cur.fetchone()
finally:
conn.close()
def _save_token(self, email_address: str, enc_token: bytes, exp: float):
conn = self._db()
try:
conn.execute(
"UPDATE email_accounts SET oauth_token_enc = ?, oauth_token_exp = ?, "
"updated_at = datetime('now') WHERE email_address = ?",
(enc_token, str(exp), email_address),
)
conn.commit()
finally:
conn.close()
def _zero_account(self, email_address: str):
conn = self._db()
try:
conn.execute(
"UPDATE email_accounts SET oauth_refresh_enc = NULL, oauth_token_enc = NULL, "
"oauth_token_exp = NULL, sync_enabled = 0, sync_status = 'paused', "
"updated_at = datetime('now') WHERE email_address = ?",
(email_address,),
)
conn.commit()
finally:
conn.close()
# ============================================================================
# Factory — resolves CONFIG.primary_auth to a concrete provider
# ============================================================================
def build_provider(db_conn_factory) -> CredentialProvider:
cfg = _cfg.CONFIG
if cfg.primary_auth == "dwd":
if not cfg.dwd_key_path or not os.path.exists(cfg.dwd_key_path):
raise errors.AuthError(
f"CRM_GMAIL_SA_KEY_PATH not found: {cfg.dwd_key_path!r}"
)
return DWDCredentialProvider(cfg.dwd_key_path)
if cfg.primary_auth == "oauth":
if not (cfg.oauth_client_id and cfg.oauth_client_secret and cfg.secret_key_b64):
raise errors.AuthError(
"OAuth mode requires CRM_GMAIL_OAUTH_CLIENT_ID, "
"CRM_GMAIL_OAUTH_CLIENT_SECRET, and CRM_GMAIL_SECRET_KEY."
)
return OAuthCredentialProvider(
db_conn_factory,
cfg.oauth_client_id,
cfg.oauth_client_secret,
cfg.secret_key_b64,
)
raise errors.AuthError(f"unknown primary_auth: {cfg.primary_auth!r}")
# ---------------------------------------------------------------------------- utils
def _b64url(data: bytes) -> bytes:
return base64.urlsafe_b64encode(data).rstrip(b"=")
def _json(obj) -> bytes:
return json.dumps(obj, separators=(",", ":")).encode("utf-8")
+79
View File
@@ -0,0 +1,79 @@
"""
AES-256-GCM encryption for OAuth refresh tokens at rest.
Key material comes from CONFIG.secret_key_b64 (env: CRM_GMAIL_SECRET_KEY).
Must be at least 32 bytes of entropy, base64-encoded.
Storage format (as stored in BLOB columns):
version(1 byte) || nonce(12 bytes) || ciphertext+tag(N bytes)
version = 1 for AES-GCM-256.
Uses the `cryptography` library. If not available (optional at scaffold time),
the OAuth fallback path is disabled with a clear error — DWD path is unaffected.
"""
import base64
import os
import secrets
from typing import Optional
try:
from cryptography.hazmat.primitives.ciphers.aead import AESGCM # type: ignore
_AVAILABLE = True
except ImportError: # pragma: no cover
AESGCM = None # type: ignore
_AVAILABLE = False
VERSION = 1
NONCE_LEN = 12
class CryptoUnavailable(RuntimeError):
pass
def _load_key(secret_key_b64: Optional[str]) -> bytes:
if not secret_key_b64:
raise CryptoUnavailable(
"CRM_GMAIL_SECRET_KEY not set; cannot encrypt/decrypt OAuth tokens. "
"DWD auth does not require this."
)
try:
key = base64.b64decode(secret_key_b64)
except Exception as e:
raise CryptoUnavailable(f"CRM_GMAIL_SECRET_KEY not valid base64: {e}") from e
if len(key) < 32:
raise CryptoUnavailable(
f"CRM_GMAIL_SECRET_KEY decodes to {len(key)} bytes; need >= 32."
)
return key[:32] # AES-256
def encrypt(plaintext: bytes, *, secret_key_b64: Optional[str]) -> bytes:
if not _AVAILABLE:
raise CryptoUnavailable("cryptography library not installed")
key = _load_key(secret_key_b64)
nonce = secrets.token_bytes(NONCE_LEN)
ct = AESGCM(key).encrypt(nonce, plaintext, None)
return bytes([VERSION]) + nonce + ct
def decrypt(blob: bytes, *, secret_key_b64: Optional[str]) -> bytes:
if not _AVAILABLE:
raise CryptoUnavailable("cryptography library not installed")
if not blob or len(blob) < 1 + NONCE_LEN + 16:
raise ValueError("ciphertext too short")
version = blob[0]
if version != VERSION:
raise ValueError(f"unsupported crypto version: {version}")
nonce = blob[1:1 + NONCE_LEN]
ct = blob[1 + NONCE_LEN:]
key = _load_key(secret_key_b64)
return AESGCM(key).decrypt(nonce, ct, None)
def generate_secret_key_b64() -> str:
"""Helper for initial setup: prints a fresh key you can drop into env."""
return base64.b64encode(os.urandom(32)).decode("ascii")
+416
View File
@@ -0,0 +1,416 @@
"""
Data-access layer for the email_integration module.
All SQL touching emails_* tables lives here. Other modules call named
helpers — they never write SQL inline. This keeps schema changes contained.
Connection pattern matches server.py get_db():
- WAL mode, foreign keys on, busy_timeout
- sqlite3.Row row_factory
The caller is responsible for committing / closing.
"""
import json
import os
import sqlite3
import uuid
from datetime import datetime, timezone
from typing import Iterable, Optional
# ------------------------------------------------------------------ migrations
def apply_migrations(cursor: sqlite3.Cursor) -> None:
"""Apply all .sql migration files in migrations/ in lexicographic order.
Called from server.init_db(). Idempotent. Does not log past migrations in
a table yet — each file is guarded by CREATE ... IF NOT EXISTS etc. If
we ever need more complex migrations, add a schema_migrations table.
"""
here = os.path.dirname(os.path.abspath(__file__))
mdir = os.path.join(here, "migrations")
if not os.path.isdir(mdir):
return
for name in sorted(os.listdir(mdir)):
if not name.endswith(".sql"):
continue
path = os.path.join(mdir, name)
with open(path, "r") as f:
sql = f.read()
cursor.executescript(sql)
# ------------------------------------------------------------------ utils
def _uuid() -> str:
return str(uuid.uuid4())
def _now_iso() -> str:
return datetime.now(tz=timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
def _json(v) -> str:
return json.dumps(v, separators=(",", ":"))
# ------------------------------------------------------------------ email_accounts
def list_sync_ready_accounts(conn: sqlite3.Connection) -> list[sqlite3.Row]:
cur = conn.cursor()
cur.execute(
"SELECT * FROM email_accounts "
"WHERE sync_enabled = 1 AND sync_status IN ('pending','active') "
"ORDER BY last_synced_at IS NOT NULL, last_synced_at"
)
return cur.fetchall()
def get_account_by_email(conn: sqlite3.Connection, email_address: str) -> Optional[sqlite3.Row]:
cur = conn.cursor()
cur.execute("SELECT * FROM email_accounts WHERE email_address = ?", (email_address,))
return cur.fetchone()
def upsert_account(conn: sqlite3.Connection, *, user_id: str, email_address: str,
auth_method: str) -> str:
existing = get_account_by_email(conn, email_address)
if existing:
return existing["id"]
account_id = _uuid()
conn.execute(
"INSERT INTO email_accounts (id, user_id, email_address, auth_method) "
"VALUES (?, ?, ?, ?)",
(account_id, user_id, email_address, auth_method),
)
return account_id
def set_account_status(conn: sqlite3.Connection, account_id: str, *,
status: str, error: Optional[str] = None) -> None:
conn.execute(
"UPDATE email_accounts SET sync_status = ?, sync_error = ?, "
"updated_at = datetime('now') WHERE id = ?",
(status, error, account_id),
)
def set_account_checkpoint(conn: sqlite3.Connection, account_id: str, *,
history_id: Optional[str] = None,
backfill_cursor: Optional[str] = None,
backfill_complete: Optional[bool] = None,
last_synced_at: Optional[str] = None) -> None:
sets, params = [], []
if history_id is not None:
sets.append("last_history_id = ?"); params.append(history_id)
if backfill_cursor is not None:
sets.append("backfill_cursor = ?"); params.append(backfill_cursor)
if backfill_complete is not None:
sets.append("backfill_complete = ?"); params.append(1 if backfill_complete else 0)
if last_synced_at is not None:
sets.append("last_synced_at = ?"); params.append(last_synced_at)
if not sets:
return
sets.append("updated_at = datetime('now')")
params.append(account_id)
conn.execute(f"UPDATE email_accounts SET {', '.join(sets)} WHERE id = ?", params)
# ------------------------------------------------------------------ emails
def find_email_by_rfc_id(conn: sqlite3.Connection, rfc_message_id: str) -> Optional[sqlite3.Row]:
cur = conn.cursor()
cur.execute("SELECT * FROM emails WHERE rfc_message_id = ?", (rfc_message_id,))
return cur.fetchone()
def find_email_id_by_any_rfc_id(conn: sqlite3.Connection,
rfc_ids: Iterable[str]) -> Optional[str]:
ids = [r for r in rfc_ids if r]
if not ids:
return None
placeholders = ",".join("?" for _ in ids)
cur = conn.cursor()
cur.execute(
f"SELECT id FROM emails WHERE rfc_message_id IN ({placeholders}) "
"ORDER BY sent_at ASC LIMIT 1",
ids,
)
row = cur.fetchone()
return row["id"] if row else None
def insert_email(conn: sqlite3.Connection, *, parsed: dict, match_status: str) -> str:
"""Insert a fresh emails row. Returns email_id.
Caller must ensure no row exists for parsed['rfc_message_id']; use
find_email_by_rfc_id first.
"""
email_id = _uuid()
conn.execute(
"""INSERT INTO emails
(id, rfc_message_id, gmail_thread_id, rfc_thread_root_id, subject,
from_email, from_name, to_emails_json, cc_emails_json, bcc_emails_json,
reply_to, sent_at, body_text, body_html, snippet, in_reply_to,
references_json, has_attachments, size_estimate, is_matched,
match_status, raw_headers_json)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(
email_id,
parsed["rfc_message_id"],
parsed.get("gmail_thread_id"),
parsed.get("rfc_thread_root_id"),
parsed.get("subject"),
parsed["from_email"],
parsed.get("from_name"),
_json(parsed.get("to", [])),
_json(parsed.get("cc", [])),
_json(parsed.get("bcc", [])),
parsed.get("reply_to"),
parsed["sent_at"],
parsed.get("body_text"),
parsed.get("body_html"),
parsed.get("snippet"),
parsed.get("in_reply_to"),
_json(parsed.get("references", [])),
1 if parsed.get("attachments") else 0,
parsed.get("size_estimate"),
1 if match_status == "matched" else 0,
match_status,
_json(parsed.get("raw_headers", {})) if parsed.get("raw_headers") else None,
),
)
# recipients
for kind in ("from", "to", "cc", "bcc", "reply_to"):
addrs = []
if kind == "from" and parsed.get("from_email"):
addrs = [(parsed["from_email"], parsed.get("from_name"))]
elif kind == "reply_to" and parsed.get("reply_to"):
addrs = [(parsed["reply_to"], None)]
else:
for a in parsed.get(kind, []):
if isinstance(a, dict):
addrs.append((a.get("email"), a.get("name")))
else:
addrs.append((a, None))
for address, name in addrs:
if not address:
continue
conn.execute(
"INSERT INTO email_recipients (id, email_id, address, display_name, kind) "
"VALUES (?, ?, ?, ?, ?)",
(_uuid(), email_id, address.lower().strip(), name, kind),
)
return email_id
def set_email_thread(conn: sqlite3.Connection, email_id: str, thread_id: str) -> None:
conn.execute(
"UPDATE emails SET thread_id = ?, updated_at = datetime('now') WHERE id = ?",
(thread_id, email_id),
)
# ------------------------------------------------------------------ sightings
def upsert_sighting(conn: sqlite3.Connection, *, email_id: str, account_id: str,
gmail_message_id: str, gmail_thread_id: str,
labels: list[str], is_sent: bool) -> None:
conn.execute(
"""INSERT OR IGNORE INTO email_account_messages
(id, email_id, account_id, gmail_message_id, gmail_thread_id,
labels_json, is_sent)
VALUES (?, ?, ?, ?, ?, ?, ?)""",
(_uuid(), email_id, account_id, gmail_message_id, gmail_thread_id,
_json(labels), 1 if is_sent else 0),
)
def update_sighting_labels(conn: sqlite3.Connection, *, account_id: str,
gmail_message_id: str, labels: list[str]) -> None:
conn.execute(
"UPDATE email_account_messages SET labels_json = ? "
"WHERE account_id = ? AND gmail_message_id = ?",
(_json(labels), account_id, gmail_message_id),
)
def tombstone_sighting(conn: sqlite3.Connection, *, account_id: str,
gmail_message_id: str) -> None:
conn.execute(
"UPDATE email_account_messages SET deleted_at = datetime('now') "
"WHERE account_id = ? AND gmail_message_id = ?",
(account_id, gmail_message_id),
)
# ------------------------------------------------------------------ attachments
def insert_attachment_stub(conn: sqlite3.Connection, *, email_id: str,
gmail_attachment_id: str, filename: str,
sanitized_filename: str, mime_type: Optional[str],
size_bytes: Optional[int], storage_path: str) -> str:
att_id = _uuid()
conn.execute(
"""INSERT INTO email_attachments
(id, email_id, gmail_attachment_id, filename, sanitized_filename,
mime_type, size_bytes, storage_path)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
(att_id, email_id, gmail_attachment_id, filename, sanitized_filename,
mime_type, size_bytes, storage_path),
)
return att_id
def mark_attachment_downloaded(conn: sqlite3.Connection, attachment_id: str, *,
sha256_hex: str, size_bytes: int) -> None:
conn.execute(
"UPDATE email_attachments SET download_status = 'downloaded', "
"sha256_hex = ?, size_bytes = ?, downloaded_at = datetime('now') "
"WHERE id = ?",
(sha256_hex, size_bytes, attachment_id),
)
def mark_attachment_failed(conn: sqlite3.Connection, attachment_id: str, *,
error: str) -> None:
conn.execute(
"UPDATE email_attachments SET download_status = 'failed', "
"download_attempts = download_attempts + 1, download_error = ? "
"WHERE id = ?",
(error, attachment_id),
)
def pending_attachments(conn: sqlite3.Connection, limit: int = 50) -> list[sqlite3.Row]:
cur = conn.cursor()
cur.execute(
"SELECT a.*, eam.gmail_message_id, eam.account_id "
"FROM email_attachments a "
"JOIN email_account_messages eam ON eam.email_id = a.email_id "
"WHERE a.download_status = 'pending' AND a.download_attempts < 5 "
"LIMIT ?",
(limit,),
)
return cur.fetchall()
# ------------------------------------------------------------------ threads
def find_thread_by_gmail_id(conn: sqlite3.Connection, gmail_thread_id: str) -> Optional[sqlite3.Row]:
cur = conn.cursor()
cur.execute(
"SELECT * FROM email_threads WHERE gmail_thread_id = ?",
(gmail_thread_id,),
)
return cur.fetchone()
def find_thread_by_rfc_root(conn: sqlite3.Connection, rfc_root: str) -> Optional[sqlite3.Row]:
cur = conn.cursor()
cur.execute(
"SELECT * FROM email_threads WHERE rfc_thread_root_id = ?",
(rfc_root,),
)
return cur.fetchone()
def create_thread(conn: sqlite3.Connection, *, gmail_thread_id: Optional[str],
rfc_thread_root_id: Optional[str], subject_normalized: Optional[str],
first_message_at: Optional[str]) -> str:
thread_id = _uuid()
conn.execute(
"""INSERT INTO email_threads
(id, gmail_thread_id, rfc_thread_root_id, subject_normalized,
first_message_at, last_message_at, message_count)
VALUES (?, ?, ?, ?, ?, ?, 0)""",
(thread_id, gmail_thread_id, rfc_thread_root_id, subject_normalized,
first_message_at, first_message_at),
)
return thread_id
def rollup_thread(conn: sqlite3.Connection, thread_id: str) -> None:
"""Recompute count / last_message_at / participants from member emails.
Cheap at 5-person team volumes. For larger deployments swap to triggers.
"""
cur = conn.cursor()
cur.execute(
"SELECT COUNT(*) AS n, MIN(sent_at) AS first, MAX(sent_at) AS last, "
"MAX(is_matched) AS matched FROM emails WHERE thread_id = ?",
(thread_id,),
)
row = cur.fetchone()
if not row or row["n"] == 0:
return
cur.execute(
"SELECT DISTINCT address FROM email_recipients er "
"JOIN emails e ON e.id = er.email_id WHERE e.thread_id = ?",
(thread_id,),
)
participants = [r["address"] for r in cur.fetchall()]
conn.execute(
"UPDATE email_threads SET message_count = ?, first_message_at = ?, "
"last_message_at = ?, participant_count = ?, participants_json = ?, "
"is_matched = ?, updated_at = datetime('now') WHERE id = ?",
(row["n"], row["first"], row["last"], len(participants),
_json(participants), int(row["matched"] or 0), thread_id),
)
# ------------------------------------------------------------------ investor links
def insert_investor_link(conn: sqlite3.Connection, *, email_id: str,
link: dict) -> None:
conn.execute(
"""INSERT INTO email_investor_links
(id, email_id, fundraising_investor_id, fundraising_contact_id,
contact_id, organization_id, matched_address, match_kind,
match_confidence)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(
_uuid(),
email_id,
link.get("fundraising_investor_id"),
link.get("fundraising_contact_id"),
link.get("contact_id"),
link.get("organization_id"),
link["matched_address"],
link["match_kind"],
float(link.get("match_confidence", 1.0)),
),
)
# ------------------------------------------------------------------ sync runs
def start_sync_run(conn: sqlite3.Connection, *, account_id: str, kind: str) -> str:
run_id = _uuid()
conn.execute(
"INSERT INTO email_sync_runs (id, account_id, kind, started_at, status) "
"VALUES (?, ?, ?, ?, 'running')",
(run_id, account_id, kind, _now_iso()),
)
return run_id
def finish_sync_run(conn: sqlite3.Connection, run_id: str, *, status: str,
stats: Optional[dict] = None, error: Optional[str] = None) -> None:
stats = stats or {}
conn.execute(
"""UPDATE email_sync_runs
SET finished_at = ?, status = ?, messages_seen = ?, messages_stored = ?,
attachments_saved = ?, api_calls = ?, retries = ?, error = ?
WHERE id = ?""",
(
_now_iso(), status,
int(stats.get("messages_seen", 0)),
int(stats.get("messages_stored", 0)),
int(stats.get("attachments_saved", 0)),
int(stats.get("api_calls", 0)),
int(stats.get("retries", 0)),
error,
run_id,
),
)
+79
View File
@@ -0,0 +1,79 @@
"""
Exception taxonomy for Gmail integration.
gmail_client._call() maps HTTP status codes to these exception types. The retry
loop in gmail_client._with_retry() inspects the class hierarchy to decide
whether to back off + retry or fail fast.
"""
class GmailError(Exception):
"""Base class for all Gmail-integration errors."""
def __init__(self, message: str = "", *, status: int = 0, payload: object = None):
super().__init__(message)
self.status = status
self.payload = payload
class AuthError(GmailError):
"""401 / 403 that is not a rate-limit. Requires operator intervention
(bad service account key, revoked OAuth, missing DWD scope). Not retried."""
class RateLimitError(GmailError):
"""429 or 403 with reason in {rateLimitExceeded, userRateLimitExceeded}.
Retried with exponential backoff."""
class TransientError(GmailError):
"""5xx or network error. Retried with exponential backoff."""
class NotFoundError(GmailError):
"""404. For messages this usually means 'deleted in Gmail after we saw it';
for history this is HistoryExpiredError."""
class HistoryExpiredError(NotFoundError):
"""404 on history.list with startHistoryId — Gmail only retains history
for a limited window (~7 days). Triggers date-based backfill fallback."""
class PermanentError(GmailError):
"""400 or other permanent failure. Skip and log; do not retry."""
def classify_http(status: int, payload: object) -> GmailError:
"""Map a Gmail API response to the appropriate exception type.
`payload` is the decoded JSON body if any; used to distinguish rate-limit
403s from pure auth 403s via the `reason` field Google returns.
"""
reason = ""
if isinstance(payload, dict):
try:
errs = payload.get("error", {}).get("errors") or []
if errs:
reason = str(errs[0].get("reason", ""))
except Exception: # pragma: no cover — defensive
pass
if status == 429:
return RateLimitError(f"rate limited: {reason}", status=status, payload=payload)
if status == 403:
if reason in ("rateLimitExceeded", "userRateLimitExceeded", "quotaExceeded"):
return RateLimitError(f"quota: {reason}", status=status, payload=payload)
return AuthError(f"forbidden: {reason}", status=status, payload=payload)
if status == 401:
return AuthError("unauthorized", status=status, payload=payload)
if status == 404:
return NotFoundError("not found", status=status, payload=payload)
if 500 <= status < 600:
return TransientError(f"server error {status}", status=status, payload=payload)
if 400 <= status < 500:
return PermanentError(f"client error {status}: {reason}", status=status, payload=payload)
return GmailError(f"unexpected status {status}", status=status, payload=payload)
RETRYABLE = (RateLimitError, TransientError)
+249
View File
@@ -0,0 +1,249 @@
"""
Thin Gmail API wrapper.
Responsibilities:
- HTTPS calls to https://gmail.googleapis.com/gmail/v1/users/me/*
- Per-account access-token injection via CredentialProvider
- Rate limiting via token bucket
- Retry loop with exponential backoff + jitter for RETRYABLE errors
- Batch requests for metadata fetches (multipart/mixed) — sketch provided
- Call-count accounting for observability (plumbed to email_sync_runs)
We call Gmail over raw urllib instead of the google-api-python-client to keep
the dependency surface small. If you prefer the Google SDK, replace _call()
with client calls; everything else is independent.
"""
import json
import random
import threading
import time
import urllib.error
import urllib.parse
import urllib.request
from dataclasses import dataclass, field
from typing import Any, Iterator, Optional
from . import config as _cfg
from . import errors
BASE = "https://gmail.googleapis.com/gmail/v1/users"
# ---------------------------------------------------------------------------- token bucket
class _TokenBucket:
"""Simple per-account rate limiter. Call wait(cost) before each API call."""
def __init__(self, units_per_sec: int, burst: Optional[int] = None):
self._rate = float(units_per_sec)
self._burst = float(burst if burst is not None else units_per_sec)
self._tokens = self._burst
self._last = time.monotonic()
self._lock = threading.Lock()
def wait(self, cost: float) -> None:
while True:
with self._lock:
now = time.monotonic()
self._tokens = min(self._burst, self._tokens + (now - self._last) * self._rate)
self._last = now
if self._tokens >= cost:
self._tokens -= cost
return
needed = cost - self._tokens
sleep_for = needed / self._rate
time.sleep(sleep_for)
# ---------------------------------------------------------------------------- call stats
@dataclass
class CallStats:
api_calls: int = 0
retries: int = 0
bytes_in: int = 0
last_errors: list[str] = field(default_factory=list)
# ---------------------------------------------------------------------------- client
class GmailClient:
"""Per-account Gmail client. Bind one instance per sync run."""
def __init__(self, credential_provider, email_address: str, stats: Optional[CallStats] = None):
self._creds = credential_provider
self._email = email_address
self._bucket = _TokenBucket(units_per_sec=_cfg.CONFIG.rate_limit_units_per_sec_per_account)
self.stats = stats or CallStats()
# -------------------------------------------------------------- messages.*
def list_messages(self, *, q: str = "", page_token: Optional[str] = None,
max_results: int = 500, label_ids: Optional[list[str]] = None) -> dict:
"""https://developers.google.com/gmail/api/reference/rest/v1/users.messages/list"""
params = {"maxResults": str(max_results)}
if q:
params["q"] = q
if page_token:
params["pageToken"] = page_token
if label_ids:
for lid in label_ids:
params.setdefault("labelIds", [])
params["labelIds"].append(lid) if False else None
return self._get("/messages", params=params, cost=5)
def get_message(self, message_id: str, *, format: str = "metadata",
metadata_headers: Optional[list[str]] = None) -> dict:
params = {"format": format}
if format == "metadata" and metadata_headers:
params["metadataHeaders"] = metadata_headers
return self._get(f"/messages/{message_id}", params=params, cost=5)
def get_attachment(self, message_id: str, attachment_id: str) -> dict:
return self._get(
f"/messages/{message_id}/attachments/{attachment_id}",
params=None,
cost=5,
)
# -------------------------------------------------------------- history.*
def list_history(self, *, start_history_id: str, page_token: Optional[str] = None,
history_types: Optional[list[str]] = None) -> dict:
params = {"startHistoryId": start_history_id, "maxResults": "500"}
if page_token:
params["pageToken"] = page_token
if history_types:
params["historyTypes"] = history_types
try:
return self._get("/history", params=params, cost=2)
except errors.NotFoundError as e:
# Gmail returns 404 when startHistoryId is too old. Wrap for callers.
raise errors.HistoryExpiredError(
"startHistoryId no longer available", status=404, payload=getattr(e, "payload", None)
) from e
# -------------------------------------------------------------- profile
def get_profile(self) -> dict:
return self._get("/profile", params=None, cost=1)
# -------------------------------------------------------------- iteration helpers
def iter_messages(self, *, q: str = "") -> Iterator[dict]:
page_token: Optional[str] = None
while True:
resp = self.list_messages(q=q, page_token=page_token,
max_results=_cfg.CONFIG.backfill_page_size)
for m in resp.get("messages") or []:
yield m
page_token = resp.get("nextPageToken")
if not page_token:
return
def iter_history(self, *, start_history_id: str,
history_types: Optional[list[str]] = None) -> Iterator[dict]:
page_token: Optional[str] = None
while True:
resp = self.list_history(
start_history_id=start_history_id,
page_token=page_token,
history_types=history_types,
)
for h in resp.get("history") or []:
yield h
page_token = resp.get("nextPageToken")
if not page_token:
# Cache final historyId for caller to checkpoint.
self._last_history_id = resp.get("historyId")
return
@property
def last_history_id(self) -> Optional[str]:
return getattr(self, "_last_history_id", None)
# -------------------------------------------------------------- internals
def _get(self, path: str, *, params: Optional[dict], cost: float) -> dict:
return self._with_retry(lambda: self._call("GET", path, params=params, cost=cost))
def _call(self, method: str, path: str, *, params: Optional[dict] = None,
body: Optional[bytes] = None, cost: float = 1.0) -> dict:
self._bucket.wait(cost)
self.stats.api_calls += 1
qs = ""
if params:
# urllib.parse.urlencode with doseq=True handles repeated params
# like metadataHeaders=Foo&metadataHeaders=Bar correctly.
qs = "?" + urllib.parse.urlencode(params, doseq=True)
url = f"{BASE}/me{path}{qs}"
token = self._creds.access_token_for(self._email)
req = urllib.request.Request(url, method=method, data=body)
req.add_header("Authorization", f"Bearer {token.token}")
req.add_header("Accept", "application/json")
if body:
req.add_header("Content-Type", "application/json")
try:
with urllib.request.urlopen(req, timeout=30) as resp:
raw = resp.read()
self.stats.bytes_in += len(raw)
if not raw:
return {}
return json.loads(raw)
except urllib.error.HTTPError as e:
raw = e.read() or b""
self.stats.bytes_in += len(raw)
try:
payload = json.loads(raw) if raw else {}
except Exception:
payload = {"raw": raw.decode("utf-8", errors="replace")}
err = errors.classify_http(e.code, payload)
# short-message logging hook (redacted of tokens by design)
self.stats.last_errors.append(f"{e.code} {type(err).__name__}")
self.stats.last_errors = self.stats.last_errors[-10:]
raise err
except (urllib.error.URLError, TimeoutError) as e:
raise errors.TransientError(f"network error: {e}") from e
def _with_retry(self, fn):
cfg = _cfg.CONFIG
attempts = 0
delay = cfg.retry_initial_delay_sec
while True:
try:
return fn()
except errors.RETRYABLE as e:
attempts += 1
if attempts >= cfg.retry_max_attempts:
raise
self.stats.retries += 1
# Full jitter
sleep_for = random.uniform(0, min(delay, cfg.retry_max_delay_sec))
time.sleep(sleep_for)
delay = min(delay * 2, cfg.retry_max_delay_sec)
# Non-retryable errors propagate immediately.
# ---------------------------------------------------------------------------- batch fetch sketch
def batch_get_metadata(client: GmailClient, message_ids: list[str],
headers: list[str]) -> dict[str, dict]:
"""Fetch metadata for up to ~100 messages.
TODO: implement using Gmail's multipart/mixed batch endpoint at
https://www.googleapis.com/batch/gmail/v1 for efficiency. In the scaffold
we fall back to serial gets so the logic is correct from day 1.
"""
out: dict[str, dict] = {}
for mid in message_ids:
try:
out[mid] = client.get_message(mid, format="metadata", metadata_headers=headers)
except errors.NotFoundError:
# Message deleted between list and get — skip.
continue
return out
+215
View File
@@ -0,0 +1,215 @@
"""
Investor matching.
Builds an in-memory index of investor email addresses from:
- fundraising_contacts.email
- contacts.email
- organizations.email + organizations.website (domain only)
For each synced email, returns a list of investor links. Exact-email matches
beat domain matches; if any exact match exists, domain matches are suppressed.
The index is rebuilt every `REFRESH_INTERVAL_SEC` or on demand via rebuild().
"""
import re
import threading
import time
from dataclasses import dataclass
from typing import Optional
REFRESH_INTERVAL_SEC = 900 # 15 minutes
# Domains we never domain-match against (personal mailboxes).
COMMON_PERSONAL_DOMAINS = {
"gmail.com", "googlemail.com",
"outlook.com", "hotmail.com", "live.com", "msn.com",
"yahoo.com", "yahoo.co.uk", "ymail.com",
"icloud.com", "me.com", "mac.com",
"aol.com", "proton.me", "protonmail.com",
"pm.me", "fastmail.com", "tuta.io", "hey.com",
"duck.com", "zoho.com",
}
# Also skip matching on the team's own domain (they email each other).
# Populated from CONFIG.workspace_domain at rebuild time.
@dataclass
class MatchTarget:
fundraising_investor_id: Optional[str] = None
fundraising_contact_id: Optional[str] = None
contact_id: Optional[str] = None
organization_id: Optional[str] = None
investor_name: Optional[str] = None
@dataclass
class InvestorLink:
matched_address: str
match_kind: str # exact_email | domain_match | manual
match_confidence: float
target: MatchTarget
class InvestorIndex:
def __init__(self, own_domain: Optional[str] = None):
self._email_index: dict[str, MatchTarget] = {}
self._domain_index: dict[str, list[MatchTarget]] = {}
self._own_domain = (own_domain or "").lower() or None
self._last_built = 0.0
self._lock = threading.Lock()
# ------------------------------------------------------------------ build
def rebuild(self, db_conn_factory) -> None:
with self._lock:
email_idx: dict[str, MatchTarget] = {}
domain_idx: dict[str, list[MatchTarget]] = {}
conn = db_conn_factory()
try:
cur = conn.cursor()
# fundraising_contacts
cur.execute(
"SELECT fc.id, fc.email, fc.investor_id, fi.investor_name "
"FROM fundraising_contacts fc "
"LEFT JOIN fundraising_investors fi ON fi.id = fc.investor_id "
"WHERE fc.email IS NOT NULL AND fc.email != ''"
)
for r in cur.fetchall():
addr = (r["email"] or "").lower().strip()
if not _valid_email(addr):
continue
email_idx[addr] = MatchTarget(
fundraising_contact_id=r["id"],
fundraising_investor_id=r["investor_id"],
investor_name=r["investor_name"],
)
# contacts
cur.execute(
"SELECT id, email, organization_id FROM contacts "
"WHERE email IS NOT NULL AND email != ''"
)
for r in cur.fetchall():
addr = (r["email"] or "").lower().strip()
if not _valid_email(addr):
continue
# Don't overwrite a fundraising_contact match; they're higher signal.
email_idx.setdefault(addr, MatchTarget(
contact_id=r["id"],
organization_id=r["organization_id"],
))
# organizations — domain-only match source
cur.execute(
"SELECT id, name, email, website FROM organizations "
"WHERE (email IS NOT NULL AND email != '') OR (website IS NOT NULL AND website != '')"
)
for r in cur.fetchall():
for d in _domains_for_org(r):
if d in COMMON_PERSONAL_DOMAINS:
continue
if self._own_domain and d == self._own_domain:
continue
domain_idx.setdefault(d, []).append(MatchTarget(
organization_id=r["id"],
investor_name=r["name"],
))
finally:
conn.close()
self._email_index = email_idx
self._domain_index = domain_idx
self._last_built = time.time()
def rebuild_if_stale(self, db_conn_factory) -> None:
if time.time() - self._last_built > REFRESH_INTERVAL_SEC:
self.rebuild(db_conn_factory)
# ------------------------------------------------------------------ query
def match(self, addresses: set[str], *,
exclude_addresses: Optional[set[str]] = None) -> list[InvestorLink]:
excl = {a.lower() for a in (exclude_addresses or set())}
candidates = {a.lower().strip() for a in addresses if a} - excl
# Exclude own domain addresses (teammates emailing each other).
if self._own_domain:
candidates = {a for a in candidates
if not a.endswith("@" + self._own_domain)}
links: list[InvestorLink] = []
seen_targets: set[tuple] = set()
# Exact email matches first.
for addr in candidates:
t = self._email_index.get(addr)
if t:
key = (t.fundraising_contact_id, t.contact_id)
if key in seen_targets:
continue
seen_targets.add(key)
links.append(InvestorLink(
matched_address=addr,
match_kind="exact_email",
match_confidence=1.0,
target=t,
))
if links: # exact hits short-circuit domain matching
return links
# Domain fallback.
for addr in candidates:
_, _, domain = addr.partition("@")
if not domain or domain in COMMON_PERSONAL_DOMAINS:
continue
for t in self._domain_index.get(domain, []):
key = ("org", t.organization_id)
if key in seen_targets:
continue
seen_targets.add(key)
links.append(InvestorLink(
matched_address=addr,
match_kind="domain_match",
match_confidence=0.6,
target=t,
))
return links
# ---------------------------------------------------------------------------- helpers
_EMAIL_RE = re.compile(r"^[^@\s]+@[^@\s]+\.[^@\s]+$")
def _valid_email(s: str) -> bool:
return bool(_EMAIL_RE.match(s))
def _domains_for_org(row) -> list[str]:
out: list[str] = []
if row["email"]:
_, _, d = row["email"].lower().partition("@")
if d:
out.append(d)
if row["website"]:
d = _domain_from_url(row["website"])
if d:
out.append(d)
return list({d for d in out if d})
def _domain_from_url(url: str) -> Optional[str]:
if not url:
return None
m = re.match(r"^\s*(?:https?://)?(?:www\.)?([^/:?#\s]+)", url.strip(), re.IGNORECASE)
if not m:
return None
return m.group(1).lower()
@@ -0,0 +1,192 @@
-- Gmail Integration — Phase 1 migration
-- Creates all tables for email capture, matching, threading, attachments.
-- This migration is IDEMPOTENT: safe to re-run.
-- Applied by email_integration.db.apply_migrations() on server startup when
-- CRM_GMAIL_INTEGRATION_ENABLED is truthy.
--
-- DO NOT modify this file in place after it ships. Create 0002_*.sql, etc.
-- ============================================================================
-- email_accounts — one row per enrolled team-member mailbox
-- ============================================================================
CREATE TABLE IF NOT EXISTS email_accounts (
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL,
email_address TEXT NOT NULL UNIQUE,
auth_method TEXT NOT NULL, -- 'dwd' | 'oauth'
oauth_refresh_enc BLOB,
oauth_token_enc BLOB,
oauth_token_exp TEXT,
sync_enabled INTEGER NOT NULL DEFAULT 1,
sync_status TEXT NOT NULL DEFAULT 'pending',
sync_error TEXT,
last_history_id TEXT,
last_synced_at TEXT,
backfill_complete INTEGER NOT NULL DEFAULT 0,
backfill_cursor TEXT,
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now')),
FOREIGN KEY(user_id) REFERENCES users(id)
);
CREATE INDEX IF NOT EXISTS idx_email_accounts_user ON email_accounts(user_id);
CREATE INDEX IF NOT EXISTS idx_email_accounts_sync ON email_accounts(sync_enabled, sync_status);
-- ============================================================================
-- emails — canonical email record, dedup'd across accounts by RFC Message-ID
-- ============================================================================
CREATE TABLE IF NOT EXISTS emails (
id TEXT PRIMARY KEY,
rfc_message_id TEXT NOT NULL UNIQUE,
gmail_thread_id TEXT,
rfc_thread_root_id TEXT,
thread_id TEXT, -- FK email_threads.id (populated by threads.py)
subject TEXT,
from_email TEXT NOT NULL,
from_name TEXT,
to_emails_json TEXT NOT NULL DEFAULT '[]',
cc_emails_json TEXT NOT NULL DEFAULT '[]',
bcc_emails_json TEXT NOT NULL DEFAULT '[]',
reply_to TEXT,
sent_at TEXT NOT NULL,
body_text TEXT,
body_html TEXT,
snippet TEXT,
in_reply_to TEXT,
references_json TEXT DEFAULT '[]',
has_attachments INTEGER NOT NULL DEFAULT 0,
size_estimate INTEGER,
is_matched INTEGER NOT NULL DEFAULT 0,
match_status TEXT NOT NULL DEFAULT 'unmatched', -- unmatched|matched|skipped
raw_headers_json TEXT,
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now'))
);
CREATE INDEX IF NOT EXISTS idx_emails_thread ON emails(gmail_thread_id);
CREATE INDEX IF NOT EXISTS idx_emails_rfc_thread ON emails(rfc_thread_root_id);
CREATE INDEX IF NOT EXISTS idx_emails_thread_fk ON emails(thread_id);
CREATE INDEX IF NOT EXISTS idx_emails_from ON emails(from_email);
CREATE INDEX IF NOT EXISTS idx_emails_sent_at ON emails(sent_at);
CREATE INDEX IF NOT EXISTS idx_emails_matched ON emails(is_matched, sent_at);
CREATE INDEX IF NOT EXISTS idx_emails_in_reply_to ON emails(in_reply_to);
-- ============================================================================
-- email_recipients — denormalized for fast address lookups
-- ============================================================================
CREATE TABLE IF NOT EXISTS email_recipients (
id TEXT PRIMARY KEY,
email_id TEXT NOT NULL,
address TEXT NOT NULL,
display_name TEXT,
kind TEXT NOT NULL, -- from|to|cc|bcc|reply_to
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_email_recipients_addr ON email_recipients(address);
CREATE INDEX IF NOT EXISTS idx_email_recipients_email ON email_recipients(email_id);
-- ============================================================================
-- email_account_messages — per-mailbox sighting of an email
-- ============================================================================
CREATE TABLE IF NOT EXISTS email_account_messages (
id TEXT PRIMARY KEY,
email_id TEXT NOT NULL,
account_id TEXT NOT NULL,
gmail_message_id TEXT NOT NULL,
gmail_thread_id TEXT NOT NULL,
labels_json TEXT DEFAULT '[]',
is_sent INTEGER NOT NULL DEFAULT 0,
first_seen_at TEXT DEFAULT (datetime('now')),
deleted_at TEXT,
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE,
FOREIGN KEY(account_id) REFERENCES email_accounts(id) ON DELETE CASCADE,
UNIQUE(account_id, gmail_message_id)
);
CREATE INDEX IF NOT EXISTS idx_eam_email ON email_account_messages(email_id);
CREATE INDEX IF NOT EXISTS idx_eam_account ON email_account_messages(account_id);
CREATE INDEX IF NOT EXISTS idx_eam_gmail_msg ON email_account_messages(gmail_message_id);
-- ============================================================================
-- email_attachments — metadata; bytes on disk under data/email_attachments/
-- ============================================================================
CREATE TABLE IF NOT EXISTS email_attachments (
id TEXT PRIMARY KEY,
email_id TEXT NOT NULL,
gmail_attachment_id TEXT NOT NULL,
filename TEXT NOT NULL,
sanitized_filename TEXT NOT NULL,
mime_type TEXT,
size_bytes INTEGER,
sha256_hex TEXT,
storage_path TEXT NOT NULL,
download_status TEXT NOT NULL DEFAULT 'pending', -- pending|downloaded|failed|skipped
download_attempts INTEGER NOT NULL DEFAULT 0,
download_error TEXT,
downloaded_at TEXT,
created_at TEXT DEFAULT (datetime('now')),
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_attach_email ON email_attachments(email_id);
CREATE INDEX IF NOT EXISTS idx_attach_sha ON email_attachments(sha256_hex);
CREATE INDEX IF NOT EXISTS idx_attach_status ON email_attachments(download_status);
-- ============================================================================
-- email_threads — thread roll-up for UI
-- ============================================================================
CREATE TABLE IF NOT EXISTS email_threads (
id TEXT PRIMARY KEY,
gmail_thread_id TEXT,
rfc_thread_root_id TEXT,
subject_normalized TEXT,
first_message_at TEXT,
last_message_at TEXT,
message_count INTEGER NOT NULL DEFAULT 0,
participant_count INTEGER NOT NULL DEFAULT 0,
participants_json TEXT DEFAULT '[]',
is_matched INTEGER NOT NULL DEFAULT 0,
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now'))
);
CREATE UNIQUE INDEX IF NOT EXISTS idx_threads_gmail_uniq ON email_threads(gmail_thread_id)
WHERE gmail_thread_id IS NOT NULL;
CREATE INDEX IF NOT EXISTS idx_threads_rfc_root ON email_threads(rfc_thread_root_id);
CREATE INDEX IF NOT EXISTS idx_threads_last_msg ON email_threads(last_message_at);
-- ============================================================================
-- email_investor_links — matched investors
-- ============================================================================
CREATE TABLE IF NOT EXISTS email_investor_links (
id TEXT PRIMARY KEY,
email_id TEXT NOT NULL,
fundraising_investor_id TEXT,
fundraising_contact_id TEXT,
contact_id TEXT,
organization_id TEXT,
matched_address TEXT NOT NULL,
match_kind TEXT NOT NULL, -- exact_email|domain_match|manual
match_confidence REAL NOT NULL DEFAULT 1.0,
created_at TEXT DEFAULT (datetime('now')),
FOREIGN KEY(email_id) REFERENCES emails(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_eil_email ON email_investor_links(email_id);
CREATE INDEX IF NOT EXISTS idx_eil_investor ON email_investor_links(fundraising_investor_id);
CREATE INDEX IF NOT EXISTS idx_eil_fr_contact ON email_investor_links(fundraising_contact_id);
CREATE INDEX IF NOT EXISTS idx_eil_contact ON email_investor_links(contact_id);
-- ============================================================================
-- email_sync_runs — per-run observability
-- ============================================================================
CREATE TABLE IF NOT EXISTS email_sync_runs (
id TEXT PRIMARY KEY,
account_id TEXT NOT NULL,
kind TEXT NOT NULL, -- backfill|incremental|manual
started_at TEXT NOT NULL,
finished_at TEXT,
status TEXT NOT NULL, -- running|ok|error|partial
messages_seen INTEGER NOT NULL DEFAULT 0,
messages_stored INTEGER NOT NULL DEFAULT 0,
attachments_saved INTEGER NOT NULL DEFAULT 0,
api_calls INTEGER NOT NULL DEFAULT 0,
retries INTEGER NOT NULL DEFAULT 0,
error TEXT,
FOREIGN KEY(account_id) REFERENCES email_accounts(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_sync_runs_account ON email_sync_runs(account_id, started_at);
+283
View File
@@ -0,0 +1,283 @@
"""
Parse a Gmail `users.messages.get` response (format=full) into a flat dict
ready for db.insert_email().
Input shape (abbreviated):
{
"id": "...", # Gmail message id
"threadId": "...",
"labelIds": ["INBOX","IMPORTANT",...],
"snippet": "...",
"historyId": "...",
"internalDate": "1713657600000", # ms epoch, authoritative
"sizeEstimate": 12345,
"payload": {
"headers": [{"name":"Subject","value":"..."}, ...],
"mimeType": "multipart/mixed",
"parts": [...recursive...],
"body": {"data": "<base64url>", "size": ...}
}
}
"""
import base64
import email.utils
import email.header
import re
from datetime import datetime, timezone
from typing import Any, Iterable, Optional
from html.parser import HTMLParser
# ---------------------------------------------------------------------------- public
def parse(message: dict, *, owning_account_address: Optional[str] = None) -> dict:
"""Parse a Gmail message payload into our canonical dict shape."""
headers = _header_map(message.get("payload", {}).get("headers") or [])
from_name, from_email = _split_addr(headers.get("from", ""))
to_list = _parse_address_list(headers.get("to", ""))
cc_list = _parse_address_list(headers.get("cc", ""))
bcc_list = _parse_address_list(headers.get("bcc", ""))
reply_to = _split_addr(headers.get("reply-to", ""))[1] or None
sent_at = _parse_date_header(headers.get("date"), fallback_ms=message.get("internalDate"))
rfc_mid = headers.get("message-id", "").strip() or f"synthetic-{message.get('id')}@ten31.local"
rfc_mid = _strip_angle_brackets(rfc_mid)
in_reply_to = _strip_angle_brackets(headers.get("in-reply-to", "").strip()) or None
references = _split_references(headers.get("references", ""))
rfc_thread_root_id = references[0] if references else (in_reply_to or rfc_mid)
body_text, body_html, attachments = _walk_payload(message.get("payload", {}))
subject = _decode_rfc2047(headers.get("subject") or "")
labels = message.get("labelIds") or []
is_sent = "SENT" in labels
return {
"gmail_message_id": message.get("id"),
"gmail_thread_id": message.get("threadId"),
"rfc_message_id": rfc_mid,
"rfc_thread_root_id": rfc_thread_root_id,
"in_reply_to": in_reply_to,
"references": references,
"subject": subject,
"from_email": (from_email or "").lower(),
"from_name": from_name,
"to": [{"email": e.lower(), "name": n} for n, e in to_list if e],
"cc": [{"email": e.lower(), "name": n} for n, e in cc_list if e],
"bcc": [{"email": e.lower(), "name": n} for n, e in bcc_list if e],
"reply_to": reply_to.lower() if reply_to else None,
"sent_at": sent_at,
"body_text": _cap_text(body_text),
"body_html": _cap_text(body_html),
"snippet": message.get("snippet"),
"attachments": attachments,
"size_estimate": message.get("sizeEstimate"),
"labels": labels,
"is_sent": is_sent,
"raw_headers": headers,
"owning_account": owning_account_address,
}
# ---------------------------------------------------------------------------- headers
def _header_map(header_list: Iterable[dict]) -> dict[str, str]:
"""Case-insensitive keys. Last-write-wins for duplicates (rare)."""
out: dict[str, str] = {}
for h in header_list:
name = (h.get("name") or "").lower()
out[name] = h.get("value") or ""
return out
def _decode_rfc2047(s: str) -> str:
if not s:
return ""
try:
parts = email.header.decode_header(s)
pieces = []
for text, charset in parts:
if isinstance(text, bytes):
try:
pieces.append(text.decode(charset or "utf-8", errors="replace"))
except LookupError:
pieces.append(text.decode("utf-8", errors="replace"))
else:
pieces.append(text)
return "".join(pieces)
except Exception:
return s
def _split_addr(raw: str) -> tuple[Optional[str], Optional[str]]:
if not raw:
return (None, None)
name, addr = email.utils.parseaddr(raw)
return (_decode_rfc2047(name) or None, addr or None)
def _parse_address_list(raw: str) -> list[tuple[Optional[str], Optional[str]]]:
if not raw:
return []
parsed = email.utils.getaddresses([raw])
return [(_decode_rfc2047(n) or None, a or None) for n, a in parsed if a]
def _parse_date_header(raw: Optional[str], *, fallback_ms: Optional[str]) -> str:
# Prefer RFC Date header, fall back to Gmail internalDate (epoch ms).
if raw:
try:
dt = email.utils.parsedate_to_datetime(raw)
if dt is not None:
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
return dt.astimezone(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
except (TypeError, ValueError):
pass
if fallback_ms:
try:
dt = datetime.fromtimestamp(int(fallback_ms) / 1000.0, tz=timezone.utc)
return dt.strftime("%Y-%m-%dT%H:%M:%SZ")
except (TypeError, ValueError):
pass
return datetime.now(tz=timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
def _split_references(raw: str) -> list[str]:
if not raw:
return []
return [_strip_angle_brackets(p) for p in raw.split() if p.strip()]
def _strip_angle_brackets(s: str) -> str:
s = (s or "").strip()
if s.startswith("<") and s.endswith(">"):
return s[1:-1]
return s
# ---------------------------------------------------------------------------- MIME walk
def _walk_payload(payload: dict) -> tuple[Optional[str], Optional[str], list[dict]]:
"""Returns (body_text, body_html, attachments).
Depth-first walk. First plain/text wins for body_text; first text/html
wins for body_html. Anything with a filename or attachment disposition
becomes an attachment entry.
"""
text: Optional[str] = None
html_body: Optional[str] = None
attachments: list[dict] = []
def visit(part: dict):
nonlocal text, html_body
mime = (part.get("mimeType") or "").lower()
filename = part.get("filename") or ""
body = part.get("body") or {}
parts = part.get("parts") or []
headers = _header_map(part.get("headers") or [])
disposition = (headers.get("content-disposition") or "").lower()
is_attachment = bool(filename) or disposition.startswith("attachment")
if is_attachment:
attachments.append({
"filename": filename or f"unnamed.{_ext_for(mime)}",
"mime_type": mime or "application/octet-stream",
"size": body.get("size"),
"gmail_attachment_id": body.get("attachmentId"),
# Some tiny attachments come inlined as base64; attachmentId is
# then missing and data is in body.data. sync.py handles both.
"inline_data_b64": body.get("data"),
"content_disposition": "inline" if disposition.startswith("inline") else "attachment",
})
else:
if mime == "text/plain" and text is None:
text = _decode_body(body)
elif mime == "text/html" and html_body is None:
html_body = _decode_body(body)
for child in parts:
visit(child)
visit(payload)
# Derive a plain-text body from HTML if only HTML exists.
if text is None and html_body:
text = _strip_html(html_body)
return text, html_body, attachments
def _decode_body(body: dict) -> Optional[str]:
data = body.get("data")
if not data:
return None
try:
padding = 4 - (len(data) % 4)
if padding != 4:
data = data + ("=" * padding)
raw = base64.urlsafe_b64decode(data.encode("ascii"))
return raw.decode("utf-8", errors="replace").replace("\r\n", "\n")
except Exception:
return None
# ---------------------------------------------------------------------------- HTML stripping
class _HTMLToText(HTMLParser):
def __init__(self):
super().__init__()
self._parts: list[str] = []
self._skip_depth = 0
def handle_starttag(self, tag, attrs):
if tag in ("script", "style"):
self._skip_depth += 1
if tag in ("br", "p", "div", "tr", "li"):
self._parts.append("\n")
def handle_endtag(self, tag):
if tag in ("script", "style"):
self._skip_depth = max(0, self._skip_depth - 1)
if tag in ("p", "div", "tr"):
self._parts.append("\n")
def handle_data(self, data):
if self._skip_depth == 0:
self._parts.append(data)
def text(self) -> str:
raw = "".join(self._parts)
return re.sub(r"\n{3,}", "\n\n", raw).strip()
def _strip_html(html: str) -> str:
p = _HTMLToText()
try:
p.feed(html)
return p.text()
except Exception:
return re.sub(r"<[^>]+>", " ", html)
def _ext_for(mime: str) -> str:
return mime.split("/")[-1] if "/" in mime else "bin"
# ---------------------------------------------------------------------------- caps
# Keep bodies bounded to avoid a pathological 500MB message exploding the DB.
_BODY_CAP_BYTES = 10 * 1024 * 1024 # 10MB
def _cap_text(s: Optional[str]) -> Optional[str]:
if s is None:
return None
if len(s.encode("utf-8", errors="ignore")) <= _BODY_CAP_BYTES:
return s
return s[: _BODY_CAP_BYTES // 2] + "\n\n[TRUNCATED BY CRM — body exceeded 10MB]"
+462
View File
@@ -0,0 +1,462 @@
"""
HTTP route handlers for the Gmail integration.
Designed to plug into server.py's CRMHandler (BaseHTTPRequestHandler) pattern.
The hook is a single function call near the top of do_GET / do_POST that
lets this module claim any /api/email/* request:
# in CRMHandler.do_GET and CRMHandler.do_POST, before the 404 fallthrough:
from email_integration.routes import try_handle
if try_handle(self):
return
`try_handle(handler)` inspects `handler.command` and `handler.get_path()` and
returns True if it handled the request (sent a response).
Every handler respects the same auth / rate-limit model as the rest of server.py
by calling handler.get_user() and handler.rate_limited(...).
"""
import json
import sqlite3
from typing import Optional
from . import config as _cfg
from . import credentials as _creds
from . import crypto as _crypto
from . import db as _db
from . import scheduler as _sched
# ---------------------------------------------------------------------------- dispatch
_GET_ROUTES = {
"/api/email/status": "status",
"/api/email/accounts": "list_accounts",
"/api/email/threads": "list_threads",
"/api/email/oauth/start": "oauth_start",
"/api/email/oauth/callback": "oauth_callback",
}
_POST_ROUTES = {
"/api/email/accounts/enroll-all": "enroll_all",
"/api/email/accounts/enroll": "enroll_one",
"/api/email/sync/run-now": "run_now",
"/api/email/rematch": "rematch",
}
def try_handle(handler) -> bool:
path = handler.get_path()
method = handler.command
table = _GET_ROUTES if method == "GET" else _POST_ROUTES if method == "POST" else {}
name = table.get(path)
if not path.startswith("/api/email/"):
return False
if not name:
# Route is owned by this module but unknown — return a proper 404
# instead of letting the main dispatcher's 404 abuse counter fire.
handler.send_error_json("Not found", 404)
return True
if not _cfg.CONFIG.enabled:
handler.send_error_json("Email integration disabled", 503)
return True
# Also enforce attachment streaming under a different prefix
# (handled above via prefix check).
impl = globals().get(f"_h_{name}")
if impl is None:
handler.send_error_json("Not implemented", 500)
return True
try:
impl(handler)
except Exception as e:
handler.send_error_json(f"Internal error: {e}", 500)
return True
# ---------------------------------------------------------------------------- helpers
def _conn() -> sqlite3.Connection:
import os
db_path = os.environ.get(
"CRM_DB_PATH",
os.path.join(_cfg.CONFIG.data_dir, "crm.db"),
)
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA foreign_keys=ON")
conn.execute("PRAGMA busy_timeout=5000")
conn.row_factory = sqlite3.Row
return conn
def _require_auth(handler) -> Optional[dict]:
user = handler.get_user()
if not user:
handler.send_error_json("Unauthorized", 401)
return None
return user
def _require_admin(handler) -> Optional[dict]:
user = _require_auth(handler)
if user is None:
return None
if user.get("role") != "admin":
handler.send_error_json("Admin required", 403)
return None
return user
# ---------------------------------------------------------------------------- GET handlers
def _h_status(handler):
user = _require_auth(handler)
if not user:
return
snap = _sched.status_snapshot()
conn = _conn()
try:
cur = conn.cursor()
cur.execute(
"SELECT COUNT(*) AS n_accounts, "
"SUM(CASE WHEN sync_status='active' THEN 1 ELSE 0 END) AS n_active, "
"SUM(CASE WHEN sync_status='error' THEN 1 ELSE 0 END) AS n_error "
"FROM email_accounts"
)
counts = dict(cur.fetchone() or {})
cur.execute("SELECT COUNT(*) AS n FROM emails WHERE match_status = 'matched'")
snap["matched_emails"] = cur.fetchone()["n"]
finally:
conn.close()
snap["accounts_summary"] = counts
handler.send_json(snap)
def _h_list_accounts(handler):
user = _require_auth(handler)
if not user:
return
conn = _conn()
try:
cur = conn.cursor()
cur.execute(
"SELECT id, user_id, email_address, auth_method, sync_enabled, "
"sync_status, sync_error, last_synced_at, backfill_complete "
"FROM email_accounts ORDER BY email_address"
)
rows = [dict(r) for r in cur.fetchall()]
finally:
conn.close()
# Non-admins only see their own row
if user.get("role") != "admin":
rows = [r for r in rows if r["user_id"] == user["user_id"]]
handler.send_json({"accounts": rows})
def _h_list_threads(handler):
user = _require_auth(handler)
if not user:
return
q = handler.get_query_params()
investor_id = q.get("investor_id")
limit = min(int(q.get("limit", 50)), 500)
conn = _conn()
try:
cur = conn.cursor()
if investor_id:
cur.execute(
"""SELECT t.*
FROM email_threads t
JOIN emails e ON e.thread_id = t.id
JOIN email_investor_links l ON l.email_id = e.id
WHERE l.fundraising_investor_id = ?
OR l.fundraising_contact_id IN (
SELECT id FROM fundraising_contacts WHERE investor_id = ?
)
GROUP BY t.id
ORDER BY t.last_message_at DESC
LIMIT ?""",
(investor_id, investor_id, limit),
)
else:
cur.execute(
"SELECT * FROM email_threads WHERE is_matched = 1 "
"ORDER BY last_message_at DESC LIMIT ?",
(limit,),
)
threads = [dict(r) for r in cur.fetchall()]
finally:
conn.close()
handler.send_json({"threads": threads})
def _h_oauth_start(handler):
"""Begin per-user OAuth consent flow (fallback path)."""
user = _require_auth(handler)
if not user:
return
if _cfg.CONFIG.primary_auth != "oauth":
return handler.send_error_json(
"Per-user OAuth disabled (set CRM_GMAIL_AUTH_METHOD=oauth to enable)", 400
)
q = handler.get_query_params()
account_email = q.get("account_email") or ""
if not account_email:
return handler.send_error_json("account_email required", 400)
import secrets
import urllib.parse
state = secrets.token_urlsafe(32)
_oauth_state_store(state, user["user_id"], account_email)
params = {
"client_id": _cfg.CONFIG.oauth_client_id,
"redirect_uri": _cfg.CONFIG.oauth_redirect_uri,
"response_type": "code",
"scope": _creds.GMAIL_READONLY_SCOPE,
"access_type": "offline",
"prompt": "consent",
"state": state,
"login_hint": account_email,
}
url = "https://accounts.google.com/o/oauth2/v2/auth?" + urllib.parse.urlencode(params)
handler.send_json({"redirect_url": url})
def _h_oauth_callback(handler):
"""Exchange code for tokens, encrypt refresh token, store."""
q = handler.get_query_params()
code = q.get("code")
state = q.get("state")
if not code or not state:
return handler.send_error_json("code and state required", 400)
state_row = _oauth_state_consume(state)
if not state_row:
return handler.send_error_json("Invalid state", 400)
import urllib.parse
import urllib.request
body = urllib.parse.urlencode({
"code": code,
"client_id": _cfg.CONFIG.oauth_client_id,
"client_secret": _cfg.CONFIG.oauth_client_secret,
"redirect_uri": _cfg.CONFIG.oauth_redirect_uri,
"grant_type": "authorization_code",
}).encode("ascii")
req = urllib.request.Request(
"https://oauth2.googleapis.com/token",
data=body,
headers={"Content-Type": "application/x-www-form-urlencoded"},
)
try:
with urllib.request.urlopen(req, timeout=15) as resp:
payload = json.loads(resp.read())
except Exception as e:
return handler.send_error_json(f"Token exchange failed: {e}", 500)
refresh = payload.get("refresh_token")
if not refresh:
return handler.send_error_json("No refresh_token returned (user may have previously consented; prompt=consent required)", 400)
enc = _crypto.encrypt(refresh.encode("ascii"), secret_key_b64=_cfg.CONFIG.secret_key_b64)
conn = _conn()
try:
_db.upsert_account(conn, user_id=state_row["user_id"],
email_address=state_row["account_email"],
auth_method="oauth")
conn.execute(
"UPDATE email_accounts SET oauth_refresh_enc = ?, sync_status = 'pending', "
"updated_at = datetime('now') WHERE email_address = ?",
(enc, state_row["account_email"]),
)
conn.commit()
finally:
conn.close()
handler.send_json({"ok": True, "account_email": state_row["account_email"]})
# ---------------------------------------------------------------------------- POST handlers
def _h_enroll_all(handler):
"""Admin: enroll every CRM user whose email is @workspace_domain via DWD."""
user = _require_admin(handler)
if not user:
return
if _cfg.CONFIG.primary_auth != "dwd":
return handler.send_error_json("enroll-all only valid in DWD mode", 400)
domain = _cfg.CONFIG.workspace_domain
if not domain:
return handler.send_error_json("CRM_GMAIL_WORKSPACE_DOMAIN not set", 400)
conn = _conn()
try:
cur = conn.cursor()
cur.execute(
"SELECT id, email FROM users WHERE is_active = 1 AND email LIKE ?",
(f"%@{domain}",),
)
users = cur.fetchall()
created = []
for u in users:
aid = _db.upsert_account(conn, user_id=u["id"],
email_address=u["email"].lower(),
auth_method="dwd")
created.append({"account_id": aid, "email": u["email"]})
conn.commit()
finally:
conn.close()
handler.send_json({"enrolled": created, "count": len(created)})
def _h_enroll_one(handler):
user = _require_admin(handler)
if not user:
return
body = handler.get_body() or {}
# Accept either `email` or `email_address` for ergonomics.
email_address = (body.get("email_address") or body.get("email") or "").lower().strip()
user_id = body.get("user_id")
auth_method = body.get("auth_method") or _cfg.CONFIG.primary_auth
if not email_address:
return handler.send_error_json("email (or email_address) required", 400)
# If the caller didn't specify a CRM user_id, resolve it from the
# users table by matching email. Falls back to the authenticated
# admin's own id (handles the common case of a single admin
# enrolling themselves without having to paste their UUID).
if not user_id:
conn = _conn()
try:
cur = conn.cursor()
cur.execute("SELECT id FROM users WHERE LOWER(email) = ?",
(email_address,))
row = cur.fetchone()
user_id = row["id"] if row else user.get("id")
finally:
conn.close()
if not user_id:
return handler.send_error_json("could not resolve user_id for that email", 400)
conn = _conn()
try:
aid = _db.upsert_account(conn, user_id=user_id,
email_address=email_address,
auth_method=auth_method)
conn.commit()
finally:
conn.close()
handler.send_json({"account_id": aid, "email": email_address, "user_id": user_id})
def _h_run_now(handler):
user = _require_admin(handler)
if not user:
return
# Reuse existing rate limit so admins can't hammer this.
if handler.rate_limited("email-sync-now", 6):
return handler.send_error_json("Too many requests", 429)
result = _sched.trigger_run_now()
handler.send_json(result)
def _h_rematch(handler):
"""Re-evaluate unmatched emails against the current investor index."""
user = _require_admin(handler)
if not user:
return
body = handler.get_body() or {}
since = body.get("since") # optional ISO8601
conn = _conn()
scanned = 0
matched = 0
try:
from .matcher import InvestorIndex
index = InvestorIndex(own_domain=_cfg.CONFIG.workspace_domain)
index.rebuild(_conn)
cur = conn.cursor()
sql = ("SELECT id, from_email, to_emails_json, cc_emails_json "
"FROM emails WHERE match_status = 'unmatched'")
params: list = []
if since:
sql += " AND sent_at >= ?"
params.append(since)
sql += " ORDER BY sent_at DESC LIMIT 10000"
cur.execute(sql, params)
for row in cur.fetchall():
scanned += 1
participants = set()
if row["from_email"]:
participants.add(row["from_email"].lower())
for col in ("to_emails_json", "cc_emails_json"):
try:
arr = json.loads(row[col] or "[]")
except Exception:
arr = []
for a in arr:
e = a.get("email") if isinstance(a, dict) else a
if e:
participants.add(e.lower())
links = index.match(participants)
if not links:
continue
matched += 1
conn.execute(
"UPDATE emails SET match_status='matched', is_matched=1, "
"updated_at=datetime('now') WHERE id=?",
(row["id"],),
)
for link in links:
_db.insert_investor_link(conn, email_id=row["id"], link={
"matched_address": link.matched_address,
"match_kind": link.match_kind,
"match_confidence": link.match_confidence,
"fundraising_investor_id": link.target.fundraising_investor_id,
"fundraising_contact_id": link.target.fundraising_contact_id,
"contact_id": link.target.contact_id,
"organization_id": link.target.organization_id,
})
# NOTE: body is still missing — we only have headers. A follow-up
# job can re-fetch the full message from Gmail using the sighting's
# gmail_message_id. Not done inline to keep this endpoint fast.
conn.commit()
finally:
conn.close()
handler.send_json({"scanned": scanned, "newly_matched": matched})
# ---------------------------------------------------------------------------- OAuth state store (in-memory)
# For a 5-person CRM the state store doesn't need to be durable — a server
# restart between start and callback is rare and just requires a retry.
_oauth_states: dict[str, dict] = {}
_oauth_state_lock = __import__("threading").Lock()
def _oauth_state_store(state: str, user_id: str, account_email: str) -> None:
import time
with _oauth_state_lock:
# Prune stale entries (>10 min).
cutoff = time.time() - 600
for k, v in list(_oauth_states.items()):
if v["created"] < cutoff:
_oauth_states.pop(k, None)
_oauth_states[state] = {
"user_id": user_id,
"account_email": account_email.lower().strip(),
"created": time.time(),
}
def _oauth_state_consume(state: str) -> Optional[dict]:
with _oauth_state_lock:
return _oauth_states.pop(state, None)
+143
View File
@@ -0,0 +1,143 @@
"""
Background sync scheduler.
Runs as a daemon thread started from server.py main(). One thread; it wakes
every `sync_interval_sec`, processes all accounts serially, sleeps again.
Singleton: start_sync_scheduler() is idempotent — calling twice won't spawn
a second thread. stop_sync_scheduler() gracefully signals shutdown (not
strictly needed since it's daemon, but useful for tests).
"""
import logging
import sqlite3
import threading
import time
from typing import Callable, Optional
from . import config as _cfg
from . import credentials as _creds
from . import sync as _sync
from .matcher import InvestorIndex
log = logging.getLogger("email_integration.scheduler")
_state: dict[str, object] = {
"thread": None,
"stop": threading.Event(),
"last_run": 0.0,
"last_result": None,
"running_now": False,
}
def _conn_factory_from_env() -> Callable[[], sqlite3.Connection]:
"""Build a get_db() compatible with server.py's pattern.
We don't import server.py (avoid circular / startup ordering). Instead
we re-implement the same settings. If server.py's DB path differs from
the default, CRM_DB_PATH env var should be set — same mechanism.
"""
import os
db_path = os.environ.get(
"CRM_DB_PATH",
os.path.join(_cfg.CONFIG.data_dir, "crm.db"),
)
def get_db() -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA foreign_keys=ON")
conn.execute("PRAGMA busy_timeout=5000")
conn.row_factory = sqlite3.Row
return conn
return get_db
def start_sync_scheduler(conn_factory: Optional[Callable] = None) -> None:
if _state["thread"] is not None:
return # already running
if not _cfg.CONFIG.enabled:
log.info("email_integration not enabled; scheduler will not start")
return
factory = conn_factory or _conn_factory_from_env()
try:
provider = _creds.build_provider(factory)
except Exception as e:
log.exception("cannot build credential provider: %s", e)
return
index = InvestorIndex(own_domain=_cfg.CONFIG.workspace_domain)
try:
index.rebuild(factory)
except Exception:
log.exception("initial investor-index build failed; scheduler continues")
stop = threading.Event()
_state["stop"] = stop
def _loop():
log.info("email sync scheduler started; interval=%ss", _cfg.CONFIG.sync_interval_sec)
# First cycle: short delay to let server finish startup.
if stop.wait(10):
return
while not stop.is_set():
_state["running_now"] = True
t0 = time.time()
try:
result = _sync.sync_all(factory, provider, index)
_state["last_result"] = result
except Exception:
log.exception("sync loop crashed; will retry next cycle")
finally:
_state["running_now"] = False
_state["last_run"] = t0
if stop.wait(_cfg.CONFIG.sync_interval_sec):
return
t = threading.Thread(target=_loop, name="email-sync", daemon=True)
t.start()
_state["thread"] = t
_state["provider"] = provider
_state["index"] = index
_state["factory"] = factory
def stop_sync_scheduler() -> None:
ev: threading.Event = _state["stop"] # type: ignore
ev.set()
t = _state.get("thread")
if t:
try:
t.join(timeout=5)
except Exception:
pass
_state["thread"] = None
def trigger_run_now() -> dict:
"""Force a single sync pass synchronously (admin 'sync now' endpoint)."""
if _state.get("running_now"):
return {"status": "already_running"}
factory = _state.get("factory")
provider = _state.get("provider")
index = _state.get("index")
if not (factory and provider and index):
return {"status": "not_initialized"}
return _sync.sync_all(factory, provider, index) # type: ignore
def status_snapshot() -> dict:
return {
"enabled": _cfg.CONFIG.enabled,
"running": _state["running_now"],
"last_run_unix": _state.get("last_run"),
"last_result": _state.get("last_result"),
"interval_sec": _cfg.CONFIG.sync_interval_sec,
}
+390
View File
@@ -0,0 +1,390 @@
"""
Sync orchestrator.
Top-level entry points:
sync_account(conn_factory, credential_provider, account_row, matcher)
Full sync pass for one mailbox. Decides backfill vs. incremental based
on email_accounts.backfill_complete. Writes a sync_runs row.
sync_all(conn_factory, credential_provider, matcher)
Iterates every sync-enabled account sequentially. Called from
scheduler.py every CRM_GMAIL_SYNC_INTERVAL_MIN minutes.
Design: match-only storage (see architecture doc §7). For each message:
1. Fetch metadata (cheap, 5 units).
2. Run matcher against participant addresses.
3. If matched → fetch full message, parse, persist body + register attachments.
4. If unmatched → persist header-only row.
5. In both cases, record the per-account sighting.
"""
import logging
import sqlite3
import traceback
from typing import Optional
from . import attachments as _attach
from . import config as _cfg
from . import db as _db
from . import errors as _errors
from . import gmail_client as _gmail
from . import parser as _parser
from . import threads as _threads
from .matcher import InvestorIndex, InvestorLink
log = logging.getLogger("email_integration.sync")
METADATA_HEADERS = [
"From", "To", "Cc", "Bcc", "Subject", "Date",
"Message-ID", "In-Reply-To", "References", "Reply-To",
]
# ---------------------------------------------------------------------------- public
def sync_all(conn_factory, credential_provider, index: InvestorIndex) -> dict:
"""Run one pass across all enabled accounts. Returns summary stats."""
index.rebuild_if_stale(conn_factory)
conn = conn_factory()
try:
accounts = _db.list_sync_ready_accounts(conn)
finally:
conn.close()
totals = {"accounts": 0, "messages_stored": 0, "errors": 0}
for acc in accounts:
totals["accounts"] += 1
try:
stats = sync_account(conn_factory, credential_provider, acc, index)
totals["messages_stored"] += stats.get("messages_stored", 0)
except Exception:
totals["errors"] += 1
log.exception("sync failed for account %s", acc["email_address"])
return totals
def sync_account(conn_factory, credential_provider, account,
index: InvestorIndex) -> dict:
"""Sync a single mailbox. Returns stats dict."""
email_addr = account["email_address"]
stats = _gmail.CallStats()
client = _gmail.GmailClient(credential_provider, email_addr, stats=stats)
# Mark running
conn = conn_factory()
try:
run_id = _db.start_sync_run(conn,
account_id=account["id"],
kind="backfill" if not account["backfill_complete"] else "incremental")
_db.set_account_status(conn, account["id"], status="active", error=None)
conn.commit()
finally:
conn.close()
run_stats = {"messages_seen": 0, "messages_stored": 0, "attachments_saved": 0}
error_str: Optional[str] = None
status = "ok"
try:
if not account["backfill_complete"]:
_run_backfill(conn_factory, client, account, index, run_stats)
else:
_run_incremental(conn_factory, client, account, index, run_stats)
# Drain attachments for this account.
conn = conn_factory()
try:
# Limit to a few cycles' worth of attachments per pass.
batched = _attach.drain_pending(conn_factory, client, account["id"], limit=100)
run_stats["attachments_saved"] = batched
finally:
conn.close()
except _errors.AuthError as e:
error_str = f"auth: {e}"
status = "error"
except _errors.HistoryExpiredError:
# Recover: reset to date-based backfill from last_synced_at.
error_str = "history expired; fallback to date backfill"
status = "partial"
_fallback_date_backfill(conn_factory, client, account, index, run_stats)
except Exception as e:
error_str = f"unexpected: {type(e).__name__}: {e}"
status = "error"
log.exception("unexpected during sync of %s", email_addr)
finally:
run_stats["api_calls"] = stats.api_calls
run_stats["retries"] = stats.retries
conn = conn_factory()
try:
_db.finish_sync_run(conn, run_id, status=status, stats=run_stats, error=error_str)
_db.set_account_status(conn, account["id"],
status="active" if status == "ok" else status,
error=error_str)
_db.set_account_checkpoint(conn, account["id"],
last_synced_at=_db._now_iso())
conn.commit()
finally:
conn.close()
return run_stats
# ---------------------------------------------------------------------------- backfill
def _run_backfill(conn_factory, client, account, index: InvestorIndex,
run_stats: dict) -> None:
"""Initial full-mailbox backfill, resumable via backfill_cursor."""
page_token = account["backfill_cursor"]
while True:
resp = client.list_messages(page_token=page_token,
max_results=_cfg.CONFIG.backfill_page_size)
messages = resp.get("messages") or []
for m in messages:
run_stats["messages_seen"] += 1
try:
_process_one_message(conn_factory, client, account, index,
gmail_message_id=m["id"], run_stats=run_stats)
except _errors.GmailError as e:
log.warning("skip msg %s on %s: %s", m["id"], account["email_address"], e)
continue
page_token = resp.get("nextPageToken")
conn = conn_factory()
try:
_db.set_account_checkpoint(conn, account["id"],
backfill_cursor=page_token,
backfill_complete=(not page_token))
conn.commit()
finally:
conn.close()
if not page_token:
# Capture current historyId as checkpoint for future incrementals.
prof = client.get_profile()
hid = prof.get("historyId")
if hid:
conn = conn_factory()
try:
_db.set_account_checkpoint(conn, account["id"], history_id=str(hid))
conn.commit()
finally:
conn.close()
return
# ---------------------------------------------------------------------------- incremental
def _run_incremental(conn_factory, client, account, index: InvestorIndex,
run_stats: dict) -> None:
start_hid = account["last_history_id"]
if not start_hid:
# Safety: if checkpoint is missing, re-enter backfill.
_run_backfill(conn_factory, client, account, index, run_stats)
return
# history_types filter limits bandwidth to what we care about.
new_hid: Optional[str] = None
try:
for h in client.iter_history(
start_history_id=start_hid,
history_types=["messageAdded", "messageDeleted", "labelAdded", "labelRemoved"],
):
for ma in h.get("messagesAdded") or []:
msg = ma.get("message") or {}
run_stats["messages_seen"] += 1
try:
_process_one_message(conn_factory, client, account, index,
gmail_message_id=msg.get("id"),
run_stats=run_stats)
except _errors.GmailError as e:
log.warning("skip msg %s on %s: %s", msg.get("id"), account["email_address"], e)
for md in h.get("messagesDeleted") or []:
msg = md.get("message") or {}
conn = conn_factory()
try:
_db.tombstone_sighting(
conn,
account_id=account["id"],
gmail_message_id=msg.get("id"),
)
conn.commit()
finally:
conn.close()
for la in (h.get("labelsAdded") or []) + (h.get("labelsRemoved") or []):
msg = la.get("message") or {}
# labels are the resulting label set in Gmail's payload after
# the change. We refresh them wholesale.
labels = msg.get("labelIds") or []
conn = conn_factory()
try:
_db.update_sighting_labels(
conn,
account_id=account["id"],
gmail_message_id=msg.get("id"),
labels=labels,
)
conn.commit()
finally:
conn.close()
new_hid = client.last_history_id
except _errors.HistoryExpiredError:
raise
if new_hid:
conn = conn_factory()
try:
_db.set_account_checkpoint(conn, account["id"], history_id=str(new_hid))
conn.commit()
finally:
conn.close()
def _fallback_date_backfill(conn_factory, client, account, index, run_stats):
"""Used when startHistoryId has been pruned by Gmail.
Pulls everything since last_synced_at (or 14d if unknown), which will
hit a large overlap with existing data but upserts are idempotent.
"""
from datetime import datetime, timedelta, timezone
since = account["last_synced_at"] or (
datetime.now(tz=timezone.utc) - timedelta(days=14)
).strftime("%Y-%m-%dT%H:%M:%SZ")
q = f"after:{since.replace('-', '/').split('T')[0]}"
for m in client.iter_messages(q=q):
run_stats["messages_seen"] += 1
try:
_process_one_message(conn_factory, client, account, index,
gmail_message_id=m["id"], run_stats=run_stats)
except _errors.GmailError as e:
log.warning("skip during date-backfill msg %s: %s", m["id"], e)
prof = client.get_profile()
hid = prof.get("historyId")
if hid:
conn = conn_factory()
try:
_db.set_account_checkpoint(conn, account["id"], history_id=str(hid))
conn.commit()
finally:
conn.close()
# ---------------------------------------------------------------------------- per-message
def _process_one_message(conn_factory, client, account, index: InvestorIndex,
*, gmail_message_id: str, run_stats: dict) -> None:
"""Fetch, match, persist one message. Idempotent."""
if not gmail_message_id:
return
# Skip if we've already sighted this message for this account.
conn = conn_factory()
try:
cur = conn.cursor()
cur.execute(
"SELECT email_id FROM email_account_messages "
"WHERE account_id = ? AND gmail_message_id = ?",
(account["id"], gmail_message_id),
)
if cur.fetchone():
return
finally:
conn.close()
# 1. Metadata fetch (cheap).
meta = client.get_message(gmail_message_id, format="metadata",
metadata_headers=METADATA_HEADERS)
meta_parsed = _parser.parse(meta, owning_account_address=account["email_address"])
participants = set()
if meta_parsed.get("from_email"):
participants.add(meta_parsed["from_email"])
for kind in ("to", "cc", "bcc"):
for a in meta_parsed.get(kind, []):
if isinstance(a, dict) and a.get("email"):
participants.add(a["email"])
# Exclude owning account's own address so we don't try to "match" ourselves.
own = {account["email_address"].lower()}
links = index.match(participants, exclude_addresses=own)
is_matched = bool(links)
# 2. If matched, fetch full and parse for body + attachments.
if is_matched:
full = client.get_message(gmail_message_id, format="full")
parsed = _parser.parse(full, owning_account_address=account["email_address"])
else:
parsed = meta_parsed
# Strip any body fields (metadata fetch shouldn't have them but be safe).
parsed["body_text"] = None
parsed["body_html"] = None
parsed["attachments"] = []
# 3. Persist (idempotent on rfc_message_id).
conn = conn_factory()
try:
existing = _db.find_email_by_rfc_id(conn, parsed["rfc_message_id"])
if existing:
email_id = existing["id"]
# If the email was previously unmatched but now matches (e.g. user
# added the investor after first sight), upgrade the row.
if is_matched and existing["match_status"] == "unmatched":
conn.execute(
"UPDATE emails SET match_status = 'matched', is_matched = 1, "
"body_text = ?, body_html = ?, updated_at = datetime('now') "
"WHERE id = ?",
(parsed.get("body_text"), parsed.get("body_html"), email_id),
)
_attach.register_stubs(conn,
email_id=email_id,
parsed_attachments=parsed.get("attachments") or [])
for link in links:
_db.insert_investor_link(conn, email_id=email_id, link=_flatten_link(link))
else:
match_status = "matched" if is_matched else "unmatched"
email_id = _db.insert_email(conn, parsed=parsed, match_status=match_status)
thread_id = _threads.resolve_thread_id(conn, parsed)
_db.set_email_thread(conn, email_id, thread_id)
if is_matched:
_attach.register_stubs(conn,
email_id=email_id,
parsed_attachments=parsed.get("attachments") or [])
for link in links:
_db.insert_investor_link(conn, email_id=email_id, link=_flatten_link(link))
_db.rollup_thread(conn, thread_id)
run_stats["messages_stored"] += 1
# Record sighting (always, even if email row was pre-existing).
_db.upsert_sighting(
conn,
email_id=email_id,
account_id=account["id"],
gmail_message_id=gmail_message_id,
gmail_thread_id=parsed.get("gmail_thread_id") or "",
labels=parsed.get("labels", []),
is_sent=parsed.get("is_sent", False),
)
conn.commit()
except sqlite3.IntegrityError:
# Concurrent insert race — re-read and proceed.
pass
finally:
conn.close()
def _flatten_link(link: InvestorLink) -> dict:
return {
"matched_address": link.matched_address,
"match_kind": link.match_kind,
"match_confidence": link.match_confidence,
"fundraising_investor_id": link.target.fundraising_investor_id,
"fundraising_contact_id": link.target.fundraising_contact_id,
"contact_id": link.target.contact_id,
"organization_id": link.target.organization_id,
}
+75
View File
@@ -0,0 +1,75 @@
"""
Threading resolution.
Given a freshly-inserted emails row (or its about-to-be-inserted parsed dict),
figure out which email_threads row it belongs to. If none exists, create one.
Priority order (see architecture doc §10):
1. Existing email in our DB that shares any RFC Message-ID with this one's
References/In-Reply-To chain — inherit its thread.
2. Existing thread with the same gmail_thread_id.
3. Existing thread with the same rfc_thread_root_id.
4. Create a new thread.
"""
import re
import sqlite3
from typing import Optional
from . import db as _db
SUBJECT_PREFIX_RE = re.compile(r"^\s*(re|fwd?|aw|sv|antw|回复|fw)\s*:\s*", re.IGNORECASE)
def normalize_subject(s: Optional[str]) -> Optional[str]:
if not s:
return None
out = s
# Strip up to 5 nested Re:/Fwd: prefixes.
for _ in range(5):
new = SUBJECT_PREFIX_RE.sub("", out, count=1)
if new == out:
break
out = new
return out.strip().lower()
def resolve_thread_id(conn: sqlite3.Connection, parsed: dict) -> str:
"""Returns a thread_id — either an existing one or a newly created one."""
# Step 1: RFC cross-link.
candidates = list(parsed.get("references") or [])
if parsed.get("in_reply_to"):
candidates.append(parsed["in_reply_to"])
if candidates:
existing_email_id = _db.find_email_id_by_any_rfc_id(conn, candidates)
if existing_email_id:
cur = conn.cursor()
cur.execute("SELECT thread_id FROM emails WHERE id = ?", (existing_email_id,))
row = cur.fetchone()
if row and row["thread_id"]:
return row["thread_id"]
# Step 2: gmail_thread_id match.
gt = parsed.get("gmail_thread_id")
if gt:
existing = _db.find_thread_by_gmail_id(conn, gt)
if existing:
return existing["id"]
# Step 3: RFC thread-root match.
rfc_root = parsed.get("rfc_thread_root_id")
if rfc_root:
existing = _db.find_thread_by_rfc_root(conn, rfc_root)
if existing:
return existing["id"]
# Step 4: create.
return _db.create_thread(
conn,
gmail_thread_id=gt,
rfc_thread_root_id=rfc_root,
subject_normalized=normalize_subject(parsed.get("subject")),
first_message_at=parsed.get("sent_at"),
)
+5
View File
@@ -0,0 +1,5 @@
"""Ten31 Phase-0 ingest pipeline (entity resolution, chunking, embed, Qdrant upsert).
All modules are local-only and read the CRM by SQLite file path (CRM is canonical;
the canonical/vector layers are derived). No real data is sent to Claude here.
"""
+64
View File
@@ -0,0 +1,64 @@
#!/usr/bin/env python3
"""Phase-0 Workstream B — backfill the CRM into Qdrant.
Chunk -> dense (bge-m3 via Spark Control) + sparse (BM25 client-side) -> upsert
to Qdrant `crm_chunks` with payload. Idempotent: deterministic point ids mean
re-running upserts in place. Reads the CRM by file path; never sends data to Claude.
python3 backend/ingest/backfill.py --db data/crm_dev.db --recreate
"""
import argparse
import sqlite3
import chunking
import config
import embed
import qdrant_io
import sparse
def run(db, recreate=False, batch=32):
conn = sqlite3.connect(db)
conn.row_factory = sqlite3.Row
chunks = chunking.build_chunks(conn)
conn.close()
print(f"Built {len(chunks)} chunks from {db}")
state = qdrant_io.create_collection(recreate=recreate)
qdrant_io.ensure_indexes()
print(f"Collection '{config.COLLECTION}': {state}")
total = 0
for i in range(0, len(chunks), batch):
group = chunks[i:i + batch]
dense = embed.dense_embed([c["text"] for c in group])
points = []
for c, dv in zip(group, dense):
sv = sparse.encode(c["text"])
points.append({
"id": c["point_id"],
"vector": {"dense": dv, "sparse": {"indices": sv["indices"], "values": sv["values"]}},
"payload": {
"lp_id": c["lp_id"], "lp_name": c["lp_name"], "person_id": c["person_id"],
"doc_type": c["doc_type"], "date_ts": c["date_ts"], "text": c["text"],
"source_model": c["source_model"], "source_id": c["source_id"], "chunk_key": c["chunk_key"],
},
})
qdrant_io.upsert(points)
total += len(points)
print(f" upserted {total}/{len(chunks)}")
print(f"Done. Qdrant '{config.COLLECTION}' now holds {qdrant_io.count()} points.")
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--db", default=config.DEFAULT_DB)
ap.add_argument("--recreate", action="store_true", help="drop & recreate the collection first")
ap.add_argument("--batch", type=int, default=32)
args = ap.parse_args()
run(args.db, recreate=args.recreate, batch=args.batch)
if __name__ == "__main__":
main()
+184
View File
@@ -0,0 +1,184 @@
"""Phase-0 Workstream B1 — chunk the CRM for retrieval.
Maps each CRM record type to one or more chunks per docs/EMBEDDINGS.md:
* one chunk per communications row (doc_type = the comm type)
* one chunk per MATCHED email (doc_type = email; body only when matched)
* one chunk per fundraising_investors notes LINE (the outreach log; split per line)
* one chunk each for free-text fields: contacts.notes, lp_profiles.notes,
opportunities (description + next_step), organizations.description
Each chunk carries a canonical `lp_id` (resolved via entity_links) and a `date_ts`
(epoch of the EVENT time, not created_at) so Qdrant can pre-filter and recency-rank.
Entities/names/dates/types are payload (filterable); only prose is embedded.
A chunk's stable `chunk_key` -> deterministic point id (uuid5), so re-ingest
upserts in place (idempotent).
"""
import sqlite3
import uuid
from datetime import datetime, timezone
_NS = uuid.UUID("6ba7b811-9dad-11d1-80b4-00c04fd430c8") # uuid5 namespace for chunk ids
def to_epoch(ts: str):
if not ts:
return None
s = ts.strip().replace("Z", "+00:00")
for parse in (datetime.fromisoformat,):
try:
dt = parse(s)
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
return int(dt.timestamp())
except Exception:
pass
# date-only fallback
try:
return int(datetime.strptime(ts[:10], "%Y-%m-%d").replace(tzinfo=timezone.utc).timestamp())
except Exception:
return None
def _point_id(chunk_key: str) -> str:
return str(uuid.uuid5(_NS, chunk_key))
def _mk(chunk_key, lp_id, lp_name, person_id, doc_type, date_ts, text, source_model, source_id):
text = (text or "").strip()
if not text or not lp_id:
return None
return {
"chunk_key": chunk_key,
"point_id": _point_id(chunk_key),
"lp_id": lp_id,
"lp_name": lp_name,
"person_id": person_id,
"doc_type": doc_type,
"date_ts": date_ts,
"text": text,
"source_model": source_model,
"source_id": source_id,
}
def _canon_maps(conn):
"""Resolution lookups from entity_links / canonical_entities."""
person_canon, org_canon, inv_canon = {}, {}, {}
for r in conn.execute("SELECT source_model, source_id, canonical_id FROM entity_links"):
if r["source_model"] == "contacts":
person_canon[r["source_id"]] = r["canonical_id"]
elif r["source_model"] == "organizations":
org_canon[r["source_id"]] = r["canonical_id"]
elif r["source_model"] == "fundraising_investors":
inv_canon[r["source_id"]] = r["canonical_id"]
name = {r["id"]: r["display_name"] for r in conn.execute("SELECT id, display_name FROM canonical_entities")}
contact_org = {r["id"]: r["organization_id"] for r in conn.execute("SELECT id, organization_id FROM contacts")}
return person_canon, org_canon, inv_canon, name, contact_org
def _contact_lp(cid, person_canon, org_canon, name, contact_org):
"""Best lp_id for a contact-anchored chunk: the firm if known, else the person."""
person = person_canon.get(cid)
firm = org_canon.get(contact_org.get(cid))
lp = firm or person
return lp, name.get(lp), person
def build_chunks(conn):
person_canon, org_canon, inv_canon, name, contact_org = _canon_maps(conn)
chunks = []
# communications
for r in conn.execute("""SELECT id, contact_id, type, subject, body, outcome, next_action, communication_date
FROM communications"""):
lp, lp_name, person = _contact_lp(r["contact_id"], person_canon, org_canon, name, contact_org)
parts = [p for p in (r["subject"], r["body"], r["outcome"], r["next_action"]) if (p or "").strip()]
chunks.append(_mk(f"communications:{r['id']}", lp, lp_name, person,
r["type"] or "note", to_epoch(r["communication_date"]),
"\n".join(parts), "communications", r["id"]))
# contacts.notes
for r in conn.execute("SELECT id, notes, updated_at FROM contacts WHERE notes IS NOT NULL AND notes <> ''"):
lp, lp_name, person = _contact_lp(r["id"], person_canon, org_canon, name, contact_org)
chunks.append(_mk(f"contacts.notes:{r['id']}", lp, lp_name, person,
"contact_note", to_epoch(r["updated_at"]), r["notes"], "contacts", r["id"]))
# lp_profiles.notes
for r in conn.execute("""SELECT lp.id, lp.contact_id, lp.notes, lp.updated_at
FROM lp_profiles lp WHERE lp.notes IS NOT NULL AND lp.notes <> ''"""):
lp, lp_name, person = _contact_lp(r["contact_id"], person_canon, org_canon, name, contact_org)
chunks.append(_mk(f"lp_profiles.notes:{r['id']}", lp, lp_name, person,
"lp_note", to_epoch(r["updated_at"]), r["notes"], "lp_profiles", r["id"]))
# opportunities (description + next_step)
for r in conn.execute("""SELECT id, contact_id, name, description, next_step, updated_at
FROM opportunities"""):
lp, lp_name, person = _contact_lp(r["contact_id"], person_canon, org_canon, name, contact_org)
parts = [p for p in (r["name"], r["description"], r["next_step"]) if (p or "").strip()]
chunks.append(_mk(f"opportunities:{r['id']}", lp, lp_name, person,
"opportunity", to_epoch(r["updated_at"]), "\n".join(parts), "opportunities", r["id"]))
# organizations.description
for r in conn.execute("""SELECT id, description, updated_at FROM organizations
WHERE description IS NOT NULL AND description <> ''"""):
lp = org_canon.get(r["id"])
chunks.append(_mk(f"organizations.description:{r['id']}", lp, name.get(lp), None,
"org_note", to_epoch(r["updated_at"]), r["description"], "organizations", r["id"]))
# fundraising_investors.notes — running outreach log, split per non-empty line
for r in conn.execute("""SELECT id, notes, updated_at FROM fundraising_investors
WHERE notes IS NOT NULL AND notes <> ''"""):
lp = inv_canon.get(r["id"])
for i, line in enumerate(str(r["notes"]).splitlines()):
if line.strip():
chunks.append(_mk(f"fundraising_investors.notes:{r['id']}:{i}", lp, name.get(lp), None,
"outreach_note", to_epoch(r["updated_at"]), line, "fundraising_investors", r["id"]))
# MATCHED emails (only matched rows carry a body; key lp via email_investor_links)
if _has_table(conn, "emails") and _has_table(conn, "email_investor_links"):
for r in conn.execute("""SELECT id, subject, body_text, snippet, sent_at FROM emails WHERE is_matched=1"""):
lp, lp_name = _email_lp(conn, r["id"], inv_canon, org_canon, person_canon, name)
text = "\n".join(p for p in (r["subject"], r["body_text"] or r["snippet"]) if (p or "").strip())
chunks.append(_mk(f"emails:{r['id']}", lp, lp_name, None, "email",
to_epoch(r["sent_at"]), text, "emails", r["id"]))
return [c for c in chunks if c]
def _has_table(conn, name):
return conn.execute("SELECT 1 FROM sqlite_master WHERE type='table' AND name=?", (name,)).fetchone() is not None
def _email_lp(conn, email_id, inv_canon, org_canon, person_canon, name):
"""Resolve a matched email's lp_id via email_investor_links, precedence:
fundraising_investor -> contact -> organization."""
row = conn.execute("""SELECT fundraising_investor_id, contact_id, organization_id
FROM email_investor_links WHERE email_id=? ORDER BY match_confidence DESC LIMIT 1""",
(email_id,)).fetchone()
if not row:
return None, None
lp = (inv_canon.get(row["fundraising_investor_id"]) or person_canon.get(row["contact_id"])
or org_canon.get(row["organization_id"]))
return lp, name.get(lp)
if __name__ == "__main__":
import argparse
from collections import Counter
from config import DEFAULT_DB
ap = argparse.ArgumentParser()
ap.add_argument("--db", default=DEFAULT_DB)
args = ap.parse_args()
conn = sqlite3.connect(args.db)
conn.row_factory = sqlite3.Row
chunks = build_chunks(conn)
print(f"{len(chunks)} chunks from {args.db}")
for dt, n in Counter(c["doc_type"] for c in chunks).most_common():
print(f" {dt:<16} {n}")
unresolved = sum(1 for c in chunks if not c["lp_id"])
print(f" (all chunks have an lp_id: {unresolved == 0})")
print("\nSample chunk:")
s = chunks[0]
print({k: (v[:80] + '' if k == 'text' and v and len(v) > 80 else v) for k, v in s.items()})
+28
View File
@@ -0,0 +1,28 @@
"""Ingest config — loads .env and exposes the Spark/Qdrant/CRM settings."""
import os
_ROOT = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
def load_env(path=None):
path = path or os.path.join(_ROOT, ".env")
if not os.path.exists(path):
return
with open(path, "r", encoding="utf-8") as fh:
for line in fh:
line = line.strip()
if not line or line.startswith("#") or "=" not in line:
continue
k, v = line.split("=", 1)
os.environ.setdefault(k.strip(), v.strip())
load_env()
SPARK_CONTROL_URL = os.environ.get("SPARK_CONTROL_URL", "").rstrip("/")
SPARK_VERIFY_TLS = os.environ.get("SPARK_CONTROL_VERIFY_TLS", "false").lower() in ("1", "true", "yes", "on")
QDRANT_URL = os.environ.get("QDRANT_URL", "").rstrip("/")
COLLECTION = os.environ.get("CRM_QDRANT_COLLECTION", "crm_chunks")
EMBED_MODEL = os.environ.get("CRM_EMBED_MODEL", "BAAI/bge-m3")
DENSE_DIM = int(os.environ.get("CRM_EMBED_DIM", "1024"))
DEFAULT_DB = os.environ.get("CRM_DEV_DB_PATH", os.path.join(_ROOT, "data", "crm_dev.db"))
+17
View File
@@ -0,0 +1,17 @@
"""Dense embeddings via Spark Control /v1/embeddings (BAAI/bge-m3, 1024-d)."""
import config
import http_util
def dense_embed(texts, batch=32):
out = []
for i in range(0, len(texts), batch):
group = texts[i:i + batch]
status, data = http_util.request(
"POST", f"{config.SPARK_CONTROL_URL}/v1/embeddings",
{"input": group, "model": config.EMBED_MODEL}, verify=config.SPARK_VERIFY_TLS)
if status != 200:
raise RuntimeError(f"/v1/embeddings -> {status}: {data}")
rows = sorted(data["data"], key=lambda d: d["index"])
out.extend(r["embedding"] for r in rows)
return out
+258
View File
@@ -0,0 +1,258 @@
#!/usr/bin/env python3
"""Phase-0 Workstream B3 / A4 — entity resolution (deterministic tier).
Collapses the CRM's two parallel investor models into the canonical identity
layer created by migration 0001:
organizations ─┐
fundraising_investors ─┴─► canonical_entities (entity_kind = lp | organization)
contacts ─┐
fundraising_contacts ─┴─► canonical_entities (entity_kind = person)
lp_profiles ───► linked to its contact's person entity
Every source row is recorded in `entity_links` so any name variant resolves to
one canonical id. This is the DETERMINISTIC tier — it merges only what we can
prove (exact email; exact normalized name within the same canonical org). The
HARD cases (nicknames like "Jon" vs "Jonathan", typos) are NOT guessed; they are
emitted as *fuzzy candidates* for the local-Qwen tier (Spark Control
/v1/chat/completions) to adjudicate later. Honest separation: we never silently
merge on a guess.
Properties:
* Local-only, read-mostly: reads CRM source tables, writes only the derived
canonical_entities / entity_links and an interaction_log audit row. Never
mutates a CRM source record (guardrail #2/#3).
* Idempotent: canonical ids are deterministic (sha1 of the resolution key), so
re-running upserts in place and keeps ids stable across runs — which keeps
downstream Qdrant point ids valid (no churn on re-embed).
* Logged: writes one interaction_log row per run (guardrail #5).
Usage:
python3 backend/ingest/entity_resolution.py --db data/crm_dev.db
python3 backend/ingest/entity_resolution.py --db data/crm_dev.db --show-candidates
"""
import argparse
import hashlib
import json
import re
import sqlite3
import uuid
from collections import defaultdict
from datetime import datetime, timezone
# ── normalization ─────────────────────────────────────────────────────────────
def norm_text(s: str) -> str:
s = (s or "").strip().lower()
s = re.sub(r"[^\w\s]", " ", s)
return re.sub(r"\s+", " ", s).strip()
def norm_email(s: str) -> str:
return (s or "").strip().lower()
def _eid(prefix: str, key: str) -> str:
"""Deterministic canonical id: stable across runs for the same resolution key."""
return f"{prefix}_{hashlib.sha1(key.encode('utf-8')).hexdigest()[:12]}"
def _now() -> str:
return datetime.now(timezone.utc).isoformat()
def _split_name(full: str):
parts = norm_text(full).split()
if not parts:
return "", ""
return parts[0], parts[-1] if len(parts) > 1 else ""
# ── upsert helpers ────────────────────────────────────────────────────────────
def _upsert_entity(conn, eid, kind, display_name, primary_email):
conn.execute(
"""
INSERT INTO canonical_entities (id, entity_kind, display_name, primary_email, source, created_at, updated_at)
VALUES (?, ?, ?, ?, 'entity_resolution', ?, ?)
ON CONFLICT(id) DO UPDATE SET
display_name = excluded.display_name,
primary_email = COALESCE(excluded.primary_email, canonical_entities.primary_email),
entity_kind = excluded.entity_kind,
updated_at = excluded.updated_at
""",
(eid, kind, display_name, primary_email or None, _now(), _now()),
)
def _link(conn, canonical_id, source_model, source_id, match_value, match_kind, confidence):
conn.execute(
"""
INSERT INTO entity_links (id, canonical_id, source_model, source_id, match_value, match_kind, confidence, created_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(source_model, source_id, match_value) DO UPDATE SET
canonical_id = excluded.canonical_id,
match_kind = excluded.match_kind,
confidence = excluded.confidence
""",
(str(uuid.uuid4()), canonical_id, source_model, source_id, match_value, match_kind, confidence, _now()),
)
# ── resolution passes ─────────────────────────────────────────────────────────
def resolve_organizations(conn):
"""Merge organizations + fundraising_investors by normalized name.
Returns (org_canon_by_orgid, org_canon_by_fundinv) so the people pass can
attach each person to their firm's canonical id.
"""
groups = defaultdict(lambda: {"orgs": [], "investors": [], "name": "", "email": ""})
for r in conn.execute("SELECT id, name, email FROM organizations"):
key = norm_text(r["name"])
if not key:
continue
g = groups[key]
g["orgs"].append(r["id"])
if len(r["name"] or "") > len(g["name"]):
g["name"] = r["name"]
if not g["email"] and (r["email"] or "").strip():
g["email"] = r["email"].strip()
for r in conn.execute("SELECT id, investor_name FROM fundraising_investors"):
key = norm_text(r["investor_name"])
if not key:
continue
g = groups[key]
g["investors"].append(r["id"])
if not g["name"]:
g["name"] = r["investor_name"]
org_canon_by_orgid, org_canon_by_fundinv = {}, {}
for key, g in groups.items():
# An org we are actively raising from (has a fundraising row) is an 'lp';
# otherwise a plain 'organization'.
kind = "lp" if g["investors"] else "organization"
cid = _eid("lp" if kind == "lp" else "org", key)
_upsert_entity(conn, cid, kind, g["name"], g["email"])
for oid in g["orgs"]:
_link(conn, cid, "organizations", oid, key, "exact_name", 1.0)
org_canon_by_orgid[oid] = cid
for iid in g["investors"]:
_link(conn, cid, "fundraising_investors", iid, key, "exact_name", 1.0)
org_canon_by_fundinv[iid] = cid
return org_canon_by_orgid, org_canon_by_fundinv
def resolve_people(conn, org_canon_by_orgid, org_canon_by_fundinv):
"""Merge contacts + fundraising_contacts by exact email, else exact name within
the same canonical org. Returns contact_id -> person canonical id (for lp_profiles)."""
# gather (model, source_id, full_name, email, org_canon)
people = []
for r in conn.execute("SELECT id, first_name, last_name, email, organization_id FROM contacts"):
full = f"{r['first_name'] or ''} {r['last_name'] or ''}".strip()
people.append(("contacts", r["id"], full, norm_email(r["email"]),
org_canon_by_orgid.get(r["organization_id"])))
for r in conn.execute("SELECT id, full_name, email, investor_id FROM fundraising_contacts"):
people.append(("fundraising_contacts", r["id"], r["full_name"] or "", norm_email(r["email"]),
org_canon_by_fundinv.get(r["investor_id"])))
contact_to_person = {}
person_meta = {} # canonical_id -> {"org": org_canon, "last": last_norm, "name": display, "email": email}
for model, sid, full, email, org_canon in people:
name_norm = norm_text(full)
if email:
key = f"e|{email}"
match_kind, conf, match_value = "exact_email", 1.0, email
elif name_norm:
key = f"n|{name_norm}|{org_canon or ''}"
match_kind, conf, match_value = "name_org", 0.8, name_norm
else:
continue
cid = _eid("per", key)
display = full.strip() or email
_upsert_entity(conn, cid, "person", display, email)
_link(conn, cid, model, sid, match_value, match_kind, conf)
if model == "contacts":
contact_to_person[sid] = cid
meta = person_meta.setdefault(cid, {"org": org_canon, "last": _split_name(full)[1],
"name": display, "email": email})
if org_canon and not meta["org"]:
meta["org"] = org_canon
# lp_profiles -> the person entity of its contact
for r in conn.execute("SELECT id, contact_id FROM lp_profiles"):
cid = contact_to_person.get(r["contact_id"])
if cid:
_link(conn, cid, "lp_profiles", r["id"], r["contact_id"], "contact_fk", 1.0)
return person_meta
def find_fuzzy_candidates(person_meta):
"""Distinct person entities sharing the same canonical org AND surname are
likely the same individual under a name variant (e.g. Jon/Jonathan). Emit them
for the local-Qwen tier; do NOT merge here."""
by_org_last = defaultdict(list)
for cid, m in person_meta.items():
if m["org"] and m["last"]:
by_org_last[(m["org"], m["last"])].append((cid, m["name"], m["email"]))
return [{"org": org, "surname": last, "members": members}
for (org, last), members in by_org_last.items() if len(members) > 1]
def run(db_path: str):
conn = sqlite3.connect(db_path)
conn.row_factory = sqlite3.Row
conn.execute("PRAGMA foreign_keys=ON")
org_by_oid, org_by_inv = resolve_organizations(conn)
conn.commit()
person_meta = resolve_people(conn, org_by_oid, org_by_inv)
conn.commit()
candidates = find_fuzzy_candidates(person_meta)
counts = {
"canonical_total": conn.execute("SELECT COUNT(*) FROM canonical_entities").fetchone()[0],
"lp": conn.execute("SELECT COUNT(*) FROM canonical_entities WHERE entity_kind='lp'").fetchone()[0],
"organization": conn.execute("SELECT COUNT(*) FROM canonical_entities WHERE entity_kind='organization'").fetchone()[0],
"person": conn.execute("SELECT COUNT(*) FROM canonical_entities WHERE entity_kind='person'").fetchone()[0],
"links": conn.execute("SELECT COUNT(*) FROM entity_links").fetchone()[0],
"fuzzy_candidates": len(candidates),
}
conn.execute(
"""
INSERT INTO interaction_log (id, ts, actor_type, actor_id, action, target_type, payload, source, created_at)
VALUES (?, ?, 'system', 'entity_resolver', 'entity_resolution.run', 'canonical_entities', ?, 'ingest', ?)
""",
(str(uuid.uuid4()), _now(), json.dumps(counts), _now()),
)
conn.commit()
conn.close()
return counts, candidates
def main():
ap = argparse.ArgumentParser(description="Deterministic entity resolution into the canonical layer.")
ap.add_argument("--db", default="data/crm_dev.db", help="path to the CRM SQLite DB")
ap.add_argument("--show-candidates", action="store_true", help="print fuzzy merge candidates")
args = ap.parse_args()
counts, candidates = run(args.db)
print(f"Entity resolution on {args.db}:")
for k, v in counts.items():
print(f" {k:<18} {v}")
if args.show_candidates and candidates:
print("\nFuzzy candidates (same org + surname, different person — for the local-Qwen tier):")
for c in candidates:
names = ", ".join(f"{n!r}{(' <'+e+'>') if e else ''}" for _, n, e in c["members"])
print(f" [{c['surname']}] {names}")
if __name__ == "__main__":
main()
+35
View File
@@ -0,0 +1,35 @@
"""Tiny stdlib JSON HTTP client (no third-party deps).
Handles the Spark Control self-signed cert (verify=False) and plain-HTTP Qdrant.
"""
import json
import ssl
import urllib.error
import urllib.request
def _ctx(verify: bool):
if verify:
return None
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
return ctx
def request(method: str, url: str, body=None, verify: bool = True, timeout: int = 180):
data = json.dumps(body).encode("utf-8") if body is not None else None
req = urllib.request.Request(url, data=data, method=method,
headers={"Content-Type": "application/json"})
ctx = _ctx(verify) if url.lower().startswith("https") else None
try:
with urllib.request.urlopen(req, timeout=timeout, context=ctx) as resp:
raw = resp.read()
return resp.status, (json.loads(raw) if raw else {})
except urllib.error.HTTPError as exc:
raw = exc.read()
try:
payload = json.loads(raw) if raw else {}
except Exception:
payload = {"raw": raw.decode("utf-8", "replace")}
return exc.code, payload
+50
View File
@@ -0,0 +1,50 @@
"""Minimal Qdrant REST client for the ingest pipeline (direct to QDRANT_URL).
Creates the crm_chunks collection per EMBEDDINGS.md: a named dense vector
(1024, Cosine) + a named sparse vector with modifier:idf, plus payload indexes.
"""
import config
import http_util
Q = config.QDRANT_URL
COL = config.COLLECTION
def _req(method, path, body=None):
return http_util.request(method, f"{Q}{path}", body, verify=False)
def exists() -> bool:
status, _ = _req("GET", f"/collections/{COL}")
return status == 200
def create_collection(recreate=False, dim=config.DENSE_DIM):
if exists():
if not recreate:
return "exists"
_req("DELETE", f"/collections/{COL}")
status, data = _req("PUT", f"/collections/{COL}", {
"vectors": {"dense": {"size": dim, "distance": "Cosine"}},
"sparse_vectors": {"sparse": {"modifier": "idf"}},
})
if status not in (200, 201):
raise RuntimeError(f"create collection -> {status}: {data}")
return "created"
def ensure_indexes():
for field, schema in (("lp_id", "keyword"), ("doc_type", "keyword"), ("date_ts", "integer")):
_req("PUT", f"/collections/{COL}/index", {"field_name": field, "field_schema": schema})
def upsert(points):
status, data = _req("PUT", f"/collections/{COL}/points?wait=true", {"points": points})
if status not in (200, 201):
raise RuntimeError(f"upsert -> {status}: {data}")
return data
def count():
status, data = _req("POST", f"/collections/{COL}/points/count", {"exact": True})
return (data or {}).get("result", {}).get("count")
+109
View File
@@ -0,0 +1,109 @@
#!/usr/bin/env python3
"""Phase-0 retrieval — thin wrappers over Spark Control /api/search.
These are the retrieval modes the CRM MCP server (Workstream C) will expose:
* semantic_search — dense only (omit sparse), high recall
* hybrid_search — dense + BM25 sparse (RRF) + rerank; best for entity queries
* keyword_search — lean on the sparse leg
All support a Qdrant `filter` (e.g. lp_id / date_ts range) to pre-filter.
`--demo` runs an entity-heavy query in dense-only vs hybrid to show the BM25
lexical leg surfacing the right LP. The query's sparse vector uses the SAME
encoder as ingest (sparse.encode).
"""
import argparse
import config
import http_util
import sparse
def _search(query, sparse_vec=None, rerank=False, top_k=5, lp_id=None, retrieve_n=80, filt=None):
body = {"query": query, "collection": config.COLLECTION, "top_k": top_k,
"retrieve_n": retrieve_n, "fusion": "rrf", "text_field": "text",
"with_payload": True, "rerank": rerank}
if sparse_vec is not None:
body["sparse"] = {"indices": sparse_vec["indices"], "values": sparse_vec["values"]}
# An explicit raw Qdrant filter (filt) wins; otherwise build one from lp_id.
if filt is not None:
body["filter"] = filt
elif lp_id:
body["filter"] = {"must": [{"key": "lp_id", "match": {"value": lp_id}}]}
status, data = http_util.request("POST", f"{config.SPARK_CONTROL_URL}/api/search",
body, verify=config.SPARK_VERIFY_TLS)
if status != 200:
raise RuntimeError(f"/api/search -> {status}: {data}")
return data.get("data", [])
def semantic_search(query, **kw):
return _search(query, sparse_vec=None, rerank=kw.pop("rerank", False), **kw)
def hybrid_search(query, **kw):
return _search(query, sparse_vec=sparse.encode(query), rerank=kw.pop("rerank", True), **kw)
def keyword_search(query, **kw):
return _search(query, sparse_vec=sparse.encode(query), rerank=kw.pop("rerank", True), **kw)
def _row(r):
p = r.get("payload", {}) or {}
text = (r.get("text") or p.get("text") or "").replace("\n", " ")
return f"{p.get('lp_name', '?'):<22} [{p.get('doc_type', '?'):<13}] {text[:58]}"
def _print(title, rows):
print(f"\n {title}")
if not rows:
print(" (no results)")
for i, r in enumerate(rows, 1):
print(f" {i}. score={r.get('score', 0):+.3f} {_row(r)}")
def demo():
target = "Cedar Point Capital"
q = "Fund III diligence and wire timeline for Cedar Point"
print(f"QUERY: {q!r}\nTarget LP: {target}")
dense = semantic_search(q, top_k=5)
hybrid = hybrid_search(q, top_k=5, rerank=False) # rerank off to isolate the BM25 leg
_print("dense-only (semantic):", dense)
_print("hybrid (dense + BM25 RRF):", hybrid)
def first_rank(rows):
for i, r in enumerate(rows, 1):
if (r.get("payload", {}) or {}).get("lp_name") == target:
return i
return None
print(f"\n First '{target}' chunk — dense rank: {first_rank(dense)}, hybrid rank: {first_rank(hybrid)}")
# Pre-filter demo: same query, restricted to one LP's chunks.
lp_id = None
for r in hybrid:
p = r.get("payload", {}) or {}
if p.get("lp_name") == target:
lp_id = p.get("lp_id")
break
if lp_id:
_print(f"hybrid + payload pre-filter (lp_id={lp_id}):",
hybrid_search(q, top_k=5, rerank=True, lp_id=lp_id))
def main():
ap = argparse.ArgumentParser()
ap.add_argument("query", nargs="?")
ap.add_argument("--mode", choices=["semantic", "hybrid", "keyword"], default="hybrid")
ap.add_argument("--top-k", type=int, default=5)
ap.add_argument("--lp-id")
ap.add_argument("--demo", action="store_true")
args = ap.parse_args()
if args.demo or not args.query:
return demo()
fn = {"semantic": semantic_search, "hybrid": hybrid_search, "keyword": keyword_search}[args.mode]
_print(f"{args.mode}: {args.query!r}", fn(args.query, top_k=args.top_k, lp_id=args.lp_id))
if __name__ == "__main__":
main()
+40
View File
@@ -0,0 +1,40 @@
"""Client-side BM25 sparse vectors.
EMBEDDINGS.md specifies FastEmbed `Qdrant/bm25` so Qdrant applies IDF (via the
sparse vector's `modifier: idf`) over OUR corpus. FastEmbed pulls onnxruntime,
which has no wheel for this Python (3.14) yet, so this module provides a
dependency-free BM25 term-frequency encoder with the same contract:
`encode(text) -> {"indices": [...], "values": [...]}`.
Qdrant computes IDF server-side from the stored sparse vectors regardless of how
indices are assigned, so this is a legitimate corpus-IDF BM25 leg. The ONLY hard
requirement is that ingest and query use the SAME encoder — they both import this
one. For production, swap `encode()` for FastEmbed `Qdrant/bm25` (and re-index, so
ingest and query stay on the same tokenizer).
"""
import hashlib
import math
import re
_TOKEN_RE = re.compile(r"[a-z0-9]+")
def tokenize(text: str):
return _TOKEN_RE.findall((text or "").lower())
def _index(token: str) -> int:
# Stable unsigned 32-bit index for a token (Qdrant sparse indices are u32).
return int.from_bytes(hashlib.md5(token.encode("utf-8")).digest()[:4], "big")
def encode(text: str):
"""Return a sparse vector {indices, values}. Value is 1 + ln(tf) (sublinear
term frequency); IDF is applied by Qdrant via modifier:idf."""
tf = {}
for tok in tokenize(text):
tf[tok] = tf.get(tok, 0) + 1
idx_val = {}
for tok, count in tf.items():
idx_val[_index(tok)] = 1.0 + math.log(count)
return {"indices": list(idx_val.keys()), "values": list(idx_val.values())}
+3
View File
@@ -0,0 +1,3 @@
"""Ten31 CRM MCP server (Workstream C) — wraps CRM reads, retrieval modes, and
logged writes for the Claude Agent SDK. No outbound/contact tools (Phase 0 gate).
"""
+200
View File
@@ -0,0 +1,200 @@
"""CRM MCP tool logic (Workstream C) — plain functions, transport-agnostic.
Kept separate from the MCP server wiring so it is unit-testable without the MCP
SDK. Each function returns JSON-serializable dicts. Reads go against the CRM
SQLite DB by path; retrieval wraps Spark Control /api/search; writes go through
the interaction_log (guardrail #5).
Tool surface:
reads get_entity, search_records, get_interaction_history
retrieval semantic_search, hybrid_search, keyword_search
writes log_interaction, set_entity_enrichment
NO outbound/contact tools — that capability is gated to Phase 3.
"""
import json
import os
import sqlite3
import sys
import uuid
from datetime import datetime, timezone
sys.path.insert(0, os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "ingest"))
import config # noqa: E402
import search # noqa: E402
def _conn(db=None):
c = sqlite3.connect(db or os.environ.get("CRM_DB_PATH") or config.DEFAULT_DB)
c.row_factory = sqlite3.Row
c.execute("PRAGMA foreign_keys=ON")
return c
def _now():
return datetime.now(timezone.utc).isoformat()
# ── read tools ────────────────────────────────────────────────────────────────
def _contact_ids_for(c, lp_id):
"""All contact ids belonging to a canonical entity: directly linked contacts
plus contacts whose organization resolves to this entity."""
ids = {r["source_id"] for r in c.execute(
"SELECT source_id FROM entity_links WHERE canonical_id=? AND source_model='contacts'", (lp_id,))}
org_src = [r["source_id"] for r in c.execute(
"SELECT source_id FROM entity_links WHERE canonical_id=? AND source_model='organizations'", (lp_id,))]
if org_src:
q = "SELECT id FROM contacts WHERE organization_id IN (%s)" % ",".join("?" * len(org_src))
ids.update(r["id"] for r in c.execute(q, org_src))
return ids
def get_entity(lp_id, db=None):
"""Fetch a canonical entity + its linked source rows and interaction count."""
c = _conn(db)
e = c.execute("SELECT * FROM canonical_entities WHERE id=?", (lp_id,)).fetchone()
if not e:
c.close()
return {"error": "not_found", "lp_id": lp_id}
out = dict(e)
out["links"] = [dict(r) for r in c.execute(
"SELECT source_model, source_id, match_kind, confidence FROM entity_links WHERE canonical_id=?", (lp_id,))]
cids = _contact_ids_for(c, lp_id)
out["interaction_count"] = (c.execute(
"SELECT COUNT(*) FROM communications WHERE contact_id IN (%s)" % ",".join("?" * len(cids)),
list(cids)).fetchone()[0] if cids else 0)
c.close()
return out
def search_records(query=None, entity_kind=None, limit=20, db=None):
"""Structured search over canonical entities (name substring + kind)."""
c = _conn(db)
sql = ("SELECT id, entity_kind, display_name, primary_email, segment, warmth_score "
"FROM canonical_entities WHERE deleted_at IS NULL")
args = []
if entity_kind:
sql += " AND entity_kind=?"
args.append(entity_kind)
if query:
sql += " AND lower(display_name) LIKE ?"
args.append(f"%{query.lower()}%")
sql += " ORDER BY display_name LIMIT ?"
args.append(limit)
rows = [dict(r) for r in c.execute(sql, args)]
c.close()
return {"results": rows, "count": len(rows)}
def get_interaction_history(lp_id, limit=20, db=None):
"""Merged, dated interaction history for an entity: communications + grid notes."""
c = _conn(db)
items = []
cids = _contact_ids_for(c, lp_id)
if cids:
q = ("SELECT type, subject, body, communication_date FROM communications "
"WHERE contact_id IN (%s) ORDER BY communication_date DESC LIMIT ?" % ",".join("?" * len(cids)))
for r in c.execute(q, [*cids, limit]):
items.append({"kind": r["type"], "date": r["communication_date"],
"subject": r["subject"], "text": (r["body"] or "")[:240]})
inv_src = [r["source_id"] for r in c.execute(
"SELECT source_id FROM entity_links WHERE canonical_id=? AND source_model='fundraising_investors'", (lp_id,))]
if inv_src:
q = "SELECT notes, updated_at FROM fundraising_investors WHERE id IN (%s)" % ",".join("?" * len(inv_src))
for r in c.execute(q, inv_src):
if (r["notes"] or "").strip():
items.append({"kind": "grid_note", "date": r["updated_at"],
"subject": "Fundraising grid notes", "text": r["notes"][:300]})
c.close()
items.sort(key=lambda x: (x["date"] or ""), reverse=True)
return {"lp_id": lp_id, "items": items[:limit], "count": len(items)}
# ── retrieval tools (wrap /api/search) ────────────────────────────────────────
def _filter(lp_id=None, doc_type=None, date_from=None, date_to=None):
must = []
if lp_id:
must.append({"key": "lp_id", "match": {"value": lp_id}})
if doc_type:
must.append({"key": "doc_type", "match": {"value": doc_type}})
if date_from is not None or date_to is not None:
rng = {}
if date_from is not None:
rng["gte"] = date_from
if date_to is not None:
rng["lte"] = date_to
must.append({"key": "date_ts", "range": rng})
return {"must": must} if must else None
def _shape(rows):
out = []
for r in rows:
p = r.get("payload", {}) or {}
out.append({"score": r.get("score"), "lp_id": p.get("lp_id"), "lp_name": p.get("lp_name"),
"doc_type": p.get("doc_type"), "date_ts": p.get("date_ts"),
"text": r.get("text") or p.get("text"),
"source": f"{p.get('source_model')}:{p.get('source_id')}"})
return out
def hybrid_search(query, top_k=8, lp_id=None, doc_type=None, date_from=None, date_to=None):
"""Dense + BM25 + rerank. Default mode; best for entity-heavy queries."""
return {"mode": "hybrid", "query": query,
"results": _shape(search.hybrid_search(query, top_k=top_k,
filt=_filter(lp_id, doc_type, date_from, date_to)))}
def semantic_search(query, top_k=8, lp_id=None, doc_type=None, date_from=None, date_to=None):
"""Dense only, high recall."""
return {"mode": "semantic", "query": query,
"results": _shape(search.semantic_search(query, top_k=top_k,
filt=_filter(lp_id, doc_type, date_from, date_to)))}
def keyword_search(query, top_k=8, lp_id=None, doc_type=None, date_from=None, date_to=None):
"""High-precision lexical (sparse leg + rerank)."""
return {"mode": "keyword", "query": query,
"results": _shape(search.keyword_search(query, top_k=top_k,
filt=_filter(lp_id, doc_type, date_from, date_to)))}
# ── write tools (every write logged — guardrail #5) ───────────────────────────
def log_interaction(action, actor_type="agent", actor_id=None, target_id=None,
target_type="canonical_entity", payload=None, source="mcp", db=None):
"""Append an entry to the append-only interaction log."""
c = _conn(db)
iid = str(uuid.uuid4())
c.execute("""INSERT INTO interaction_log
(id, ts, actor_type, actor_id, action, target_type, target_id, payload, source, created_at)
VALUES (?,?,?,?,?,?,?,?,?,?)""",
(iid, _now(), actor_type, actor_id, action, target_type, target_id,
json.dumps(payload) if payload is not None else None, source, _now()))
c.commit()
c.close()
return {"id": iid, "logged": True}
_ENRICH_FIELDS = {"thesis_fit", "segment", "accreditation_status", "qp_status",
"warmth_score", "source", "owner_id", "last_touch_at", "notes"}
def set_entity_enrichment(lp_id, fields, actor_id="analyst", db=None):
"""One-way enrichment write INTO the canonical entity (guardrail #8). Logged."""
upd = {k: v for k, v in (fields or {}).items() if k in _ENRICH_FIELDS}
if not upd:
return {"error": "no_valid_fields", "allowed": sorted(_ENRICH_FIELDS)}
c = _conn(db)
sets = ", ".join(f"{k}=?" for k in upd) + ", updated_at=?"
c.execute(f"UPDATE canonical_entities SET {sets} WHERE id=?", [*upd.values(), _now(), lp_id])
iid = str(uuid.uuid4())
c.execute("""INSERT INTO interaction_log
(id, ts, actor_type, actor_id, action, target_type, target_id, payload, source, created_at)
VALUES (?,?,?,?,?,?,?,?,?,?)""",
(iid, _now(), "agent", actor_id, "enrichment.written", "canonical_entity", lp_id,
json.dumps(upd), "mcp", _now()))
c.commit()
c.close()
return {"lp_id": lp_id, "updated": list(upd.keys()), "log_id": iid}
+88
View File
@@ -0,0 +1,88 @@
#!/usr/bin/env python3
"""Ten31 CRM MCP server (Workstream C).
Exposes CRM reads, retrieval modes, and logged writes to the Claude Agent SDK
over MCP (stdio). All logic lives in crm_tools.py (tested independently); this
file is the thin transport wrapper.
Run:
pip install mcp # one-time (MCP Python SDK)
CRM_DB_PATH=/data/crm.db python3 backend/mcp/server.py
Register with the Agent SDK / Claude Code as an stdio MCP server pointing at this
script. NO outbound/contact tools are exposed — that capability is gated to
Phase 3 behind the compliance review (CLAUDE.md guardrails #4, #6).
"""
import os
import sys
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
import crm_tools as t # noqa: E402
from mcp.server.fastmcp import FastMCP # noqa: E402
mcp = FastMCP("ten31-crm")
# ── reads ──
@mcp.tool()
def get_entity(lp_id: str) -> dict:
"""Fetch a canonical LP/organization/person entity by id, with its linked
source records and interaction count."""
return t.get_entity(lp_id)
@mcp.tool()
def search_records(query: str = "", entity_kind: str = "", limit: int = 20) -> dict:
"""Structured search over canonical entities by name substring and kind
('lp' | 'organization' | 'person')."""
return t.search_records(query=query or None, entity_kind=entity_kind or None, limit=limit)
@mcp.tool()
def get_interaction_history(lp_id: str, limit: int = 20) -> dict:
"""Merged, dated interaction history (communications + fundraising grid notes)
for a canonical entity."""
return t.get_interaction_history(lp_id, limit=limit)
# ── retrieval modes ──
@mcp.tool()
def hybrid_search(query: str, top_k: int = 8, lp_id: str = "", doc_type: str = "",
date_from: int = 0, date_to: int = 0) -> dict:
"""Dense + BM25 + rerank retrieval (default; best for entity-heavy queries).
Optional filters: lp_id, doc_type, date_from/date_to (epoch seconds)."""
return t.hybrid_search(query, top_k=top_k, lp_id=lp_id or None, doc_type=doc_type or None,
date_from=date_from or None, date_to=date_to or None)
@mcp.tool()
def semantic_search(query: str, top_k: int = 8, lp_id: str = "", doc_type: str = "") -> dict:
"""Dense-only retrieval (high recall)."""
return t.semantic_search(query, top_k=top_k, lp_id=lp_id or None, doc_type=doc_type or None)
@mcp.tool()
def keyword_search(query: str, top_k: int = 8, lp_id: str = "", doc_type: str = "") -> dict:
"""High-precision lexical retrieval (sparse leg + rerank)."""
return t.keyword_search(query, top_k=top_k, lp_id=lp_id or None, doc_type=doc_type or None)
# ── writes (logged) ──
@mcp.tool()
def log_interaction(action: str, actor_type: str = "agent", actor_id: str = "",
target_id: str = "", payload: dict = None, source: str = "mcp") -> dict:
"""Append an entry to the append-only interaction log (guardrail #5)."""
return t.log_interaction(action, actor_type=actor_type, actor_id=actor_id or None,
target_id=target_id or None, payload=payload, source=source)
@mcp.tool()
def set_entity_enrichment(lp_id: str, fields: dict, actor_id: str = "analyst") -> dict:
"""One-way enrichment write into a canonical entity (thesis_fit, segment,
warmth_score, accreditation_status, etc.). Logged automatically."""
return t.set_entity_enrichment(lp_id, fields, actor_id=actor_id)
if __name__ == "__main__":
mcp.run()
@@ -0,0 +1,23 @@
-- Reversal of 0001_phase0_foundation.sql.
--
-- Run manually (it is NOT applied automatically) if the Phase-0 foundation
-- schema needs to be rolled back. Drops the new tables. The added `deleted_at`
-- columns on existing tables are left in place by default because they are
-- inert (nullable, unused unless the DELETE handlers are changed) and dropping
-- columns requires SQLite >= 3.35; uncomment the DROP COLUMN lines to remove
-- them on a modern SQLite.
DROP TABLE IF EXISTS relationship_edges;
DROP TABLE IF EXISTS interaction_log;
DROP TABLE IF EXISTS entity_links;
DROP TABLE IF EXISTS canonical_entities;
-- Requires SQLite >= 3.35.0. Safe to leave the columns if unsure.
-- ALTER TABLE contacts DROP COLUMN deleted_at;
-- ALTER TABLE organizations DROP COLUMN deleted_at;
-- ALTER TABLE opportunities DROP COLUMN deleted_at;
-- ALTER TABLE communications DROP COLUMN deleted_at;
-- ALTER TABLE lp_profiles DROP COLUMN deleted_at;
-- Forget the migration so it can be re-applied:
DELETE FROM schema_migrations WHERE filename = '0001_phase0_foundation.sql';
@@ -0,0 +1,116 @@
-- Phase 0 — Workstream A2: foundation schema for the agentic system.
--
-- ADDITIVE AND REVERSIBLE ONLY (CLAUDE.md guardrail #3): this migration adds
-- new tables and new nullable columns alongside the existing CRM. It never
-- drops, renames, or rewrites existing data. Its reversal is 0001_phase0_foundation.down.sql.
--
-- Applied once at startup by backend/core_migrations.py, tracked in the
-- schema_migrations ledger. Safe to leave in place; the canonical layer it
-- creates starts EMPTY and is populated later by entity resolution (A4/B3).
-- ============================================================================
-- 1. canonical_entities — the single, model-agnostic identity for an LP /
-- organization / person. Both the classic contacts/lp_profiles model and the
-- fundraising_* grid map INTO this; neither existing model is demoted.
-- IDs are full-length (e.g. 'lp_' + uuid4 hex), NOT the 8-char truncated
-- UUIDs used elsewhere in the CRM, so they are safe as the index/payload key.
-- ============================================================================
CREATE TABLE IF NOT EXISTS canonical_entities (
id TEXT PRIMARY KEY,
entity_kind TEXT NOT NULL, -- 'lp' | 'organization' | 'person'
display_name TEXT NOT NULL,
primary_email TEXT,
-- Phase-0 LP/prospect fields (model-agnostic home):
thesis_fit TEXT,
segment TEXT,
accreditation_status TEXT, -- free-text until counsel defines the vocabulary (guardrail #6)
qp_status TEXT,
warmth_score REAL,
source TEXT,
owner_id TEXT REFERENCES users(id),
last_touch_at TEXT,
notes TEXT,
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now')),
deleted_at TEXT -- soft-delete (never hard-delete; guardrail #3)
);
CREATE INDEX IF NOT EXISTS idx_canonical_kind ON canonical_entities(entity_kind);
CREATE INDEX IF NOT EXISTS idx_canonical_email ON canonical_entities(primary_email);
CREATE INDEX IF NOT EXISTS idx_canonical_owner ON canonical_entities(owner_id);
-- ============================================================================
-- 2. entity_links — resolution map. Every source row (a contacts row, a
-- fundraising_investors row, etc.) and every email/name variant points at the
-- canonical entity it resolves to. This is how name variants collapse to one id.
-- ============================================================================
CREATE TABLE IF NOT EXISTS entity_links (
id TEXT PRIMARY KEY,
canonical_id TEXT NOT NULL REFERENCES canonical_entities(id) ON DELETE CASCADE,
source_model TEXT NOT NULL, -- contacts|organizations|lp_profiles|fundraising_investors|fundraising_contacts|email_address|alias
source_id TEXT, -- the local PK in that model (NULL for a bare email/name alias)
match_value TEXT, -- normalized email or name variant
match_kind TEXT NOT NULL, -- exact_email|name_variant|domain|manual
confidence REAL DEFAULT 1.0,
created_at TEXT DEFAULT (datetime('now')),
UNIQUE(source_model, source_id, match_value)
);
CREATE INDEX IF NOT EXISTS idx_entity_links_canonical ON entity_links(canonical_id);
CREATE INDEX IF NOT EXISTS idx_entity_links_match ON entity_links(match_value);
CREATE INDEX IF NOT EXISTS idx_entity_links_source ON entity_links(source_model, source_id);
-- ============================================================================
-- 3. interaction_log — APPEND-ONLY record of every agent action and every human
-- touch (guardrail #5). Distinct from audit_log (which is mutation-diff-only
-- and has no actor/agent dimension). Nothing in this table is ever updated or
-- deleted by convention.
-- ============================================================================
CREATE TABLE IF NOT EXISTS interaction_log (
id TEXT PRIMARY KEY,
ts TEXT NOT NULL DEFAULT (datetime('now')), -- event time
actor_type TEXT NOT NULL, -- human | agent | system
actor_id TEXT, -- users.id, or an agent name (Scout/Analyst/...)
action TEXT NOT NULL, -- e.g. note.created | email.matched | enrichment.written | search.run
target_type TEXT, -- canonical_entity | contact | communication | opportunity | ...
target_id TEXT, -- canonical_entities.id where possible
payload TEXT, -- JSON blob with the action detail
source TEXT, -- crm_ui | mcp | ingest | scout | ...
created_at TEXT DEFAULT (datetime('now'))
);
CREATE INDEX IF NOT EXISTS idx_interaction_target ON interaction_log(target_type, target_id);
CREATE INDEX IF NOT EXISTS idx_interaction_ts ON interaction_log(ts);
CREATE INDEX IF NOT EXISTS idx_interaction_actor ON interaction_log(actor_type, actor_id);
-- ============================================================================
-- 4. relationship_edges — derived graph of who-knows-whom between canonical
-- entities. Starts EMPTY; seeded later from email_investor_links + calendar +
-- X follower overlap (Analyst, Phase 2).
-- ============================================================================
CREATE TABLE IF NOT EXISTS relationship_edges (
id TEXT PRIMARY KEY,
src_id TEXT NOT NULL REFERENCES canonical_entities(id) ON DELETE CASCADE,
dst_id TEXT NOT NULL REFERENCES canonical_entities(id) ON DELETE CASCADE,
edge_type TEXT NOT NULL, -- email_corr | calendar | x_follow | intro | colleague
source TEXT NOT NULL, -- provenance of this edge
strength REAL DEFAULT 0,
directed INTEGER DEFAULT 0,
evidence TEXT, -- JSON supporting detail
first_seen_at TEXT,
last_seen_at TEXT,
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now')),
UNIQUE(src_id, dst_id, edge_type, source)
);
CREATE INDEX IF NOT EXISTS idx_rel_src ON relationship_edges(src_id);
CREATE INDEX IF NOT EXISTS idx_rel_dst ON relationship_edges(dst_id);
-- ============================================================================
-- 5. Soft-delete columns on existing tables. Additive nullable columns; the CRM
-- currently HARD-deletes everywhere (guardrail #3 gap). Adding the column is
-- safe now; switching the DELETE handlers to set it instead of hard-deleting
-- is a separate, reviewed code change.
-- ============================================================================
ALTER TABLE contacts ADD COLUMN deleted_at TEXT;
ALTER TABLE organizations ADD COLUMN deleted_at TEXT;
ALTER TABLE opportunities ADD COLUMN deleted_at TEXT;
ALTER TABLE communications ADD COLUMN deleted_at TEXT;
ALTER TABLE lp_profiles ADD COLUMN deleted_at TEXT;
+1
View File
@@ -9,3 +9,4 @@ passlib[bcrypt]==1.7.4
python-multipart==0.0.9
aiofiles==23.2.1
httpx==0.27.0
cryptography==42.0.5
+279
View File
@@ -0,0 +1,279 @@
#!/usr/bin/env python3
"""Seed a SYNTHETIC dev database for Ten31 CRM ingest/retrieval testing.
ALL DATA IS FAKE. No real LP/prospect information appears here (CLAUDE.md
guardrail #9: Claude works only on synthetic/redacted data). This produces a
realistic-shaped corpus so the Phase-0 ingest, chunking, and entity-resolution
work can be developed and tested without ever touching the live CRM.
What it builds (into a SEPARATE dev DB, never crm.db):
* The full real schema, via server.init_db() — which also runs the new
core migration (backend/migrations/), so the canonical/interaction/graph
tables exist.
* A classic-model dataset: organizations, contacts (investors + prospects),
opportunities across pipeline stages, communications with entity-rich prose
notes, and lp_profiles.
* A fundraising grid (fundraising_state.grid_json) populated via the real
sync_fundraising_relational() code path, so the normalized mirror + the
grid->classic bridge behave exactly as in production.
* DELIBERATE entity-resolution test cases: several investors appear in BOTH
models with NAME VARIANTS (e.g. "Jonathan Reyes" vs grid contact "Jon
Reyes"), some with matching email (easy merge) and some without (hard case).
Usage:
python3 backend/scripts/seed_synthetic.py # -> data/crm_dev.db
python3 backend/scripts/seed_synthetic.py --db /tmp/x.db
"""
import argparse
import datetime
import json
import os
import random
import sys
HERE = os.path.dirname(os.path.abspath(__file__))
BACKEND_DIR = os.path.dirname(HERE)
PROJECT_DIR = os.path.dirname(BACKEND_DIR)
sys.path.insert(0, BACKEND_DIR)
ORGS = [
("Cedar Point Capital", "family_office"),
("Harbor & Vine Family Office", "family_office"),
("Northwind Endowment", "endowment"),
("Granite Peak Partners", "institutional"),
("Solano Ventures", "wealth_management"),
("Tidewater Holdings", "family_office"),
("Brightwater Capital", "institutional"),
("Meridian Trust", "foundation"),
]
FIRST = ["Jonathan", "Katherine", "Michael", "William", "Robert", "Elena", "Priya",
"David", "Sarah", "James", "Maria", "Thomas", "Laura", "Daniel", "Rachel",
"Steven", "Nicole", "Andrew", "Jessica", "Brian"]
LAST = ["Reyes", "Calder", "Okonkwo", "Brandt", "Sutter", "Vance", "Mehta", "Ellison",
"Cho", "Whitlock", "Santos", "Aldridge", "Kerr", "Nilsson", "Pope", "Devlin",
"Frye", "Osei", "Lindqvist", "Marsh"]
NICK = {"Jonathan": "Jon", "Katherine": "Kate", "Michael": "Mike", "William": "Bill",
"Robert": "Bob", "James": "Jim", "Thomas": "Tom", "Daniel": "Dan",
"Steven": "Steve", "Jessica": "Jess"}
FUND_COLS = ["fund_i", "fund_ii", "fund_iii", "tactical_fund", "pawn_to_e4",
"ten31_terahash", "sats_and_stats", "pawn_to_f4", "join_the_fold"]
FUND_LABELS = {"fund_i": "Fund I", "fund_ii": "Fund II", "fund_iii": "Fund III",
"tactical_fund": "Tactical Fund"}
AMOUNTS = [250_000, 500_000, 1_000_000, 2_500_000, 5_000_000]
LEADS = ["JK", "Grant", "MB", "Parker"]
COMM_TEMPLATES = [
("call", "Intro call recap",
"Spoke with {person} ({org}) for {dur} min about {fund}. Strong interest in the "
"bitcoin-energy and AI-infrastructure thesis; wants the latest deck and DPI figures. "
"Flagged accreditation paperwork still outstanding. Next: send one-pager and schedule a partner call."),
("email", "Follow-up: {fund} allocation",
"Sent {person} the {fund} summary and the scarcity/critical-infrastructure memo. "
"They asked how Ten31 Terahash relates to the energy thesis. Following up next week on commitment size."),
("meeting", "Partner meeting notes",
"Met {person} at {org}. Discussed pacing into {fund} and co-invest appetite. "
"Concern about lockup; reassured on secondary options. Warm — wants to meet the GP again before committing."),
("note", "Diligence status",
"{person} is mid-diligence on {fund}. Legal reviewing subscription docs; wire expected within 30 days. "
"Keep warm; send the Q update."),
("text", "Quick ping",
"Texted {person} re: the {fund} close timeline. Said they're 'in for at least a unit' pending IC approval."),
]
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--db", default=os.path.join(PROJECT_DIR, "data", "crm_dev.db"))
args = ap.parse_args()
db = os.path.abspath(args.db)
if os.path.basename(db) == "crm.db":
sys.exit("Refusing to seed the real crm.db. Use a dev path like data/crm_dev.db.")
for ext in ("", "-wal", "-shm"):
if os.path.exists(db + ext):
os.remove(db + ext)
os.environ["CRM_DB_PATH"] = db
import server # noqa: E402 (must follow CRM_DB_PATH assignment)
server.init_db()
gen, now = server.generate_id, server.now
conn = server.get_db()
random.seed(31)
def past(days_ago):
return (datetime.datetime.utcnow() - datetime.timedelta(days=days_ago)).isoformat() + "Z"
# ── dev user (FK target for created_by/owner_id) ──
uid = gen()
conn.execute(
"INSERT INTO users (id, username, email, password_hash, full_name, role) VALUES (?,?,?,?,?,?)",
(uid, "dev_admin", "dev@example.invalid", server.hash_password("devpassword"),
"Dev Admin", "admin"))
# ── organizations ──
org_ids = {}
for name, otype in ORGS:
oid = gen()
org_ids[name] = oid
conn.execute(
"INSERT INTO organizations (id, name, type, industry, country, description, created_by, updated_at) "
"VALUES (?,?,?,?,?,?,?,?)",
(oid, name, otype, "Investment Management", "USA",
f"{name} — synthetic {otype.replace('_', ' ')} used for ingest testing.", uid, now()))
# ── classic contacts (investors + prospects) ──
contacts = [] # (cid, first, last, org_name, contact_type)
overlap_specs = [] # investors we will also place in the grid, with variants
used = set()
for i, (org_name, _) in enumerate(ORGS):
# one "primary" investor contact per org
first, last = FIRST[i], LAST[i]
used.add((first, last))
cid = gen()
email = f"{first.lower()}.{last.lower()}@{org_name.split()[0].lower()}.invalid"
conn.execute(
"INSERT INTO contacts (id, first_name, last_name, email, title, organization_id, contact_type, "
"status, source, notes, created_by, updated_at) VALUES (?,?,?,?,?,?,?,?,?,?,?,?)",
(cid, first, last, email, "Managing Partner", org_ids[org_name], "investor", "active",
"referral", f"Primary relationship at {org_name}. Met via conference intro.", uid, now()))
contacts.append((cid, first, last, org_name, "investor"))
# mark 5 of 8 for grid overlap with a NAME VARIANT
if i < 5:
variant = NICK.get(first, first[0] + ".") # nickname or initial
match_email = email if i % 2 == 0 else "" # half share email (easy), half don't (hard)
overlap_specs.append((org_name, f"{variant} {last}", match_email))
# extra prospect contacts (no org sometimes)
for j in range(12):
first = FIRST[(j + 8) % len(FIRST)]
last = LAST[(j + 8) % len(LAST)]
if (first, last) in used:
last = LAST[(j + 11) % len(LAST)]
used.add((first, last))
org_name = ORGS[j % len(ORGS)][0] if j % 3 else None
cid = gen()
email = f"{first.lower()}{last.lower()}@example.invalid"
conn.execute(
"INSERT INTO contacts (id, first_name, last_name, email, title, organization_id, contact_type, "
"status, source, notes, created_by, updated_at) VALUES (?,?,?,?,?,?,?,?,?,?,?,?)",
(cid, first, last, email, "Principal", org_ids.get(org_name) if org_name else None,
"prospect", "active", random.choice(["inbound", "referral", "conference", "x"]),
f"Prospect sourced via {random.choice(['X DM', 'warm intro', 'podcast'])}.", uid, now()))
contacts.append((cid, first, last, org_name, "prospect"))
# ── opportunities + lp_profiles + communications ──
stages = server.PIPELINE_STAGES
for idx, (cid, first, last, org_name, ctype) in enumerate(contacts):
person = f"{first} {last}"
# opportunity for most contacts
if idx % 5 != 4:
stage = stages[idx % len(stages)]
fund_label = random.choice(list(FUND_LABELS.values()))
amt = random.choice(AMOUNTS)
conn.execute(
"INSERT INTO opportunities (id, name, contact_id, organization_id, stage, commitment_amount, "
"expected_amount, probability, fund_name, description, next_step, owner_id, priority, updated_at) "
"VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?)",
(gen(), f"{org_name or person}{fund_label}", cid, org_ids.get(org_name) if org_name else None,
stage, amt if stage in ("committed", "funded") else 0, amt,
{"lead": 10, "outreach": 25, "meeting": 40, "due_diligence": 60, "committed": 90, "funded": 100}[stage],
fund_label, f"Potential {fund_label} allocation for {person}.",
random.choice(["Send deck", "Schedule call", "Await IC", "Send subdocs"]),
uid, random.choice(["low", "medium", "high"]), now()))
# lp_profile for ~closed investors
if ctype == "investor" and idx % 2 == 0:
amt = random.choice(AMOUNTS)
conn.execute(
"INSERT INTO lp_profiles (id, contact_id, commitment_amount, funded_amount, commitment_date, "
"fund_name, investor_type, accredited, legal_docs_signed, wire_received, k1_sent, notes, updated_at) "
"VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)",
(gen(), cid, amt, amt if idx % 4 == 0 else 0, past(120),
random.choice(list(FUND_LABELS.values())),
random.choice(["family_office", "institutional", "endowment", "individual"]),
1, 1 if idx % 3 else 0, 1 if idx % 4 == 0 else 0, 0,
f"Closed LP. Accreditation on file. Primary contact {person}.", now()))
# 2-4 communications each, entity-rich prose
for k in range(random.randint(2, 4)):
ctype_comm, subj, body = random.choice(COMM_TEMPLATES)
fund = random.choice(["Fund III", "Tactical Fund", "Ten31 Terahash", "Fund II"])
conn.execute(
"INSERT INTO communications (id, contact_id, type, subject, body, communication_date, "
"duration_minutes, outcome, next_action, created_by, updated_at) VALUES (?,?,?,?,?,?,?,?,?,?,?)",
(gen(), cid, ctype_comm,
subj.format(fund=fund),
body.format(person=person, org=org_name or "their firm", fund=fund, dur=random.choice([20, 30, 45])),
past(random.randint(1, 200)), random.choice([20, 30, 45, None]),
random.choice(["positive", "neutral", "needs follow-up"]),
random.choice(["Send deck", "Schedule call", "Send subdocs", None]),
uid, now()))
# ── fundraising grid (authoritative grid_json -> real sync path) ──
columns = server.DEFAULT_FUNDRAISING_COLUMNS
views = server.DEFAULT_GRID_VIEWS
rows = []
# (a) overlap investors — same org as a classic investor, but a NAME-VARIANT contact
for org_name, variant_name, match_email in overlap_specs:
row = {"id": "row-" + gen(), "investor_name": org_name, "lead": random.choice(LEADS),
"lead_source": random.choice(["Conference", "Warm intro", "X"]),
"notes": f"[call] {variant_name}: discussed Fund III pacing and co-invest. Warm.\n"
f"[email] {variant_name}: sent the energy-thesis memo.",
"priority": random.random() < 0.4, "follow_up": random.random() < 0.5, "graveyard": False,
"contacts": [{"name": variant_name,
"email": match_email or f"{variant_name.split()[0].lower()}@{org_name.split()[0].lower()}.invalid",
"title": "Managing Partner"}]}
for fc in random.sample(FUND_COLS, k=random.randint(1, 3)):
row[fc] = random.choice(AMOUNTS)
rows.append(row)
# (b) grid-only investors (no classic counterpart) — exercise the create path
for n in range(7):
nm = f"{random.choice(['Slate', 'Copper', 'Ridgeline', 'Anchor', 'Falcon', 'Quarry', 'Beacon'])} " \
f"{random.choice(['Capital', 'Partners', 'Holdings', 'Group'])}"
row = {"id": "row-" + gen(), "investor_name": nm, "lead": random.choice(LEADS),
"lead_source": random.choice(["Inbound", "Referral", "Podcast"]),
"notes": f"[note] First touch with {nm}. Sourced via X. Gauging thesis fit.",
"priority": False, "follow_up": random.random() < 0.6,
"graveyard": n >= 5, # a couple in the graveyard list
"contacts": [{"name": f"{random.choice(FIRST)} {random.choice(LAST)}",
"email": f"contact{n}@{nm.split()[0].lower()}.invalid", "title": "Partner"}]}
for fc in random.sample(FUND_COLS, k=random.randint(0, 2)):
row[fc] = random.choice(AMOUNTS)
rows.append(row)
grid = {"columns": columns, "rows": rows}
conn.execute(
"INSERT INTO fundraising_state (id, grid_json, views_json, version, updated_by, created_at, updated_at) "
"VALUES ('main', ?, ?, 1, ?, ?, ?)",
(json.dumps(grid), json.dumps(views), uid, now(), now()))
server.sync_fundraising_relational(conn, grid, views, actor_user_id=uid)
conn.commit()
# ── summary ──
def count(t):
return conn.execute(f"SELECT COUNT(*) FROM {t}").fetchone()[0]
print(f"\nSynthetic dev DB written to: {db}")
print(" Classic model:")
for t in ("organizations", "contacts", "opportunities", "communications", "lp_profiles"):
print(f" {t:<24} {count(t)}")
print(" Fundraising grid (after real sync):")
for t in ("fundraising_investors", "fundraising_contacts", "fundraising_funds",
"fundraising_commitments", "fundraising_list_memberships"):
print(f" {t:<24} {count(t)}")
print(" Phase-0 foundation tables (from migration, empty until entity resolution):")
for t in ("canonical_entities", "entity_links", "interaction_log", "relationship_edges"):
print(f" {t:<24} {count(t)}")
inv = count("contacts") # note grid bridge may have created extra investor contacts (the variants)
print(f"\n Entity-resolution test bed: {len(overlap_specs)} investors intentionally appear in BOTH models "
f"with name variants; total contacts now {inv} (grid bridge added the variant rows).")
conn.close()
if __name__ == "__main__":
main()
+620 -52
View File
@@ -18,7 +18,7 @@ import re
import base64
import threading
from datetime import datetime, timedelta
from http.server import HTTPServer, BaseHTTPRequestHandler
from http.server import HTTPServer, ThreadingHTTPServer, BaseHTTPRequestHandler
from urllib.parse import urlparse, parse_qs, unquote
from functools import wraps
@@ -52,6 +52,14 @@ CORS_ORIGIN = os.environ.get("CRM_CORS_ORIGIN", "*")
ENV = os.environ.get("CRM_ENV", "development")
LOGIN_RATE_LIMIT_PER_MIN = int(os.environ.get("CRM_LOGIN_RATE_LIMIT_PER_MIN", "20"))
WRITE_RATE_LIMIT_PER_MIN = int(os.environ.get("CRM_WRITE_RATE_LIMIT_PER_MIN", "300"))
GET_RATE_LIMIT_PER_MIN = int(os.environ.get("CRM_GET_RATE_LIMIT_PER_MIN", "600"))
# Auto-ban any IP that racks up too many 404s in a short window — almost always
# a vulnerability scanner blasting common paths (/.env, /.git/config, /swagger,
# /actuator/env, wp-json, etc.). Banned IPs get instant 429s with no DB or
# filesystem work, so they can't keep the single SQLite writer busy.
ABUSE_404_THRESHOLD = int(os.environ.get("CRM_ABUSE_404_THRESHOLD", "15"))
ABUSE_404_WINDOW_SEC = int(os.environ.get("CRM_ABUSE_404_WINDOW_SEC", "60"))
ABUSE_BAN_SEC = int(os.environ.get("CRM_ABUSE_BAN_SEC", "900")) # 15 minutes
BACKUP_POLICY_SETTING_KEY = "fundraising_backup_policy"
DEFAULT_BACKUP_POLICY = {
"enabled": True,
@@ -127,6 +135,10 @@ def init_db():
tags TEXT DEFAULT '[]',
notes TEXT,
linkedin_url TEXT,
city TEXT,
state TEXT,
country TEXT,
location_query TEXT,
preferred_contact TEXT DEFAULT 'email',
created_by TEXT REFERENCES users(id),
created_at TEXT DEFAULT (datetime('now')),
@@ -258,6 +270,7 @@ def init_db():
investor_name TEXT NOT NULL,
notes TEXT,
lead TEXT,
lead_source TEXT,
priority INTEGER DEFAULT 0,
follow_up INTEGER DEFAULT 0,
graveyard INTEGER DEFAULT 0,
@@ -337,6 +350,30 @@ def init_db():
UNIQUE(investor_id, list_key)
);
CREATE TABLE IF NOT EXISTS fundraising_presence (
user_id TEXT PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE,
username TEXT NOT NULL,
full_name TEXT,
active_view TEXT,
row_id TEXT,
col_id TEXT,
is_editing INTEGER DEFAULT 0,
cell_key TEXT,
last_seen_at TEXT DEFAULT (datetime('now')),
expires_at_epoch INTEGER NOT NULL
);
CREATE TABLE IF NOT EXISTS fundraising_cell_locks (
cell_key TEXT PRIMARY KEY,
row_id TEXT NOT NULL,
col_id TEXT NOT NULL,
locked_by_user_id TEXT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
locked_by_username TEXT NOT NULL,
locked_by_full_name TEXT,
last_seen_at TEXT DEFAULT (datetime('now')),
expires_at_epoch INTEGER NOT NULL
);
CREATE TABLE IF NOT EXISTS app_settings (
key TEXT PRIMARY KEY,
value_json TEXT NOT NULL,
@@ -363,9 +400,42 @@ def init_db():
CREATE INDEX IF NOT EXISTS idx_fr_commitments_fund ON fundraising_commitments(fund_id);
CREATE INDEX IF NOT EXISTS idx_fr_automation_runs_created ON fundraising_automation_runs(created_at);
CREATE INDEX IF NOT EXISTS idx_fr_memberships_list ON fundraising_list_memberships(list_key);
CREATE INDEX IF NOT EXISTS idx_fr_presence_expires ON fundraising_presence(expires_at_epoch);
CREATE INDEX IF NOT EXISTS idx_fr_locks_expires ON fundraising_cell_locks(expires_at_epoch);
""")
# Lightweight schema migrations for existing databases.
for stmt in [
"ALTER TABLE contacts ADD COLUMN city TEXT",
"ALTER TABLE contacts ADD COLUMN state TEXT",
"ALTER TABLE contacts ADD COLUMN country TEXT",
"ALTER TABLE contacts ADD COLUMN location_query TEXT",
"ALTER TABLE fundraising_investors ADD COLUMN lead_source TEXT",
]:
try:
conn.execute(stmt)
except sqlite3.OperationalError:
pass
# ─── Gmail integration migrations (feature-flag-guarded import) ───
try:
from email_integration.db import apply_migrations as _email_apply_migrations
_email_apply_migrations(cursor)
except ImportError:
pass
except Exception as _e:
print(f"[email_integration] migration warning: {_e}")
conn.commit()
# ─── Core schema migrations (Phase 0+; ordered .sql files w/ ledger) ───
# Additive/reversible only; tracked in schema_migrations. See core_migrations.py.
try:
from core_migrations import apply_core_migrations as _apply_core_migrations
_apply_core_migrations(conn)
except Exception as _e:
print(f"[migrations] core migration warning: {_e}")
conn.close()
print(f"Database initialized at {DB_PATH}")
@@ -563,6 +633,16 @@ def _split_full_name(full_name):
def _normalize_text(value):
return str(value or '').strip().lower()
def _parse_location_text(text):
raw = str(text or '').strip()
if not raw:
return '', '', '', ''
parts = [p.strip() for p in raw.split(',') if p.strip()]
city = parts[0] if len(parts) >= 1 else ''
state = parts[1] if len(parts) >= 2 else ''
country = parts[2] if len(parts) >= 3 else ''
return city, state, country, raw
def ensure_default_automation_rules(conn):
defaults = [
{
@@ -662,6 +742,11 @@ def _upsert_contact_from_fundraising(conn, investor_name, contact, actor_user_id
full_name = str(contact.get('name') or '').strip()
email = str(contact.get('email') or '').strip()
title = str(contact.get('title') or '').strip()
source = str(contact.get('source') or '').strip()
city = str(contact.get('city') or '').strip()
state = str(contact.get('state') or '').strip()
country = str(contact.get('country') or '').strip()
location_query = str(contact.get('location_query') or '').strip()
if not full_name and not email:
return None
first_name, last_name = _split_full_name(full_name)
@@ -700,20 +785,25 @@ def _upsert_contact_from_fundraising(conn, investor_name, contact, actor_user_id
next_last = last_name if (last_name or full_name) else str(existing['last_name'] or '')
next_email = email or str(existing['email'] or '')
next_title = title or str(existing['title'] or '')
next_source = source or str(existing['source'] or '')
next_city = city or str(existing['city'] or '')
next_state = state or str(existing['state'] or '')
next_country = country or str(existing['country'] or '')
next_location_query = location_query or str(existing['location_query'] or '')
next_org = org_id or existing['organization_id']
conn.execute("""
UPDATE contacts
SET first_name = ?, last_name = ?, email = ?, title = ?,
organization_id = ?, contact_type = 'investor', updated_at = ?
organization_id = ?, source = ?, contact_type = 'investor', city = ?, state = ?, country = ?, location_query = ?, updated_at = ?
WHERE id = ?
""", (next_first, next_last, next_email, next_title, next_org, now(), existing['id']))
""", (next_first, next_last, next_email, next_title, next_org, next_source, next_city, next_state, next_country, next_location_query, now(), existing['id']))
return existing['id']
contact_id = generate_id()
conn.execute("""
INSERT INTO contacts (
id, first_name, last_name, email, title, organization_id, contact_type, status, created_by, updated_at
) VALUES (?, ?, ?, ?, ?, ?, 'investor', 'active', ?, ?)
id, first_name, last_name, email, title, organization_id, source, contact_type, status, city, state, country, location_query, created_by, updated_at
) VALUES (?, ?, ?, ?, ?, ?, ?, 'investor', 'active', ?, ?, ?, ?, ?, ?)
""", (
contact_id,
first_name or 'Unknown',
@@ -721,6 +811,11 @@ def _upsert_contact_from_fundraising(conn, investor_name, contact, actor_user_id
email,
title,
org_id,
source,
city,
state,
country,
location_query,
actor_user_id,
now()
))
@@ -748,6 +843,11 @@ def _sync_contact_to_fundraising_state(conn, contact_row, actor_user_id=None, re
email = str(contact_row.get('email') or '').strip()
full_name = ' '.join([str(contact_row.get('first_name') or '').strip(), str(contact_row.get('last_name') or '').strip()]).strip()
title = str(contact_row.get('title') or '').strip()
source = str(contact_row.get('source') or '').strip()
city = str(contact_row.get('city') or '').strip()
state = str(contact_row.get('state') or '').strip()
country = str(contact_row.get('country') or '').strip()
location_query = str(contact_row.get('location_query') or '').strip()
if not full_name and not email:
return
@@ -814,22 +914,24 @@ def _sync_contact_to_fundraising_state(conn, contact_row, actor_user_id=None, re
"name": full_name or existing.get('name') or '',
"email": email or existing.get('email') or '',
"title": title or existing.get('title') or '',
"city": str(existing.get('city') or ''),
"state": str(existing.get('state') or ''),
"country": str(existing.get('country') or ''),
"location_query": str(existing.get('location_query') or '')
"city": city or str(existing.get('city') or ''),
"state": state or str(existing.get('state') or ''),
"country": country or str(existing.get('country') or ''),
"location_query": location_query or str(existing.get('location_query') or '')
}
else:
next_contacts.append({
"name": full_name,
"email": email,
"title": title,
"city": "",
"state": "",
"country": "",
"location_query": ""
"city": city,
"state": state,
"country": country,
"location_query": location_query
})
inv['contacts'] = next_contacts
if source and not str(inv.get('lead_source') or '').strip():
inv['lead_source'] = source
changed = True
if not changed:
@@ -901,6 +1003,7 @@ def sync_fundraising_relational(conn, grid, views, actor_user_id=None):
investor_name = str(row.get('investor_name') or '').strip() or 'Untitled Investor'
notes = str(row.get('notes') or '')
lead = str(row.get('lead') or '')
lead_source = str(row.get('lead_source') or row.get('combined_lead_source') or '').strip()
total_invested = 0.0
for _, col in fund_columns:
total_invested += _to_number(row.get(str(col.get('id'))))
@@ -909,13 +1012,17 @@ def sync_fundraising_relational(conn, grid, views, actor_user_id=None):
investor_id = existing['id'] if existing else generate_id()
conn.execute("""
INSERT INTO fundraising_investors (
id, investor_name, notes, lead, priority, follow_up, graveyard,
id, investor_name, notes, lead, lead_source, priority, follow_up, graveyard,
source_row_id, total_invested, updated_at
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(source_row_id) DO UPDATE SET
investor_name = excluded.investor_name,
notes = excluded.notes,
lead = excluded.lead,
lead_source = CASE
WHEN COALESCE(trim(fundraising_investors.lead_source), '') = '' THEN excluded.lead_source
ELSE fundraising_investors.lead_source
END,
priority = excluded.priority,
follow_up = excluded.follow_up,
graveyard = excluded.graveyard,
@@ -926,6 +1033,7 @@ def sync_fundraising_relational(conn, grid, views, actor_user_id=None):
investor_name,
notes,
lead,
lead_source,
1 if _to_bool(row.get('priority')) else 0,
1 if _to_bool(row.get('follow_up')) else 0,
1 if _to_bool(row.get('graveyard')) else 0,
@@ -947,7 +1055,10 @@ def sync_fundraising_relational(conn, grid, views, actor_user_id=None):
email = str(c.get('email') or '').strip()
if not full_name and not email:
continue
_upsert_contact_from_fundraising(conn, investor_name, c, actor_user_id=actor_user_id)
contact_payload = dict(c)
if lead_source and not str(contact_payload.get('source') or '').strip():
contact_payload['source'] = lead_source
_upsert_contact_from_fundraising(conn, investor_name, contact_payload, actor_user_id=actor_user_id)
conn.execute("""
INSERT INTO fundraising_contacts (
id, investor_id, full_name, email, title, city, state, country, location_query, sort_order, updated_at
@@ -958,7 +1069,7 @@ def sync_fundraising_relational(conn, grid, views, actor_user_id=None):
str(c.get('location_query') or ''), i, now()
))
elif isinstance(contacts, str) and contacts.strip():
_upsert_contact_from_fundraising(conn, investor_name, {"name": contacts.strip(), "email": "", "title": ""}, actor_user_id=actor_user_id)
_upsert_contact_from_fundraising(conn, investor_name, {"name": contacts.strip(), "email": "", "title": "", "source": lead_source}, actor_user_id=actor_user_id)
conn.execute("""
INSERT INTO fundraising_contacts (
id, investor_id, full_name, email, title, city, state, country, location_query, sort_order, updated_at
@@ -1291,6 +1402,7 @@ DEFAULT_FUNDRAISING_COLUMNS = [
{"id": "contacts", "label": "Contacts", "type": "contacts", "width": 260},
{"id": "log_action", "label": "Log", "type": "action", "readOnly": True, "width": 90},
{"id": "notes", "label": "Notes / Communication / Outreach", "type": "longtext", "width": 420},
{"id": "lead_source", "label": "Lead Source", "type": "text", "width": 180},
{"id": "notes_last_modified", "label": "Notes Last Modified", "type": "date", "readOnly": True, "width": 180},
{"id": "last_communication_date", "label": "Last Communication Date", "type": "date", "readOnly": True, "width": 195},
{"id": "priority", "label": "Priority", "type": "checkbox", "width": 110},
@@ -1314,7 +1426,12 @@ DEFAULT_FUNDRAISING_ROWS = []
class CRMHandler(BaseHTTPRequestHandler):
"""Main HTTP request handler for the CRM API."""
_rate_limit_buckets = {}
# Class-level state shared across all handler threads. Protected by
# _abuse_lock; see rate_limited() and record_404() for usage.
_rate_limit_buckets = {} # (scope, ip) -> [timestamps]
_404_buckets = {} # ip -> [timestamps] of recent 404 responses
_banned_ips = {} # ip -> ban_until_epoch
_abuse_lock = threading.Lock()
def log_message(self, format, *args):
"""Override to use cleaner logging."""
@@ -1323,14 +1440,22 @@ class CRMHandler(BaseHTTPRequestHandler):
# ── Request Parsing ──
def get_body(self):
# Cache parsed body on the request handler instance so repeated
# calls don't try to re-read an already-consumed stream. Handler
# instances are one-per-request in ThreadingHTTPServer, so the
# cache is naturally request-scoped.
if hasattr(self, '_cached_body'):
return self._cached_body
content_length = int(self.headers.get('Content-Length', 0))
if content_length == 0:
return {}
self._cached_body = {}
return self._cached_body
body = self.rfile.read(content_length)
try:
return json.loads(body.decode('utf-8'))
self._cached_body = json.loads(body.decode('utf-8'))
except json.JSONDecodeError:
return {}
self._cached_body = {}
return self._cached_body
def get_query_params(self):
parsed = urlparse(self.path)
@@ -1373,15 +1498,57 @@ class CRMHandler(BaseHTTPRequestHandler):
def rate_limited(self, scope, limit_per_minute):
now_ts = time.time()
bucket_key = f"{scope}:{self.get_client_ip()}"
bucket = self._rate_limit_buckets.get(bucket_key, [])
cutoff = now_ts - 60.0
bucket = [t for t in bucket if t >= cutoff]
if len(bucket) >= max(1, int(limit_per_minute)):
with self._abuse_lock:
bucket = self._rate_limit_buckets.get(bucket_key, [])
cutoff = now_ts - 60.0
bucket = [t for t in bucket if t >= cutoff]
if len(bucket) >= max(1, int(limit_per_minute)):
self._rate_limit_buckets[bucket_key] = bucket
return True
bucket.append(now_ts)
self._rate_limit_buckets[bucket_key] = bucket
return False
def is_banned(self):
"""Return True if the client IP is currently in the abuse blacklist.
Uses a coarse class-level dict — fine for a handful of scanners hitting
a small team CRM. Auto-expires entries when their ban window passes.
"""
ip = self.get_client_ip()
now_ts = time.time()
with self._abuse_lock:
until = self._banned_ips.get(ip)
if until is None:
return False
if now_ts >= until:
self._banned_ips.pop(ip, None)
return False
return True
bucket.append(now_ts)
self._rate_limit_buckets[bucket_key] = bucket
return False
def record_404(self):
"""Track 404s per IP and auto-ban IPs that exceed the burst threshold.
Called from send_error_json whenever we send a 404. A scanner probing
/.env, /.git/config, /swagger, /actuator/env etc. will trip this fast
and get parked on the blacklist for ABUSE_BAN_SEC seconds.
"""
ip = self.get_client_ip()
now_ts = time.time()
with self._abuse_lock:
bucket = self._404_buckets.get(ip, [])
cutoff = now_ts - ABUSE_404_WINDOW_SEC
bucket = [t for t in bucket if t >= cutoff]
bucket.append(now_ts)
if len(bucket) >= ABUSE_404_THRESHOLD:
self._banned_ips[ip] = now_ts + ABUSE_BAN_SEC
self._404_buckets.pop(ip, None)
sys.stderr.write(
f"[abuse] Banning {ip} for {ABUSE_BAN_SEC}s after "
f"{len(bucket)} 404s in {ABUSE_404_WINDOW_SEC}s\n"
)
else:
self._404_buckets[ip] = bucket
# ── Response Helpers ──
@@ -1395,6 +1562,14 @@ class CRMHandler(BaseHTTPRequestHandler):
self.wfile.write(json.dumps(data, default=str).encode('utf-8'))
def send_error_json(self, message, status=400):
# Record 404s for the abuser-detection blacklist before responding.
# We do this on send (not on routing) so that any code path that 404s
# contributes to the burst counter, including unknown POST paths.
if status == 404:
try:
self.record_404()
except Exception:
pass
self.send_json({"error": message}, status)
def send_file(self, filepath, content_type='text/html'):
@@ -1412,6 +1587,8 @@ class CRMHandler(BaseHTTPRequestHandler):
# ── Routing ──
def do_OPTIONS(self):
if self.is_banned():
return self.send_error_json("Too many requests", 429)
self.send_response(200)
self.send_header('Access-Control-Allow-Origin', CORS_ORIGIN)
self.send_header('Access-Control-Allow-Methods', 'GET, POST, PUT, PATCH, DELETE, OPTIONS')
@@ -1419,8 +1596,25 @@ class CRMHandler(BaseHTTPRequestHandler):
self.end_headers()
def do_GET(self):
# Short-circuit known abusers before any work, including file reads.
if self.is_banned():
return self.send_error_json("Too many requests", 429)
# Generic per-IP GET rate limit. Generous enough for a normal user
# (page load fans out ~15 GETs + heartbeats every ~6s) but blocks a
# scanner blasting hundreds of paths per second.
if self.rate_limited('get', GET_RATE_LIMIT_PER_MIN):
return self.send_error_json("Too many requests", 429)
path = self.get_path()
# ─── Gmail integration routes (feature-flag-guarded) ─────────
try:
from email_integration.routes import try_handle as _email_try_handle
if _email_try_handle(self):
return
except ImportError:
pass
# Serve frontend
if path == '/' or path == '/index.html':
return self.send_file(os.path.join(FRONTEND_DIR, 'index.html'))
@@ -1510,6 +1704,8 @@ class CRMHandler(BaseHTTPRequestHandler):
# Fundraising grid state
if path == '/api/fundraising/state':
return self.handle_get_fundraising_state(user)
if path == '/api/fundraising/collab/state':
return self.handle_get_fundraising_collab_state(user)
if path == '/api/fundraising/export':
return self.handle_export_fundraising_state(user)
if path == '/api/fundraising/backups':
@@ -1538,12 +1734,23 @@ class CRMHandler(BaseHTTPRequestHandler):
self.send_error_json("Not found", 404)
def do_POST(self):
if self.is_banned():
return self.send_error_json("Too many requests", 429)
path = self.get_path()
body = self.get_body()
if self.rate_limited('write', WRITE_RATE_LIMIT_PER_MIN):
return self.send_error_json("Too many requests", 429)
# ─── Gmail integration routes (feature-flag-guarded) ─────────
try:
from email_integration.routes import try_handle as _email_try_handle
if _email_try_handle(self):
return
except ImportError:
pass
# Auth (no token needed)
if path == '/api/auth/login':
if self.rate_limited('login', LOGIN_RATE_LIMIT_PER_MIN):
@@ -1573,8 +1780,12 @@ class CRMHandler(BaseHTTPRequestHandler):
return self.handle_create_feature_request(user, body)
if path == '/api/fundraising/log-communication':
return self.handle_log_fundraising_communication(user, body)
if path == '/api/fundraising/collab/heartbeat':
return self.handle_fundraising_collab_heartbeat(user, body)
if path == '/api/admin/users':
return self.handle_admin_create_user(user, body)
if path == '/api/admin/reset-all-data':
return self.handle_admin_reset_all_data(user, body)
if path == '/api/fundraising/backup':
return self.handle_backup_fundraising_state(user)
if path == '/api/fundraising/restore-preview':
@@ -1587,6 +1798,8 @@ class CRMHandler(BaseHTTPRequestHandler):
self.send_error_json("Not found", 404)
def do_PUT(self):
if self.is_banned():
return self.send_error_json("Too many requests", 429)
path = self.get_path()
body = self.get_body()
if self.rate_limited('write', WRITE_RATE_LIMIT_PER_MIN):
@@ -1611,6 +1824,8 @@ class CRMHandler(BaseHTTPRequestHandler):
self.send_error_json("Not found", 404)
def do_PATCH(self):
if self.is_banned():
return self.send_error_json("Too many requests", 429)
path = self.get_path()
body = self.get_body()
if self.rate_limited('write', WRITE_RATE_LIMIT_PER_MIN):
@@ -1637,6 +1852,8 @@ class CRMHandler(BaseHTTPRequestHandler):
self.send_error_json("Not found", 404)
def do_DELETE(self):
if self.is_banned():
return self.send_error_json("Too many requests", 429)
path = self.get_path()
if self.rate_limited('write', WRITE_RATE_LIMIT_PER_MIN):
return self.send_error_json("Too many requests", 429)
@@ -1761,8 +1978,8 @@ class CRMHandler(BaseHTTPRequestHandler):
args.append(params['status'])
if params.get('search'):
search = f"%{params['search']}%"
query += " AND (c.first_name LIKE ? OR c.last_name LIKE ? OR c.email LIKE ? OR o.name LIKE ?)"
args.extend([search, search, search, search])
query += " AND (c.first_name LIKE ? OR c.last_name LIKE ? OR c.email LIKE ? OR o.name LIKE ? OR c.source LIKE ?)"
args.extend([search, search, search, search, search])
if params.get('organization_id'):
query += " AND c.organization_id = ?"
args.append(params['organization_id'])
@@ -1772,7 +1989,7 @@ class CRMHandler(BaseHTTPRequestHandler):
sort = params.get('sort', 'updated_at')
order = 'DESC' if params.get('order', 'desc').lower() == 'desc' else 'ASC'
allowed_sorts = ['first_name', 'last_name', 'email', 'created_at', 'updated_at', 'contact_type']
allowed_sorts = ['first_name', 'last_name', 'email', 'created_at', 'updated_at', 'contact_type', 'source']
if sort in allowed_sorts:
query += f" ORDER BY c.{sort} {order}"
else:
@@ -1837,20 +2054,25 @@ class CRMHandler(BaseHTTPRequestHandler):
contact_id = generate_id()
conn = get_db()
organization_id = body.get('organization_id')
if not organization_id and body.get('organization'):
organization_id = _ensure_organization_by_name(conn, body.get('organization'), user['user_id'])
tags = json.dumps(body.get('tags', []))
conn.execute("""
INSERT INTO contacts (id, first_name, last_name, email, phone, mobile, title,
organization_id, contact_type, status, source, tags, notes, linkedin_url,
preferred_contact, created_by)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
city, state, country, location_query, preferred_contact, created_by)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
contact_id, body['first_name'], body['last_name'],
body.get('email'), body.get('phone'), body.get('mobile'),
body.get('title'), body.get('organization_id'),
body.get('title'), organization_id,
body.get('contact_type', 'prospect'), body.get('status', 'active'),
body.get('source'), tags, body.get('notes'),
body.get('linkedin_url'), body.get('preferred_contact', 'email'),
body.get('linkedin_url'), body.get('city'), body.get('state'),
body.get('country'), body.get('location_query'),
body.get('preferred_contact', 'email'),
user['user_id']
))
@@ -1879,13 +2101,17 @@ class CRMHandler(BaseHTTPRequestHandler):
updatable = ['first_name', 'last_name', 'email', 'phone', 'mobile', 'title',
'organization_id', 'contact_type', 'status', 'source', 'notes',
'linkedin_url', 'preferred_contact']
'linkedin_url', 'city', 'state', 'country', 'location_query', 'preferred_contact']
sets = []
args = []
for field in updatable:
if field in body:
sets.append(f"{field} = ?")
args.append(body[field])
if 'organization' in body and 'organization_id' not in body:
org_id = _ensure_organization_by_name(conn, body.get('organization'), user['user_id'])
sets.append("organization_id = ?")
args.append(org_id)
if 'tags' in body:
sets.append("tags = ?")
@@ -2876,12 +3102,17 @@ class CRMHandler(BaseHTTPRequestHandler):
entity_type = body.get('entity_type', 'contacts')
mapping = body.get('mapping', {})
dry_run = body.get('dry_run', False)
update_existing = bool(body.get('update_existing', True))
action_overrides_raw = body.get('action_overrides', {}) or {}
if not csv_data:
return self.send_error_json("No data provided. Send 'data' as array of objects.")
conn = get_db()
results = {"created": 0, "updated": 0, "skipped": 0, "errors": []}
results = {"created": 0, "updated": 0, "skipped": 0, "errors": [], "matches": []}
# Keep in-memory email matches so dry-run mirrors real behavior for
# duplicate emails appearing multiple times in the same CSV batch.
batch_email_matches = {}
try:
for i, row in enumerate(csv_data):
@@ -2911,11 +3142,44 @@ class CRMHandler(BaseHTTPRequestHandler):
continue
email = data.get('email', '').strip()
email_key = email.lower()
linkedin_url = data.get('linkedin_url', data.get('linkedin', '')).strip()
city = data.get('city', '').strip()
state = data.get('state', '').strip()
country = data.get('country', '').strip()
location_query = data.get('location_query', '').strip()
raw_location = data.get('location', data.get('city_location', data.get('city/location', ''))).strip()
if raw_location:
p_city, p_state, p_country, p_query = _parse_location_text(raw_location)
city = city or p_city
state = state or p_state
country = country or p_country
location_query = location_query or p_query
# Check for existing contact by email
existing = None
existing_summary = None
if email:
existing = conn.execute("SELECT id FROM contacts WHERE email = ?", (email,)).fetchone()
if email_key in batch_email_matches:
existing_summary = batch_email_matches[email_key]
existing = {"id": existing_summary.get('id')}
else:
existing = conn.execute("""
SELECT c.id, c.first_name, c.last_name, c.email, o.name as organization_name
FROM contacts c
LEFT JOIN organizations o ON c.organization_id = o.id
WHERE lower(c.email) = lower(?)
ORDER BY c.updated_at DESC
LIMIT 1
""", (email,)).fetchone()
if existing:
existing_summary = {
"id": existing['id'],
"name": f"{str(existing['first_name'] or '').strip()} {str(existing['last_name'] or '').strip()}".strip(),
"email": str(existing['email'] or ''),
"organization": str(existing['organization_name'] or '')
}
batch_email_matches[email_key] = existing_summary
# Handle organization
org_id = None
@@ -2931,32 +3195,149 @@ class CRMHandler(BaseHTTPRequestHandler):
(org_id, org_name, user['user_id'])
)
action_override = None
if isinstance(action_overrides_raw, dict):
action_override = action_overrides_raw.get(str(i + 1)) or action_overrides_raw.get(i + 1)
default_action = 'update' if update_existing else 'skip'
action = action_override if action_override in ('update', 'skip', 'create_duplicate') else default_action
if existing:
incoming_name = f"{first_name} {last_name}".strip()
results['matches'].append({
"row": i + 1,
"incoming_name": incoming_name,
"incoming_email": email,
"incoming_organization": org_name,
"existing_id": existing_summary.get('id') if isinstance(existing_summary, dict) else existing['id'],
"existing_name": existing_summary.get('name') if isinstance(existing_summary, dict) else '',
"existing_email": existing_summary.get('email') if isinstance(existing_summary, dict) else email,
"existing_organization": existing_summary.get('organization') if isinstance(existing_summary, dict) else '',
"default_action": default_action,
"action": action
})
if not dry_run:
if existing:
conn.execute("""
UPDATE contacts SET first_name=?, last_name=?, phone=?, title=?,
organization_id=COALESCE(?, organization_id),
contact_type=COALESCE(?, contact_type), updated_at=?
WHERE id=?
""", (first_name, last_name, data.get('phone'),
data.get('title'), org_id,
data.get('contact_type'), now(), existing['id']))
results['updated'] += 1
if action == 'update':
conn.execute("""
UPDATE contacts SET first_name=?, last_name=?, phone=?, title=?,
organization_id=COALESCE(?, organization_id),
contact_type=COALESCE(?, contact_type),
linkedin_url=COALESCE(?, linkedin_url),
city=COALESCE(?, city),
state=COALESCE(?, state),
country=COALESCE(?, country),
location_query=COALESCE(?, location_query),
updated_at=?
WHERE id=?
""", (first_name, last_name, data.get('phone'),
data.get('title'), org_id,
data.get('contact_type'),
linkedin_url if linkedin_url else None,
city if city else None,
state if state else None,
country if country else None,
location_query if location_query else None,
now(), existing['id']))
if email:
batch_email_matches[email_key] = {
"id": existing['id'],
"name": f"{first_name} {last_name}".strip(),
"email": email,
"organization": org_name
}
updated_contact = row_to_dict(conn.execute("""
SELECT c.*, o.name as organization_name
FROM contacts c LEFT JOIN organizations o ON c.organization_id = o.id
WHERE c.id = ?
""", (existing['id'],)).fetchone())
_sync_contact_to_fundraising_state(conn, updated_contact, actor_user_id=user['user_id'], remove=False)
results['updated'] += 1
elif action == 'create_duplicate':
contact_id = generate_id()
conn.execute("""
INSERT INTO contacts (id, first_name, last_name, email, phone,
title, organization_id, contact_type, status, source,
linkedin_url, city, state, country, location_query, created_by)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, 'active', 'import', ?, ?, ?, ?, ?, ?)
""", (contact_id, first_name, last_name, email,
data.get('phone'), data.get('title'), org_id,
data.get('contact_type', 'prospect'), linkedin_url,
city, state, country, location_query, user['user_id']))
if email:
batch_email_matches[email_key] = {
"id": contact_id,
"name": f"{first_name} {last_name}".strip(),
"email": email,
"organization": org_name
}
created_contact = row_to_dict(conn.execute("""
SELECT c.*, o.name as organization_name
FROM contacts c LEFT JOIN organizations o ON c.organization_id = o.id
WHERE c.id = ?
""", (contact_id,)).fetchone())
_sync_contact_to_fundraising_state(conn, created_contact, actor_user_id=user['user_id'], remove=False)
results['created'] += 1
else:
results['skipped'] += 1
results['errors'].append(f"Row {i+1}: Existing contact matched by email; skipped")
else:
contact_id = generate_id()
conn.execute("""
INSERT INTO contacts (id, first_name, last_name, email, phone,
title, organization_id, contact_type, status, source, created_by)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, 'active', 'import', ?)
title, organization_id, contact_type, status, source,
linkedin_url, city, state, country, location_query, created_by)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, 'active', 'import', ?, ?, ?, ?, ?, ?)
""", (contact_id, first_name, last_name, email,
data.get('phone'), data.get('title'), org_id,
data.get('contact_type', 'prospect'), user['user_id']))
data.get('contact_type', 'prospect'), linkedin_url,
city, state, country, location_query, user['user_id']))
if email:
batch_email_matches[email_key] = {
"id": contact_id,
"name": f"{first_name} {last_name}".strip(),
"email": email,
"organization": org_name
}
created_contact = row_to_dict(conn.execute("""
SELECT c.*, o.name as organization_name
FROM contacts c LEFT JOIN organizations o ON c.organization_id = o.id
WHERE c.id = ?
""", (contact_id,)).fetchone())
_sync_contact_to_fundraising_state(conn, created_contact, actor_user_id=user['user_id'], remove=False)
results['created'] += 1
else:
if existing:
results['updated'] += 1
if action == 'update':
results['updated'] += 1
if email:
batch_email_matches[email_key] = {
"id": existing['id'],
"name": f"{first_name} {last_name}".strip(),
"email": email,
"organization": org_name
}
elif action == 'create_duplicate':
results['created'] += 1
if email:
batch_email_matches[email_key] = {
"id": f"dryrun-{i+1}",
"name": f"{first_name} {last_name}".strip(),
"email": email,
"organization": org_name
}
else:
results['skipped'] += 1
results['errors'].append(f"Row {i+1}: Existing contact matched by email; would be skipped")
else:
results['created'] += 1
if email:
# Simulate that the row now exists for subsequent duplicate-email rows.
batch_email_matches[email_key] = {
"id": f"dryrun-{i+1}",
"name": f"{first_name} {last_name}".strip(),
"email": email,
"organization": org_name
}
except Exception as e:
results['errors'].append(f"Row {i+1}: {str(e)}")
@@ -3141,6 +3522,58 @@ class CRMHandler(BaseHTTPRequestHandler):
conn.close()
return self.send_json({"data": updated})
def handle_admin_reset_all_data(self, user, body):
if not require_admin(user):
return self.send_error_json("Admin only", 403)
confirm_phrase = str(body.get('confirm_phrase') or '').strip()
if confirm_phrase != 'RESET ALL DATA':
return self.send_error_json("Confirmation phrase must be exactly: RESET ALL DATA", 400)
conn = get_db()
try:
ensure_fundraising_state_row(conn)
state = conn.execute("SELECT * FROM fundraising_state WHERE id = 'main'").fetchone()
pre_backup = create_fundraising_backup_file(state, kind="pre_restore") if state else None
conn.execute("DELETE FROM communications")
conn.execute("DELETE FROM opportunities")
conn.execute("DELETE FROM lp_profiles")
conn.execute("DELETE FROM custom_field_values")
conn.execute("DELETE FROM custom_fields")
conn.execute("DELETE FROM feature_requests")
conn.execute("DELETE FROM contacts")
conn.execute("DELETE FROM organizations")
default_grid = {
"columns": deep_copy_json(DEFAULT_FUNDRAISING_COLUMNS),
"rows": deep_copy_json(DEFAULT_FUNDRAISING_ROWS)
}
default_views = sanitize_grid_views(deep_copy_json(DEFAULT_GRID_VIEWS))
conn.execute("""
UPDATE fundraising_state
SET grid_json = ?, views_json = ?, version = COALESCE(version, 1) + 1, updated_by = ?, updated_at = ?
WHERE id = 'main'
""", (json.dumps(default_grid), json.dumps(default_views), user['user_id'], now()))
sync_fundraising_relational(conn, default_grid, default_views, actor_user_id=user['user_id'])
log_audit(conn, user['user_id'], 'system', 'all-data', 'reset', {
"pre_backup": pre_backup['filename'] if pre_backup else None
})
conn.commit()
except Exception as exc:
conn.rollback()
conn.close()
return self.send_error_json(f"Failed to reset data: {str(exc)}", 500)
conn.close()
return self.send_json({
"data": {
"status": "ok",
"pre_backup": pre_backup
}
})
def handle_list_audit_log(self, user, params):
if not require_admin(user):
return self.send_error_json("Admin access required", 403)
@@ -3574,6 +4007,128 @@ class CRMHandler(BaseHTTPRequestHandler):
}
})
def _cleanup_fundraising_collab(self, conn):
now_epoch = int(time.time())
conn.execute("DELETE FROM fundraising_presence WHERE expires_at_epoch <= ?", (now_epoch,))
conn.execute("DELETE FROM fundraising_cell_locks WHERE expires_at_epoch <= ?", (now_epoch,))
def _list_fundraising_collab_state(self, conn):
presence_rows = rows_to_list(conn.execute("""
SELECT user_id, username, full_name, active_view, row_id, col_id, is_editing, cell_key, last_seen_at
FROM fundraising_presence
ORDER BY last_seen_at DESC
""").fetchall())
lock_rows = rows_to_list(conn.execute("""
SELECT cell_key, row_id, col_id, locked_by_user_id, locked_by_username, locked_by_full_name, last_seen_at
FROM fundraising_cell_locks
ORDER BY last_seen_at DESC
""").fetchall())
for row in presence_rows:
row['is_editing'] = bool(row.get('is_editing'))
return {"presence": presence_rows, "locks": lock_rows}
def handle_get_fundraising_collab_state(self, user):
conn = get_db()
self._cleanup_fundraising_collab(conn)
snapshot = self._list_fundraising_collab_state(conn)
conn.commit()
conn.close()
return self.send_json({"data": snapshot})
def handle_fundraising_collab_heartbeat(self, user, body):
active_view = str(body.get('active_view') or '').strip()
selected = body.get('selected') if isinstance(body.get('selected'), dict) else {}
editing = body.get('editing') if isinstance(body.get('editing'), dict) else {}
selected_row_id = str(selected.get('row_id') or '').strip()
selected_col_id = str(selected.get('col_id') or '').strip()
editing_row_id = str(editing.get('row_id') or '').strip()
editing_col_id = str(editing.get('col_id') or '').strip()
is_editing = bool(editing_row_id and editing_col_id)
ttl_seconds = int(body.get('ttl_seconds') or 25)
ttl_seconds = max(10, min(120, ttl_seconds))
now_epoch = int(time.time())
expires_at_epoch = now_epoch + ttl_seconds
seen_at = now()
lock_conflict = None
conn = get_db()
self._cleanup_fundraising_collab(conn)
user_row = conn.execute("SELECT username, full_name FROM users WHERE id = ?", (user['user_id'],)).fetchone()
username = str(user_row['username']) if user_row and user_row['username'] else str(user.get('username') or '')
full_name = str(user_row['full_name']) if user_row and user_row['full_name'] else ''
editing_cell_key = f"{editing_row_id}:{editing_col_id}" if is_editing else None
if is_editing and editing_cell_key:
existing_lock = conn.execute("""
SELECT cell_key, row_id, col_id, locked_by_user_id, locked_by_username, locked_by_full_name, last_seen_at
FROM fundraising_cell_locks
WHERE cell_key = ? AND locked_by_user_id != ? AND expires_at_epoch > ?
LIMIT 1
""", (editing_cell_key, user['user_id'], now_epoch)).fetchone()
if existing_lock:
lock_conflict = row_to_dict(existing_lock)
is_editing = False
editing_cell_key = None
else:
conn.execute("""
INSERT INTO fundraising_cell_locks (
cell_key, row_id, col_id, locked_by_user_id, locked_by_username, locked_by_full_name, last_seen_at, expires_at_epoch
) VALUES (?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(cell_key) DO UPDATE SET
row_id = excluded.row_id,
col_id = excluded.col_id,
locked_by_user_id = excluded.locked_by_user_id,
locked_by_username = excluded.locked_by_username,
locked_by_full_name = excluded.locked_by_full_name,
last_seen_at = excluded.last_seen_at,
expires_at_epoch = excluded.expires_at_epoch
""", (editing_cell_key, editing_row_id, editing_col_id, user['user_id'], username, full_name, seen_at, expires_at_epoch))
conn.execute("""
DELETE FROM fundraising_cell_locks
WHERE locked_by_user_id = ? AND cell_key != ?
""", (user['user_id'], editing_cell_key))
else:
conn.execute("DELETE FROM fundraising_cell_locks WHERE locked_by_user_id = ?", (user['user_id'],))
conn.execute("""
INSERT INTO fundraising_presence (
user_id, username, full_name, active_view, row_id, col_id, is_editing, cell_key, last_seen_at, expires_at_epoch
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(user_id) DO UPDATE SET
username = excluded.username,
full_name = excluded.full_name,
active_view = excluded.active_view,
row_id = excluded.row_id,
col_id = excluded.col_id,
is_editing = excluded.is_editing,
cell_key = excluded.cell_key,
last_seen_at = excluded.last_seen_at,
expires_at_epoch = excluded.expires_at_epoch
""", (
user['user_id'],
username,
full_name,
active_view,
selected_row_id or editing_row_id,
selected_col_id or editing_col_id,
1 if is_editing else 0,
editing_cell_key,
seen_at,
expires_at_epoch
))
snapshot = self._list_fundraising_collab_state(conn)
conn.commit()
conn.close()
return self.send_json({
"data": {
**snapshot,
"lock_conflict": lock_conflict
}
})
def handle_update_fundraising_state(self, user, body):
grid = body.get('grid', {})
views = body.get('views')
@@ -3948,7 +4503,20 @@ def main():
print("Demo data seeding disabled (set CRM_SEED_DEMO_DATA=1 to enable).")
start_backup_scheduler()
server = HTTPServer((HOST, PORT), CRMHandler)
# ─── Gmail sync scheduler (feature-flag-guarded) ─────────────────
if os.environ.get("CRM_GMAIL_INTEGRATION_ENABLED", "").lower() in ("1", "true", "yes", "on"):
try:
from email_integration.scheduler import start_sync_scheduler
start_sync_scheduler()
print("[email_integration] Gmail sync scheduler started")
except Exception as _e:
print(f"[email_integration] failed to start scheduler: {_e}")
# ThreadingHTTPServer lets one slow request (or a wave of scanner probes)
# not block legit users. SQLite is opened per-request via get_db(), and
# WAL mode allows concurrent readers + a single writer, so this is safe.
server = ThreadingHTTPServer((HOST, PORT), CRMHandler)
server.daemon_threads = True
print(f"\n{'='*60}")
print(f" Venture Fund CRM Server")
print(f" Running at http://{HOST}:{PORT}")
BIN
View File
Binary file not shown.
+202
View File
@@ -0,0 +1,202 @@
# Embeddings + Retrieval through Spark Control (v0.15.0)
Spark Control now fronts a local RAG stack so your agent/CRM system can do
dense embeddings, reranking, and hybrid vector search against one trusted host
— same TLS cert and allowlist as the LLM and audio endpoints.
## What runs where
| Component | Host | Port | Role |
|---|---|---|---|
| **spark-embed** | Spark 2 (GPU) | 8088 | `BAAI/bge-m3` dense embeddings (1024-d) + `BAAI/bge-reranker-v2-m3` cross-encoder rerank |
| **Qdrant** | Spark 2 (CPU) | 6333/6334 | Vector storage, hybrid dense+sparse retrieval, RRF fusion, payload filtering |
| **Spark Control** | Start9 | (your LAN URL) | Proxies all of the above behind one host |
`spark-embed` is a small FastAPI server built **from the NGC PyTorch image**
(the torch we've proven runs on the GB10's sm_121 GPU). We did *not* use HF Text
Embeddings Inference because as of 2026 it ships no arm64 CUDA image. No
torchaudio, no flash-attn — bge-m3 + the reranker are plain XLM-RoBERTa
encoders.
## Endpoints (all on the Spark Control host)
### `POST /v1/embeddings` — dense embeddings (OpenAI-compatible)
```json
// request
{ "input": "text to embed", "model": "BAAI/bge-m3" }
// or { "input": ["batch", "of", "texts"] }
// response (OpenAI shape)
{
"object": "list",
"data": [ { "object": "embedding", "index": 0, "embedding": [0.01, ...1024 floats] } ],
"model": "BAAI/bge-m3",
"usage": { "prompt_tokens": 0, "total_tokens": 0 }
}
```
Vectors are L2-normalized by default (cosine == dot product). Works with the
stock OpenAI Python/JS client by pointing `base_url` at Spark Control.
### `POST /v1/rerank` — cross-encoder rerank
```json
// request
{ "query": "did Brightwater commit?", "documents": ["chunk a", "chunk b", ...],
"top_n": 5, "return_documents": false }
// response (Cohere-ish)
{ "object": "rerank.result", "model": "BAAI/bge-reranker-v2-m3",
"results": [ { "index": 3, "relevance_score": 5.21 }, { "index": 0, "relevance_score": 1.04 }, ... ] }
```
`relevance_score` is the reranker's raw logit (higher = more relevant; unbounded,
roughly 10..+10). Sort desc. Send a candidate set (≤200), not your whole corpus.
### `POST /api/search` — orchestrated hybrid retrieval
One call: embeds the query (dense), retrieves from Qdrant (hybrid dense+sparse
with RRF **when you supply a sparse vector**, else dense-only), optionally
cross-encoder reranks, returns top_k.
```json
// request
{
"query": "Did Brightwater commit to the Fund III close in Q1?",
"collection": "crm_chunks",
"top_k": 8,
"retrieve_n": 80, // first-stage candidates (default max(50, top_k*10))
"sparse": { "indices": [12, 904], "values": [0.7, 1.2] }, // optional BM25 vector for hybrid
"fusion": "rrf", // or "dbsf"
"filter": { // raw Qdrant filter (pre-filter, see below)
"must": [ { "key": "lp_id", "match": { "value": "lp_0427" } } ]
},
"rerank": true,
"text_field": "text", // payload field holding the chunk text
"with_payload": true
}
// response
{
"object": "search.result_list",
"model": "BAAI/bge-m3+bge-reranker-v2-m3",
"query": "...",
"collection": "crm_chunks",
"reranked": true,
"data": [
{ "object": "search.result", "index": 0, "id": "chunk_99c1_3",
"score": 5.21, "fused_score": 0.41, "rerank_score": 5.21,
"text": "...Brightwater confirmed verbal commitment...",
"payload": { "lp_id": "lp_0427", "doc_type": "email", "date_ts": 1771027200, ... } }
],
"usage": { "embed_ms": 21, "qdrant_ms": 9, "rerank_ms": 140, "candidates": 80 }
}
```
`score` is the rerank score when reranked, else the fused/dense score.
## The sparse (BM25) story — important for entity-heavy data
bge-m3's dense vectors carry semantic meaning, but exact entity matches
(fund names, tickers, people) are a **lexical** signal. For that you want hybrid
dense + sparse. Two design facts:
1. **spark-embed serves dense only.** TEI/Infinity can't emit bge-m3's learned
sparse either, and bge-m3's pretrained sparse weights underweight novel
entity tokens (brand-new fund names). So we don't use bge-m3 sparse.
2. **Use Qdrant BM25 with its built-in IDF**, generated client-side. This learns
IDF over *your* corpus, so novel entity strings get correct weight.
### Your ingest pipeline (the part you own)
For each chunk, produce and upsert **both** vectors:
- **dense**: call Spark Control `POST /v1/embeddings` → 1024-d vector.
- **sparse**: run [FastEmbed](https://github.com/qdrant/fastembed) BM25 client-side:
```python
from fastembed import SparseTextEmbedding
bm25 = SparseTextEmbedding(model_name="Qdrant/bm25")
sp = next(bm25.embed([chunk_text])) # -> {indices, values}
```
Create the collection with a named dense vector and a named sparse vector that
uses `modifier: idf` (so Qdrant applies IDF server-side):
```json
PUT /collections/crm_chunks
{ "vectors": { "dense": { "size": 1024, "distance": "Cosine" } },
"sparse_vectors": { "sparse": { "modifier": "idf" } } }
```
Add payload indexes for your filter fields so filtered queries stay fast:
```json
PUT /collections/crm_chunks/index { "field_name": "lp_id", "field_schema": "keyword" }
PUT /collections/crm_chunks/index { "field_name": "doc_type", "field_schema": "keyword" }
PUT /collections/crm_chunks/index { "field_name": "date_ts", "field_schema": "integer" }
```
Upsert points with both vectors + payload:
```json
PUT /collections/crm_chunks/points
{ "points": [ { "id": 1,
"vector": { "dense": [...1024...], "sparse": { "indices": [...], "values": [...] } },
"payload": { "lp_id": "lp_0427", "lp_name": "Brightwater Capital",
"doc_type": "email", "date_ts": 1771027200, "text": "..." } } ] }
```
### At query time
Generate the query's BM25 sparse vector with the **same** FastEmbed model, and
pass it to `/api/search` as `sparse`. Spark Control fuses dense+sparse with RRF
inside Qdrant, then reranks. If you omit `sparse`, you get dense + rerank
(still good, just no exact-lexical leg).
> You can talk to Qdrant directly (`http://<spark2>:6333`) for collection
> management and upserts — that's the natural home for ingest. `/api/search` is
> the convenience path for the read side so your agents hit one host. If you'd
> rather Spark Control proxy Qdrant admin too, say the word.
## Chunking guidance (entity-heavy CRM)
- One chunk per email / note / transcript-turn; one chunk per memo *section* —
don't split mid-thought.
- Keep entity + date as **payload fields** (filterable), not buried in embedded text.
- Time-aware: don't merge a 2022 note with a 2026 update in one chunk; store
`date_ts` so you can pre-filter and recency-boost.
- Resolve entity-name variants ("J. Smith" / "Jonathan Smith" / "JS") to one
canonical `lp_id` at ingest, or retrieval fragments across variants.
## Pre-filtering
Most agent queries aren't pure semantic — they're "recent emails with Bob about
energy". Pass a Qdrant `filter` to restrict the search space *before* vector
scoring (faster + more precise). Qdrant also supports server-side recency
boosting via Formula/Score-Boosting if you want decay-by-`date_ts` without an
app round-trip — ask and we can expose it through `/api/search`.
## Service discovery + health
- `GET /api/endpoints` includes `embeddings` and `qdrant` entries (base_url +
ready flag) alongside `vllm`, `parakeet`, `kokoro`.
- The dashboard shows Embeddings + Qdrant health dots and Start/Restart/Stop
controls in the Always-On Services panel.
- Spark Control will auto-restart a wedged **embedding** container (GPU CUDA
wedge recovery, like the audio services) but **never** auto-restarts Qdrant —
it holds your only copy of the index, so a restart is surfaced for manual
action instead.
## Capacity
At your scale (tens of thousands of chunks now → low hundreds of thousands),
this is trivial: ~0.61.2 GB of dense vectors at 300k chunks, spark-embed +
Qdrant together use a few GB of GPU/RAM on Spark 2's 122 GB. Full re-embed of
300k chunks is ~815 min, so re-indexing is cheap if you change models.
## Model upgrade path
If dense recall becomes the bottleneck, `Qwen3-Embedding-4B` (Matryoshka-trained,
tops 2026 MTEB) is the A/B candidate — same `/v1/embeddings` contract, swap the
model in spark-embed. bge-m3 is the lower-risk starting point and ships now.
+71
View File
@@ -0,0 +1,71 @@
# Phase 0 — Foundation: Kickoff Brief
**Goal:** stand up the data + retrieval substrate that every later agent depends on. By the end of Phase 0 we can ingest the CRM into a searchable index and query it through a CRM MCP server using multiple retrieval modes. **No outward-facing agent behavior in this phase.**
See `../CLAUDE.md` for settled architecture and non-negotiable guardrails. This brief assumes them.
## Prerequisites / inputs needed before building
- [ ] CRM source code present in this repo.
- [ ] `docs/EMBEDDINGS.md` read — it is the authoritative embedding/retrieval contract and ingest recipe. This brief defers to it for all vector mechanics.
- [ ] `.env` populated from `.env.example`: `ANTHROPIC_API_KEY`, `SPARK_CONTROL_URL` (gateway), `QDRANT_URL` (direct Qdrant on Spark 2 for ingest), `X_API_KEY`, CRM connection vars.
- [ ] Spark Control endpoints live (shipped in v0.15.0): `/v1/embeddings`, `/v1/rerank`, `/api/search`. Verify with `GET /api/endpoints`.
- [ ] `fastembed` installed in the ingest environment (client-side BM25 sparse vectors, model `Qdrant/bm25`).
- [ ] A small set of real example records + documents (a few LPs, a memo, a transcript) to test ingest and retrieval against.
## Workstream A — Understand the CRM, then extend the schema
1. Read the CRM code. Write `docs/crm-overview.md`: storage engine and where it runs, the data model (records, custom fields, and especially how notes/interactions/correspondence are stored), the API/route surface, and auth. Fill in the Conventions section and CRM env vars in `CLAUDE.md`.
2. Design and apply additive, reversible schema extensions:
- LP/prospect fields: thesis fit, segment, accreditation/QP status, warmth score, source, owner, last-touch.
- **Interaction log** table: append-only record of every agent action and every human touch (actor, action, target, timestamp, payload).
- **Relationship graph** table: edges between people/entities with source + strength (seeded later from email/calendar + X follower overlap).
- **Canonical entity IDs**: a canonical record each name variant resolves to (see Workstream B3).
*Guardrail: additive and reversible only; no destructive migrations; review before running.*
## Workstream B — Ingest / sync pipeline
**Follow the ingest recipe in `docs/EMBEDDINGS.md` exactly** (collection schema, dual-vector upsert, FastEmbed BM25, payload fields). This brief covers only the CRM-specific decisions layered on top of it.
1. **Chunking** (a bigger quality lever than model choice). Per EMBEDDINGS.md: one chunk per email / note / transcript-turn; one chunk per memo *section*; never split mid-thought; keep entities + `date_ts` as filterable payload, not embedded text. The CRM-specific work is mapping *your* record types to chunks (how a thread, a note, a memo, a call transcript each decompose).
2. **Two vectors per chunk.** Dense via Spark Control `POST /v1/embeddings` (`BAAI/bge-m3`, 1024-d); sparse BM25 via FastEmbed (`Qdrant/bm25`) generated **client-side**. Create the `crm_chunks` collection with a named dense vector (1024, Cosine) + named sparse vector (`modifier: idf`), and payload indexes on `lp_id` (keyword), `doc_type` (keyword), `date_ts` (integer). Upsert both vectors + payload (`lp_id`, `lp_name`, `doc_type`, `date_ts`, `text`). **Ingest talks to Qdrant directly** (`QDRANT_URL`), not through the gateway.
3. **Entity resolution at ingest.** Resolve name variants ("John Smith" / "J. Smith" / "JS") to one canonical `lp_id` before storing — a lightweight local-Qwen call via Spark Control `/v1/chat/completions`. Without it, retrieval fragments across variants. Same canonical-ID work as Workstream A4.
4. **Backfill + incremental sync, idempotent.** One-time full backfill, then incremental updates as records change. A full re-embed is ~815 min, so the index is cheaply rebuildable — but only if the pipeline is **idempotent and re-runnable** (Qdrant is never auto-restarted and holds the only live index). Prefer simple, correct, repeatable sync over premature optimization.
## Workstream C — CRM MCP server
Wrap the CRM behind an MCP server that agents (Claude Agent SDK) call. The retrieval modes are thin wrappers over Spark Control `POST /api/search` (which orchestrates embed → Qdrant RRF → rerank); for hybrid/keyword modes, generate the query's BM25 sparse vector with the **same** FastEmbed model used at ingest and pass it as `sparse`. Tools to expose:
- **Reads:** fetch record(s), list/search records by structured filters, fetch interaction history for an entity.
- **Retrieval modes** (all via `/api/search`, varying parameters):
- `semantic_search` — dense only (omit `sparse`), large `top_k`/`retrieve_n`. High recall (Scout).
- `hybrid_search` — pass `sparse` + `rerank: true` + `filter`. Dense+BM25 RRF then rerank. **Default**, best for entity-heavy queries.
- `keyword_search` — high-precision lexical ("did we ever discuss X with this LP?"): lean on the sparse leg + rerank. For a pure-BM25 result, query Qdrant's sparse vector directly via `QDRANT_URL`.
- Always support a Qdrant `filter` (e.g. `lp_id`, `date_ts` range) to pre-filter before scoring.
- **Writes:** create/update records, append to the interaction log, write enrichment results back. (No outbound/contact actions — Phase 3, gated.)
Keep retrieval modes as small, composable functions so the orchestrator can select among them later.
## Workstream D — Compliance groundwork (parallel, non-blocking)
Engage counsel to define solicitation posture (506(b) vs 506(c)), accreditation/QP verification, and recordkeeping rules. This does not block AC, but it must be resolved before any Phase 3 outbound work begins.
## Acceptance criteria
- [ ] CRM schema extended (additive, reversible) and documented.
- [ ] `crm_chunks` collection created per EMBEDDINGS.md (named dense + sparse vectors, payload indexes).
- [ ] Full CRM corpus backfilled into Qdrant with **both** dense + BM25 sparse vectors and payload; incremental sync working and idempotent.
- [ ] Entity resolution collapses obvious name variants to one canonical `lp_id`.
- [ ] CRM MCP server live; retrieval modes wrap `/api/search`; hybrid (sparse + rerank) visibly beats dense-only on entity-heavy test queries (e.g. exact fund/LP name lookups).
- [ ] Every write goes through the interaction log.
- [ ] No outward-facing behavior exists anywhere in the codebase.
## Explicitly out of scope for Phase 0
- Any agent that sends email, posts, or contacts a prospect.
- Scout monitoring, Closer outreach, Scribe publishing, the Orchestrator.
- The workflow/scheduling engine (revisited at Phase 3).
## Suggested order
A1 → A2 → B1 → B2 → B3 → B4 → C. Start D in parallel.
+133
View File
@@ -0,0 +1,133 @@
# Ten31 — Agentic Capability Build Plan
*Working document. Purpose: a concrete, sequenced plan for building an in-house system of AI agents to widen the top of the fundraising funnel, refine and propagate Ten31's thesis, and automate marketing/branding workflows — built with internal resources using Claude and Claude Code as the engineering partner.*
---
## 1. Approach in one paragraph
Build **six agents** — five workers plus a lightweight orchestrator — on the **Claude Agent SDK**, connected to your systems through **MCP**. Run the *reasoning* on **Claude** (frontier-quality judgment for research, messaging, drafting). **Self-host the data layer and the privacy-sensitive model work** on your existing Start9 server and your **dual DGX Sparks**. **Buy nothing for the core**: your self-built CRM becomes the system of record, and your existing Gmail/Superhuman + calendar connectors supply the relationship data. The real unit of reuse is not the agent count — it is one shared **LP graph** (your CRM) plus a library of **skills** every agent draws from.
---
## 2. Guiding principles
1. **Sovereignty first.** Sensitive LP and relationship data stays on infrastructure you control (Start9 + DGX Sparks). Only the minimum necessary context per call ever reaches a third-party model API.
2. **Frontier reasoning where it is best-in-class; local where privacy or cost dominate.** Claude for hard agentic reasoning and LP-facing output; local open models for embeddings, redaction, triage, transcription, and reasoning over data that must not leave your walls.
3. **Human-in-the-loop on anything outbound or thesis-defining.** Agents draft and prepare; partners approve and send.
4. **Compliant by design.** Log every agent action; gate all outbound; bring counsel in before any cold outreach goes live.
5. **One source of truth.** Every agent reads from and writes to the same LP graph, so research → outreach → nurture → meeting prep compound instead of fragmenting.
---
## 3. The agent roster (6)
| Agent | Job | Cadence | Brain | Human gate |
|---|---|---|---|---|
| **Scout** | Watches sources (X/nostr, filings, treasury announcements, conference rosters, podcast networks); flags trigger events; populates the pipeline. | Continuous / scheduled | Local (triage) + Claude (judgment calls) | None (internal only) |
| **Analyst** | Builds LP dossiers, enriches records, maps shortest warm-intro path through the team's network. | On-demand + triggered | Claude (synthesis); local for RAG/embeddings | None (internal only) |
| **Architect** | **Thesis articulation.** Owns and refines the canonical messaging — the scarcity / critical-infrastructure throughline tying bitcoin to AI infrastructure. The copilot partners sit with to sharpen the narrative. Output = a living "messaging source of truth." | On-demand, collaborative | Claude | Partner sign-off on canonical thesis |
| **Scribe** | **Distribution / amplification.** Takes the Architect's canonical thesis + your content (Bitcoin Alpha, partner shows, memos) and propagates segment-specific cuts across X, nostr, LinkedIn, email. | Scheduled + on-demand | Claude | Review before publish |
| **Closer** | Drafts personalized outreach and nurture sequences, preps partners before LP calls, writes follow-ups, keeps the CRM clean. | Triggered + on-demand | Claude | **Hard gate** — human sends all outbound |
| **Orchestrator** ("Chief of Staff") | Schedules runs, routes work between agents, escalates to a human. | Always on | Claude (light) | n/a |
**Why Architect and Scribe are separate.** Distribution is high-frequency and semi-mechanical; thesis articulation is low-frequency, high-judgment, and collaborative. Keeping them apart lets the Architect own a stable, partner-approved narrative that the Scribe then propagates consistently everywhere.
---
## 4. Architecture and hosting map
### 4.1 Model layer
- **Claude (API)** — the brains for Analyst synthesis, Architect thesis work, Scribe drafting, Closer judgment, and Orchestrator routing. Use a stronger model for Architect/Analyst, a faster one for high-volume Scout/Closer tasks.
- **Local model on the DGX Sparks** — current local model is **Qwen3.6 35B-A3B running on a single Spark**. Used for PII redaction before any data leaves your walls, inbound triage/classification, transcription orchestration, structuring/extraction, and local reasoning over data you choose never to send out.
- The **A3B (~3B active params)** design means only a small slice of the model runs per token, so it largely sidesteps the Spark's memory-bandwidth limit and keeps decode fast despite being a 35B-total model. No need to link both Sparks for a larger model — that earlier ceiling is moot for this workload.
- **Embeddings + reranking (shipped, Spark Control v0.15.0).** Retrieval runs on `BAAI/bge-m3` (dense, 1024-dim, L2-normalized) plus `BAAI/bge-reranker-v2-m3` (cross-encoder), served by **spark-embed** — a small FastAPI server on **Spark 2** built from the NGC PyTorch image (HF TEI was ruled out: no arm64 CUDA image). Exposed through Spark Control as `/v1/embeddings`, `/v1/rerank`, and `/api/search` (orchestrated hybrid retrieval). Combined GPU footprint on Spark 2 is trivial (~3 GB).
- **Spark allocation.** Spark 1 = LLM serving (hot KV cache). Spark 2 = embeddings + reranker + audio + the Qdrant vector index. Both Sparks are treated as always-on production infrastructure.
- **All local model services are fronted by Spark Control** (the self-hosted gateway on Start9): agents hit one trusted URL for chat, embeddings, rerank, transcription, and TTS, with shared TLS, access control, and observability.
- **Auth note:** Agent SDK agents must authenticate with an **API key**, not a claude.ai login.
### 4.2 Data layer — the LP graph (self-hosted)
- **The CRM (self-hosted on Start9) is the canonical system of record.** Extend it to be the LP graph. Add: prospect/LP schema fields (thesis fit, segment, accreditation/QP status, warmth score, source, owner, last-touch), an interaction log (every agent action + every human touch), a derived **relationship graph** table, and **canonical entity IDs** for entity resolution (see ingest pipeline).
- **Vector store: Qdrant on Spark 2 (settled).** Holds the embedded chunks. It is a **rebuildable, derived index**, not a second source of truth — if lost, it re-embeds from the CRM in minutes. Qdrant provides dense search + native BM25 + payload filtering + Reciprocal Rank Fusion in one service.
- **Retrieval pipeline.** One orchestrated call to Spark Control `/api/search`: embed query (BGE-M3) → Qdrant dense + BM25 RRF with payload pre-filter → cross-encoder rerank → top_k. BM25 is generated **client-side** via FastEmbed (`Qdrant/bm25`) at both ingest and query time, with Qdrant applying IDF over *your* corpus — so domain entities (LP names, tickers, portfolio companies) are weighted by your own term statistics rather than BGE-M3's general-web sparse weights.
- **Ingest pipeline (the real Phase 0 work).** CRM record/change → chunk (one chunk per email/note/transcript-turn; one per memo *section*; time-aware; entities + `date_ts` kept as filterable payload, not embedded text) → resolve entities to a canonical `lp_id` (lightweight local-Qwen step) → produce **both** a dense vector (`/v1/embeddings`) and a sparse BM25 vector (FastEmbed) → upsert both + payload to Qdrant **directly** (not via the gateway). One-time backfill + idempotent incremental sync. Full recipe: `docs/EMBEDDINGS.md`.
- **Per-agent retrieval modes.** Don't force one pipeline on all agents. Build a small library the orchestrator picks from: high-recall dense at large K (Scout), high-precision keyword/BM25 (Closer — "did we ever discuss X with this LP?"), long-context + rerank (Architect). The CRM MCP server exposes these as tools.
- **Wrap the CRM in an MCP server** so all agents read/write through one uniform interface, including the retrieval modes above. Because the CRM is self-built, any endpoint the agents need can be added.
### 4.3 Integration layer (MCP fabric)
- MCP servers to stand up / connect:
- **CRM / LP graph** (custom, self-hosted) — primary.
- **Email + calendar** — Gmail/Superhuman connectors are already live; these feed Closer (drafting, follow-ups) and the Analyst's warm-path derivation.
- **Drive / notes** — internal documents and memos.
- **Publishing channels** — X, nostr, LinkedIn, email/newsletter (for Scribe).
- **Public data sources** — filings, web search, and the **X API (official key in hand)** for Scout/Analyst enrichment. X is a primary source here: per-prospect public profile/bio/activity and follower-following overlap for thesis-fit scoring and mutual-connection discovery (Analyst), plus account/list/keyword monitoring and follower-graph signals (Scout). Confirm what your X access *tier* permits (full-archive search, follower-graph pulls, streaming) — that sets the ceiling on heavier monitoring. nostr APIs as a complementary source.
### 4.4 Orchestration / runtime
- Inner loop: **Claude Agent SDK** handles each agent's tool-use loop and context management.
- Outer loop: a thin workflow engine decides *when* and *which* agent runs (Temporal for durable retries, or simpler cron/queue + n8n glue to start).
- **Observability:** structured logging of every agent action, with a simple dashboard. Required for both debugging and compliance.
### 4.5 Enrichment (privacy-preserving)
- Default: **one-way, per-prospect public lookups** that write results *into* the CRM. Never upload the LP list to a third party. The **X API** is the workhorse here — public, per-prospect, ToS-compliant via the official key — and its follower-graph data complements the email/calendar relationship graph for warm-path mapping.
- Optional: a **self-hosted scraper/enrichment pipeline on the Sparks** if you want zero third-party API exposure.
### 4.6 Redaction / re-hydration boundary (Claude-facing reasoning)
- For the steps where an agent must have **Claude reason over LP-specific content** (Analyst dossiers, Closer drafting), a local **scrub → reason → re-hydrate** round-trip keeps identifiers off the third-party API: the Sparks pseudonymize names/orgs/amounts to stable placeholders, Claude reasons over the de-identified prompt, and real values are swapped back locally before a human reviews. The ingest/retrieval path is already fully local and needs none of this.
- This is **designed now, built in Phase 2/3** (it is not needed in Phase 0). Full design: `docs/redaction-rehydration.md`.
---
## 5. Build sequence
### Phase 0 — Foundation
The substrate: data layer + retrieval, no live-in-the-wild agents yet. Division of labor:
- **Spark developer (their side):** TEI serving BGE-M3 + BGE-Reranker-v2-m3 and Qdrant on Spark 2, exposed via Spark Control `/v1/embeddings` + `/v1/rerank`.
- **Claude Code + you (this project):**
1. Read the CRM code; document the storage engine, schema, and API surface.
2. Extend the CRM schema (LP/prospect fields, interaction log, relationship graph, canonical entity IDs).
3. Build the ingest/sync pipeline (chunking + entity resolution + metadata payloads; backfill + incremental).
4. Build the CRM MCP server wrapping CRM reads/writes and the per-agent retrieval modes.
5. Bring counsel in to define outbound and recordkeeping rules so the system is compliant from day one.
### Phase 1 — Architect + Scribe
- Stand up the **Architect** first: encode the current thesis, voice, and segment definitions as skills; use it collaboratively to produce the canonical messaging source of truth.
- Then **Scribe**: propagate that thesis into segment-specific content with human review before publish.
- Lowest risk, highest immediate awareness ROI, never touches cold outreach — and it proves the full pattern (SDK + skills + MCP + human review).
### Phase 2 — Scout + Analyst
- **Scout** populates the pipeline from public signals (X monitoring via the API key); **Analyst** builds dossiers and derives warm paths from your own email/calendar graph plus X follower-graph overlap.
- Internal-facing, still no outbound. This is where the Sparks earn their keep (bulk classification, embeddings, RAG).
### Phase 3 — Closer + Orchestrator
- **Closer** drafts outbound, nurture, and meeting prep — with hard human-in-the-loop gates and full logging. Highest-risk and regulated, so it comes last.
- **Orchestrator** added once there are multiple agents to coordinate and schedule.
---
## 6. Team and ownership model
- **Engineering partner:** Claude + Claude Code, supplying Agent SDK and MCP fluency, scaffolding the agents, writing the MCP servers and orchestration, and customizing the Start9 CRM package.
- **Operator:** you (and your partner). You own deployment, secrets/key management, uptime, and the human-review gates. Your prior Start9 CRM build demonstrates this is well within reach.
- **The one real risk is time, not capability.** Removing the part-time data/ops hire means operational ownership lands on the partners. If partner time is scarce, that — not tooling or skill — is the constraint to manage. Mitigations: keep the early phases internal-only (no on-call urgency), automate logging/monitoring, and stage the highest-maintenance agent (Closer) last.
---
## 7. Compliance by design
- Log every agent action and every outbound draft.
- Gate all outbound through human send.
- Resolve solicitation posture (e.g. 506(b) vs 506(c)), accreditation/QP verification, and recordkeeping with counsel **before** the Closer touches cold outreach.
- Start with distribution and inbound nurture, where constraints are lightest.
---
## 8. Open decisions
**Resolved:** local chat/triage model = Qwen3.6 35B-A3B (Spark 1); embedding = `BAAI/bge-m3` dense 1024-dim; reranker = `BAAI/bge-reranker-v2-m3`; vector DB = Qdrant v1.16.0 on Spark 2; serving = **spark-embed** (custom FastAPI on NGC PyTorch image, *not* TEI); BM25 sparse generated client-side via FastEmbed (`Qdrant/bm25`); all fronted by Spark Control (`/v1/embeddings`, `/v1/rerank`, `/api/search`), shipped v0.15.0. Embedding-model A/B upgrade candidate if dense recall lags: `Qwen3-Embedding-4B` (same `/v1/embeddings` contract).
**Still open:**
1. Workflow engine for the outer loop (Phase 3): Temporal vs. cron/queue + n8n to start.
2. Whether any third-party enrichment API is acceptable, or X + fully self-hosted enrichment only.
3. Confirm **X API usage limits** (full-archive search, follower-graph pulls, streaming) to size Scout's monitoring scope. (Current access is pay-as-you-go credits.)
4. Segment definitions for the Architect/Scribe (who are the distinct LP audiences, and what does each one need to hear?).
5. Embedding dimension/quantization left at BGE-M3 native 1024-dim fp16 — no Matryoshka truncation or int8 needed at this corpus scale.
+207
View File
@@ -0,0 +1,207 @@
# CRM Overview — Storage, Data Model, API, Auth
*Workstream A1 deliverable (see `PHASE_0.md`). Read-only documentation of the existing CRM as of 2026-06. Every concrete claim is anchored to `file:line`. This is a description of what exists today, not a proposal — the schema-extension proposal for A2 lives separately.*
---
## 0. TL;DR for Phase 0
- **One Python file, no framework.** The whole backend is `backend/server.py` (~4,530 lines): a stdlib `http.server.ThreadingHTTPServer` with a hand-written `CRMHandler(BaseHTTPRequestHandler)` and manual path dispatch. `requirements.txt` lists FastAPI/SQLAlchemy/Alembic/Pydantic but **none are imported** — they are vestigial.
- **Storage is one SQLite file** (`data/crm.db`), WAL mode, opened fresh per request. Schema is created idempotently in-code at boot. There is no Alembic; "migrations" are `CREATE TABLE IF NOT EXISTS` + best-effort `ALTER TABLE ADD COLUMN`.
- **Two parallel investor data models** coexist with no shared key: (1) the *classic* `contacts / organizations / opportunities / communications / lp_profiles` CRM, and (2) the *newer, actively-used* `fundraising_*` collaborative grid. They are bridged only by fuzzy name/email matching. **This duality is the central entity-resolution problem for Phase 0.**
- **A real Gmail subsystem** (`backend/email_integration/`) stores threaded correspondence in `crm.db` and matches emails to investors — but is **self-disabling** (off unless a service-account key is present).
- **Auth is a single scheme:** username/password → HS256 JWT (Bearer header), re-validated against the `users` table each request; two roles (`admin`/`member`). The `X_API_KEY` named in `CLAUDE.md`/`PHASE_0.md` **does not exist in the code** — it is aspirational.
- **Guardrail flags:** all deletes are **hard deletes** (violates guardrail #3 as written); a destructive `POST /api/admin/reset-all-data` exists; `audit_log` is mutation-only and is *not* the append-only interaction log Phase 0 wants.
---
## 1. Storage engine & where it runs
### 1.1 Runtime
- **Server:** `ThreadingHTTPServer((HOST, PORT), CRMHandler)`, `daemon_threads = True`, `serve_forever()``backend/server.py:4509`. Handler class at `backend/server.py:1418`. Pure Python stdlib (`http.server`); **not** FastAPI/uvicorn despite `backend/requirements.txt:1-2`.
- **Concurrency model:** one OS thread per request. Safe because each request opens its own short-lived SQLite connection under WAL (rationale documented inline at `backend/server.py:4506-4508`).
- **Request lifecycle:** banned-IP check → per-IP rate limit → (email module hook) → manual `if path == …` dispatch. Body is read once and cached on the handler instance (`get_body()`, `backend/server.py:1433`). Malformed JSON silently becomes `{}`.
- **Background threads:** a backup scheduler loops every 60 s (`start_backup_scheduler`, `backend/server.py:1367`); an optional Gmail sync scheduler starts only if `CRM_GMAIL_INTEGRATION_ENABLED` is truthy (`backend/server.py:4498`).
- **Default bind:** `0.0.0.0:8080`, plain HTTP — TLS is expected to be terminated upstream (Start9 / Tailscale).
### 1.2 SQLite configuration
`get_db()` (`backend/server.py:77-84`) sets, on every connection:
- `PRAGMA journal_mode=WAL` — concurrent readers + single writer (this is what makes the ingest reader safe against the live writer).
- `PRAGMA foreign_keys=ON` — FKs are enforced at runtime (per-connection in SQLite, so re-set each time).
- `PRAGMA busy_timeout=5000` — 5 s wait on a lock.
- `row_factory = sqlite3.Row`.
The Gmail module re-implements the identical pragmas (`email_integration/scheduler.py:49`, `email_integration/routes.py:89`) rather than import `server.py`, to avoid a circular import.
### 1.3 Schema bootstrap & "migrations"
- `init_db()` (`backend/server.py:86`) runs once at startup, before binding. One big `executescript` of `CREATE TABLE/INDEX IF NOT EXISTS` (`backend/server.py:91-405`) creates **both** data models plus `app_settings`.
- **Core "migrations":** a hardcoded list of `ALTER TABLE … ADD COLUMN` wrapped in a try/except that swallows `OperationalError` (`backend/server.py:407-418`) — additive-only, idempotent-by-failure. No version table, no down-migrations. (Currently adds `city/state/country/location_query` to `contacts` and `lead_source` to `fundraising_investors`.)
- **The only real migration runner** is in the Gmail module: `email_integration.db.apply_migrations()` (`email_integration/db.py:23`) runs numbered `NNNN_*.sql` files lexicographically. There is one today: `migrations/0001_email_tables.sql`. ⚠️ This is called from `init_db()` guarded only by `ImportError` (`backend/server.py:421-427`), so on any image where the package is importable, **the email tables are created even when Gmail sync is disabled**.
### 1.4 `data/` layout
- `crm.db`, `crm.db-wal`, `crm.db-shm` — the DB + WAL + shared-memory.
- `backups/` — JSON snapshots of the **fundraising grid state only** (not the whole DB), written by the backup scheduler.
- `secrets/` — holds `gmail-service-account.json` (mode 600).
- `email_attachments/` — Gmail attachment bytes on disk (created on the 0.4 image).
- `.crm-secret` — persisted JWT secret, written by the container entrypoint (not the app).
### 1.5 Production deployment (StartOS)
Package id `ten-database` ("Ten31 Database"). Both generations run the **same** app (`python3 /app/backend/server.py`) in a `python:3.11-slim` container; all state on a single persistent volume `main` mounted at `/data`.
- **`start9/0.4/` — current/live target.** Manifest & lifecycle are TypeScript under `start9/0.4/startos/` (`manifest/index.ts`, `main.ts`, `backups.ts`, `interfaces.ts`). Built for `x86_64` + `aarch64`. Whole-volume backups (`sdk.Backups.ofVolumes('main')`). The richer `docker_entrypoint.sh` creates `/data/{backups,secrets,email_attachments}`, persists `CRM_SECRET_KEY` to `/data/.crm-secret`, and **conditionally enables Gmail** iff `/data/secrets/gmail-service-account.json` exists (then exports DWD env: `CRM_GMAIL_AUTH_METHOD=dwd`, `CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz`, sync interval 180 min). Version notes record the 0.3.5→0.4 migration is complete and the live `/data` volume is the sole source of truth (no more baked-in seed snapshot).
- **`start9/0.3.5/` — legacy.** Hand-written YAML manifest (`manifest.yaml`), arm64-only, Tor `80→8080` + LAN `443(ssl)→8080`. Superseded by 0.4.
- **Local/dev:** `start.sh` runs `python3 backend/server.py` with dev defaults. `start_beta.sh` is a Tailscale launcher that sources `.env.beta`, forces `CRM_ENV=production`, and requires a ≥24-char `CRM_SECRET_KEY`. In production mode the app refuses to start without `CRM_SECRET_KEY` (`backend/server.py:4487`).
### 1.6 Environment variables (for CLAUDE.md "CRM connection vars")
**Core server (`backend/server.py:42-71`):** `CRM_DATA_DIR`, `CRM_FRONTEND_DIR`, `CRM_DB_PATH` (default `<DATA_DIR>/crm.db`), `CRM_SECRET_KEY` (JWT signing — required in production), `CRM_HOST` (default `0.0.0.0`), `CRM_PORT` (default `8080`), `CRM_CORS_ORIGIN` (default `*`), `CRM_ENV` (default `development`), `CRM_LOGIN_RATE_LIMIT_PER_MIN`, `CRM_WRITE_RATE_LIMIT_PER_MIN`, `CRM_GET_RATE_LIMIT_PER_MIN`, `CRM_ABUSE_404_THRESHOLD`, `CRM_ABUSE_404_WINDOW_SEC`, `CRM_ABUSE_BAN_SEC`, `CRM_SEED_DEMO_DATA`, `CRM_GMAIL_INTEGRATION_ENABLED`.
**Gmail module (`email_integration/config.py:80-101`):** `CRM_GMAIL_AUTH_METHOD` (`dwd`/`oauth`), `CRM_GMAIL_SA_KEY_PATH`, `CRM_GMAIL_WORKSPACE_DOMAIN`, `CRM_GMAIL_OAUTH_CLIENT_ID/SECRET/REDIRECT_URI`, `CRM_GMAIL_SECRET_KEY` (AES key for OAuth-token-at-rest — **separate** from `CRM_SECRET_KEY`), `CRM_GMAIL_SYNC_INTERVAL_MIN`, `CRM_GMAIL_BACKFILL_PAGE_SIZE`, `CRM_GMAIL_MAX_ATTACHMENT_MB`, plus rate/retry knobs.
> **There is no network DB protocol.** "Connecting to the CRM" means either (a) opening the same SQLite file (`CRM_DB_PATH`) — only possible co-located with the `/data` volume — or (b) HTTP at `http://<host>:8080` with a Bearer JWT.
---
## 2. Data model
PKs are **8-char truncated UUIDs** (`generate_id()` = `str(uuid.uuid4())[:8]`, `backend/server.py:522`). Timestamps are ISO-8601 UTC strings (`now()`, `backend/server.py:525`). JSON-bearing TEXT columns (`tags`, `attendees`, `options`) are `json.dumps`'d on write and auto-parsed on read by `row_to_dict()` (`backend/server.py:506-517`).
### 2.1 Classic CRM model
| Table | Role | Key columns / notes |
|---|---|---|
| `users` | auth + ownership principal | `username`/`email` UNIQUE, `password_hash`, `role` ∈ {`admin`,`member`}, `is_active`. First user forced `admin`. (`backend/server.py:92`) |
| `organizations` | weak parent of contacts/opps | `name` (not unique), `type` (free-text, default `other`), `tags` JSON, `description`. (`backend/server.py:104`) |
| `contacts` | **the hub** | `first_name`/`last_name` (req), `organization_id` (FK SET NULL), `contact_type` (free-text; load-bearing values `prospect`/`investor`), `status` (default `active`), `source`, `tags` JSON, `notes`, `linkedin_url`. (`backend/server.py:123`) |
| `opportunities` | deal pipeline | `contact_id` (req, FK **CASCADE**), `stage` (allowlist `PIPELINE_STAGES` at `backend/server.py:1380`, enforced **only** on the stage endpoint), `commitment_amount`, `expected_amount`, `fund_name`, `owner_id`, `lost_reason`. (`backend/server.py:148`) |
| `lp_profiles` | closed-LP extension | 1:1 with a contact (`contact_id` UNIQUE, FK CASCADE). Holds `commitment_amount`, `funded_amount`, `accredited` (bare 0/1), `legal_docs_signed`, `wire_received`, `k1_sent`, `investor_type` (free-text). (`backend/server.py:186`) |
| `custom_fields` / `custom_field_values` | EAV custom fields | **Dead**: schema exists but has **no routes/handlers**; only ever wiped by reset. Do not build on this. (`backend/server.py:206`) |
| `tags` | global tag palette | `name` UNIQUE + `color`. Not FK-linked to the per-row `tags` JSON arrays; just an autocomplete source. (`backend/server.py:237`) |
| `audit_log` | mutation diff trail | `user_id`, `entity_type`, `entity_id`, `action`, `changes` JSON. **Mutation-only**, no reads, no actor/agent dimension. (`backend/server.py:227`) |
**How an LP is represented:** a single `contacts` row is the canonical record; `contact_type` carries the funnel stage (`prospect``investor`). Promotion to `investor` is a side effect of creating an `lp_profiles` row (`backend/server.py:2834`) or of fundraising-grid sync (`backend/server.py:788`). The contact dossier is assembled by `GET /api/contacts/{id}` (`backend/server.py:2008`): contact + last-20 communications + all opportunities + the one lp_profile. **Note the unreconciled double-modeling of money:** in-flight commitment lives on `opportunities`, closed commitment lives on `lp_profiles`, and the grid has a *third* copy in `fundraising_commitments`.
### 2.2 Fundraising grid model (newer, actively used)
A **real-time collaborative spreadsheet** the partners actually edit: funds are columns, investors are rows, dollar commitments are cells, plus saved views, live presence/cell-locks, and a small automation engine.
- **Authoritative store = one JSON blob:** `fundraising_state.grid_json` + `views_json` (row `id='main'`, `backend/server.py:258`). Reads/exports come straight from this.
- **Normalized tables are a derived mirror**, fully rebuilt from the JSON on **every** write by `sync_fundraising_relational()` (`backend/server.py:945`): `fundraising_investors` (keyed by `source_row_id`), `fundraising_funds` (by `column_id`), `fundraising_commitments` (cells), `fundraising_contacts`, `fundraising_views`. ⚠️ **`fundraising_contacts`/`fundraising_commitments`/`views` get fresh UUIDs on every save** (DELETE+reinsert); only `fundraising_investors.id` and `fundraising_funds.id` are stable. Don't persist external references to the volatile ones.
- **Automation engine** (`run_fundraising_automations`, `backend/server.py:668`): currently **ignores** the rules' `condition_json`/`action_json` and uses hard-coded flag logic (`graveyard`→graveyard list, `follow_up`→follow_up list, everyone→`all`), rebuilding `fundraising_list_memberships` and logging changes to `fundraising_automation_runs`. So the rules table is display/config surface, not a live interpreter.
- **Backups:** JSON-only filesystem snapshots of grid state to `data/backups/` (manual/auto/pre_restore), governed by a policy in `app_settings`. Restore overwrites state then re-syncs the mirror.
### 2.3 The two-model bridge (the central problem)
There is **no foreign key** between `fundraising_investors` and the classic `contacts/organizations/opportunities/lp_profiles`. They are joined only by best-effort name/email matching, essentially one-directional **grid → classic**:
1. **Grid → classic (write-through):** every grid save pushes each grid contact into classic `contacts`/`organizations` via `_upsert_contact_from_fundraising` (`backend/server.py:730`), matching by lowercased email else (name + org). No stored key links the resulting `contacts.id` back to `fundraising_investors.id` — it re-matches by name/email each time.
2. **Grid-logged comms → classic `communications`:** `POST /api/fundraising/log-communication` (`backend/server.py:2561`) writes into the classic `communications` table *and* appends a dated line to the grid row's `notes`.
3. **Classic → grid (partial reverse mirror):** `_sync_contact_to_fundraising_state` (`backend/server.py:815`) patches an existing matching grid row but will **not** create a new investor row.
**Net:** the same investor can simultaneously exist as a `fundraising_investors` row, one-or-more `contacts` rows, an `organizations` row, and an `lp_profiles` row, with **no authoritative join key**. The email matcher treats the **fundraising side as higher-signal** (`matcher.py:103`). `lp_profiles` is entirely outside the fundraising subsystem.
### 2.4 Notes, interactions & correspondence (Phase-0 critical)
Three subsystems hold embeddable text:
**(a) `communications`** (`backend/server.py:168`) — the primary human-logged activity store. One row per `note`/`call`/`email`/`meeting`/`text` (type is free-text; UI offers those 5, `frontend/index.html:4220`). Columns: `contact_id` (req, CASCADE), `opportunity_id` (SET NULL), `subject`, `body`, `communication_date` (the **event** timestamp, distinct from `created_at` — this is the `date_ts` source), `outcome`, `next_action`, `attendees` JSON. Written by `POST /api/communications` and by the fundraising log endpoint. **Hard delete** at `backend/server.py:2758`.
**(b) Scattered free-text fields** worth embedding: `contacts.notes`, `lp_profiles.notes`, `fundraising_investors.notes` (a running, newline-appended outreach log mirroring the grid Notes column), `opportunities.description`/`next_step`, `organizations.description`.
**(c) Gmail correspondence** (`backend/email_integration/`, schema in `migrations/0001_email_tables.sql`):
- `emails` — canonical record deduped by RFC `message_id`; `subject`, `from_*`, `to/cc/bcc` JSON, `sent_at`, `body_text`, `body_html`, `snippet`, `is_matched`, `match_status`. ⚠️ **Bodies are stored only for matched emails**; unmatched emails are metadata-only with the body nulled (`sync.py:319`).
- `email_threads` — thread roll-up; threading via RFC `In-Reply-To`/`References` chain then Gmail thread id (`threads.py:38`).
- `email_account_messages` — per-mailbox sighting (dedup across team inboxes).
- `email_attachments` — metadata; bytes on disk, deduped by SHA-256.
- **`email_investor_links`** — the entity linkage. Populates any subset of `fundraising_investor_id` / `fundraising_contact_id` / `contact_id` / `organization_id` (all **soft references, no FK**) with `match_kind` (`exact_email` conf 1.0 / `domain_match` conf 0.6) and confidence. A single email can link to several entities at once. `email_sync_runs` records per-run observability.
**`audit_log` is *not* the interaction log.** It is mutation-only (≈26 write sites), logs no reads, and has no agent/actor dimension. Phase 0's "append-only interaction log of every agent action and every human touch" (Workstream A2) needs a **new** table, not a repurpose.
---
## 3. API / route surface
Full REST verbs exist (mutations are **not** tunneled through POST): `do_GET` (1589), `do_POST` (1727), `do_PUT` (1791), `do_PATCH` (1817), `do_DELETE` (1845), `do_OPTIONS` (1580), all in `backend/server.py`. Routing is a flat `if/elif` ladder: exact string for collections, `re.match(r'^/api/x/[^/]+$')` for items, path params parsed positionally with `path.split('/')`. The Gmail module hooks in at the top of `do_GET`/`do_POST` via `try_handle(self)` (`email_integration/routes.py:49`), claiming any `/api/email/*` path.
**Response envelope:** reads/writes → `{"data": …}` (+`total`/`limit`/`offset` for lists); errors → `{"error": msg}`; create → 201; auth → bare `{"token","user"}`; email handlers use ad-hoc keys. CORS allows all verbs; `Access-Control-Allow-Origin` echoes `CRM_CORS_ORIGIN` (default `*`).
**Auth column:** None = public · Bearer = any active user · Admin = `require_admin`.
| Method | Path | Purpose | Auth |
|---|---|---|---|
| GET | `/`, `/index.html`, `/assets/*` | Serve SPA + static | None |
| GET | `/api/health` | Liveness | None |
| GET | `/api/bootstrap/status` | First-run check | None |
| POST | `/api/auth/login` | Login → JWT | None |
| POST | `/api/auth/register` | First-user registration (self-disables) | None |
| GET/POST | `/api/contacts` | List/search · Create | Bearer |
| GET/PUT/DELETE | `/api/contacts/{id}` | Detail (dossier) · Update · **Hard delete** | Bearer |
| GET | `/api/contacts/{id}/communications` | Per-contact interaction history | Bearer |
| GET/POST · GET/PUT/DELETE | `/api/organizations[/{id}]` | Org CRUD | Bearer |
| GET/POST · GET/PUT/DELETE | `/api/opportunities[/{id}]` | Opp CRUD | Bearer |
| PATCH | `/api/opportunities/{id}/stage` | Move pipeline stage (validated) | Bearer |
| GET/POST · GET/PUT/DELETE | `/api/communications[/{id}]` | Comms CRUD | Bearer |
| GET/POST · GET/PUT | `/api/lp-profiles[/{id}]` | LP-profile CRUD (no delete route) | Bearer |
| GET | `/api/reports/{dashboard,pipeline,lp-breakdown,activity}` | Aggregates | Bearer |
| GET | `/api/export/contacts` | Export **all** contacts (returns JSON, not CSV) | Bearer |
| POST | `/api/import/csv` | Bulk import from JSON rows | Bearer |
| GET/POST · PATCH | `/api/feature-requests[/{id}]` | Feature-request tracker | Bearer |
| GET | `/api/users` | List users (no hashes) | Bearer |
| POST · PATCH | `/api/admin/users[/{id}]` | Create / update user | **Admin** |
| POST | `/api/admin/reset-all-data` | ⚠️ Wipe CRM (confirm phrase `RESET ALL DATA`) | **Admin** |
| GET | `/api/audit-log` | Mutation audit trail | **Admin** |
| GET | `/api/security/status` | Config/security status | **Admin** |
| GET/PUT | `/api/fundraising/state` | Get / save grid (optimistic `version`, 409 on conflict) | Bearer |
| GET/POST | `/api/fundraising/collab/{state,heartbeat}` | Presence + cell locks | Bearer |
| POST | `/api/fundraising/log-communication` | Log comm + append grid note | Bearer |
| GET | `/api/fundraising/{export,relational-summary}` | Export / counts | Bearer |
| GET | `/api/fundraising/activity` | Merged audit+automation+backup feed | **Admin** |
| GET/PATCH | `/api/fundraising/automations[/{id}]` | Automation rules | **Admin** |
| GET | `/api/fundraising/automation-runs` | Run history | **Admin** |
| GET/POST | `/api/fundraising/{backups,backup,backup-verify}` | Backup mgmt | **Admin** |
| POST | `/api/fundraising/{restore-preview,restore}` | Restore grid | **Admin** |
| GET/PATCH | `/api/fundraising/backup-policy` | Backup policy | **Admin** |
| GET | `/api/email/{status,accounts,threads}` | Sync status / accounts / matched threads | Bearer + flag |
| GET | `/api/email/oauth/{start,callback}` | Per-user OAuth (callback is state-token gated, no Bearer) | mixed + flag |
| POST | `/api/email/accounts/{enroll-all,enroll}` · `/sync/run-now` · `/rematch` | Enrollment & sync ops | **Admin** + flag |
**Defined but NOT routed:** `handle_list_tags`/`create`/`delete` exist (`backend/server.py:3366-3400`) but no `/api/tags` route is wired; the `custom_fields` tables have no routes at all. Treat both as dead for Phase 0.
---
## 4. Authentication & authorization
- **Login:** `POST /api/auth/login``handle_login` (`backend/server.py:1880`). Looks up active user, `verify_password` (bcrypt, PBKDF2-SHA256 fallback, `backend/server.py:444`), issues `create_token` (HS256 JWT via PyJWT, HMAC fallback; claims `user_id/username/role/exp/iat`; 24 h expiry; `backend/server.py:464`).
- **Per-request verification:** `get_user()` (`backend/server.py:1458`) reads `Authorization: Bearer`, `decode_token` (pins `algorithms=["HS256"]` — no `alg:none` downgrade), then **re-loads the user row** and rejects if missing/inactive. Identity (incl. role) comes from the **DB row, not token claims**, so deactivation and role changes take effect immediately.
- **No cookies, no logout, no refresh, no revocation.** The only early kill-switch is `is_active=0`.
- **Bootstrap:** `GET /api/bootstrap/status` (public) reports `setup_required`. `POST /api/auth/register` (public, **self-disables** once any user exists) creates and force-promotes the first user to `admin`. A separate flag-gated `seed_demo_data()` hardcodes `admin/admin123` + `grant/password` and prints them — **dev-only, off by default** (`backend/server.py:4351-4374`).
- **Roles:** only `admin`/`member`. Enforcement is an inline `require_admin(user)` (`backend/server.py:541`) at the top of each admin handler — no middleware. **No row-level authorization:** any active member can read/edit *all* LP and fundraising data; `created_by`/`owner_id` are informational only.
- **`X_API_KEY` does not exist in code.** Repo-wide it appears only in `CLAUDE.md` and `docs/PHASE_0.md`. There is no API-key header path and no service-auth distinct from the user JWT. (The `Bearer` tokens in `email_integration/` are *outbound* Google OAuth tokens, unrelated.)
- **Secrets:** JWT key `CRM_SECRET_KEY` (random per-process default; hard-fails in production if unset, `backend/server.py:4487`). Gmail OAuth tokens are encrypted at rest with **AES-256-GCM** keyed off the *separate* `CRM_GMAIL_SECRET_KEY` (`email_integration/crypto.py:37`) — a genuinely sound scheme. The Google service-account key lives at `data/secrets/gmail-service-account.json` (0600).
- **Hardening present:** per-IP sliding-window rate limits (login 20/min, writes 300/min, GETs 600/min); auto-ban of scanner IPs after a 404 burst (`record_404`, `backend/server.py:1520`). **Absent:** security headers (HSTS/CSP/X-Frame-Options), CORS defaults to wildcard `*`, `X-Forwarded-For` is trusted (only safe behind a controlled proxy).
---
## 5. Phase-0 implications (carry into A2/B/C)
1. **Pick a canonical LP identity.** The classic vs fundraising duality (§2.3) means entity resolution (A4/B3) must collapse across **both** models — grid investor + grid contacts + classic contact/org/lp_profile — into one canonical `lp_id`, not just dedupe name variants within one table. The email matcher's preference (`fundraising_contacts` > `contacts`) is a hint that the **grid is the operationally-live LP graph**, but closed financials/accreditation live only in `lp_profiles`.
2. **Canonical ID host.** `contacts.id` is the natural join target (lp_profiles + opportunities FK to it; the dossier is keyed on it), but the **8-char truncated UUID** is a uniqueness risk if it becomes the Qdrant payload key — resolve before indexing.
3. **Interaction log is greenfield.** `audit_log` won't do (mutation-only, no actor/agent dimension). A2 should add a dedicated append-only `interaction_log` table and route all MCP/agent writes through it (guardrail #5).
4. **Soft-delete gap.** Everything is hard-deleted today (guardrail #3 risk) and there's no tombstone for an idempotent re-embed to detect removals. A2 should add a `deleted_at`/archive flag and have ingest treat it as a delete-from-index signal.
5. **Chunk sources** (per `EMBEDDINGS.md`): one chunk per `communications` row; one per **matched** `emails` row (only these have bodies); split the `fundraising_investors.notes` outreach log per dated line; one chunk each for the scattered note/description fields. Keep ids/names/dates/types/confidence as filterable payload; embed only subject/body/note prose. `date_ts` = epoch of the **event** timestamp (`communication_date`/`sent_at`), not `created_at`. Key Qdrant point ids on stable ids (`emails.id`), namespaced to avoid the 8-char collision risk.
6. **Migration mechanism.** No Alembic. Either extend the in-code idempotent pattern (`CREATE … IF NOT EXISTS` + try/except `ALTER ADD COLUMN`) or adopt the email module's numbered-`.sql` runner for the core schema (recommended; add a `schema_migrations` ledger). SQLite ALTER is add-column/rename only — which conveniently forces the additive/reversible guardrail.
7. **Ingest connection.** No network DB protocol — open `data/crm.db` **read-only** (`sqlite3` URI `mode=ro`) with the same WAL/`busy_timeout` pragmas, co-located with the `/data` volume; WAL guarantees the reader never blocks the live writer. Reserve HTTP+JWT for any write-back. Decide the MCP↔CRM auth boundary explicitly (recommend: read-only direct SQLite for ingest; a constrained service principal for write-back) since `X_API_KEY` is unimplemented.
---
## 6. Open questions for the owner
1. **Which model is canonical for an LP** — the `contacts` row or the `fundraising_investors` grid row? (Determines the ingest spine and the canonical-ID target.)
2. **Is the Gmail integration enabled on the live Start9 box, and has a backfill run?** If not, the Phase-0 corpus is just `communications` + note fields until it is. (The email schema exists regardless.)
3. **Should `custom_fields` (EAV) and the unrouted `tags` CRUD be revived or removed?** Leaving the EAV risks a second divergent custom-data path next to the live fundraising custom columns.
4. **Accreditation today is a single boolean** (`lp_profiles.accredited`) with no QP flag / method / date — where should the 506(b)/506(c) + accreditation/QP fields counsel will require (guardrail #6) live?
5. **MCP↔CRM auth:** build a real `X_API_KEY` service-key path, authenticate the MCP server as a dedicated CRM user (24 h JWT, must re-login), or read SQLite directly? And does the agent principal need a new least-privilege role below `member`?
6. **`fundraising_state.grid_json` vs the normalized mirror** — confirmed authoritative = the JSON blob; is reading the normalized mirror (consistent after each save) acceptable for ingest, treating the JSON as the re-derivable source of truth?
---
*Sources: `backend/server.py`, `backend/email_integration/*`, `backend/email_integration/migrations/0001_email_tables.sql`, `start9/0.3.5/*`, `start9/0.4/*`. Generated from a structured multi-agent read of the codebase, cross-checked against the live `data/crm.db` schema (currently a near-empty seeded instance: 1 user, 9 funds, 4 views, 2 automation rules — the real corpus lives on the Start9 deployment).*
+115
View File
@@ -0,0 +1,115 @@
# Gmail Integration — Enablement Runbook
*How to turn on the (already-built) Gmail correspondence integration on the live Start9 box, validate it with a small observed backfill, then roll out to the domain. Read-only capture; all mail stays on Ten31 infrastructure.*
Code: `backend/email_integration/`. Schema: `migrations/0001_email_tables.sql`. See `docs/crm-overview.md` §2.4 for the data model.
---
## What this does & the sovereignty posture
- Pulls Gmail messages for enrolled `@ten31.xyz` mailboxes into the CRM's own SQLite DB (`emails`, `email_threads`, `email_attachments`, …), **deduped across inboxes**, **threaded**, and **matched** to investors/contacts (`email_investor_links`).
- **Scope is `https://www.googleapis.com/auth/gmail.readonly`** (`credentials.py:34`) — the integration can *read* mail, never send or modify. Lower risk, and it's all the ingest needs.
- **Data path is Google → your Start9 box only.** No new third party, and per guardrail #9 Claude never reads the mail — the correspondence becomes ingest input for *local* embeddings (bge-m3 on the Sparks), not API context. (Contrast with Superhuman's MCP — see §7.)
## 0. Pick the auth method
| Method | When | What you provide |
|---|---|---|
| **DWD (domain-wide delegation)** — recommended | You administer the `ten31.xyz` Google Workspace and want to capture team mailboxes without per-user consent | One service-account JSON key + a Workspace admin authorization |
| **Per-user OAuth** | Capturing a mailbox you don't admin, or avoiding DWD | OAuth client id/secret + each user clicks through `/api/email/oauth/start` |
The Start9 0.4 entrypoint is built around **DWD** (auto-detects the key, sets `CRM_GMAIL_AUTH_METHOD=dwd`, `CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz`). The rest of this runbook assumes DWD.
## 1. Google-side setup (one time)
You need Workspace **super-admin** + a GCP project.
1. **GCP project** → enable the **Gmail API** (`APIs & Services → Library → Gmail API → Enable`).
2. **Create a service account** (`IAM & Admin → Service Accounts`). Note its **client ID** (a long number) and its email.
3. **Create a JSON key** for it (`Keys → Add key → JSON`). This file is the secret — handle per guardrail #7.
4. **Authorize domain-wide delegation** in the Workspace **Admin console** (`Security → Access and data control → API controls → Domain-wide delegation → Add new`):
- **Client ID** = the service account's client ID from step 2.
- **OAuth scopes** = `https://www.googleapis.com/auth/gmail.readonly`
- Save. (Without this exact scope authorized, sync returns a non-retryable auth error — see `errors.py:21`.)
## 2. Install the key on Start9
1. Copy the JSON key to the service's data volume at **`/data/secrets/gmail-service-account.json`**.
2. Lock it down: `chmod 600 /data/secrets/gmail-service-account.json` (the entrypoint also `chmod 700`s `/data/secrets`).
3. **Restart the service.** On boot the 0.4 entrypoint detects the key and exports:
`CRM_GMAIL_INTEGRATION_ENABLED=true`, `CRM_GMAIL_AUTH_METHOD=dwd`, `CRM_GMAIL_SA_KEY_PATH=/data/secrets/gmail-service-account.json`, `CRM_GMAIL_WORKSPACE_DOMAIN=ten31.xyz`, `CRM_GMAIL_SYNC_INTERVAL_MIN=180`.
It logs `Gmail integration: ENABLED (key at …)`.
## 3. Smoke test — ONE mailbox first (the "don't rush it" gate)
Do a single-mailbox run before enrolling the whole team, to shake out auth/matching bugs on a small surface. All calls need an **admin Bearer token**:
```bash
CRM=https://<your-start9-crm-host> # the CRM's address
TOKEN=$(curl -sk $CRM/api/auth/login -H 'Content-Type: application/json' \
-d '{"username":"<admin>","password":"<pw>"}' | python3 -c 'import sys,json;print(json.load(sys.stdin)["token"])')
# integration alive?
curl -sk $CRM/api/email/status -H "Authorization: Bearer $TOKEN"
# enroll just yourself
curl -sk $CRM/api/email/accounts/enroll -H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' -d '{"email":"you@ten31.xyz"}'
# trigger a sync now (otherwise it runs every 180 min)
curl -sk $CRM/api/email/sync/run-now -X POST -H "Authorization: Bearer $TOKEN"
```
**Tip:** to keep the first backfill small, set `CRM_GMAIL_BACKFILL_PAGE_SIZE` low (e.g. `50`) before the restart, watch one page land, then raise it.
## 4. Verify (on the box, read-only SQL)
```sql
-- sync ran cleanly?
SELECT kind, status, messages_seen, messages_stored, attachments_saved, error
FROM email_sync_runs ORDER BY started_at DESC LIMIT 3;
-- mail captured + how much got matched to investors/contacts
SELECT COUNT(*) total, SUM(is_matched) matched FROM emails;
-- who did it match, and how confidently?
SELECT match_kind, COUNT(*) FROM email_investor_links GROUP BY match_kind;
```
Or via the API: `GET /api/email/status` (counts) and `GET /api/email/threads?investor_id=<id>` (matched threads for one investor). If matching looks thin, run `POST /api/email/rematch` with `{"since":"<ISO8601>"}` after the investor list is populated.
## 5. Roll out to the domain
Once the single mailbox looks right:
```bash
curl -sk $CRM/api/email/accounts/enroll-all -X POST -H "Authorization: Bearer $TOKEN"
curl -sk $CRM/api/email/sync/run-now -X POST -H "Authorization: Bearer $TOKEN"
```
Incremental sync then runs every `CRM_GMAIL_SYNC_INTERVAL_MIN` (default 180) via the scheduler thread.
## 6. Tuning knobs (env, `config.py`)
`CRM_GMAIL_SYNC_INTERVAL_MIN` (180) · `CRM_GMAIL_BACKFILL_PAGE_SIZE` (500) · `CRM_GMAIL_MAX_ATTACHMENT_MB` (50) · `CRM_GMAIL_ATTACH_CONCURRENCY` (4) · `CRM_GMAIL_RATE_UNITS_SEC` (150) · `CRM_GMAIL_HISTORY_STALE_DAYS` (5, forces a backfill if Gmail pruned history).
## 7. Where Superhuman fits (and where it doesn't)
You have Superhuman connected to Gmail, and it exposes an MCP server. The two are **complementary, not competing**, and it matters which job each does:
- **Canonical correspondence ingest → use this DWD integration, not Superhuman.** It pulls mail straight into your own `crm.db` on Start9 and feeds the *local* embedding pipeline. Routing bulk ingest through Superhuman's MCP would put your email content through Superhuman's servers and — because an agent/Claude would be driving those calls — through Anthropic, which is exactly what guardrail #1 keeps the corpus away from. DWD keeps the data path Google → your box.
- **Human mail workflow & drafting → Superhuman MCP is great.** Reading/triaging your own inbox, and Closer-style *draft* generation that a human reviews and sends, naturally happen in your real mail client. The `batch-draft-writer` skill already drives the Superhuman MCP for that, and it's usable today — independent of the CRM pipeline.
Net: **DWD = system-of-record correspondence (sovereign, for retrieval). Superhuman MCP = the human's working surface (drafting, triage).** Don't make Superhuman the ingest source of truth.
## 8. Disable / rollback
Remove (or rename) `/data/secrets/gmail-service-account.json` and restart → the entrypoint logs `DISABLED` and routes return 503; captured data remains. To pause one mailbox without disabling the whole integration, set its `email_accounts.sync_enabled = 0`.
## 9. Troubleshooting
- **401/403 from Google on sync** → DWD scope not authorized, wrong client ID, or Gmail API not enabled (steps 1 & 4). This error is non-retryable by design (`errors.py`).
- **`status` says disabled / routes 503** → key not found at `CRM_GMAIL_SA_KEY_PATH`, or `CRM_GMAIL_INTEGRATION_ENABLED` not truthy (the entrypoint only sets it when the key file exists).
- **Mail captured but `matched = 0`** → the investor/contact list was empty or addresses don't match; populate the CRM/grid first, then `POST /api/email/rematch`.
- **Bodies missing on some emails** → by design, unmatched emails are stored metadata-only (no body) until matched (`sync.py`); re-match to backfill.
+88
View File
@@ -0,0 +1,88 @@
# Redaction / Re-hydration Boundary
*Design note for the privacy boundary between Ten31's sovereign data and the Claude API. Written in Phase 0 so it is a designed step, not an afterthought. **It is BUILT in Phase 2/3** (Analyst, Closer) — Phase 0 does not need it (see §1).*
Related: `CLAUDE.md` guardrails #1 (sovereignty), #4 (human-in-the-loop outbound), #9 (dev-time data handling); `docs/Ten31_Agentic_Build_Plan.md` §4.5.
---
## 1. When this applies (and when it doesn't)
The system has two very different data paths, and only one of them needs this:
- **Local-only paths — no redaction needed, because Claude is never in the loop.** All of Phase 0 is here: ingest, chunking, embeddings (bge-m3), the vector index (Qdrant), and entity resolution (local Qwen). Sensitive data flows `CRM → Sparks → Qdrant`, all on Ten31 infrastructure. Nothing reaches Anthropic, so there is nothing to scrub. **Do not add redaction overhead to the ingest/retrieval path.**
- **Claude-facing reasoning steps — this boundary applies.** When an agent asks Claude to *reason over* LP-specific content: **Analyst** (building a dossier from retrieved chunks), **Closer** (drafting outreach/nurture/meeting prep), and any Orchestrator step that forwards record content. These send sensitive context to a third-party API and are the reason this boundary exists.
The guiding rule (guardrail #1) is *"only the minimum necessary, non-sensitive context per call."* Redaction/re-hydration is how we honor that rule **when the task genuinely needs record content** — as opposed to simply sending less.
## 2. The three-tier data classification
Before any agent calls Claude, classify each piece of context:
| Tier | Examples | Treatment |
|---|---|---|
| **Never send** | Full LP list/export, bulk relationship graph, raw account numbers, wire details, SSNs/passport, anything covered by a confidentiality obligation | Stays on Ten31 infra. Not even tokenized — just excluded. |
| **Tokenize (pseudonymize)** | Person names, org/fund names, emails, phone, physical addresses, exact $ amounts, dates that pin identity | Replaced with stable placeholders before the call; real values swapped back locally after. |
| **Send as-is** | The *substance* an agent needs Claude to reason about: thesis discussion, sentiment, objections, generic deal mechanics, the drafted message body (minus identifiers) | Sent in the de-identified prompt. |
The art is the middle tier: keep enough semantic content for Claude to be useful, while every *identifier* is a placeholder.
## 3. The round-trip
```
┌─────────────────────────── Ten31 infrastructure (sovereign) ───────────────────────────┐
│ │
│ 1. SCRUB (local model on the Sparks, via Spark Control) │
│ - Pull the minimal context the task needs (retrieved chunks + record fields). │
│ - NER + rule pass replaces Tier-2 identifiers with stable tokens: │
│ "Jonathan Reyes" -> [PERSON_1] "Cedar Point Capital" -> [ORG_1] │
│ "jon@cedarpoint..." -> [EMAIL_1] "$5,000,000" -> [AMOUNT_1] "Fund III" -> [FUND_1]│
│ - Tokens are STABLE within a task (same entity -> same token) and CONSISTENT across │
│ all chunks in the call, so Claude can reason about relationships. │
│ - The pseudonym map { [PERSON_1] -> "Jonathan Reyes", ... } is held LOCALLY, keyed to │
│ the task/session. It never leaves the box. │
│ - Drop Tier-1 content entirely. Log the scrub to the interaction_log. │
│ │
└──────────────────────────────────────────┬───────────────────────────────────────────────┘
│ de-identified prompt (placeholders only)
2. REASON — Claude API (Agent SDK)
Drafts / synthesizes using [PERSON_1], [ORG_1], [FUND_1] ...
│ response referencing the same placeholders
┌──────────────────────────────────────────┴───────────────────────────────────────────────┐
│ │
│ 3. RE-HYDRATE (local) │
│ - Substitute real values back in using the local pseudonym map. │
│ - A human reviews the re-hydrated draft (guardrail #4) before anything is sent. │
│ - Log the rehydrate + the human decision to the interaction_log. │
│ │
└──────────────────────────────────────────────────────────────────────────────────────────┘
```
## 4. Where it runs
The natural home is **Spark Control** (the gateway that already fronts the local models): add a scrub endpoint and a rehydrate endpoint, or run the pair as middleware inside the Agent SDK tool loop so every outbound Claude call passes through it. Keeping it at the gateway means:
- One enforcement point — agents can't accidentally bypass it.
- The pseudonym map lives next to the local models, never in agent memory that might get logged.
- The scrub uses the same local Qwen already used for entity resolution, so the NER is consistent with how entities were canonicalized at ingest (reuse the `canonical_entities` → token mapping).
## 5. Caveats (why this is a tool, not a magic switch)
- **Free-text leakage is the hard part.** A note that says *"the family that sold the mining company in Texas last year"* re-identifies even with the name tokenized. The scrub model must catch *descriptive* identifiers, not just named entities — and it will not be perfect. For high-sensitivity tasks, prefer sending *less* (summary/thesis only) over trusting the scrub to catch everything.
- **Re-identification by inference.** Enough tokenized-but-specific detail (amounts + dates + sector) can still single out a person. Keep Tier-2 amounts/dates *bucketed* ("~$5M", "Q1") when the exact value isn't needed.
- **Map integrity.** The pseudonym map is sensitive (it's the de-anonymization key) — keep it local, in memory or short-lived, never logged to a third party, never sent in a prompt.
- **It does not replace minimization.** First ask "does Claude need this record content at all?" Often a retrieval summary suffices. Redaction is for when the answer is genuinely yes.
- **Consistency with retrieval.** Retrieval itself is already local, so chunks come back with real values; the scrub is applied at the *prompt-assembly* step, not at ingest.
## 6. Verification (when we build it)
- A test harness that asserts no Tier-1 string and no real Tier-2 identifier appears in any outbound payload (golden-file diff over recorded prompts).
- A re-identification spot-check: have the local model attempt to re-identify entities from the de-identified prompt alone; flag anything it gets right.
- Every scrub/rehydrate logged to `interaction_log` (actor, task, token-count, what tier was dropped) for audit (guardrail #5).
## 7. Open questions (resolve at build time, Phase 2/3)
1. Token granularity — per-task ephemeral maps, or a stable per-entity token space reused across tasks (better for Claude's cross-call memory, worse for re-identification risk)?
2. Do we tokenize the *drafted outbound message itself* (Closer) and re-hydrate, or draft against placeholders and let the human fill specifics? (Affects how much the human edits.)
3. Bucketing policy for amounts/dates — what precision is "non-sensitive enough"?
4. Where exactly in the Agent SDK loop the middleware sits, and how it composes with prompt caching (placeholders must be cache-stable).
Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

+1558 -122
View File
File diff suppressed because it is too large Load Diff
-3571
View File
File diff suppressed because it is too large Load Diff
-3559
View File
File diff suppressed because it is too large Load Diff
-33
View File
@@ -1,33 +0,0 @@
# Deploy on StartOS 0.3.5 (Raspberry Pi)
## 1) Build the package on your Mac
```bash
cd /Users/macpro/Projects/CRM
make -C start9/0.3.5 package
```
This creates:
- `start9/0.3.5/image.tar`
- `start9/0.3.5/ten31-database.s9pk`
## 2) Upload package to StartOS
1. Open StartOS web UI.
2. Go to Services -> Sideload Package (or equivalent 0.3.5 menu).
3. Upload `ten31-database.s9pk`.
4. Install and start the service.
## 3) First run
1. Open the service UI.
2. Create first admin account on the login screen.
3. In Settings, run one manual backup immediately.
## 4) Data persistence contract
- App DB path: `/data/crm.db`
- Backup path: `/data/backups`
Because these are in the persistent service volume, app restarts/upgrades do not erase data.
## 5) Before any upgrade/migration
1. Run manual backup in-app.
2. Export fundraising state in-app.
3. Keep both files off-device as recovery copy.
-25
View File
@@ -1,25 +0,0 @@
FROM python:3.11-slim
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
CRM_ENV=production \
CRM_HOST=0.0.0.0 \
CRM_PORT=8080 \
CRM_DATA_DIR=/data \
CRM_FRONTEND_DIR=/app/frontend
WORKDIR /app
RUN apt-get update \
&& apt-get install -y --no-install-recommends ca-certificates curl \
&& rm -rf /var/lib/apt/lists/*
COPY backend/server.py /app/backend/server.py
COPY frontend /app/frontend
COPY start9/0.3.5/docker_entrypoint.sh /usr/local/bin/docker_entrypoint.sh
COPY start9/0.3.5/healthcheck.sh /usr/local/bin/healthcheck.sh
RUN chmod +x /usr/local/bin/docker_entrypoint.sh /usr/local/bin/healthcheck.sh
EXPOSE 8080
ENTRYPOINT ["/usr/local/bin/docker_entrypoint.sh"]
-23
View File
@@ -1,23 +0,0 @@
PKG_ID := ten-database
PKG_VERSION := 0.1.0.1
REPO_ROOT := $(abspath ../..)
WRAPPER_DIR := $(CURDIR)
IMAGE_NAME := start9/$(PKG_ID)/main:$(PKG_VERSION)
.PHONY: image-arm package verify clean
image-arm:
docker buildx build --platform=linux/arm64 \
-f $(WRAPPER_DIR)/Dockerfile \
-t $(IMAGE_NAME) \
-o type=docker,dest=$(WRAPPER_DIR)/image.tar \
$(REPO_ROOT)
package: image-arm
start-sdk pack
verify:
start-sdk verify s9pk $(PKG_ID).s9pk
clean:
rm -f $(WRAPPER_DIR)/image.tar $(WRAPPER_DIR)/$(PKG_ID).s9pk
-23
View File
@@ -1,23 +0,0 @@
# Start9 Wrapper (0.3.5)
This directory contains the StartOS 0.3.5 package wrapper for Ten31 Database.
## Build prerequisites
- Docker with buildx
- `start-sdk` installed on build machine
## Build package
```bash
cd /Users/macpro/Projects/CRM
make -C start9/0.3.5 package
```
## Verify package
```bash
cd /Users/macpro/Projects/CRM
make -C start9/0.3.5 verify
```
## Outputs
- `start9/0.3.5/image.tar`
- `start9/0.3.5/ten-database.s9pk`
-20
View File
@@ -1,20 +0,0 @@
#!/bin/sh
set -eu
DATA_DIR="${CRM_DATA_DIR:-/data}"
SECRET_FILE="$DATA_DIR/.crm-secret"
mkdir -p "$DATA_DIR" "$DATA_DIR/backups"
if [ -z "${CRM_SECRET_KEY:-}" ]; then
if [ -f "$SECRET_FILE" ]; then
CRM_SECRET_KEY="$(cat "$SECRET_FILE")"
else
CRM_SECRET_KEY="$(head -c 48 /dev/urandom | base64 | tr -d '\n' | tr '/+' 'ab')"
printf '%s' "$CRM_SECRET_KEY" > "$SECRET_FILE"
chmod 600 "$SECRET_FILE"
fi
export CRM_SECRET_KEY
fi
exec python3 /app/backend/server.py
-5
View File
@@ -1,5 +0,0 @@
#!/bin/sh
set -eu
PORT="${CRM_PORT:-8080}"
curl -fsS "http://127.0.0.1:${PORT}/api/health" >/dev/null
Binary file not shown.

Before

Width:  |  Height:  |  Size: 112 B

Binary file not shown.
-24
View File
@@ -1,24 +0,0 @@
# Ten31 Database (StartOS 0.3.5)
## What this package does
- Runs Ten31 Database as a private web app.
- Persists all data under the StartOS service volume (`/data`).
- Exposes web UI/API on internal port `8080`.
## First launch
1. Open the service UI from StartOS.
2. If this is a fresh install, create the first admin account from the login screen.
3. Go to Settings and run a manual backup once.
## Airtable migration
1. Open Settings -> Migration.
2. Choose "Import from Airtable CSV".
3. Confirm row/column mappings before final import.
## Data safety
- Database path in container: `/data/crm.db`.
- Backups path in container: `/data/backups/`.
- Before StartOS or package upgrades, run a backup and export from Settings.
## Upgrade note
This 0.3.5 wrapper keeps app/runtime files separate from data volume so migration to a future 0.4 wrapper can preserve the same data directory layout.
-95
View File
@@ -1,95 +0,0 @@
id: ten-database
title: Ten31 Database
version: 0.1.0.1
release-notes: >-
Initial StartOS 0.3.5 package wrapper for Ten31 Database.
license: MIT
wrapper-repo: https://github.com/ten31/ten31-database-startos
upstream-repo: https://github.com/ten31/ten31-database
support-site: https://github.com/ten31/ten31-database/issues
marketing-site: https://ten31.vc
build: ["make image-arm"]
min-os-version: 0.3.5
description:
short: Self-hosted investor and fundraising database for Ten31.
long: >-
Ten31 Database is an Airtable-style investor CRM with fundraising grid,
communications logging, views, backups, and CSV import. This package stores
all runtime data in the service volume for upgrade-safe persistence.
assets:
license: LICENSE
icon: icon.png
instructions: instructions.md
docker-images: image.tar
main:
type: docker
image: main
entrypoint: docker_entrypoint.sh
args: []
mounts:
main: /data
health-checks:
main:
name: API health
success-message: CRM API is responding.
type: docker
image: main
entrypoint: healthcheck.sh
args: []
inject: true
config: ~
dependencies: {}
volumes:
main:
type: data
interfaces:
main:
name: Web Interface
description: Browser UI and API for Ten31 Database.
tor-config:
port-mapping:
80: "8080"
lan-config:
8080:
ssl: false
internal: 8080
ui: true
protocols: [http]
backup:
create:
type: docker
image: main
system: false
entrypoint: sh
args:
- -c
- |
set -eu
rm -rf /backup/*
cp -a /data/. /backup/
mounts:
main: /data
backup: /backup
restore:
type: docker
image: main
system: false
entrypoint: sh
args:
- -c
- |
set -eu
cp -a /backup/. /data/
mounts:
main: /data
backup: /backup
actions: {}
+6
View File
@@ -0,0 +1,6 @@
node_modules/
javascript/
javascript.old/
.DS_Store
.home/
*.s9pk
+231
View File
@@ -0,0 +1,231 @@
# Ten31 Database — StartOS 0.4 deployment guide
This guide walks through building the `ten-database` 0.4 service package and
sideloading it onto an x86_64 StartOS 0.4 beta machine.
The `start9/0.4/` folder is intentionally self-contained. It does not share
any files with `start9/0.3.5/`, so the legacy package stays intact and can
be rebuilt later if needed.
---
## 0 — How data preservation works
Starting with **0.1.0:40**, this package no longer ships a seed snapshot. The
0.3.5 → 0.4 migration is complete and the live `/data` volume on the StartOS
host is the sole source of truth.
Key facts:
- StartOS preserves the `main` volume across sideloads. Reinstalling a new
`.s9pk` does **not** touch `/data/crm.db`, `/data/backups/`, or
`/data/.crm-secret`. Live edits made between releases are kept.
- Only `Uninstall` from the StartOS UI destroys the `main` volume. As long
as you only `Stop → Sideload new .s9pk → Start`, your data persists.
- Use StartOS-level **Backups → Create Backup** for full volume snapshots,
and the in-app **Settings → Admin → Run Backup** for JSON exports under
`/data/backups/`.
Container paths (unchanged from 0.3.5):
- `/data/crm.db` — primary SQLite DB (WAL journal mode)
- `/data/backups/` — JSON exports
- `/data/.crm-secret` — JWT signing key (kept across restarts so sessions stick)
> Historical note: `0.1.0:39` shipped a baked-in seed snapshot and a
> first-boot copy guarded by `! -f /data/crm.db`. That code path was removed
> in `0.1.0:40`. If you ever need to bootstrap a fresh host again, sideload
> `0.1.0:39` first, let it seed, then upgrade to the latest.
---
## 1 — Build-machine prerequisites
The 0.4 build runs on any machine with:
- Node.js ≥ 20 and npm
- Docker with buildx enabled (Docker Desktop on macOS works; Linux Docker
must have the `buildx` plugin)
- `start-cli` (Start9 SDK) — install per
https://docs.start9.com/packaging/0.4.0.x/environment-setup.html
- `jq`, `make`, `s3cmd` (s3cmd only if you also plan to `make publish`)
Recommended one-time setup:
```sh
# Initialize the Start9 developer key (run once per build machine)
start-cli init-key
# Create ~/.startos/config.yaml so `make install` can sideload:
cat > ~/.startos/config.yaml <<'YAML'
# Replace with the hostname of your 0.4 beta node
host: http://start9.local
YAML
```
---
## 2 — Build the x86_64 .s9pk
From the repo root:
```sh
cd start9/0.4
# One-time dependency install (pulls start-sdk + friends):
npm ci
# Clean build (produces ten-database_x86_64.s9pk):
make clean
make x86
```
Output:
- `ten-database_x86_64.s9pk` in `start9/0.4/`
- Build summary printed by s9pk.mk (title, version, arch, SDK version,
git hash)
> Note: `make` by default builds x86, arm, and riscv. The `Makefile`
> in this folder overrides `ARCHES := x86` so only x86_64 is produced.
> If you later need arm64 too, switch to `ARCHES := x86 arm`.
### If the build fails
Common causes and fixes:
- **`.git/HEAD` or `.git/index` missing** — s9pk.mk requires a real git
repo. It looks at `../../.git` relative to `start9/0.4/` (i.e. the repo
root). Make sure you're building inside the actual repo.
- **`start-cli not found`** — install the Start9 SDK CLI.
- **docker buildx error** — run `docker buildx create --use` once.
- **Permission denied removing `javascript/` between builds** — macOS
extended attributes can make ncc output files immutable. Run
`chmod -R u+w start9/0.4/javascript` and retry, or just `rm -rf
start9/0.4/javascript` from Finder.
---
## 3 — Sideload onto the StartOS 0.4 beta node
Two options:
### Option 1 — `make install` (uses ~/.startos/config.yaml)
```sh
cd start9/0.4
make install
```
This runs `start-cli package install -s ten-database_x86_64.s9pk` against
whatever host you set in `~/.startos/config.yaml`.
### Option 2 — StartOS web UI
1. Copy `ten-database_x86_64.s9pk` onto a machine that can reach the
StartOS 0.4 UI.
2. In the UI: **System → Sideload Service → pick the .s9pk → Install.**
3. After the install completes, open the service and click **Start**.
### First-boot verification
After upgrading the service:
1. Open the Ten31 Database UI from the Interfaces page.
2. Log in with your existing account — passwords and sessions persist
because `/data/.crm-secret` is preserved.
3. Spot-check a few rows in the fundraising grid against what you saw
before the upgrade.
4. Run one manual backup (Settings → Admin → Run Backup) to confirm the
app's write path works.
---
## 4 — Rollback plan
If a new sideload misbehaves:
1. **Stop** the service in StartOS — do not Uninstall (that deletes the
`main` volume).
2. Sideload the previous `.s9pk` (keep one around) and Start.
3. Investigate by opening the service logs from the StartOS UI.
For full disaster recovery, restore the `main` volume from a StartOS-level
Backup.
---
## 5 — File map (what lives where)
```
start9/0.4/
├── DEPLOY_040.md # this file
├── README.md # short overview
├── Dockerfile # self-contained; refs only start9/0.4/ paths
├── Makefile # thin override: ARCHES := x86
├── s9pk.mk # shared 0.4 build plumbing (do not edit)
├── package.json, -lock.json # start-sdk + build tooling
├── tsconfig.json
├── docker_entrypoint.sh # ensures /data dirs + JWT secret, starts server.py
├── healthcheck.sh # curl /api/health (diagnostics only)
├── icon.svg # service icon
├── LICENSE
├── refresh_seed.sh # (LEGACY) scp helper from 0.3.5; kept for reference
├── assets/
│ └── ABOUT.md # user-facing install description
├── seed/ # (LEGACY) historical seed snapshot, NOT shipped
│ ├── README.md
│ └── data/ # crm.db + backups from initial 0.3.5 → 0.4 cut
└── startos/ # SDK source (manifest, main, interfaces…)
├── index.ts # SDK entry (no edits normally needed)
├── sdk.ts # typed SDK instance
├── utils.ts # shared constants
├── i18n.ts # simple passthrough
├── manifest/
│ ├── index.ts # id, title, images, arches, volumes, alerts
│ └── i18n.ts # localized short/long description
├── versions/
│ ├── index.ts # versionGraph wiring
│ ├── v0.1.0.39.ts # first 0.4 release (with seed)
│ └── v0.1.0.40.ts # current release (seed removed)
├── init/index.ts # setupInit ordering
├── main.ts # daemon + health check
├── interfaces.ts # HTTP interface on port 8080
├── backups.ts # Backups.ofVolumes('main')
├── dependencies.ts # (none)
└── actions/index.ts # (none)
```
The `seed/` directory and `refresh_seed.sh` are no longer referenced by the
build and can be deleted from the repo at any time. They are kept on disk
purely as a historical snapshot of the data that was migrated off the 0.3.5
host on first cutover.
---
## 6 — Things to remember
- Package id stays `ten-database` across both 0.3.5 and 0.4 so there is
exactly one service to manage on each host.
- The service volume id is `main` on both sides and mounts at `/data`
inside the container. This is what makes data preservation trivial.
- The 0.4 release is x86_64 only. If you later deploy to aarch64, change
`ARCHES` in the Makefile and rebuild.
- The built `.s9pk` is not committed — treat it as a build artifact.
`.gitignore` already ignores `*.s9pk` and `javascript/`.
- If you change anything under `startos/`, run `npm run check` (tsc) and
`npm run build` (ncc) before re-packaging.
---
## 7 — Quick cheat sheet
```sh
cd start9/0.4
make clean
make x86
make install
```
After `make install` completes, open the service in the StartOS UI,
hit Start, and verify the app still works.
+52
View File
@@ -0,0 +1,52 @@
# ─────────────────────────────────────────────────────────────────
# Ten31 Database — StartOS 0.4 container image
# ─────────────────────────────────────────────────────────────────
# Build context (from the startos manifest dockerBuild.workdir)
# is the repository root (two levels up from start9/0.4/), so all
# COPY paths below are relative to the repo root.
#
# This image is intentionally self-contained under start9/0.4/:
# no files are pulled from start9/0.3.5/ so the two packages can
# evolve independently.
#
# As of 0.1.0:40 the image NO LONGER ships a seed snapshot. The
# initial migration from 0.3.5 has been completed; from this
# release forward the live /data volume on the StartOS host is
# the sole source of truth and is preserved across sideloads.
# ─────────────────────────────────────────────────────────────────
FROM python:3.11-slim
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
CRM_ENV=production \
CRM_HOST=0.0.0.0 \
CRM_PORT=8080 \
CRM_DATA_DIR=/data \
CRM_FRONTEND_DIR=/app/frontend
WORKDIR /app
RUN apt-get update \
&& apt-get install -y --no-install-recommends ca-certificates curl \
&& rm -rf /var/lib/apt/lists/*
# ── Python dependencies ─────────────────────────────────────────
# Only one hard dep for now: `cryptography` is required by the Gmail
# integration's RS256 JWT signing (DWD bearer tokens). Everything else
# server.py needs is stdlib.
RUN pip install --no-cache-dir cryptography==42.0.5
# ── Application source ──────────────────────────────────────────
COPY backend/server.py /app/backend/server.py
COPY backend/email_integration /app/backend/email_integration
COPY frontend /app/frontend
# ── StartOS wrapper scripts ─────────────────────────────────────
COPY start9/0.4/docker_entrypoint.sh /usr/local/bin/docker_entrypoint.sh
COPY start9/0.4/healthcheck.sh /usr/local/bin/healthcheck.sh
RUN chmod +x /usr/local/bin/docker_entrypoint.sh \
/usr/local/bin/healthcheck.sh
EXPOSE 8080
ENTRYPOINT ["/usr/local/bin/docker_entrypoint.sh"]
+4
View File
@@ -0,0 +1,4 @@
# overrides to s9pk.mk must precede the include statement
ARCHES := x86
include s9pk.mk
+50 -7
View File
@@ -1,9 +1,52 @@
# Start9 Wrapper (0.4 placeholder)
# Ten31 Database — StartOS 0.4 wrapper (x86_64)
This directory is reserved for the StartOS 0.4 package wrapper.
This directory is the self-contained StartOS 0.4 service package for
Ten31 Database. It is the x86_64 successor to the 0.3.5 (aarch64)
wrapper in `../0.3.5/`. Both packages share the same package id
(`ten-database`) and the same `/data` volume layout so data can be
preserved across the migration.
Migration plan from 0.3.5:
1. Keep package id stable (`ten-database`) if StartOS migration path allows.
2. Keep mounted data directory contract unchanged (`/data/crm.db`, `/data/backups`).
3. Rebuild wrapper files against 0.4 packaging spec and verify with current start-sdk.
4. Test upgrade on a staging node using production backup restore before live cutover.
## Start here
**Read `DEPLOY_040.md` first.** It covers:
1. How the image-seed data-preservation mechanism works.
2. How to refresh the seed with live production data from the 0.3.5 host
(via `./refresh_seed.sh` or manual scp).
3. How to install the build prerequisites (Node, Docker, `start-cli`).
4. How to build the x86_64 `.s9pk`.
5. How to sideload onto the StartOS 0.4 beta node.
6. A rollback plan and a post-install verification checklist.
## Quick cheat sheet
```sh
# From this directory:
./refresh_seed.sh embassy@embassy.local # pull live prod data into seed/
make clean
make x86
make install # uses ~/.startos/config.yaml
```
## Data layout (unchanged from 0.3.5)
Inside the container:
- `/data/crm.db` — SQLite database
- `/data/backups/` — app-level JSON exports
- `/data/.crm-secret` — JWT signing key (created on first boot if absent)
The entrypoint seeds an empty volume from the image's baked-in snapshot on
first boot, and is a no-op for every later boot. Existing volumes are
never overwritten.
## Status
- Source scaffold: complete and `tsc --noEmit` clean against
`@start9labs/start-sdk` 0.4.0.
- Dockerfile: self-contained under `start9/0.4/` with no cross-folder
references to `start9/0.3.5/`.
- Seed snapshot: present at `seed/data/` (repo dev DB — replace with live
prod data before building).
- Not yet built into a `.s9pk` here; build on a machine with Docker +
`start-cli` per `DEPLOY_040.md`.
+11
View File
@@ -0,0 +1,11 @@
Ten31 Database is a self-hosted investor CRM and fundraising database.
This StartOS 0.4 package is the x86_64 successor to the 0.3.5 (aarch64) wrapper. It preserves the original runtime data layout inside the service volume:
- `/data/crm.db` — SQLite database (investors, contacts, fundraising grid, views, users, backups, feature requests, app settings)
- `/data/backups/` — app-level JSON snapshot exports
- `/data/.crm-secret` — JWT signing key (generated on first boot if absent)
First boot seeds the service volume from a snapshot baked into the image so the new install comes up with existing data already populated. The seed is skipped if the volume already contains a `crm.db`, so it is safe to reinstall or restore from a future StartOS 0.4 backup without losing data.
The wrapper's only differences from upstream are StartOS container wiring, the private web interface on internal port 8080, and backup integration (the whole `main` volume is included in StartOS backups).
+61
View File
@@ -0,0 +1,61 @@
#!/bin/sh
# ═══════════════════════════════════════════════════════════════
# Ten31 Database container entrypoint (StartOS 0.4 wrapper)
# ═══════════════════════════════════════════════════════════════
#
# Responsibilities:
# 1. Ensure the mounted /data volume directories exist.
# 2. Ensure a persistent CRM_SECRET_KEY exists so issued JWTs
# survive container restarts.
# 3. Launch the Python backend server.
#
# Note: This entrypoint NO LONGER seeds /data from a baked-in
# snapshot. The 0.3.5 → 0.4 migration is complete; from 0.1.0:40
# forward the live /data volume on the StartOS host is the sole
# source of truth. StartOS preserves /data across sideloads, so
# upgrades will not disturb live data.
# ═══════════════════════════════════════════════════════════════
set -eu
DATA_DIR="${CRM_DATA_DIR:-/data}"
SECRET_FILE="$DATA_DIR/.crm-secret"
SECRETS_DIR="$DATA_DIR/secrets"
EMAIL_ATTACHMENTS_DIR="$DATA_DIR/email_attachments"
GMAIL_SA_KEY="$SECRETS_DIR/gmail-service-account.json"
mkdir -p "$DATA_DIR" "$DATA_DIR/backups" "$SECRETS_DIR" "$EMAIL_ATTACHMENTS_DIR"
# /data/secrets holds the Gmail service-account key; lock it down so only
# the container user can read the directory. chmod on the file itself is
# the operator's responsibility when they drop the key in.
chmod 700 "$SECRETS_DIR" 2>/dev/null || true
# ── Persistent JWT secret ───────────────────────────────────────
if [ -z "${CRM_SECRET_KEY:-}" ]; then
if [ -f "$SECRET_FILE" ]; then
CRM_SECRET_KEY="$(cat "$SECRET_FILE")"
else
CRM_SECRET_KEY="$(head -c 48 /dev/urandom | base64 | tr -d '\n' | tr '/+' 'ab')"
printf '%s' "$CRM_SECRET_KEY" > "$SECRET_FILE"
chmod 600 "$SECRET_FILE"
fi
export CRM_SECRET_KEY
fi
# ── Gmail integration env vars ──────────────────────────────────
# The integration is enabled only if the service-account key file is
# actually present on the /data volume. This makes the package
# self-disabling on fresh installs until an operator drops the key in.
if [ -f "$GMAIL_SA_KEY" ]; then
export CRM_GMAIL_INTEGRATION_ENABLED="${CRM_GMAIL_INTEGRATION_ENABLED:-true}"
export CRM_GMAIL_AUTH_METHOD="${CRM_GMAIL_AUTH_METHOD:-dwd}"
export CRM_GMAIL_SA_KEY_PATH="${CRM_GMAIL_SA_KEY_PATH:-$GMAIL_SA_KEY}"
export CRM_GMAIL_WORKSPACE_DOMAIN="${CRM_GMAIL_WORKSPACE_DOMAIN:-ten31.xyz}"
export CRM_GMAIL_SYNC_INTERVAL_MIN="${CRM_GMAIL_SYNC_INTERVAL_MIN:-180}"
echo "[entrypoint] Gmail integration: ENABLED (key at $GMAIL_SA_KEY)"
else
echo "[entrypoint] Gmail integration: DISABLED (no key at $GMAIL_SA_KEY)"
fi
# ── Launch the app ──────────────────────────────────────────────
exec python3 /app/backend/server.py
+9
View File
@@ -0,0 +1,9 @@
#!/bin/sh
# Container-side health probe for the Ten31 Database service.
# The StartOS 0.4 daemon uses checkPortListening at the platform
# level, but this script is kept for parity with the 0.3.5 wrapper
# and so the same image can be exec'd directly for diagnostics.
set -eu
PORT="${CRM_PORT:-8080}"
curl -fsS "http://127.0.0.1:${PORT}/api/health" >/dev/null
+43
View File
@@ -0,0 +1,43 @@
<?xml version="1.0" encoding="UTF-8"?>
<svg id="Layer_1" data-name="Layer 1" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 722.69 280.85">
<defs>
<style>
.cls-1 {
font-family: LTCGoudyOldstylePro-Bold, 'LTC Goudy Oldstyle Pro';
font-size: 192px;
font-weight: 700;
}
.cls-1, .cls-2, .cls-3 {
fill: #fff;
}
.cls-2, .cls-4 {
stroke-width: 3px;
}
.cls-2, .cls-4, .cls-3 {
stroke: #fff;
stroke-miterlimit: 10;
}
.cls-4 {
fill: none;
}
.cls-5 {
letter-spacing: -.06em;
}
</style>
</defs>
<text class="cls-1" transform="translate(120.54 208.45)"><tspan class="cls-5" x="0" y="0">T</tspan><tspan x="120.96" y="0">en31</tspan></text>
<g>
<polygon class="cls-3" points="95.52 140.42 54.54 154.4 54.54 126.45 95.52 140.42"/>
<line class="cls-2" x1="0" y1="140.42" x2="60.54" y2="140.42"/>
</g>
<rect class="cls-4" x="97.1" y="1.5" width="527.95" height="277.85"/>
<g>
<polygon class="cls-3" points="721.15 140.42 680.16 154.4 680.16 126.45 721.15 140.42"/>
<line class="cls-2" x1="625.62" y1="140.42" x2="686.16" y2="140.42"/>
</g>
</svg>

After

Width:  |  Height:  |  Size: 1.2 KiB

+330
View File
@@ -0,0 +1,330 @@
{
"name": "ten-database-startos-040",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "ten-database-startos-040",
"dependencies": {
"@start9labs/start-sdk": "^0.4.0-beta.66"
},
"devDependencies": {
"@types/node": "^22.19.0",
"@vercel/ncc": "^0.38.4",
"prettier": "^3.6.2",
"typescript": "^5.9.3"
}
},
"node_modules/@iarna/toml": {
"version": "3.0.0",
"resolved": "https://registry.npmjs.org/@iarna/toml/-/toml-3.0.0.tgz",
"integrity": "sha512-td6ZUkz2oS3VeleBcN+m//Q6HlCFCPrnI0FZhrt/h4XqLEdOyYp2u21nd8MdsR+WJy5r9PTDaHTDDfhf4H4l6Q==",
"license": "ISC"
},
"node_modules/@noble/curves": {
"version": "1.9.7",
"resolved": "https://registry.npmjs.org/@noble/curves/-/curves-1.9.7.tgz",
"integrity": "sha512-gbKGcRUYIjA3/zCCNaWDciTMFI0dCkvou3TL8Zmy5Nc7sJ47a0jtOeZoTaMxkuqRo9cRhjOdZJXegxYE5FN/xw==",
"license": "MIT",
"dependencies": {
"@noble/hashes": "1.8.0"
},
"engines": {
"node": "^14.21.3 || >=16"
},
"funding": {
"url": "https://paulmillr.com/funding/"
}
},
"node_modules/@noble/hashes": {
"version": "1.8.0",
"resolved": "https://registry.npmjs.org/@noble/hashes/-/hashes-1.8.0.tgz",
"integrity": "sha512-jCs9ldd7NwzpgXDIf6P3+NrHh9/sD6CQdxHyjQI+h/6rDNo88ypBxxz45UDuZHz9r3tNz7N/VInSVoVdtXEI4A==",
"license": "MIT",
"engines": {
"node": "^14.21.3 || >=16"
},
"funding": {
"url": "https://paulmillr.com/funding/"
}
},
"node_modules/@start9labs/start-sdk": {
"version": "0.4.0",
"resolved": "https://registry.npmjs.org/@start9labs/start-sdk/-/start-sdk-0.4.0.tgz",
"integrity": "sha512-PFfO7tV9nzQFZL3KXaZyf16C5VZtM+dCDlRhLHpmwssTKtcjyCEhBrB9locuS2yFqu69rj+5kLFzCWDHeRRibg==",
"license": "MIT",
"dependencies": {
"@iarna/toml": "^3.0.0",
"@noble/curves": "^1.8.2",
"@noble/hashes": "^1.7.2",
"@types/ini": "^4.1.1",
"deep-equality-data-structures": "^2.0.0",
"fast-xml-parser": "^5.5.6",
"ini": "^5.0.0",
"isomorphic-fetch": "^3.0.0",
"mime": "^4.0.7",
"yaml": "^2.7.1",
"zod": "^4.3.6",
"zod-deep-partial": "^1.2.0"
}
},
"node_modules/@types/ini": {
"version": "4.1.1",
"resolved": "https://registry.npmjs.org/@types/ini/-/ini-4.1.1.tgz",
"integrity": "sha512-MIyNUZipBTbyUNnhvuXJTY7B6qNI78meck9Jbv3wk0OgNwRyOOVEKDutAkOs1snB/tx0FafyR6/SN4Ps0hZPeg==",
"license": "MIT"
},
"node_modules/@types/node": {
"version": "22.19.17",
"resolved": "https://registry.npmjs.org/@types/node/-/node-22.19.17.tgz",
"integrity": "sha512-wGdMcf+vPYM6jikpS/qhg6WiqSV/OhG+jeeHT/KlVqxYfD40iYJf9/AE1uQxVWFvU7MipKRkRv8NSHiCGgPr8Q==",
"dev": true,
"license": "MIT",
"dependencies": {
"undici-types": "~6.21.0"
}
},
"node_modules/@vercel/ncc": {
"version": "0.38.4",
"resolved": "https://registry.npmjs.org/@vercel/ncc/-/ncc-0.38.4.tgz",
"integrity": "sha512-8LwjnlP39s08C08J5NstzriPvW1SP8Zfpp1BvC2sI35kPeZnHfxVkCwu4/+Wodgnd60UtT1n8K8zw+Mp7J9JmQ==",
"dev": true,
"license": "MIT",
"bin": {
"ncc": "dist/ncc/cli.js"
}
},
"node_modules/deep-equality-data-structures": {
"version": "2.0.0",
"resolved": "https://registry.npmjs.org/deep-equality-data-structures/-/deep-equality-data-structures-2.0.0.tgz",
"integrity": "sha512-qgrUr7MKXq7VRN+WUpQ48QlXVGL0KdibAoTX8KRg18lgOgqbEKMAW1WZsVCtakY4+XX42pbAJzTz/DlXEFM2Fg==",
"license": "MIT",
"dependencies": {
"object-hash": "^3.0.0"
}
},
"node_modules/fast-xml-builder": {
"version": "1.1.4",
"resolved": "https://registry.npmjs.org/fast-xml-builder/-/fast-xml-builder-1.1.4.tgz",
"integrity": "sha512-f2jhpN4Eccy0/Uz9csxh3Nu6q4ErKxf0XIsasomfOihuSUa3/xw6w8dnOtCDgEItQFJG8KyXPzQXzcODDrrbOg==",
"funding": [
{
"type": "github",
"url": "https://github.com/sponsors/NaturalIntelligence"
}
],
"license": "MIT",
"dependencies": {
"path-expression-matcher": "^1.1.3"
}
},
"node_modules/fast-xml-parser": {
"version": "5.5.12",
"resolved": "https://registry.npmjs.org/fast-xml-parser/-/fast-xml-parser-5.5.12.tgz",
"integrity": "sha512-nUR0q8PPfoA/svPM43Gup7vLOZWppaNrYgGmrVqrAVJa7cOH4hMG6FX9M4mQ8dZA1/ObGZHzES7Ed88hxEBSJg==",
"funding": [
{
"type": "github",
"url": "https://github.com/sponsors/NaturalIntelligence"
}
],
"license": "MIT",
"dependencies": {
"fast-xml-builder": "^1.1.4",
"path-expression-matcher": "^1.5.0",
"strnum": "^2.2.3"
},
"bin": {
"fxparser": "src/cli/cli.js"
}
},
"node_modules/ini": {
"version": "5.0.0",
"resolved": "https://registry.npmjs.org/ini/-/ini-5.0.0.tgz",
"integrity": "sha512-+N0ngpO3e7cRUWOJAS7qw0IZIVc6XPrW4MlFBdD066F2L4k1L6ker3hLqSq7iXxU5tgS4WGkIUElWn5vogAEnw==",
"license": "ISC",
"engines": {
"node": "^18.17.0 || >=20.5.0"
}
},
"node_modules/isomorphic-fetch": {
"version": "3.0.0",
"resolved": "https://registry.npmjs.org/isomorphic-fetch/-/isomorphic-fetch-3.0.0.tgz",
"integrity": "sha512-qvUtwJ3j6qwsF3jLxkZ72qCgjMysPzDfeV240JHiGZsANBYd+EEuu35v7dfrJ9Up0Ak07D7GGSkGhCHTqg/5wA==",
"license": "MIT",
"dependencies": {
"node-fetch": "^2.6.1",
"whatwg-fetch": "^3.4.1"
}
},
"node_modules/mime": {
"version": "4.1.0",
"resolved": "https://registry.npmjs.org/mime/-/mime-4.1.0.tgz",
"integrity": "sha512-X5ju04+cAzsojXKes0B/S4tcYtFAJ6tTMuSPBEn9CPGlrWr8Fiw7qYeLT0XyH80HSoAoqWCaz+MWKh22P7G1cw==",
"funding": [
"https://github.com/sponsors/broofa"
],
"license": "MIT",
"bin": {
"mime": "bin/cli.js"
},
"engines": {
"node": ">=16"
}
},
"node_modules/node-fetch": {
"version": "2.7.0",
"resolved": "https://registry.npmjs.org/node-fetch/-/node-fetch-2.7.0.tgz",
"integrity": "sha512-c4FRfUm/dbcWZ7U+1Wq0AwCyFL+3nt2bEw05wfxSz+DWpWsitgmSgYmy2dQdWyKC1694ELPqMs/YzUSNozLt8A==",
"license": "MIT",
"dependencies": {
"whatwg-url": "^5.0.0"
},
"engines": {
"node": "4.x || >=6.0.0"
},
"peerDependencies": {
"encoding": "^0.1.0"
},
"peerDependenciesMeta": {
"encoding": {
"optional": true
}
}
},
"node_modules/object-hash": {
"version": "3.0.0",
"resolved": "https://registry.npmjs.org/object-hash/-/object-hash-3.0.0.tgz",
"integrity": "sha512-RSn9F68PjH9HqtltsSnqYC1XXoWe9Bju5+213R98cNGttag9q9yAOTzdbsqvIa7aNm5WffBZFpWYr2aWrklWAw==",
"license": "MIT",
"engines": {
"node": ">= 6"
}
},
"node_modules/path-expression-matcher": {
"version": "1.5.0",
"resolved": "https://registry.npmjs.org/path-expression-matcher/-/path-expression-matcher-1.5.0.tgz",
"integrity": "sha512-cbrerZV+6rvdQrrD+iGMcZFEiiSrbv9Tfdkvnusy6y0x0GKBXREFg/Y65GhIfm0tnLntThhzCnfKwp1WRjeCyQ==",
"funding": [
{
"type": "github",
"url": "https://github.com/sponsors/NaturalIntelligence"
}
],
"license": "MIT",
"engines": {
"node": ">=14.0.0"
}
},
"node_modules/prettier": {
"version": "3.8.2",
"resolved": "https://registry.npmjs.org/prettier/-/prettier-3.8.2.tgz",
"integrity": "sha512-8c3mgTe0ASwWAJK+78dpviD+A8EqhndQPUBpNUIPt6+xWlIigCwfN01lWr9MAede4uqXGTEKeQWTvzb3vjia0Q==",
"dev": true,
"license": "MIT",
"bin": {
"prettier": "bin/prettier.cjs"
},
"engines": {
"node": ">=14"
},
"funding": {
"url": "https://github.com/prettier/prettier?sponsor=1"
}
},
"node_modules/strnum": {
"version": "2.2.3",
"resolved": "https://registry.npmjs.org/strnum/-/strnum-2.2.3.tgz",
"integrity": "sha512-oKx6RUCuHfT3oyVjtnrmn19H1SiCqgJSg+54XqURKp5aCMbrXrhLjRN9TjuwMjiYstZ0MzDrHqkGZ5dFTKd+zg==",
"funding": [
{
"type": "github",
"url": "https://github.com/sponsors/NaturalIntelligence"
}
],
"license": "MIT"
},
"node_modules/tr46": {
"version": "0.0.3",
"resolved": "https://registry.npmjs.org/tr46/-/tr46-0.0.3.tgz",
"integrity": "sha512-N3WMsuqV66lT30CrXNbEjx4GEwlow3v6rr4mCcv6prnfwhS01rkgyFdjPNBYd9br7LpXV1+Emh01fHnq2Gdgrw==",
"license": "MIT"
},
"node_modules/typescript": {
"version": "5.9.3",
"resolved": "https://registry.npmjs.org/typescript/-/typescript-5.9.3.tgz",
"integrity": "sha512-jl1vZzPDinLr9eUt3J/t7V6FgNEw9QjvBPdysz9KfQDD41fQrC2Y4vKQdiaUpFT4bXlb1RHhLpp8wtm6M5TgSw==",
"dev": true,
"license": "Apache-2.0",
"bin": {
"tsc": "bin/tsc",
"tsserver": "bin/tsserver"
},
"engines": {
"node": ">=14.17"
}
},
"node_modules/undici-types": {
"version": "6.21.0",
"resolved": "https://registry.npmjs.org/undici-types/-/undici-types-6.21.0.tgz",
"integrity": "sha512-iwDZqg0QAGrg9Rav5H4n0M64c3mkR59cJ6wQp+7C4nI0gsmExaedaYLNO44eT4AtBBwjbTiGPMlt2Md0T9H9JQ==",
"dev": true,
"license": "MIT"
},
"node_modules/webidl-conversions": {
"version": "3.0.1",
"resolved": "https://registry.npmjs.org/webidl-conversions/-/webidl-conversions-3.0.1.tgz",
"integrity": "sha512-2JAn3z8AR6rjK8Sm8orRC0h/bcl/DqL7tRPdGZ4I1CjdF+EaMLmYxBHyXuKL849eucPFhvBoxMsflfOb8kxaeQ==",
"license": "BSD-2-Clause"
},
"node_modules/whatwg-fetch": {
"version": "3.6.20",
"resolved": "https://registry.npmjs.org/whatwg-fetch/-/whatwg-fetch-3.6.20.tgz",
"integrity": "sha512-EqhiFU6daOA8kpjOWTL0olhVOF3i7OrFzSYiGsEMB8GcXS+RrzauAERX65xMeNWVqxA6HXH2m69Z9LaKKdisfg==",
"license": "MIT"
},
"node_modules/whatwg-url": {
"version": "5.0.0",
"resolved": "https://registry.npmjs.org/whatwg-url/-/whatwg-url-5.0.0.tgz",
"integrity": "sha512-saE57nupxk6v3HY35+jzBwYa0rKSy0XR8JSxZPwgLr7ys0IBzhGviA1/TUGJLmSVqs8pb9AnvICXEuOHLprYTw==",
"license": "MIT",
"dependencies": {
"tr46": "~0.0.3",
"webidl-conversions": "^3.0.0"
}
},
"node_modules/yaml": {
"version": "2.8.3",
"resolved": "https://registry.npmjs.org/yaml/-/yaml-2.8.3.tgz",
"integrity": "sha512-AvbaCLOO2Otw/lW5bmh9d/WEdcDFdQp2Z2ZUH3pX9U2ihyUY0nvLv7J6TrWowklRGPYbB/IuIMfYgxaCPg5Bpg==",
"license": "ISC",
"bin": {
"yaml": "bin.mjs"
},
"engines": {
"node": ">= 14.6"
},
"funding": {
"url": "https://github.com/sponsors/eemeli"
}
},
"node_modules/zod": {
"version": "4.3.6",
"resolved": "https://registry.npmjs.org/zod/-/zod-4.3.6.tgz",
"integrity": "sha512-rftlrkhHZOcjDwkGlnUtZZkvaPHCsDATp4pGpuOOMDaTdDDXF91wuVDJoWoPsKX/3YPQ5fHuF3STjcYyKr+Qhg==",
"license": "MIT",
"funding": {
"url": "https://github.com/sponsors/colinhacks"
}
},
"node_modules/zod-deep-partial": {
"version": "1.4.4",
"resolved": "https://registry.npmjs.org/zod-deep-partial/-/zod-deep-partial-1.4.4.tgz",
"integrity": "sha512-aWkPl7hVStgE01WzbbSxCgX4O+sSpgt8JOjvFUtMTF75VgL6MhWQbiZi+AWGN85SfSTtI9gsOtL1vInoqfDVaA==",
"license": "MIT",
"peerDependencies": {
"zod": "^4.1.13"
}
}
}
}
+23
View File
@@ -0,0 +1,23 @@
{
"name": "ten-database-startos-040",
"scripts": {
"build": "rm -rf ./javascript && ncc build startos/index.ts -o ./javascript",
"prettier": "prettier --write startos",
"check": "tsc --noEmit"
},
"dependencies": {
"@start9labs/start-sdk": "^0.4.0-beta.66"
},
"devDependencies": {
"@types/node": "^22.19.0",
"@vercel/ncc": "^0.38.4",
"prettier": "^3.6.2",
"typescript": "^5.9.3"
},
"prettier": {
"trailingComma": "all",
"tabWidth": 2,
"semi": false,
"singleQuote": true
}
}
+116
View File
@@ -0,0 +1,116 @@
#!/bin/bash
# ═══════════════════════════════════════════════════════════════
# refresh_seed.sh
# Pull the live Ten31 Database data off a StartOS 0.3.5 host
# and stage it as the seed snapshot baked into the 0.4 image.
# ═══════════════════════════════════════════════════════════════
#
# Usage:
# ./refresh_seed.sh <ssh-user@host> [remote-data-dir]
#
# Examples:
# ./refresh_seed.sh start9@192.168.1.50
# ./refresh_seed.sh embassy@embassy.local \
# /embassy-data/package-data/volumes/ten-database/data/main
#
# What it does:
# 1. Finds the remote /data directory for the ten-database service.
# 2. Copies crm.db, backups/, and (optionally) .crm-secret into
# start9/0.4/seed/data/ on this machine.
# 3. Prints a row-count summary so you can verify content.
#
# After it finishes, run:
# make clean && make x86
# from this (start9/0.4/) directory to rebuild the .s9pk.
# ═══════════════════════════════════════════════════════════════
set -eu
if [ $# -lt 1 ]; then
echo "Usage: $0 <ssh-user@host> [remote-data-dir]"
echo ""
echo "Remote data dir defaults (tried in order):"
echo " /embassy-data/package-data/volumes/ten-database/data/main"
echo " /mnt/embassy-os/package-data/volumes/ten-database/data/main"
echo " /var/lib/embassy/services/ten-database/data"
exit 1
fi
REMOTE="$1"
REMOTE_DIR="${2:-}"
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
SEED_DIR="$SCRIPT_DIR/seed/data"
echo ""
echo " Staging production seed from $REMOTE"
echo " into $SEED_DIR"
echo ""
# Auto-detect remote data dir if not supplied
if [ -z "$REMOTE_DIR" ]; then
echo " Probing for remote data directory..."
for candidate in \
"/embassy-data/package-data/volumes/ten-database/data/main" \
"/mnt/embassy-os/package-data/volumes/ten-database/data/main" \
"/var/lib/embassy/services/ten-database/data"; do
if ssh "$REMOTE" "[ -f \"$candidate/crm.db\" ]" 2>/dev/null; then
REMOTE_DIR="$candidate"
echo " found: $REMOTE_DIR"
break
fi
done
if [ -z "$REMOTE_DIR" ]; then
echo " Could not auto-detect a valid data directory with crm.db on $REMOTE."
echo " Re-run this script and pass the path explicitly as the 2nd argument."
exit 2
fi
fi
mkdir -p "$SEED_DIR/backups"
echo ""
echo " Copying crm.db ..."
scp "$REMOTE:$REMOTE_DIR/crm.db" "$SEED_DIR/crm.db"
echo " Copying backups/ (if present) ..."
if ssh "$REMOTE" "[ -d \"$REMOTE_DIR/backups\" ]" 2>/dev/null; then
scp -r "$REMOTE:$REMOTE_DIR/backups/." "$SEED_DIR/backups/" || true
else
echo " (none found, skipping)"
fi
echo " Copying .crm-secret (optional — keeps existing JWTs valid) ..."
if ssh "$REMOTE" "[ -f \"$REMOTE_DIR/.crm-secret\" ]" 2>/dev/null; then
read -r -p " Include .crm-secret in the baked image? [y/N] " ans
case "$ans" in
[yY]*) scp "$REMOTE:$REMOTE_DIR/.crm-secret" "$SEED_DIR/.crm-secret" ;;
*) echo " skipping .crm-secret; a fresh secret will be generated on first boot" ;;
esac
else
echo " (no .crm-secret on remote)"
fi
echo ""
echo " Summary of staged seed:"
ls -la "$SEED_DIR"
echo ""
if command -v python3 >/dev/null 2>&1 && [ -f "$SEED_DIR/crm.db" ]; then
python3 - <<PY
import sqlite3
db = sqlite3.connect("$SEED_DIR/crm.db")
cur = db.cursor()
cur.execute("PRAGMA integrity_check")
print(" integrity_check:", cur.fetchone()[0])
for t in ("users","fundraising_state","fundraising_funds","fundraising_views",
"contacts","organizations","audit_log","feature_requests","app_settings"):
try:
cur.execute(f"SELECT COUNT(*) FROM {t}")
print(f" {t:30s} {cur.fetchone()[0]} rows")
except Exception as e:
print(f" {t}: n/a ({e})")
PY
fi
echo ""
echo " Seed refreshed. Next: cd $(dirname "$SCRIPT_DIR")/0.4 && make clean && make x86"
+130
View File
@@ -0,0 +1,130 @@
# ** Plumbing. DO NOT EDIT **.
# This file is imported by ./Makefile. Make edits there.
PACKAGE_ID := $(shell awk -F"'" '/id:/ {print $$2}' startos/manifest/index.ts)
INGREDIENTS := $(shell start-cli s9pk list-ingredients 2>/dev/null)
REPO_GIT_DIR := ../../.git
ARCHES ?= x86 arm riscv
TARGETS ?= arches
ifdef VARIANT
BASE_NAME := $(PACKAGE_ID)_$(VARIANT)
else
BASE_NAME := $(PACKAGE_ID)
endif
.PHONY: all arches aarch64 x86_64 riscv64 arm arm64 x86 riscv arch/* clean install check-deps check-init package ingredients
.DELETE_ON_ERROR:
.SECONDARY:
define SUMMARY
@manifest=$$(start-cli s9pk inspect $(1) manifest); \
size=$$(du -h $(1) | awk '{print $$1}'); \
title=$$(printf '%s' "$$manifest" | jq -r .title); \
version=$$(printf '%s' "$$manifest" | jq -r .version); \
arches=$$(printf '%s' "$$manifest" | jq -r '[.images[].arch // []] | flatten | unique | join(", ")'); \
sdkv=$$(printf '%s' "$$manifest" | jq -r .sdkVersion); \
gitHash=$$(printf '%s' "$$manifest" | jq -r .gitHash | sed -E 's/(.*-modified)$$/\x1b[0;31m\1\x1b[0m/'); \
printf "\n"; \
printf "\033[1;32m✅ Build Complete!\033[0m\n"; \
printf "\n"; \
printf "\033[1;37m $$title\033[0m \033[36mv$$version\033[0m\n"; \
printf "───────────────────────────────\n"; \
printf " \033[1;36mFilename:\033[0m %s\n" "$(1)"; \
printf " \033[1;36mSize:\033[0m %s\n" "$$size"; \
printf " \033[1;36mArch:\033[0m %s\n" "$$arches"; \
printf " \033[1;36mSDK:\033[0m %s\n" "$$sdkv"; \
printf " \033[1;36mGit:\033[0m %s\n" "$$gitHash"; \
echo ""
endef
all: $(TARGETS)
arches: $(ARCHES)
universal: $(BASE_NAME).s9pk
$(call SUMMARY,$<)
arch/%: $(BASE_NAME)_%.s9pk
$(call SUMMARY,$<)
x86 x86_64: arch/x86_64
arm arm64 aarch64: arch/aarch64
riscv riscv64: arch/riscv64
$(BASE_NAME).s9pk: $(INGREDIENTS) $(REPO_GIT_DIR)/HEAD $(REPO_GIT_DIR)/index
@$(MAKE) --no-print-directory ingredients
@echo " Packing '$@'..."
start-cli s9pk pack -o $@
$(BASE_NAME)_%.s9pk: $(INGREDIENTS) $(REPO_GIT_DIR)/HEAD $(REPO_GIT_DIR)/index
@$(MAKE) --no-print-directory ingredients
@echo " Packing '$@'..."
start-cli s9pk pack --arch=$* -o $@
ingredients: $(INGREDIENTS)
@echo " Re-evaluating ingredients..."
install: | check-deps check-init
@HOST=$$(awk -F'/' '/^host:/ {print $$3}' ~/.startos/config.yaml); \
if [ -z "$$HOST" ]; then \
echo "Error: You must define \"host: http://server-name.local\" in ~/.startos/config.yaml"; \
exit 1; \
fi; \
S9PK=$$(ls -t *.s9pk 2>/dev/null | head -1); \
if [ -z "$$S9PK" ]; then \
echo "Error: No .s9pk file found. Run 'make' first."; \
exit 1; \
fi; \
printf "\n Installing %s to %s ...\n" "$$S9PK" "$$HOST"; \
start-cli package install -s "$$S9PK"
publish: | all
@REGISTRY=$$(awk -F'/' '/^registry:/ {print $$3}' ~/.startos/config.yaml); \
if [ -z "$$REGISTRY" ]; then \
echo "Error: You must define \"registry: https://my-registry.tld\" in ~/.startos/config.yaml"; \
exit 1; \
fi; \
S3BASE=$$(awk -F'/' '/^s9pk-s3base:/ {print $$3}' ~/.startos/config.yaml); \
if [ -z "$$S3BASE" ]; then \
echo "Error: You must define \"s3base: https://s9pks.my-s3-bucket.tld\" in ~/.startos/config.yaml"; \
exit 1; \
fi; \
command -v s3cmd >/dev/null || \
(echo "Error: s3cmd not found. It must be installed to publish using s3." && exit 1); \
printf "\n Publishing to %s; indexing on %s ...\n" "$$S3BASE" "$$REGISTRY"; \
for s9pk in *.s9pk; do \
age=$$(( $$(date +%s) - $$(stat -f %m "$$s9pk" 2>/dev/null || stat -c %Y "$$s9pk") )); \
if [ "$$age" -gt 3600 ]; then \
printf "\033[1;33m⚠️ %s is %d minutes old. Publish anyway? [y/N] \033[0m" "$$s9pk" "$$((age / 60))"; \
read -r ans; \
case "$$ans" in [yY]*) ;; *) echo "Skipping $$s9pk"; continue ;; esac; \
fi; \
start-cli s9pk publish "$$s9pk"; \
done
check-deps:
@command -v start-cli >/dev/null || \
(echo "Error: start-cli not found. Please see https://docs.start9.com/latest/developer-guide/sdk/installing-the-sdk" && exit 1)
@command -v npm >/dev/null || \
(echo "Error: npm not found. Please install Node.js and npm." && exit 1)
check-init:
@if [ ! -f ~/.startos/developer.key.pem ]; then \
echo "Initializing StartOS developer environment..."; \
start-cli init-key; \
fi
javascript/index.js: $(shell find startos -type f) tsconfig.json node_modules
npm run build
node_modules: package-lock.json
npm ci
package-lock.json: package.json
npm i
clean:
@echo "Cleaning up build artifacts..."
@rm -rf $(PACKAGE_ID).s9pk $(PACKAGE_ID)_x86_64.s9pk $(PACKAGE_ID)_aarch64.s9pk $(PACKAGE_ID)_riscv64.s9pk javascript node_modules
+58
View File
@@ -0,0 +1,58 @@
# Seed Snapshot (baked into the Docker image)
Anything under `seed/data/` is copied into the container image at build time
and placed at `/app/seed/data/`. On first boot, if `/data/crm.db` is not
present on the StartOS service volume, `docker_entrypoint.sh` copies
`/app/seed/data/.` into `/data/.` so the new 0.4 install starts with the
preserved data instead of an empty database.
## What's currently baked in
Initial snapshot was taken from the repo-level `data/` directory at build
time (the same DB that the 0.3.5 dev workflow pointed at). Files:
- `seed/data/crm.db` — SQLite database (investors, contacts, fundraising
rows, views, feature_requests, users, app_settings, etc.)
- `seed/data/backups/*.json` — app-level snapshot exports
## Refreshing the seed before a build
If you want the 0.4 deploy to come up with the absolute latest production
state from the 0.3.5 StartOS server, replace the files in `seed/data/`
BEFORE running `make`:
```sh
# 1) On the 0.3.5 StartOS server, take a fresh app backup and/or grab
# the live database file:
# /media/embassy/services/ten-database/data/crm.db (canonical)
# /media/embassy/services/ten-database/data/backups/*.json (optional)
# /media/embassy/services/ten-database/data/.crm-secret (optional)
# Exact path may differ by StartOS 0.3.5 build.
#
# 2) scp them into this folder:
scp embassy@<old-host>:/media/.../ten-database/data/crm.db \
start9/0.4/seed/data/crm.db
# (Optional) include backups + secret:
scp embassy@<old-host>:/media/.../ten-database/data/backups/* \
start9/0.4/seed/data/backups/
scp embassy@<old-host>:/media/.../ten-database/data/.crm-secret \
start9/0.4/seed/data/.crm-secret
#
# 3) Rebuild:
cd start9/0.4 && make clean && make x86
```
## Keeping `.crm-secret` out of the image
By default `seed/data/.crm-secret` is NOT included. The first boot on the
new machine generates a fresh JWT secret. Existing password hashes in
`crm.db` remain valid, so users just log in once on the new host.
If you WANT to preserve the exact secret (so already-issued JWTs remain
valid), drop the file at `seed/data/.crm-secret` and rebuild.
## Safety
The entrypoint never overwrites an existing `/data/crm.db`. If the volume
already contains data (StartOS restore, manual SSH pre-seed, prior install)
the seed is skipped and a `.seeded` marker is written.
+3
View File
@@ -0,0 +1,3 @@
import { sdk } from '../sdk'
export const actions = sdk.Actions.of()
+7
View File
@@ -0,0 +1,7 @@
import { sdk } from './sdk'
export const { createBackup, restoreInit } = sdk.setupBackups(async () =>
// Preserve the entire service volume so crm.db, backup JSON files, and the
// persisted JWT secret all remain compatible with the prior package layout.
sdk.Backups.ofVolumes('main'),
)
+5
View File
@@ -0,0 +1,5 @@
import { sdk } from './sdk'
export const setDependencies = sdk.setupDependencies(async () => {
return {}
})
+1
View File
@@ -0,0 +1 @@
export const i18n = (text: string) => text
+13
View File
@@ -0,0 +1,13 @@
/**
* Plumbing. DO NOT EDIT.
*/
export { createBackup } from './backups'
export { main } from './main'
export { init, uninit } from './init'
export { actions } from './actions'
import { buildManifest } from '@start9labs/start-sdk'
import { manifest as sdkManifest } from './manifest'
import { versionGraph } from './versions'
export const manifest = buildManifest(versionGraph, sdkManifest)
+16
View File
@@ -0,0 +1,16 @@
import { sdk } from '../sdk'
import { setDependencies } from '../dependencies'
import { setInterfaces } from '../interfaces'
import { versionGraph } from '../versions'
import { actions } from '../actions'
import { restoreInit } from '../backups'
export const init = sdk.setupInit(
restoreInit,
versionGraph,
setInterfaces,
setDependencies,
actions,
)
export const uninit = sdk.setupUninit(versionGraph)
+25
View File
@@ -0,0 +1,25 @@
import { i18n } from './i18n'
import { sdk } from './sdk'
import { WEB_PORT } from './utils'
export const setInterfaces = sdk.setupInterfaces(async ({ effects }) => {
const uiMulti = sdk.MultiHost.of(effects, 'ui-multi')
const uiMultiOrigin = await uiMulti.bindPort(WEB_PORT, {
protocol: 'http',
})
const ui = sdk.createInterface(effects, {
name: i18n('Web UI'),
id: 'ui',
description: i18n('The web interface of Ten31 Database'),
type: 'ui',
masked: false,
schemeOverride: null,
username: null,
path: '',
query: {},
})
const uiReceipt = await uiMultiOrigin.export([ui])
return [uiReceipt]
})
+33
View File
@@ -0,0 +1,33 @@
import { i18n } from './i18n'
import { sdk } from './sdk'
import { DATA_MOUNT_PATH, IMAGE_ID, WEB_PORT } from './utils'
export const main = sdk.setupMain(async ({ effects }) => {
console.info(i18n('Starting Ten31 Database'))
return sdk.Daemons.of(effects).addDaemon('primary', {
subcontainer: await sdk.SubContainer.of(
effects,
{ imageId: IMAGE_ID },
sdk.Mounts.of().mountVolume({
volumeId: 'main',
subpath: null,
mountpoint: DATA_MOUNT_PATH,
readonly: false,
}),
'ten31-database-main',
),
exec: {
command: ['/usr/local/bin/docker_entrypoint.sh'],
},
ready: {
display: i18n('Web Interface'),
fn: () =>
sdk.healthCheck.checkPortListening(effects, WEB_PORT, {
successMessage: i18n('CRM API is responding.'),
errorMessage: i18n('CRM API is not responding.'),
}),
},
requires: [],
})
})
+13
View File
@@ -0,0 +1,13 @@
export const short = {
en_US: 'Self-hosted investor and fundraising database for Ten31.',
}
export const long = {
en_US:
'Ten31 Database is an Airtable-style investor CRM with fundraising grid, communications logging, views, backups, and CSV import. This StartOS 0.4 wrapper preserves the existing /data layout for upgrade-safe persistence.',
}
export const alertUpdate = {
en_US:
'This 0.4 package is designed to keep using the existing /data/crm.db, /data/backups, and /data/.crm-secret layout from the 0.3.5.1 package.',
}
+35
View File
@@ -0,0 +1,35 @@
import { setupManifest } from '@start9labs/start-sdk'
import { alertUpdate, long, short } from './i18n'
export const manifest = setupManifest({
id: 'ten-database',
title: 'Ten31 Database',
license: 'MIT',
packageRepo: 'https://github.com/ten31/ten31-database-startos',
upstreamRepo: 'https://github.com/ten31/ten31-database',
marketingUrl: 'https://ten31.vc',
donationUrl: null,
docsUrls: ['https://docs.start9.com/packaging/0.4.0.x/'],
description: { short, long },
volumes: ['main'],
images: {
main: {
source: {
dockerBuild: {
dockerfile: './Dockerfile',
workdir: '../..',
},
},
arch: ['x86_64', 'aarch64'],
},
},
alerts: {
install: null,
update: alertUpdate,
uninstall: null,
restore: null,
start: null,
stop: null,
},
dependencies: {},
})
+9
View File
@@ -0,0 +1,9 @@
import { StartSdk } from '@start9labs/start-sdk'
import { manifest } from './manifest'
/**
* Plumbing. DO NOT EDIT.
*
* The exported `sdk` const is used throughout this package codebase.
*/
export const sdk = StartSdk.of().withManifest(manifest).build(true)
+16
View File
@@ -0,0 +1,16 @@
// Informational constants shared across the startos/ modules.
// The authoritative id, title and version for the package come
// from manifest/index.ts (id, title) and versions/ (version).
export const PACKAGE_ID = 'ten-database'
export const PACKAGE_TITLE = 'Ten31 Database'
// ExVer form of the current 0.4 wrapper release (upstream 0.1.0, wrapper rev 41).
// * 0.3.5 wrapper: 0.1.0.38 (legacy, aarch64)
// * First 0.4: 0.1.0:39 (shipped seed snapshot for migration)
// * Cleanup: 0.1.0:40 (seed removed + multi-threaded server + abuser auto-ban)
// * Current: 0.1.0:41 (frontend persists auth across refreshes)
export const PACKAGE_VERSION = '0.1.0:41'
export const DATA_MOUNT_PATH = '/data'
export const WEB_PORT = 8080
export const IMAGE_ID = 'main'
export const VOLUME_ID = 'main'
+11
View File
@@ -0,0 +1,11 @@
import { VersionGraph } from '@start9labs/start-sdk'
import { v_0_1_0_39 } from './v0.1.0.39'
import { v_0_1_0_40 } from './v0.1.0.40'
import { v_0_1_0_41 } from './v0.1.0.41'
import { v_0_1_0_42 } from './v0.1.0.42'
import { v_0_1_0_43 } from './v0.1.0.43'
export const versionGraph = VersionGraph.of({
current: v_0_1_0_43,
other: [v_0_1_0_39, v_0_1_0_40, v_0_1_0_41, v_0_1_0_42],
})
+38
View File
@@ -0,0 +1,38 @@
import { VersionInfo } from '@start9labs/start-sdk'
// First StartOS 0.4 release of Ten31 Database.
//
// Upgrade context:
// * The 0.3.5 wrapper shipped at 0.1.0.38 (legacy, aarch64 only).
// * This 0.4 wrapper is built for x86_64 and is intended for a
// parallel install on a new StartOS 0.4 host.
// * Data continuity is NOT handled by a StartOS-level in-place
// upgrade (that path does not exist across StartOS majors).
// Instead the container image is pre-seeded with a snapshot of
// /data (crm.db, backups/, optional .crm-secret). On first boot
// docker_entrypoint.sh copies that snapshot into the mounted
// `main` volume if it is empty.
//
// Because both "up" and "down" paths are inside the same wrapper
// lineage (and the first 0.4 release has no earlier 0.4 version
// to migrate from), the migration functions are intentionally
// no-ops. Future 0.4.x releases can chain off this node in the
// version graph.
export const v_0_1_0_39 = VersionInfo.of({
version: '0.1.0:39',
releaseNotes: {
en_US: [
'First StartOS 0.4 package for Ten31 Database.',
'Built for x86_64; sideload-only during beta.',
'Container image ships with a baked-in /data snapshot so the',
'service boots with the existing investor and fundraising data,',
'saved views, backups, users, and app settings already in place.',
'No StartOS-level migration is performed from the 0.3.5 package;',
'this package is installed fresh on a 0.4 host.',
].join(' '),
},
migrations: {
up: async () => {},
down: async () => {},
},
})
+57
View File
@@ -0,0 +1,57 @@
import { VersionInfo } from '@start9labs/start-sdk'
// Post-migration cleanup + hardening release.
//
// Context:
// * 0.1.0:39 was the first 0.4 package and shipped a baked-in
// /data snapshot that docker_entrypoint.sh copied into the
// mounted `main` volume on first boot (only if the volume was
// empty). That snapshot did its job and the live host now has
// a populated /data with all real investor + fundraising data.
// * 0.1.0:40 removes the seed snapshot from the image and the
// seeding logic from the entrypoint. The live /data volume is
// the sole source of truth from here on. StartOS preserves the
// volume across sideloads, so this upgrade does not disturb
// any data — it just slims the image and removes a code path
// that should never run again.
// * 0.1.0:40 also hardens the backend HTTP server against the
// vulnerability scanners that find the StartTunnel-exposed
// interface within hours of going live:
// - HTTPServer → ThreadingHTTPServer so one slow request or
// a wave of scanner probes can't block legit users.
// - Per-IP GET rate limit (default 600/min) in addition to
// the existing login/write limits.
// - 404-burst auto-ban: any IP that produces ABUSE_404_THRESHOLD
// 404s within ABUSE_404_WINDOW_SEC (default 15 in 60s) is
// parked on a class-level blacklist for ABUSE_BAN_SEC
// (default 15 minutes). Banned IPs get an instant 429 with
// no DB or filesystem work.
// - All limits stay tunable via env vars
// (CRM_GET_RATE_LIMIT_PER_MIN, CRM_ABUSE_404_THRESHOLD,
// CRM_ABUSE_404_WINDOW_SEC, CRM_ABUSE_BAN_SEC).
//
// No data migration is required: the SQLite schema is unchanged
// and the live DB on /data is left exactly as-is.
export const v_0_1_0_40 = VersionInfo.of({
version: '0.1.0:40',
releaseNotes: {
en_US: [
'Removes the baked-in /data seed snapshot now that the',
'0.3.5 → 0.4 migration is complete. The live /data volume',
'on the StartOS host is the sole source of truth and is',
'preserved across sideloads, so no live data is touched by',
'this upgrade. Image is smaller and the first-boot seeding',
'code path has been removed. Also hardens the backend',
'against vulnerability scanners hitting the public',
'StartTunnel interface: the HTTP server is now multi-threaded',
'so one slow request can no longer block legit users, GET',
'requests are rate-limited per IP, and any IP that bursts',
'too many 404s in a short window is auto-banned for 15',
'minutes with no DB work performed.',
].join(' '),
},
migrations: {
up: async () => {},
down: async () => {},
},
})
+42
View File
@@ -0,0 +1,42 @@
import { VersionInfo } from '@start9labs/start-sdk'
// Frontend convenience release: persist auth across page reloads.
//
// Background: through 0.1.0:40 the auth token + user object were held only
// in React state in memory. Any refresh, tab close, or browser restart
// dropped the token and forced the user back to the login screen. Since
// the JWT is signed with /data/.crm-secret (which already survives sideloads
// and container restarts), the underlying token is still valid for its full
// 24-hour lifetime — we just weren't keeping it anywhere persistent.
//
// 0.1.0:41 stores the JWT and user object in localStorage on login (and
// rehydrates from there on app mount), so refreshes and reopened tabs stay
// signed in until the token expires. The api() helper now also dispatches
// a 'crm:unauthorized' event whenever an authenticated request comes back
// with a 401, and the AuthProvider listens for that event to clear the
// stored auth — so an expired or rejected token immediately bounces the
// user back to the login screen instead of leaving the app in a broken
// "loaded but every request fails" state.
//
// Backend is unchanged: the JWT still carries the user's true role and is
// re-verified on every request, so a tampered localStorage user object
// cannot escalate privileges (the next admin call would just 401/403).
//
// No data migration is required.
export const v_0_1_0_41 = VersionInfo.of({
version: '0.1.0:41',
releaseNotes: {
en_US: [
'Logins now persist across page refreshes and tab closures for',
'the full 24-hour token lifetime. Previously every reload bounced',
'you to the login screen even though the token was still valid.',
'If the server later rejects a stored token (expired, secret key',
'changed, etc.) the app automatically clears it and shows the',
'login screen instead of leaving requests silently failing.',
].join(' '),
},
migrations: {
up: async () => {},
down: async () => {},
},
})
+58
View File
@@ -0,0 +1,58 @@
import { VersionInfo } from '@start9labs/start-sdk'
// Gmail integration — Phase 1.
//
// Background: the CRM previously had no ingestion path for email
// activity. Contacts were logged manually; correspondence history lived
// only in our mailboxes. This release adds a one-way capture pipeline
// that ingests sent and received mail for every Workspace user at
// ten31.xyz, matches messages against existing investor records, and
// records metadata (+ bodies and attachments for matched threads) into
// the CRM database.
//
// Auth model: domain-wide delegation via a Google service account. The
// service-account JSON key is stored on the /data volume at
// /data/secrets/gmail-service-account.json (chmod 600, operator-dropped).
// The integration is self-disabling: if the key file is absent, the
// scheduler doesn't start and /api/email/* routes return 503. No key →
// no behavior change from 0.1.0:41.
//
// When the key IS present, docker_entrypoint.sh auto-enables the
// integration and sets sensible defaults (3-hour sync interval, domain
// ten31.xyz, DWD auth). All defaults can still be overridden via env.
//
// Database: migration 0001 adds eight new tables under the email_
// namespace (emails, email_accounts, email_recipients,
// email_account_messages, email_attachments, email_threads,
// email_investor_links, email_sync_runs). All CREATE TABLE IF NOT EXISTS,
// so the migration is safely idempotent — re-applying is a no-op.
//
// Backend: wholly isolated under backend/email_integration/. Three tiny,
// feature-flag-guarded hooks in server.py (migration call, scheduler
// startup, /api/email/* route dispatch). Removing or disabling the
// integration leaves server behavior identical to 0.1.0:41.
//
// New Python dep: cryptography==42.0.5 (required for RS256 JWT signing
// in DWD bearer token exchange). Now installed in the image.
//
// No data migration code needed — new tables, additive only.
export const v_0_1_0_42 = VersionInfo.of({
version: '0.1.0:42',
releaseNotes: {
en_US: [
'Adds a Gmail capture pipeline. When a Google Workspace',
"service-account key is dropped into the server's /data/secrets",
'folder, the CRM begins pulling sent and received mail for every',
'ten31.xyz user on a 3-hour cycle, matching messages against',
'existing investor records and storing metadata (plus bodies and',
'attachments for matched threads) in the database. With no key',
'present the feature is dormant and this release behaves',
'identically to 0.1.0:41. Eight new email_* tables are added',
'additively; no existing data is touched.',
].join(' '),
},
migrations: {
up: async () => {},
down: async () => {},
},
})
+44
View File
@@ -0,0 +1,44 @@
import { VersionInfo } from '@start9labs/start-sdk'
// Hotfix for 0.1.0:42.
//
// Issue 1 (critical): POST requests to /api/email/* hung indefinitely.
// server.py's do_POST called get_body() early in the dispatch to support
// /api/auth/login, which reads bytes off the request stream. My Gmail
// integration hook then ran route handlers that called get_body() a
// second time — but the stream was already drained, so the second read
// blocked waiting for bytes that never came. GET requests (which don't
// read a body) were unaffected.
//
// Fix: get_body() now caches the parsed JSON on the handler instance
// on first call. Repeat calls return the cached value. Handler
// instances are per-request in ThreadingHTTPServer, so the cache is
// naturally request-scoped and thread-safe.
//
// Issue 2 (minor): the /api/email/accounts/enroll endpoint required
// both `email_address` and `user_id` in the body, making it painful to
// call for the common single-admin-enrolling-themselves case.
//
// Fix: the endpoint now also accepts `email` as an alias, and if
// user_id isn't supplied it auto-resolves by looking up the email in
// the users table (falling back to the authenticated admin's own id
// if no match).
//
// No schema changes, no data migration.
export const v_0_1_0_43 = VersionInfo.of({
version: '0.1.0:43',
releaseNotes: {
en_US: [
'Hotfix for the Gmail integration in 0.1.0:42. POST requests to',
'/api/email/* endpoints were hanging because the request body was',
'being read twice from a single-shot stream. This release caches',
'the parsed body on the request so subsequent reads are safe, and',
'also relaxes the enroll endpoint to accept just an email and',
'auto-resolve the CRM user.',
].join(' '),
},
migrations: {
up: async () => {},
down: async () => {},
},
})
+11
View File
@@ -0,0 +1,11 @@
{
"include": ["startos/**/*.ts", "node_modules/**/startos"],
"compilerOptions": {
"target": "ES2018",
"module": "CommonJS",
"moduleResolution": "node",
"esModuleInterop": true,
"strict": true,
"skipLibCheck": true
}
}
+115
View File
@@ -0,0 +1,115 @@
# Start9 Packaging Checklist (0.3.5 style)
This checklist is written for the StartOS 0.3.5 packaging flow used in this repo.
Use it as an indicative template for other projects, not a literal one-size-fits-all script.
## 1) Required packaging scaffold (inside `start9/<version>/`)
- `manifest.yaml`
- `Makefile`
- `Dockerfile`
- `docker_entrypoint.sh`
- `healthcheck.sh`
- `instructions.md`
- `icon.png` (and/or `icon.svg` if desired)
Optional:
- `scripts/` for package-specific helper scripts
- prebuilt artifacts (`image.tar`, `.s9pk`) generated by packaging
## 2) Project-specific values to change
In `manifest.yaml`:
- `id`
- `title`
- `version`
- `description`
- `upstream-repo`, `support-site`, `marketing-site`
- `interfaces` (port, protocol, TLS, UI flags)
- `config` (runtime env/config options)
- `backup` mounts/commands
- `actions` (if you expose maintenance actions)
In `Makefile`:
- package id/version variables
- image name/tag
- paths/targets used by `make ... package`
In `Dockerfile`:
- base image
- runtime dependencies
- app copy paths
- entrypoint/cmd
In scripts:
- read config/env from StartOS mount/env conventions
- write data only to mounted persistent directories
## 3) What must exist outside `start9/`
The wrapper is not fully standalone. It builds an image from your app source.
For this CRM package specifically:
- `backend/server.py`
- `frontend/` (all static assets/UI)
These are copied in Docker build steps. In other projects, these paths, filenames, and build inputs can be different.
This document is meant to show the pattern; each project must map to its own app layout.
## 4) Data + persistence checklist
- Persist DB/files under mounted data path (not container ephemeral path).
- Confirm backup/restore mounts and commands match the manifest volume names exactly.
- Verify restore can start app cleanly and preserve schema/data.
## 5) Network/interface checklist
- Confirm service listens on the internal container port expected by `manifest.yaml`.
- Confirm LAN interface protocol settings match actual service behavior (HTTP vs HTTPS/TCP).
- Confirm UI launches from StartOS Interfaces page without cert/protocol mismatch.
## 6) Build + install flow
1. Bump version in:
- `start9/<version>/manifest.yaml`
- `start9/<version>/Makefile`
2. Build package:
- `make -C start9/<version> package`
3. Install resulting `.s9pk` in StartOS.
4. Start service and check:
- health/logs
- UI launch
- persistence after restart
- backup/restore smoke test
## 7) Reusable vs non-reusable parts
Reusable:
- overall folder structure and file roles in `start9/<version>/`
- packaging workflow (`manifest` + `Makefile` + `Dockerfile` + scripts)
Non-reusable without edits:
- app copy paths in Dockerfile
- app-specific env/config keys
- ports/interfaces/protocol values
- backup/restore commands tied to app data layout
## 8) Planned migration path to StartOS 0.4
When 0.4 is ready for your deployment, use this approach:
1. Keep 0.3.5 package stable as the production branch.
2. Create a parallel package folder for 0.4 (for example `start9/0.4/`).
3. Port wrapper files (`manifest`, `Makefile`, Docker packaging scripts) to the 0.4 schema/tooling.
4. Update interface/config/backup definitions to 0.4 expectations.
5. Build and install 0.4 package in a test server first.
6. Restore a real backup into 0.4 and validate:
- app starts
- UI works
- data integrity is preserved
- backup/restore still works
7. Only after successful validation, promote 0.4 package for primary use.
Notes:
- Keep database path and backup format stable where possible to make migration low-risk.
- If schema changes are required, add explicit migration steps and rollback steps before production cutover.
-258
View File
@@ -1,258 +0,0 @@
# Venture Fund CRM — Project Context for Claude
You are continuing development on a self-hosted CRM system for a venture fund. Below is everything you need to know about what has been built, how it works, and what comes next.
---
## Business Context
- **Fund:** ~$200M AUM, currently fundraising for Fund II
- **Users:** Team of 5 people, accessing via browser on local network or remotely via Tailscale VPN
- **Current LPs:** 150 investors
- **Prospects:** 250+ being tracked
- **Migrating from:** Airtable (CSV exports available)
- **Core goals:**
1. Eliminate sensitive LP/prospect data from third-party servers (Airtable, CRMs)
2. Stop paying monthly subscription costs
3. Purpose-built tool for fundraising workflow: managing existing investors, tracking new prospects, raising capital
- **User:** Grant (grant@ten31.xyz)
---
## What Has Been Built (Sprint 1 — Complete)
A fully functional prototype with backend API, frontend UI, demo data, and utility scripts. Everything runs locally with zero external dependencies beyond two Python packages.
### Tech Stack (Actual — differs from original plan)
The original plan called for FastAPI + SQLAlchemy + separate React build, but the build environment lacked pip/npm access. The stack was adapted to:
- **Backend:** Python 3 stdlib HTTP server + `sqlite3` + `bcrypt` + `PyJWT` — single file, no framework
- **Database:** SQLite with WAL mode (concurrent reads, serialized writes — fine for 5 users)
- **Frontend:** Single self-contained HTML file loading React 18 + Babel from CDN (unpkg)
- **Deployment:** Run `python3 backend/server.py` — serves both API and frontend on port 8080
- **Remote access:** Tailscale mesh VPN (each device gets a private IP, peer-to-peer encrypted)
### Project Structure
```
venture-crm/
├── backend/
│ ├── server.py # Complete API server (1,873 lines)
│ └── requirements.txt # bcrypt, PyJWT (for reference)
├── frontend/
│ └── index.html # Complete React SPA (2,982 lines)
├── data/
│ └── crm.db # SQLite database (created on first run)
├── scripts/
│ ├── create_user.py # CLI tool to add users
│ ├── reset_password.py # CLI tool to reset passwords
│ └── backup.sh # Database backup with 30-day retention
└── start.sh # Launch script
```
### Database Schema
All tables use TEXT primary keys (8-char UUIDs). The database is at `data/crm.db`.
**Tables:**
- `users` — id, username, email, password_hash, full_name, role (admin/manager/member), is_active
- `contacts` — id, first_name, last_name, email, phone, mobile, title, organization_id (FK), contact_type (investor/prospect/advisor/other), status, source, tags (JSON), notes, linkedin_url, preferred_contact, created_by (FK)
- `organizations` — id, name, type, industry, website, phone, email, address, city, state, country, description, tags (JSON), created_by (FK)
- `opportunities` — id, name, contact_id (FK), organization_id (FK), stage (lead/outreach/meeting/due_diligence/committed/funded), commitment_amount, expected_amount, probability, expected_close_date, fund_name, description, next_step, owner_id (FK), priority (low/medium/high), lost_reason
- `communications` — id, contact_id (FK), opportunity_id (FK), type (email/call/meeting/note/text), subject, body, communication_date, duration_minutes, outcome, next_action, next_action_date, attendees (JSON), created_by (FK)
- `lp_profiles` — id, contact_id (FK, unique), commitment_amount, funded_amount, commitment_date, fund_name, investor_type, accredited, legal_docs_signed, signed_date, wire_received, wire_date, k1_sent, preferred_communication, notes
- `custom_fields` — id, name, entity_type, field_type, options (JSON), required, display_order
- `custom_field_values` — id, custom_field_id (FK), entity_id, entity_type, value
- `audit_log` — id, user_id (FK), entity_type, entity_id, action, changes (JSON), created_at
- `tags` — id, name (unique), color
**Key indexes:** contacts(contact_type, status, organization_id), opportunities(stage, owner_id, contact_id), communications(contact_id, communication_date), audit_log(entity_type, entity_id), lp_profiles(contact_id)
### API Endpoints
All endpoints except auth require `Authorization: Bearer <jwt_token>` header. Server runs at `http://0.0.0.0:8080`.
**Auth:**
- `POST /api/auth/login` — body: {username, password} → {token, user}
- `POST /api/auth/register` — body: {username, password, email, full_name} → {token, user}
**Contacts:**
- `GET /api/contacts?type=&status=&search=&sort=&order=&limit=&offset=&organization_id=&tag=` → {data[], total, limit, offset}
- `GET /api/contacts/:id` → {data: {contact + communications[], opportunities[], lp_profile}}
- `POST /api/contacts` — full CRUD
- `PUT /api/contacts/:id`
- `DELETE /api/contacts/:id`
**Organizations:**
- `GET /api/organizations?search=&type=&limit=&offset=` → {data[], total}
- `GET /api/organizations/:id` → {data: {org + contacts[], opportunities[]}}
- `POST /api/organizations` — full CRUD
- `PUT /api/organizations/:id`
- `DELETE /api/organizations/:id`
**Opportunities (Pipeline):**
- `GET /api/opportunities?stage=&owner_id=&search=&priority=&fund_name=&limit=&offset=` → {data[], total}
- `GET /api/opportunities/:id` → {data: {opp + communications[], stage_history[]}}
- `POST /api/opportunities`
- `PUT /api/opportunities/:id`
- `PATCH /api/opportunities/:id/stage` — body: {stage} (logs stage change in audit)
- `DELETE /api/opportunities/:id`
**Communications:**
- `GET /api/communications?contact_id=&type=&search=&limit=&offset=` → {data[], total}
- `GET /api/contacts/:id/communications` → same as above, scoped to contact
- `POST /api/communications`
- `PUT /api/communications/:id`
- `DELETE /api/communications/:id`
**LP Profiles:**
- `GET /api/lp-profiles?fund_name=&search=` → {data[], total}
- `GET /api/lp-profiles/:id` → {data}
- `POST /api/lp-profiles` — also sets contact type to 'investor'
- `PUT /api/lp-profiles/:id`
**Reports:**
- `GET /api/reports/dashboard` → {metrics, pipeline_stages[], recent_communications[], upcoming_actions[], recent_stage_changes[]}
- `GET /api/reports/pipeline` → {by_stage[], by_owner[], by_priority[]}
- `GET /api/reports/lp-breakdown` → {lps[], summary, by_type[]}
- `GET /api/reports/activity?days=30` → {by_user[], by_day[]}
**Import/Export:**
- `POST /api/import/csv` — body: {data: [...objects], entity_type, mapping: {csv_col: crm_field}, dry_run: bool}
- `GET /api/export/contacts` → {data[]}
**Other:**
- `GET /api/tags` / `POST /api/tags`
- `GET /api/users`
- `GET /api/audit-log?entity_type=&entity_id=`
- `GET /api/health`
### Frontend Pages
The frontend is a single HTML file with inline CSS (dark theme) and React via CDN. Pages:
1. **Login** — username/password form, registration option
2. **Dashboard** — KPI cards (Total LPs, Committed $, Pipeline Value, Active Opportunities, Prospects, Monthly Comms), pipeline stage visualization, recent communications, upcoming actions, recent stage changes
3. **Contacts** — tabbed (All/Investors/Prospects), searchable sortable table, slide-over detail panel with communications timeline and opportunities, add/edit modal
4. **Pipeline** — Kanban-style board (Lead → Outreach → Meeting → DD → Committed → Funded), stage summary bar with $ per stage, opportunity cards with stage selector, add/edit modal
5. **Communications** — chronological list, filter by type/contact, log new communication form
6. **LP Tracker** — summary cards (Total Committed, Funded, Avg Check, LP Count), table with status indicators (checkmarks) for docs/wire/K1
7. **Import** — CSV paste/upload, preview table, field mapping interface, dry-run validation, execute import
8. **Settings** — user profile, tag management
### Demo Data (Seeded Automatically)
On first run, the server seeds:
- 2 users: `admin`/`admin123` (admin role), `grant`/`password` (admin role)
- 8 organizations (Sovereign Wealth Holdings, Pacific Capital Partners, Northeast Pension Fund, Redwood Endowment, Atlas Family Office, Summit Insurance Group, Cascade Wealth Management, Blue Harbor Foundation)
- 12 contacts (6 investors, 6 prospects)
- 6 LP profiles totaling $83M committed (all Fund I, all fully funded)
- 6 pipeline opportunities totaling $40M expected (Fund II prospects at various stages)
- 8 communication records (emails, calls, meetings)
- 6 tags (High Priority, Fund I LP, Fund II Prospect, Family Office, Institutional, Re-up Target)
### How to Run
```bash
pip3 install bcrypt PyJWT
cd venture-crm
python3 backend/server.py
# Open http://localhost:8080
# Login: grant / password
```
### What Has Been Tested
All API endpoints have been verified via curl:
- Auth (login, register)
- Contact CRUD + search
- Organization CRUD
- Opportunity CRUD + stage changes
- Communication CRUD
- LP profile CRUD
- Dashboard, pipeline, LP breakdown reports
- CSV import with dry-run and field mapping
- Frontend serves correctly from the backend
---
## What Has NOT Been Built Yet (Remaining Sprints)
### Sprint 2 items still needed:
- Custom fields UI (backend schema exists but not wired to frontend forms)
- Drag-and-drop on pipeline board (currently uses dropdown stage selector)
### Sprint 3: Airtable Migration + Custom Fields
- Custom field definition admin UI
- Display custom fields on contact/opportunity forms
- Actual Airtable data migration (import wizard exists but hasn't been used with real data)
### Sprint 4: Reporting + Polish
- Pipeline analytics (deal velocity, conversion rates between stages)
- User activity report page
- CSV export buttons on all reports
- Bulk actions on contact list (tag multiple, assign, bulk export)
- Automated daily backup via cron
- Team setup documentation
### Future Enhancements (discussed but not planned):
- Email integration (auto-log emails via IMAP)
- Calendar sync
- Task assignments linked to opportunities
- Bulk email with templates
- Two-factor authentication
- Advanced saved search filters
- Audit trail UI page
---
## Architecture Decisions & Constraints
1. **Single-file backend:** The Python server is one file (`server.py`) using stdlib `http.server`. No framework. This keeps deployment dead simple but means no middleware pattern, no auto-docs, no async. If the codebase grows significantly, consider migrating to FastAPI.
2. **Single-file frontend:** The React app is one HTML file loading from CDN. No build step. This means no TypeScript, no tree-shaking, no code splitting. Babel compiles JSX in the browser. If the UI grows significantly, consider splitting into a proper Vite/React project.
3. **SQLite WAL mode:** Handles 5 concurrent readers + 1 writer. Fine for this team size. If the team grows past 10-15, migrate to PostgreSQL.
4. **No localStorage:** JWT token stored in React state only (memory). Page refresh = re-login. This is intentional for security.
5. **8-char UUIDs:** Generated via `uuid.uuid4()[:8]`. Collision probability is negligible at this data scale.
6. **Tailscale for remote access:** Server binds to 0.0.0.0. Tailscale gives each device a 100.x.x.x IP. No port forwarding, no public exposure.
---
## Key Files to Read
When making changes, these are the files:
- **`backend/server.py`** (1,873 lines) — ALL backend logic: database schema, auth, every API endpoint, seed data, server startup. Search for handler method names like `handle_list_contacts`, `handle_create_opportunity`, etc.
- **`frontend/index.html`** (2,982 lines) — ALL frontend logic: CSS styles, React components, API client, every page. Search for component names like `Dashboard`, `ContactsPage`, `PipelinePage`, etc.
- **`scripts/create_user.py`** — CLI to add team members
- **`scripts/backup.sh`** — Database backup with rotation
- **`start.sh`** — Launch script
---
## Common Modification Patterns
**Adding a new field to contacts:**
1. In `server.py`: add column to CREATE TABLE, add to INSERT/UPDATE in handler methods
2. In `index.html`: add field to the contact form component and detail view
3. Delete `data/crm.db` to recreate schema (or use sqlite3 ALTER TABLE)
**Adding a new pipeline stage:**
1. In `server.py`: add to `PIPELINE_STAGES` list
2. In `index.html`: add to the stages array in the Pipeline component
**Changing the color scheme:**
1. In `index.html`: modify the CSS variables in the `<style>` tag (search for hex colors like `#0f172a`, `#1e293b`, `#6366f1`)
**Adding a new API endpoint:**
1. In `server.py`: add route matching in `do_GET`/`do_POST`/etc., then add handler method
**Adding a new page:**
1. In `index.html`: create a new component, add it to the navigation sidebar and the page router