Matrix intake: fuzzy investor matching + conversational in-thread edits (v0.1.0:86)

Close the two locked post-deploy enhancements for the Matrix intake bot.

Fuzzy matching (server-side, ships in the s9pk): new find_intake_candidates in
server.py returns ranked deterministic near-matches (difflib name similarity +
token-set Jaccard, legal-suffix-aware, + email Levenshtein <= 2); GET
/api/intake/match now returns {match, candidates}. The bot surfaces a numbered
shortlist so a near-duplicate (Charlie/Charles, Acme Capital vs Acme Capital LLC,
a one-char email typo) is confirmed by a human instead of silently creating a
second investor. Exact match still auto-attaches; fuzzy candidates are never
auto-attached. The optional LLM-judge re-rank is deferred.

Conversational edits (bot-side, ships on the Spark): any in-thread reply that
isn't yes/no/edit field=value is treated as a natural-language revision and
re-run through local Qwen (parse.revise). Email integrity is preserved -- a
changed address must literally appear in the instruction; the model's email
field is structurally unreachable. No-op revisions re-prompt.

Docs/current-state brought current; 27/27 backend tests green.
This commit is contained in:
Keysat
2026-06-17 18:50:58 -05:00
parent fa6c9da0e6
commit 0b893295e1
15 changed files with 734 additions and 41 deletions
+5 -4
View File
@@ -103,16 +103,17 @@ Subsystem rules live in `docs/guides/` and lazy-load in Claude Code via `.claude
## Current state ## Current state
_Phase 0 substrate + Phase 1 thesis/outreach are built; **box and repo at v0.1.0:83** (deployed & verified live 2026-06-16). v83 (latest): **email search/query + windowed digest preview** — Communications tab gains a fixed/typed investor dropdown, a date-range filter, a full-body view, and a semantic "Search content" mode; the Daily Digest gains an in-app windowed preview before send. Prior v82: front-end libs vendored + SRI-pinned + jsdom render-smoke build gate. **Decision (2026-06-16): the fundraising grid + email capture is the canonical system of record** — vestigial classic-CRM surfaces get pruned or repurposed (see `ROADMAP.md` → "Consolidate on the fundraising grid as canonical"). Longer-term backlog: `ROADMAP.md`._ _Phase 0 substrate + Phase 1 thesis/outreach built; **box and repo at v0.1.0:85** (deployed & verified live 2026-06-17). **The fundraising grid + email capture is the canonical system of record** (decision 2026-06-16) — vestigial classic-CRM surfaces get pruned/repurposed. Longer-term backlog: `ROADMAP.md`._
- **Built & reviewed, not yet deployed — Matrix intake bot (M1+M2), `backend/matrix_intake/`:** a separate-process bot (its `matrix-nio` dep isolated from the stdlib CRM) that turns a typed message in a dedicated Matrix room into a proposed fundraising-grid add/edit and writes only after **in-thread human approval** (`yes`/`edit field=value`/`no`). Parse = local Qwen via Spark Control (reuses `ingest/llm.py`; no Claude, no scrub needed — local path like the digest). Writes reuse the CRM's own `POST /api/fundraising/log-communication` (create-if-missing + contact upsert + note + relational sync + audit), tagged `source="matrix_intake"`; the one new CRM surface is read-only `GET /api/intake/match` (`find_intake_match`) returning the **grid row id** so an approved note lands on the matched investor (no duplicate). v1 is **text-only** — business-card photo (M3) is deferred (Spark Control has no vision model). Reviewer-passed (double-approve race fixed — `handle_reply` pops before the commit await; edit-grammar fix). **Code-complete, compiles, 26/26 tests green; a live Matrix smoke needs creds + `matrix-nio` on the Spark (can't run in CI).** Guide: `docs/guides/matrix-intake.md` (incl. the `settings.py`-not-`config.py` collision + email-integrity gotchas). - **Matrix intake bot — DEPLOYED & LIVE (2026-06-17), `backend/matrix_intake/`:** a separate-process bot (its `matrix-nio` dep isolated from the stdlib CRM) turning a typed Matrix-room message into a proposed fundraising-grid add/edit, written only after **in-thread human approval** (`yes`/`edit field=value`/`no`). Parse = local Qwen via Spark Control (no Claude/scrub, like the digest); writes reuse the CRM's own `POST /api/fundraising/log-communication` tagged `source="matrix_intake"`; new-vs-existing via read-only `GET /api/intake/match` (returns the grid row id → no duplicate). **Runs on the Spark** (`modelo32`, nohup+venv; pid `/tmp/intake-bot.pid`, log `/tmp/intake-bot.log`) — **not a systemd service yet** (won't survive a reboot). **Live-smoked end-to-end** (new-investor create + existing-investor note matched & appended, no dup). Server side shipped to the box as **v0.1.0:84** (`/api/intake/match` + `source` provenance — these were missing on v83, so the bot 404'd until v84); then UX adds: main-timeline nudge pointer, top-level-`yes`→thread redirect, clearer commit wording, note text in the grid line (v85 dropped the `[note]` tag). M3 (business-card photo) deferred (no Spark vision model). Guide: `docs/guides/matrix-intake.md`.
- **Matrix intake — fuzzy-match + conversational-edit pass — BUILT 2026-06-17, NOT yet deployed/live-smoked (repo at v0.1.0:86; box still v85).** Closes the two locked post-deploy enhancements (ROADMAP). **(a) Fuzzy matching (server-side, ships in the s9pk):** `find_intake_candidates` in `server.py` (deterministic — stdlib `difflib` name similarity + token-set Jaccard, legal-suffix-aware via `_strip_legal_suffix`, + email Levenshtein ≤ 2; ranked, ≥0.62, top 5); `GET /api/intake/match` now returns `{match, candidates}`. The bot surfaces a numbered shortlist (`_stage="disambiguate"`) so a near-duplicate ("Charlie"/"Charles", "Acme Capital"/"Acme Capital LLC", a one-char email typo) is **confirmed by a human** instead of silently creating a second investor — never auto-attached. **The optional LLM-judge re-rank was deferred** (deterministic filter already surfaces the cases; LLM is the right shortlist *pruner* if noise proves real). **(b) Conversational edits (bot-side, ships on the Spark):** any in-thread reply that isn't `yes`/`no`/`edit field=value``parse.revise` re-runs `{proposal + instruction}` through local Qwen and re-renders the card; **email integrity preserved** (a changed address must literally appear in the instruction; the model's email field is never trusted); no-op revisions re-prompt (`same_fields`). **Deploy is split:** the `candidates` need an **s9pk build+install** (v86); the bot's disambiguation+revise need a **Spark `git pull` + restart** — a bot restart alone won't deliver `candidates` (box returns `[]`, bot safely proposes new). Tests green; **needs a Matrix live-smoke** (grammar + Qwen `revise` leg). Guide updated.
- **Working (all draft-only):** CRM + ingest (chunk→embed→Qdrant + retrieval) + redaction boundary; Gmail capture (DWD) + email-activity propose→approve; Thesis Workshop + Architect (Claude) with dual-approval gate; Outreach Draft Assistant + follow-up radar + per-user voice + Tier-B in-thread Gmail draft creation. - **Working (all draft-only):** CRM + ingest (chunk→embed→Qdrant + retrieval) + redaction boundary; Gmail capture (DWD) + email-activity propose→approve; Thesis Workshop + Architect (Claude) with dual-approval gate; Outreach Draft Assistant + follow-up radar + per-user voice + Tier-B in-thread Gmail draft creation.
- **Deployed & verified live: v0.1.0:83** (box `$START9_BOX_HOST`/immense-voyage.local; `installed-version``0.1.0:83`, migration chain `…82→83` clean, server up on `:8080`, Gmail + ingest + digest schedulers all started; render-smoke gated the build) — **email search/query + windowed digest preview** (code-only, migrations no-op). Communications tab (`CommunicationsPage` + `email_integration/db.query_email_activity`): **fixed the investor dropdown** — the facet now mirrors the list with the digest's precedence (grid → org → contact → address) and **typed keys** (`fund:`/`org:`/`contact:`), so email matched only to a classic contact or org domain (no grid id — the common case, since `fundraising_contacts.email` is sparsely populated) now resolves to a real name and is selectable, instead of the dropdown being empty; added a **date-range filter** (`since`/`until`), and a **click-to-expand full-body view** (`GET /api/email/detail?id=``query_email_detail`, admin, soft-delete-gated, renders `body_text` escaped — never raw HTML). New **semantic content search**: a "Search content" toggle → `GET /api/email/search?q=` (`routes._h_search`) wrapping `ingest/search.py:hybrid_search` filtered to `doc_type='email'` (lazy import; **503** if Spark/Qdrant unreachable), **hydrated + soft-delete-filtered against SQLite** (`db.search_hit_emails` — never trust the derived index). **Daily Digest:** Settings → Admin now builds a digest over a chosen window (last 24h or since a date) as an **in-app preview** before sending (`POST /api/admin/digest/preview`); manual send uses the same window (`send-now` + `digest_scheduler.send_digest_window`); window resolved by `digest_builder.resolve_digest_window` (cap 92d). Both run the **real local-Spark summarizer** and **never touch the daily cursor**. Verified: 22/22 backend tests, `py_compile` clean, render-smoke pass. **Grant validated both live on the box 2026-06-16** — the digest windowed preview renders real Spark narratives over real activity, and the Communications dropdown / date filter / full-body view / content-search all work. Detail: `docs/guides/email.md`. - **Deployed & verified live: v0.1.0:83** (box `$START9_BOX_HOST`/immense-voyage.local; `installed-version``0.1.0:83`, migration chain `…82→83` clean, server up on `:8080`, Gmail + ingest + digest schedulers all started; render-smoke gated the build) — **email search/query + windowed digest preview** (code-only, migrations no-op). Communications tab (`CommunicationsPage` + `email_integration/db.query_email_activity`): **fixed the investor dropdown** — the facet now mirrors the list with the digest's precedence (grid → org → contact → address) and **typed keys** (`fund:`/`org:`/`contact:`), so email matched only to a classic contact or org domain (no grid id — the common case, since `fundraising_contacts.email` is sparsely populated) now resolves to a real name and is selectable, instead of the dropdown being empty; added a **date-range filter** (`since`/`until`), and a **click-to-expand full-body view** (`GET /api/email/detail?id=``query_email_detail`, admin, soft-delete-gated, renders `body_text` escaped — never raw HTML). New **semantic content search**: a "Search content" toggle → `GET /api/email/search?q=` (`routes._h_search`) wrapping `ingest/search.py:hybrid_search` filtered to `doc_type='email'` (lazy import; **503** if Spark/Qdrant unreachable), **hydrated + soft-delete-filtered against SQLite** (`db.search_hit_emails` — never trust the derived index). **Daily Digest:** Settings → Admin now builds a digest over a chosen window (last 24h or since a date) as an **in-app preview** before sending (`POST /api/admin/digest/preview`); manual send uses the same window (`send-now` + `digest_scheduler.send_digest_window`); window resolved by `digest_builder.resolve_digest_window` (cap 92d). Both run the **real local-Spark summarizer** and **never touch the daily cursor**. Verified: 22/22 backend tests, `py_compile` clean, render-smoke pass. **Grant validated both live on the box 2026-06-16** — the digest windowed preview renders real Spark narratives over real activity, and the Communications dropdown / date filter / full-body view / content-search all work. Detail: `docs/guides/email.md`.
- **Deployed & verified live: v0.1.0:82** (box `$START9_BOX_HOST`/immense-voyage.local; `installed-version``0.1.0:82`, migration chain `…81→82` clean, server up on `:8080`, schedulers + Gmail integration up). **v82 vendored React 18.3.1 / ReactDOM 18.3.1 / @babel/standalone 7.29.7 into `frontend/assets/vendor/`**, served same-origin with `sha384` SRI (no CDN, no outbound-internet dependency to render the UI), and added **`start9/0.4/render-smoke.mjs`** — a jsdom check (shipped-Babel transform asserts classic/non-module + parseable; real mount asserts the login UI renders) wired into the default `make` goal (`verified-build`), so every build is gated on the frontend actually rendering. Closes the v78 (blank screen) + v79 (Babel-8 ESM-import) class structurally. Detail: `docs/guides/packaging.md`. **Prior shipped & live:** v81 Communications-tab matched-only (`query_email_activity` gates on `EXISTS(email_investor_links)`; unmatched email captured but never shown; `docs/guides/email.md`); v80 admin-only email-activity panel (`GET /api/email/activity`); v78 retired `lp_profiles`/LP Tracker + repointed Dashboard "Total Committed" onto the grid (graveyard-excluded). **Digest fully live:** capture (DWD) → propose→approve; Gmail-DWD→SMTP transport; daily Phase-B digest (`digest_builder.py` + always-on `digest_scheduler.py` reading a DB policy + `send-now`); **daily auto-send is now ENABLED** (Grant turned it on in Settings → Admin, 2026-06-16). Detail: `docs/guides/email.md`. - **Deployed & verified live: v0.1.0:82** (box `$START9_BOX_HOST`/immense-voyage.local; `installed-version``0.1.0:82`, migration chain `…81→82` clean, server up on `:8080`, schedulers + Gmail integration up). **v82 vendored React 18.3.1 / ReactDOM 18.3.1 / @babel/standalone 7.29.7 into `frontend/assets/vendor/`**, served same-origin with `sha384` SRI (no CDN, no outbound-internet dependency to render the UI), and added **`start9/0.4/render-smoke.mjs`** — a jsdom check (shipped-Babel transform asserts classic/non-module + parseable; real mount asserts the login UI renders) wired into the default `make` goal (`verified-build`), so every build is gated on the frontend actually rendering. Closes the v78 (blank screen) + v79 (Babel-8 ESM-import) class structurally. Detail: `docs/guides/packaging.md`. **Prior shipped & live:** v81 Communications-tab matched-only (`query_email_activity` gates on `EXISTS(email_investor_links)`; unmatched email captured but never shown; `docs/guides/email.md`); v80 admin-only email-activity panel (`GET /api/email/activity`); v78 retired `lp_profiles`/LP Tracker + repointed Dashboard "Total Committed" onto the grid (graveyard-excluded). **Digest fully live:** capture (DWD) → propose→approve; Gmail-DWD→SMTP transport; daily Phase-B digest (`digest_builder.py` + always-on `digest_scheduler.py` reading a DB policy + `send-now`); **daily auto-send is now ENABLED** (Grant turned it on in Settings → Admin, 2026-06-16). Detail: `docs/guides/email.md`.
- **Live since v74 (2026-06-13):** login works; `/assets/` traversal 404s (plain + URL-encoded), root health 200. On boot, `ensure_thesis_v2_promoted` makes the v2.0 reserve-asset spine the working *approved* spine (node-level, reversible). Security/privacy hardening (path-traversal close, outreach NER backstop, get-by-id soft-delete) shipped in v74 — detail in `EVALUATION.md`. - **Live since v74 (2026-06-13):** login works; `/assets/` traversal 404s (plain + URL-encoded), root health 200. On boot, `ensure_thesis_v2_promoted` makes the v2.0 reserve-asset spine the working *approved* spine (node-level, reversible). Security/privacy hardening (path-traversal close, outreach NER backstop, get-by-id soft-delete) shipped in v74 — detail in `EVALUATION.md`.
- **Tests (2026-06-16):** **26/26 backend tests green** via `python3 backend/run_tests.py`, `py_compile` clean. (+4 this session for the Matrix intake bot: `matrix_intake/test_parse.py`, `test_proposals.py`, `test_crm_client.py`, and `test_intake_endpoints.py` — the last boots the real server against a temp DB and covers `/api/intake/match`, the create→match no-duplicate contract, and `source="matrix_intake"` provenance.) `test_email_activity_panel.py` now covers the **typed facet + org/contact resolution** (the dropdown fix), the **date-range filter**, the **detail view** (full body / recipients / attachments / soft-delete), and the **content-search route** (hydrate / drop-tombstoned / 503 / admin) with retrieval stubbed; `test_digest_builder.py` adds the **window resolver** + **`send_digest_window`** (no-cursor-touch) cases. Frontend **render smoke check** (`cd start9/0.4 && make render-smoke`) still gates the default `make` build. The 2 stale thesis tests stay fixed (seed structure in `docs/guides/thesis.md`). - **Tests (2026-06-17):** **27/27 backend tests green** via `python3 backend/run_tests.py`, `py_compile` clean. (+4 last session for the Matrix intake bot: `matrix_intake/test_parse.py`, `test_proposals.py`, `test_crm_client.py`, and `test_intake_endpoints.py` — the last boots the real server against a temp DB and covers `/api/intake/match`, the create→match no-duplicate contract, and `source="matrix_intake"` provenance.) **This session (v86 fuzzy + conversational pass) added cases to those same files**`test_intake_endpoints.py`: fuzzy `candidates` (near-spelling, legal-suffix-at-1.0, one-char email typo, exact→no-candidates, nothing-close→empty); `test_proposals.py`: the disambiguation grammar + `attach_to_candidate`/`promote_to_new`/`same_fields`; `test_parse.py`: `revise` merge + email-integrity-from-instruction + match-id preservation; `test_crm_client.py`: the `{match, candidates}` shape + no-query-skips-network. `test_email_activity_panel.py` now covers the **typed facet + org/contact resolution** (the dropdown fix), the **date-range filter**, the **detail view** (full body / recipients / attachments / soft-delete), and the **content-search route** (hydrate / drop-tombstoned / 503 / admin) with retrieval stubbed; `test_digest_builder.py` adds the **window resolver** + **`send_digest_window`** (no-cursor-touch) cases. Frontend **render smoke check** (`cd start9/0.4 && make render-smoke`) still gates the default `make` build. The 2 stale thesis tests stay fixed (seed structure in `docs/guides/thesis.md`).
- **Decided, not yet built (detail in `ROADMAP.md`):** Pipeline adoption + a grid flag that auto-loads flagged investors as opportunities; **NL→safe-query** feature (search item 3 — the larger, separate build); CRM as canonical thesis backbone with the signal-engine reading from it (reconciliation unwired); reply-all for Tier-B drafts (currently reply to the LP only). *(Done this session, v83: email search item 1 [activity query/panel gaps — typed facet fix + date range + full-body view] and item 2 [semantic content search] both shipped; daily-digest windowed preview→send.)* - **Decided, not yet built (detail in `ROADMAP.md`):** Pipeline adoption + a grid flag that auto-loads flagged investors as opportunities; **NL→safe-query** feature (search item 3 — the larger, separate build); CRM as canonical thesis backbone with the signal-engine reading from it (reconciliation unwired); reply-all for Tier-B drafts (currently reply to the LP only). *(Done this session, v83: email search item 1 [activity query/panel gaps — typed facet fix + date range + full-body view] and item 2 [semantic content search] both shipped; daily-digest windowed preview→send.)*
- **Known debt (P2, not deploy-blocking):** **reports-subsystem soft-delete sweep**`handle_pipeline_report` + remaining report/aggregate queries over opportunities/communications still count soft-deleted rows (v78 shrank this surface: the `lp_profiles`/lp-breakdown aggregates are gone and the dashboard "Total Committed" is now grid-sourced); needs a pass + report-endpoint tests. Also `?limit=abc` crashes the request thread (authenticated list path); scrub-gateway TLS verify off; `cryptography==42.0.5`; stale user-visible `start9/0.4/assets/ABOUT.md`; hardcoded Spark/Qdrant IPs in the s9pk; **StartOS package icon oversized/zoomed** (research the Start9 icon spec, source a base ten31 logo, produce a correctly sized icon **before the next s9pk upload**); the 5.4k-line `server.py` monolith. P3 batch + full list in `EVALUATION.md`. *(Resolved v82: front-end CDN/SRI risk — libs vendored + SRI-pinned — and the render smoke check is now scripted into the build.)* - **Known debt (P2, not deploy-blocking):** **reports-subsystem soft-delete sweep**`handle_pipeline_report` + remaining report/aggregate queries over opportunities/communications still count soft-deleted rows (v78 shrank this surface: the `lp_profiles`/lp-breakdown aggregates are gone and the dashboard "Total Committed" is now grid-sourced); needs a pass + report-endpoint tests. Also `?limit=abc` crashes the request thread (authenticated list path); scrub-gateway TLS verify off; `cryptography==42.0.5`; stale user-visible `start9/0.4/assets/ABOUT.md`; hardcoded Spark/Qdrant IPs in the s9pk; **StartOS package icon oversized/zoomed** (research the Start9 icon spec, source a base ten31 logo, produce a correctly sized icon **before the next s9pk upload**); the 5.4k-line `server.py` monolith. P3 batch + full list in `EVALUATION.md`. *(Resolved v82: front-end CDN/SRI risk — libs vendored + SRI-pinned — and the render smoke check is now scripted into the build.)*
- **Doc drift to reconcile:** `crm-overview.md` + `EVALUATION.md` still describe `lp_profiles` as a live model in places — a doc-auditor pass should align them to "grid canonical, `lp_profiles` retired." - **Doc drift to reconcile:** `crm-overview.md` + `EVALUATION.md` still describe `lp_profiles` as a live model in places — a doc-auditor pass should align them to "grid canonical, `lp_profiles` retired."
- **Other gaps:** the v2.0 spine is the *working* spine but **not a canonical `thesis_version`** (needs Grant + Jonathan dual sign-off); Appendix-A conviction/exposure (incl. ~40% Strike) stay Grant's working read, not canonical, not fed to the engine. Live infra now exercised on the box (Gmail capture + schedulers up; local-Spark summarization confirmed via the digest preview; Qdrant via Communications content-search); **Claude/Architect path still unverified live on the box.** - **Other gaps:** the v2.0 spine is the *working* spine but **not a canonical `thesis_version`** (needs Grant + Jonathan dual sign-off); Appendix-A conviction/exposure (incl. ~40% Strike) stay Grant's working read, not canonical, not fed to the engine. Live infra now exercised on the box (Gmail capture + schedulers up; local-Spark summarization confirmed via the digest preview; Qdrant via Communications content-search); **Claude/Architect path still unverified live on the box.**
- **Next:** 1) **deploy + live-smoke the Matrix intake bot** (`pip install matrix-nio` + `MATRIX_*`/`CRM_BOT_*` in `.env` on the Spark, create the CRM bot user, `python3 backend/matrix_intake/bot.py`, post a test message); 2) **Pipeline adoption** — grid flag → auto-load opportunities (the agreed next major build); 3) add an **auth regression test** asserting the 3 v79-gated GET endpoints (`/api/users`, `/api/email/status`, `/api/email/accounts`) reject members; 4) **reports-subsystem soft-delete sweep** + report-endpoint tests; 5) `?limit=abc` crash; 6) **email-capture tab error on email sync status** (likely `/api/email/status`); 7) **NL→safe-query** (search item 3 — separate, larger); 8) Grant + Jonathan freeze v2.0 canonical; 9) reply-all for Tier-B drafts. *(Logged to ROADMAP: a build step that pre-compiles JSX to drop runtime Babel entirely — bigger, contradicts the "no build step" convention.)* - **Next:** 1) **Pipeline adoption** — grid flag → auto-create/sync an `opportunities` row so flagged investors load into the Pipeline board (the agreed next major build; design the grid↔pipeline link first — see ROADMAP "Adopt the Pipeline"); 2) **make the intake bot a managed service** (systemd / restart-on-boot — it's a nohup process today); 3) **deploy + Matrix-smoke the v86 intake pass** — s9pk build+install (carries `find_intake_candidates`) + Spark `git pull`+restart (carries disambiguation + `revise`), then live-smoke the shortlist grammar and the Qwen revise leg (built this session, ROADMAP updated); 4) **reports-subsystem soft-delete sweep** + report-endpoint tests; 5) `?limit=abc` crash; 6) **auth regression test** for the 3 v79-gated GET endpoints (`/api/users`, `/api/email/status`, `/api/email/accounts`); 7) **NL→safe-query** (search item 3 — separate, larger); 8) Grant + Jonathan freeze v2.0 canonical; 9) reply-all for Tier-B drafts.
+4 -2
View File
@@ -100,9 +100,11 @@ Use the **matrix-bridge** repo's pattern to listen on a dedicated ten31-database
- **CRM-side:** `POST /api/intake/investor` (service-auth) creates a new investor+contact **through the existing grid-save path** (so relational sync + audit + backup-on-write happen as with a UI edit; bot never does whole-blob RMW) or appends a meeting note to the interaction log for an existing investor; `GET /api/intake/match?q=` fuzzy-matches via the existing entity-resolution/email-matcher. New investor needs no fund at intake. - **CRM-side:** `POST /api/intake/investor` (service-auth) creates a new investor+contact **through the existing grid-save path** (so relational sync + audit + backup-on-write happen as with a UI edit; bot never does whole-blob RMW) or appends a meeting note to the interaction log for an existing investor; `GET /api/intake/match?q=` fuzzy-matches via the existing entity-resolution/email-matcher. New investor needs no fund at intake.
- **Phases:** M1 = scaffold + parse + in-thread propose, **no writes** (proves Matrix↔Spark). M2 = intake endpoint + match + write-on-approve + tests. M3 (deferred) = business-card photo. - **Phases:** M1 = scaffold + parse + in-thread propose, **no writes** (proves Matrix↔Spark). M2 = intake endpoint + match + write-on-approve + tests. M3 (deferred) = business-card photo.
**Post-deploy enhancement — fuzzy match + in-thread confirm (Grant, 2026-06-17).** Today `find_intake_match` is **exact-after-normalization** (`_normalize_text` = lowercase+strip), so near-misses — "Charlie" vs "Charles" (same last name), "Acme Capital" vs "Acme Capital LLC", a one-character email typo — return no match and the bot proposes a **new** investor, risking a duplicate the human approves without realizing a near-match exists. The existing in-thread approval gate is useless against this because the human is never *shown* the near-match. Fix: matcher returns **ranked fuzzy candidates** (deterministic pre-filter: normalized name similarity / token overlap + email edit-distance ≤ ~2), surfaced in-thread for the human to confirm or pick, with the **local Spark LLM optionally re-ranking/judging the shortlist** (good at Charlie/Charles + legal-suffix equivalence; fed only the shortlist, never the whole LP list). Keeps the approval gate but makes it effective against duplicates. Land **after** the live smoke — net-new logic + reply grammar + tests; the current exact match is safe and its failure mode (a duplicate) is recoverable via the existing entity-merge subsystem (`backend/entity_*.py`). **Post-deploy enhancement — fuzzy match + in-thread confirm (Grant, 2026-06-17). BUILT 2026-06-17 (v0.1.0:86), not yet deployed / live-smoked.** Today `find_intake_match` is **exact-after-normalization** (`_normalize_text` = lowercase+strip), so near-misses — "Charlie" vs "Charles" (same last name), "Acme Capital" vs "Acme Capital LLC", a one-character email typo — return no match and the bot proposes a **new** investor, risking a duplicate the human approves without realizing a near-match exists. The existing in-thread approval gate is useless against this because the human is never *shown* the near-match. Fix: matcher returns **ranked fuzzy candidates** (deterministic pre-filter: normalized name similarity / token overlap + email edit-distance ≤ ~2), surfaced in-thread for the human to confirm or pick, with the **local Spark LLM optionally re-ranking/judging the shortlist** (good at Charlie/Charles + legal-suffix equivalence; fed only the shortlist, never the whole LP list). Keeps the approval gate but makes it effective against duplicates. Land **after** the live smoke — net-new logic + reply grammar + tests; the current exact match is safe and its failure mode (a duplicate) is recoverable via the existing entity-merge subsystem (`backend/entity_*.py`).
- **As built:** `find_intake_candidates` in `server.py` (deterministic — stdlib `difflib` name similarity + token-set Jaccard, legal-suffix-aware via `_strip_legal_suffix`, + email Levenshtein ≤ 2; ranked, ≥0.62, top 5). `GET /api/intake/match` now returns `{match, candidates}`. Bot: a new `_stage="disambiguate"` shortlist (`proposals.render_disambiguation` / `interpret_disambiguation` / `attach_to_candidate` / `promote_to_new`) — human picks a number / `new` / `no`. **The optional LLM-judge re-rank was deliberately deferred** (the deterministic filter already surfaces the named cases; an LLM judge is the right *pruner* for shortlist noise — build if the deterministic ranking proves too noisy in practice). Tests: `test_intake_endpoints.py` (server fuzzy cases), `matrix_intake/test_proposals.py` (disambiguation grammar), `matrix_intake/test_crm_client.py` (candidate shape).
**Post-deploy enhancement — conversational (LLM-mediated) edits (Grant, 2026-06-17).** Today an in-thread correction uses a rigid grammar (`edit field=value`). Let a free-form reply that isn't `yes`/`no`/a literal `edit …` be treated as a natural-language revision instruction: send {current proposal + the instruction} back through local Qwen (`spark.py`, the same parse leg — no Claude, no scrub) and re-render the revised proposal card for approval (e.g. "add that we met on June 14" → updated Note). Keeps the draft→human-approve gate (the human still confirms the LLM's revision) and subsumes `edit field=value` as a deterministic fast path. Thread the instruction text into `normalize`'s source so the email-integrity rule still holds (a revised email must appear in the original message or the instruction). Pairs naturally with the fuzzy-match item above — build both as one conversational-UX pass after the smoke. (Parsing of free-form *intake* messages already works today via the Qwen parse leg; this item is specifically about the *edit/refine* turn.) **Post-deploy enhancement — conversational (LLM-mediated) edits (Grant, 2026-06-17). BUILT 2026-06-17 (bot-side, ships on the Spark), not yet deployed / live-smoked.** Today an in-thread correction uses a rigid grammar (`edit field=value`). Let a free-form reply that isn't `yes`/`no`/a literal `edit …` be treated as a natural-language revision instruction: send {current proposal + the instruction} back through local Qwen (`spark.py`, the same parse leg — no Claude, no scrub) and re-render the revised proposal card for approval (e.g. "add that we met on June 14" → updated Note). Keeps the draft→human-approve gate (the human still confirms the LLM's revision) and subsumes `edit field=value` as a deterministic fast path. Thread the instruction text into `normalize`'s source so the email-integrity rule still holds (a revised email must appear in the original message or the instruction). Pairs naturally with the fuzzy-match item above — build both as one conversational-UX pass after the smoke. (Parsing of free-form *intake* messages already works today via the Qwen parse leg; this item is specifically about the *edit/refine* turn.)
- **As built:** `parse.revise` + `_apply_revision` (offline-testable; the approval-stage `else` branch in `bot.py` routes any non-yes/no/edit reply here). `parse_message` now stashes `_source_text` so revise can re-check email integrity against {instruction + original}; the model's email field is never trusted. No-op revisions are caught via `proposals.same_fields` (re-prompt, not a false "Updated"). **Known v1 limit:** revise edits fields but does not re-run the matcher on a mid-thread firm rename. Tests: `matrix_intake/test_parse.py` (revise merge + email integrity + match-id preservation).
### Scoped service-credential auth path for automated CRM writers ### Scoped service-credential auth path for automated CRM writers
*Surfaced 2026-06-17 while deploying the Matrix intake bot. **Decision: defer — the bot uses a dedicated member username/password for now.** The CRM has no API-key/service-token path; its only auth is username+password → JWT. A dedicated **member** login is appropriately scoped against what matters operationally (no admin: can't manage users, reset data, or change settings) and unblocks the live smoke today.* *Surfaced 2026-06-17 while deploying the Matrix intake bot. **Decision: defer — the bot uses a dedicated member username/password for now.** The CRM has no API-key/service-token path; its only auth is username+password → JWT. A dedicated **member** login is appropriately scoped against what matters operationally (no admin: can't manage users, reset data, or change settings) and unblocks the live smoke today.*
+70 -12
View File
@@ -46,30 +46,49 @@ async def main():
try: try:
proposal = await asyncio.to_thread(parse.parse_message, text) proposal = await asyncio.to_thread(parse.parse_message, text)
except Exception as exc: # Spark/Qwen unreachable or bad response except Exception as exc: # Spark/Qwen unreachable or bad response
await say(room_id, f"⚠️ couldn't reach the local parser: {exc}", root) await say(room_id, f"⚠️ couldn't reach the local parser: {str(exc)[:200]}", root)
return return
if proposal["intent"] == "unclear": if proposal["intent"] == "unclear":
await say(room_id, UNCLEAR_HELP, root) await say(room_id, UNCLEAR_HELP, root)
return return
# Confirm new-vs-existing against the CRM matcher (read-only). Degrade gracefully if # Resolve new-vs-existing against the CRM matcher (read-only). Degrade gracefully if the
# the CRM is unreachable — still propose, just without the "looks like existing" hint. # CRM is unreachable — still propose as new, just without match/candidate hints.
hint = "" match, candidates = None, []
try: try:
match = await asyncio.to_thread(crm_client.match, proposal) res = await asyncio.to_thread(crm_client.match, proposal)
if match: match = res.get("match")
proposal["intent"] = "meeting_note" candidates = res.get("candidates") or []
proposal["_match_id"] = match["id"]
hint = f"\n\n🔎 Looks like an existing investor: **{match['name']}** — this will append a note to them."
except Exception: except Exception:
pass pass
if match:
# Confident exact match → auto-attach the note to that investor (no disambiguation).
proposal["intent"] = "meeting_note"
proposal["_match_id"] = match["id"]
proposal["_stage"] = "approval"
store.put(root, proposal) store.put(root, proposal)
hint = (f"\n\n🔎 Looks like an existing investor: **{match['name']}** — "
"this will append a note to them.")
await say(room_id, proposals.render(proposal) + hint, root) await say(room_id, proposals.render(proposal) + hint, root)
await nudge(room_id, proposals.summary_line(proposal), root)
return
if candidates:
# No exact match but near-misses exist → make the human pick one or confirm "new",
# so a typo'd/near-duplicate name can't silently create a second investor.
proposal["_stage"] = "disambiguate"
proposal["_candidates"] = candidates
store.put(root, proposal)
await say(room_id, proposals.render_disambiguation(proposal), root)
await nudge(room_id, proposals.disambiguation_nudge(proposal), root)
return
# Genuinely new — straight to the new-investor approval card.
proposal["_stage"] = "approval"
store.put(root, proposal)
await say(room_id, proposals.render(proposal), root)
# Also drop a brief, un-threaded reply in the main timeline so the proposal isn't # Also drop a brief, un-threaded reply in the main timeline so the proposal isn't
# easy to miss inside a thread (the full card + yes/edit/no stay in the thread). # easy to miss inside a thread (the full card + yes/edit/no stay in the thread).
await nudge(room_id, proposals.summary_line(proposal), root) await nudge(room_id, proposals.summary_line(proposal), root)
async def handle_reply(room_id, root, text): async def handle_reply(room_id, root, text):
action, payload = proposals.interpret_reply(text)
# Claim the proposal synchronously — BEFORE any await — so a second reply that # Claim the proposal synchronously — BEFORE any await — so a second reply that
# arrives while a commit is in flight can't double-process it. asyncio is # arrives while a commit is in flight can't double-process it. asyncio is
# cooperative: nothing else runs between here and the first await below, so the # cooperative: nothing else runs between here and the first await below, so the
@@ -77,6 +96,11 @@ async def main():
proposal = store.pop(root) proposal = store.pop(root)
if proposal is None: if proposal is None:
return return
if proposal.get("_stage") == "disambiguate":
await handle_disambiguation(room_id, root, text, proposal)
return
action, payload = proposals.interpret_reply(text)
if action == "approve": if action == "approve":
try: try:
summary = await asyncio.to_thread(crm_client.commit, proposal) summary = await asyncio.to_thread(crm_client.commit, proposal)
@@ -92,9 +116,43 @@ async def main():
proposal = proposals.apply_edit(proposal, field, value) proposal = proposals.apply_edit(proposal, field, value)
store.put(root, proposal) # keep it pending (edited) for the next reply store.put(root, proposal) # keep it pending (edited) for the next reply
await say(room_id, "✏️ Updated:\n\n" + proposals.render(proposal), root) await say(room_id, "✏️ Updated:\n\n" + proposals.render(proposal), root)
else: # unrecognized reply — leave the proposal pending else:
# Not yes/no/edit-grammar → treat it as a natural-language revision instruction and
# re-run it through local Qwen (no Claude, no scrub). The human still approves the
# revised card, so the draft→approve gate holds.
try:
revised = await asyncio.to_thread(parse.revise, proposal, text)
except Exception as exc:
store.put(root, proposal) store.put(root, proposal)
await say(room_id, "Reply **yes** to commit, **edit field=value**, or **no**.", root) await say(room_id, f"⚠️ couldn't apply that change ({str(exc)[:200]}).\n\nReply **yes** "
"to commit, **no** to discard, **edit field=value**, or rephrase.", root)
return
if proposals.same_fields(proposal, revised):
store.put(root, proposal)
await say(room_id, "I didn't catch a change there. Reply **yes** to commit, **no** "
"to discard, **edit field=value**, or tell me what to change.", root)
return
store.put(root, revised)
await say(room_id, "✏️ Updated:\n\n" + proposals.render(revised), root)
async def handle_disambiguation(room_id, root, text, proposal):
cands = proposal.get("_candidates") or []
action, payload = proposals.interpret_disambiguation(text, len(cands))
if action == "pick":
updated = proposals.attach_to_candidate(proposal, cands[payload])
store.put(root, updated)
await say(room_id, "✏️ Will log against the existing investor:\n\n"
+ proposals.render(updated), root)
elif action == "new":
updated = proposals.promote_to_new(proposal)
store.put(root, updated)
await say(room_id, " OK — adding as a new investor:\n\n"
+ proposals.render(updated), root)
elif action == "reject":
await say(room_id, "🗑️ Discarded — nothing written.", root)
else: # unrecognized — re-show the shortlist
store.put(root, proposal)
await say(room_id, "I didn't catch that.\n\n" + proposals.render_disambiguation(proposal), root)
async def on_message(room: MatrixRoom, event: RoomMessageText): async def on_message(room: MatrixRoom, event: RoomMessageText):
if event.sender == mx["user_id"]: if event.sender == mx["user_id"]:
+19 -6
View File
@@ -70,19 +70,32 @@ def _authed(method, path, body=None):
def match(proposal): def match(proposal):
"""Return {'id', 'name'} for an existing investor matching this proposal, else None.""" """Resolve new-vs-existing for this proposal against the CRM matcher.
Returns {'match': {...}|None, 'candidates': [...]}:
- `match` is a confident EXACT existing investor — {'id', 'name'} — that the bot
auto-attaches a note to (no human disambiguation needed).
- `candidates` is a ranked list of fuzzy NEAR-matches — each {'id', 'name', 'score',
'matched_on'} — surfaced in-thread for the human to pick from (or confirm "new")
when there is no exact match, so a typo'd/near-duplicate name doesn't silently
create a second investor."""
q = proposal.get("investor_name") or proposal.get("contact_name") or "" q = proposal.get("investor_name") or proposal.get("contact_name") or ""
email = proposal.get("contact_email") or "" email = proposal.get("contact_email") or ""
if not q and not email: if not q and not email:
return None return {"match": None, "candidates": []}
qs = urlencode({"q": q, "email": email}) qs = urlencode({"q": q, "email": email})
status, data = _authed("GET", f"/api/intake/match?{qs}") status, data = _authed("GET", f"/api/intake/match?{qs}")
if status != 200: if status != 200:
raise RuntimeError(f"intake match failed ({status}): {data.get('error') or data}") raise RuntimeError(f"intake match failed ({status}): {data.get('error') or data}")
m = (data.get("data") or {}).get("match") payload = data.get("data") or {}
if not m: m = payload.get("match")
return None match_out = {"id": m["id"], "name": m.get("investor_name") or q} if m else None
return {"id": m["id"], "name": m.get("investor_name") or q} candidates = [
{"id": c["id"], "name": c.get("investor_name") or "?",
"score": c.get("score"), "matched_on": c.get("matched_on")}
for c in (payload.get("candidates") or []) if c.get("id")
]
return {"match": match_out, "candidates": candidates}
def build_commit_payload(proposal): def build_commit_payload(proposal):
+57 -1
View File
@@ -2,7 +2,13 @@
The model only EXTRACTS structure; it never decides to write anything. New-vs-existing is The model only EXTRACTS structure; it never decides to write anything. New-vs-existing is
finalized in M2 against the CRM matcher — here `intent` is the model's first read. finalized in M2 against the CRM matcher — here `intent` is the model's first read.
`revise()` is the conversational-edit leg: a free-form correction the human types in the
proposal thread (e.g. "add that we met June 14") is applied to the pending proposal via the
same local Qwen — no Claude, no scrub. Email integrity is preserved: a changed address must
literally appear in the instruction (or the original message); the model can never mint one.
""" """
import json
import re import re
import spark import spark
@@ -60,4 +66,54 @@ def parse_message(text, parse_fn=spark.parse_json):
"""Parse one intake message. `parse_fn` is injectable for tests (defaults to Spark/Qwen). """Parse one intake message. `parse_fn` is injectable for tests (defaults to Spark/Qwen).
Returns a normalized proposal dict. On a model/transport failure, raises (caller decides).""" Returns a normalized proposal dict. On a model/transport failure, raises (caller decides)."""
raw = parse_fn(text, system=SYSTEM, max_tokens=400) raw = parse_fn(text, system=SYSTEM, max_tokens=400)
return normalize(raw, source_text=text) proposal = normalize(raw, source_text=text)
# Stash the original message so a later revise() can re-check email integrity against it.
proposal["_source_text"] = text
return proposal
REVISE_SYSTEM = (
"You revise a structured investor-intake proposal from a short correction a venture-fund "
"team member typed. You are given the CURRENT proposal as JSON and an INSTRUCTION. Apply "
"the instruction and reply with ONLY the full revised JSON object, these keys:\n"
' "investor_name", "contact_name", "contact_email", "contact_title", "note".\n'
"Change ONLY what the instruction asks; copy every other field through unchanged. Use null "
"for a field the instruction clears or that is genuinely absent. Never invent an email "
"address. Output JSON only."
)
_REVISABLE = ("investor_name", "contact_name", "contact_title", "note")
def _apply_revision(proposal, model_out, instruction):
"""Merge the model's revised fields onto the proposal. Pure + offline-testable.
Preserves control keys (_match_id / _stage / intent / _source_text). Enforces email
integrity: a revised address is taken only if it literally appears in the INSTRUCTION the
human typed; otherwise the existing (already integrity-checked) address is kept. The model's
own email field is never trusted — it must not mint an address."""
model_out = model_out or {}
out = dict(proposal)
for k in _REVISABLE:
if k in model_out:
out[k] = _clean(model_out.get(k))
m = _EMAIL_RE.search(instruction or "")
if m:
out["contact_email"] = m.group(0).rstrip(".,;:!?)]}>\"'")
# else: keep proposal's current contact_email (untouched above; control key copied by dict())
# Don't let a revision strip the proposal down to nothing actionable.
if not out.get("investor_name") and not out.get("contact_name"):
out["investor_name"] = proposal.get("investor_name")
out["contact_name"] = proposal.get("contact_name")
return out
def revise(proposal, instruction, parse_fn=spark.parse_json):
"""Apply a natural-language correction to a pending proposal via local Qwen; return the
revised proposal dict. `parse_fn` is injectable for tests (defaults to Spark/Qwen)."""
current = {k: proposal.get(k) for k in
("investor_name", "contact_name", "contact_email", "contact_title", "note")}
prompt = ("CURRENT:\n" + json.dumps(current, ensure_ascii=False)
+ "\n\nINSTRUCTION:\n" + (instruction or "").strip())
raw = parse_fn(prompt, system=REVISE_SYSTEM, max_tokens=400)
return _apply_revision(proposal, raw, instruction)
+78
View File
@@ -5,7 +5,12 @@ Matrix thread root (the bot's proposal lives in a thread rooted at the user's me
the user replies inside that thread). In-memory and ephemeral by design — a restart drops the user replies inside that thread). In-memory and ephemeral by design — a restart drops
pending proposals (the user just re-sends), matching matrix-bridge's stateless-by-default pending proposals (the user just re-sends), matching matrix-bridge's stateless-by-default
ethos. Nothing here writes to the CRM; the bot calls the CRM client only after `approve`. ethos. Nothing here writes to the CRM; the bot calls the CRM client only after `approve`.
A proposal carries a `_stage`: "approval" (the normal yes/edit/no card) or "disambiguate"
(a fuzzy-match shortlist the human must resolve — pick a number / "new" / "no" — before it
becomes an approval-stage proposal). The shortlist itself rides on `_candidates`.
""" """
import re
# field aliases accepted in `edit <field>=<value>` # field aliases accepted in `edit <field>=<value>`
_EDIT_ALIASES = { _EDIT_ALIASES = {
@@ -18,6 +23,10 @@ _EDIT_ALIASES = {
_YES = {"yes", "y", "approve", "approved", "ok", "confirm", "go", "👍", ""} _YES = {"yes", "y", "approve", "approved", "ok", "confirm", "go", "👍", ""}
_NO = {"no", "n", "cancel", "discard", "reject", "stop", "👎", ""} _NO = {"no", "n", "cancel", "discard", "reject", "stop", "👎", ""}
# "create a new investor anyway" replies to a disambiguation shortlist
_NEW = {"new", "none", "new investor", "none of these", "create", "create new", "add new", "neither"}
_CONTENT_FIELDS = ("intent", "investor_name", "contact_name", "contact_email", "contact_title", "note")
class ProposalStore: class ProposalStore:
@@ -84,6 +93,75 @@ def apply_edit(proposal, field, value):
return updated return updated
def same_fields(a, b):
"""True if two proposals carry identical content (used to detect a no-op NL revision so we
don't tell the human 'Updated' when nothing changed)."""
return all((a or {}).get(k) == (b or {}).get(k) for k in _CONTENT_FIELDS)
def interpret_disambiguation(text, n_candidates):
"""Classify a reply to a fuzzy-match shortlist.
Returns ("pick", index) | ("new", None) | ("reject", None) | ("unknown", None). A bare
number selects that candidate; "new"/"none" creates a new investor; "no"/"cancel" discards."""
t = (text or "").strip().lower()
if not t:
return ("unknown", None)
if t in _NO:
return ("reject", None)
if t in _NEW:
return ("new", None)
m = re.fullmatch(r"#?\s*(\d{1,2})", t)
if m:
idx = int(m.group(1)) - 1
if 0 <= idx < n_candidates:
return ("pick", idx)
return ("unknown", None)
def attach_to_candidate(proposal, candidate):
"""Promote a disambiguation pick into an approval-stage meeting note on the chosen investor.
The note will target that existing grid row (via _match_id); the firm name is shown for
accuracy. Drops the shortlist."""
updated = dict(proposal)
updated.pop("_candidates", None)
updated["_stage"] = "approval"
updated["_match_id"] = candidate["id"]
updated["intent"] = "meeting_note"
if candidate.get("name"):
updated["investor_name"] = candidate["name"]
return updated
def promote_to_new(proposal):
"""Disambiguation 'new' — discard the shortlist and proceed as a new-investor proposal."""
updated = dict(proposal)
updated.pop("_candidates", None)
updated.pop("_match_id", None)
updated["_stage"] = "approval"
return updated
def render_disambiguation(proposal):
"""Render the fuzzy-match shortlist a human resolves before we create a new investor."""
name = proposal.get("investor_name") or proposal.get("contact_name") or "?"
cands = proposal.get("_candidates") or []
lines = [f"🔎 Before adding **{name}** as new — these existing investors look similar:"]
for i, c in enumerate(cands, 1):
lines.append(f" **{i}.** {c.get('name') or '?'}")
lines.append("")
lines.append("Reply a **number** to log this against that investor, **new** to add it as a "
"new investor, or **no** to discard.")
return "\n".join(lines)
def disambiguation_nudge(proposal):
"""Brief main-timeline pointer for a disambiguation proposal (the shortlist is in the thread)."""
name = proposal.get("investor_name") or proposal.get("contact_name") or "?"
return (f"🔎 **{name}** may match an existing investor — open the **thread** to pick one "
"or confirm it's new.")
def render(proposal): def render(proposal):
"""Render a proposal as the in-thread message a human approves.""" """Render a proposal as the in-thread message a human approves."""
if proposal.get("intent") == "meeting_note": if proposal.get("intent") == "meeting_note":
+55
View File
@@ -58,6 +58,61 @@ def test_subject_blank_when_note_present_else_provenance_label():
assert no_note["subject"] == "Intake (Matrix)" assert no_note["subject"] == "Intake (Matrix)"
def _with_stub_authed(reply, capture=None):
"""Swap crm_client._authed for a canned (status, data); return a restorer."""
orig = crm_client._authed
def fake(method, path, body=None):
if capture is not None:
capture["path"] = path
return reply
crm_client._authed = fake
return orig
def test_match_parses_exact_match():
cap = {}
orig = _with_stub_authed((200, {"data": {
"match": {"id": "rowAcme", "investor_name": "Acme Capital", "matched_on": "name"},
"candidates": [],
}}), cap)
try:
res = crm_client.match({"investor_name": "Acme Capital", "contact_email": ""})
finally:
crm_client._authed = orig
assert res["match"] == {"id": "rowAcme", "name": "Acme Capital"}
assert res["candidates"] == []
assert "q=Acme" in cap["path"] # the query was forwarded
def test_match_returns_ranked_candidates_when_no_exact():
orig = _with_stub_authed((200, {"data": {"match": None, "candidates": [
{"id": "rowCharlie", "investor_name": "Charlie Brown", "score": 0.92, "matched_on": "name"},
{"id": "rowBeta", "investor_name": "Beta Capital LLC", "score": 0.86, "matched_on": "name"},
]}}))
try:
res = crm_client.match({"investor_name": "Charles Brown"})
finally:
crm_client._authed = orig
assert res["match"] is None
assert [c["id"] for c in res["candidates"]] == ["rowCharlie", "rowBeta"]
assert res["candidates"][0]["name"] == "Charlie Brown"
assert res["candidates"][0]["matched_on"] == "name"
def test_match_no_query_skips_network():
def boom(*a, **k):
raise AssertionError("should not hit the network when there's nothing to match on")
orig = crm_client._authed
crm_client._authed = boom
try:
res = crm_client.match({"investor_name": None, "contact_name": None, "contact_email": None})
finally:
crm_client._authed = orig
assert res == {"match": None, "candidates": []}
if __name__ == "__main__": if __name__ == "__main__":
fns = [v for k, v in sorted(globals().items()) if k.startswith("test_") and callable(v)] fns = [v for k, v in sorted(globals().items()) if k.startswith("test_") and callable(v)]
for fn in fns: for fn in fns:
+59
View File
@@ -102,6 +102,65 @@ def test_none_model_reply_is_unclear():
assert p["intent"] == "unclear" assert p["intent"] == "unclear"
def test_parse_message_stashes_source_text():
p = parse.parse_message("Acme Capital, Jane jane@acme.com",
parse_fn=_stub({"intent": "new_investor", "investor_name": "Acme Capital",
"contact_name": "Jane", "contact_email": "jane@acme.com"}))
assert p["_source_text"] == "Acme Capital, Jane jane@acme.com"
def test_revise_applies_note_change_and_preserves_control_keys():
proposal = parse.parse_message(
"New investor Acme Capital, Jane Doe jane@acme.com",
parse_fn=_stub({"intent": "new_investor", "investor_name": "Acme Capital",
"contact_name": "Jane Doe", "contact_email": "jane@acme.com",
"contact_title": None, "note": None}))
revised = parse.revise(
proposal, "add that we met on June 14",
parse_fn=_stub({"investor_name": "Acme Capital", "contact_name": "Jane Doe",
"contact_email": "jane@acme.com", "contact_title": None,
"note": "met on June 14"}))
assert revised["note"] == "met on June 14"
assert revised["investor_name"] == "Acme Capital"
assert revised["intent"] == "new_investor" # control key preserved
assert revised["_source_text"] == proposal["_source_text"] # preserved for email integrity
def test_revise_email_taken_only_from_instruction():
proposal = {"intent": "new_investor", "investor_name": "Acme", "contact_name": "Jane",
"contact_email": "jane@acme.com", "contact_title": None, "note": None,
"_source_text": "Acme, Jane jane@acme.com"}
# instruction literally carries the new address → accepted
r1 = parse.revise(proposal, "her email is jane@newfirm.com",
parse_fn=_stub({"contact_email": "jane@newfirm.com"}))
assert r1["contact_email"] == "jane@newfirm.com"
# model tries to change the email but the instruction has no address → keep the existing one
r2 = parse.revise(proposal, "set her title to GP",
parse_fn=_stub({"contact_email": "totally@madeup.test", "contact_title": "GP"}))
assert r2["contact_email"] == "jane@acme.com" # model's email ignored (not in instruction)
assert r2["contact_title"] == "GP"
def test_revise_preserves_match_id():
proposal = {"intent": "meeting_note", "investor_name": "Acme", "contact_name": None,
"contact_email": None, "contact_title": None, "note": "old",
"_match_id": "rowAcme", "_stage": "approval", "_source_text": "note for Acme: old"}
revised = parse.revise(proposal, "change the note to: sent the deck",
parse_fn=_stub({"note": "sent the deck"}))
assert revised["note"] == "sent the deck"
assert revised["_match_id"] == "rowAcme"
assert revised["intent"] == "meeting_note"
def test_revise_cannot_empty_the_proposal():
proposal = {"intent": "new_investor", "investor_name": "Acme", "contact_name": "Jane",
"contact_email": None, "contact_title": None, "note": "x", "_source_text": "Acme Jane"}
revised = parse.revise(proposal, "clear it",
parse_fn=_stub({"investor_name": None, "contact_name": None,
"contact_title": None, "note": None}))
assert revised["investor_name"] == "Acme" and revised["contact_name"] == "Jane"
if __name__ == "__main__": if __name__ == "__main__":
fns = [v for k, v in sorted(globals().items()) if k.startswith("test_") and callable(v)] fns = [v for k, v in sorted(globals().items()) if k.startswith("test_") and callable(v)]
for fn in fns: for fn in fns:
+74
View File
@@ -1,4 +1,5 @@
"""Tests for the proposal store + approval state machine (pure logic, no network).""" """Tests for the proposal store + approval state machine (pure logic, no network)."""
import copy
import os import os
import sys import sys
@@ -105,6 +106,79 @@ def test_summary_line_new_vs_note():
assert "thread" in new_line.lower() assert "thread" in new_line.lower()
# --- fuzzy-match disambiguation + conversational-revision helpers ---
DISAMBIG = {"intent": "new_investor", "investor_name": "Charles Brown",
"contact_name": "Charles Brown", "contact_email": None, "contact_title": None,
"note": "met at conf", "_stage": "disambiguate",
"_candidates": [{"id": "rowCharlie", "name": "Charlie Brown", "score": 0.92, "matched_on": "name"},
{"id": "rowBeta", "name": "Beta Capital LLC", "score": 0.7, "matched_on": "name"}]}
def test_interpret_disambiguation_pick_number():
assert proposals.interpret_disambiguation("1", 2) == ("pick", 0)
assert proposals.interpret_disambiguation(" 2 ", 2) == ("pick", 1)
assert proposals.interpret_disambiguation("#1", 2) == ("pick", 0)
def test_interpret_disambiguation_out_of_range_is_unknown():
assert proposals.interpret_disambiguation("3", 2)[0] == "unknown"
assert proposals.interpret_disambiguation("0", 2)[0] == "unknown"
def test_interpret_disambiguation_new_and_no():
assert proposals.interpret_disambiguation("new", 2)[0] == "new"
assert proposals.interpret_disambiguation("none of these", 2)[0] == "new"
assert proposals.interpret_disambiguation("no", 2)[0] == "reject"
def test_interpret_disambiguation_freeform_is_unknown():
# a free-form reply in the shortlist stage isn't guessed at — re-prompt instead
assert proposals.interpret_disambiguation("the first one", 2)[0] == "unknown"
def test_attach_to_candidate_promotes_to_meeting_note():
out = proposals.attach_to_candidate(DISAMBIG, DISAMBIG["_candidates"][0])
assert out["_match_id"] == "rowCharlie"
assert out["intent"] == "meeting_note"
assert out["_stage"] == "approval"
assert out["investor_name"] == "Charlie Brown" # canonical existing name shown
assert "_candidates" not in out
assert "_candidates" in DISAMBIG # original untouched
def test_promote_to_new_clears_shortlist_and_match():
out = proposals.promote_to_new(dict(DISAMBIG, _match_id="rowX"))
assert out["_stage"] == "approval"
assert "_candidates" not in out
assert "_match_id" not in out
def test_disambiguation_pick_then_yes_reaches_approval():
# Closes the seam between the two state machines: a shortlist pick promotes the proposal to
# approval stage carrying the chosen investor's row id, and a following 'yes' classifies as
# approve (the normal commit path) — so pick -> yes lands the note on the existing investor.
picked = proposals.attach_to_candidate(copy.deepcopy(DISAMBIG), DISAMBIG["_candidates"][0])
assert picked["_stage"] == "approval"
assert picked["_match_id"] == "rowCharlie"
assert picked["intent"] == "meeting_note"
assert proposals.interpret_reply("yes") == ("approve", None)
def test_render_disambiguation_lists_numbered_candidates():
text = proposals.render_disambiguation(DISAMBIG)
assert "Charlie Brown" in text and "Beta Capital LLC" in text
assert "1." in text and "2." in text
assert "new" in text.lower() and "no" in text.lower()
def test_same_fields_ignores_control_keys():
a = dict(SAMPLE)
assert proposals.same_fields(a, dict(a))
assert not proposals.same_fields(a, dict(a, note="different"))
assert proposals.same_fields(a, dict(a, _match_id="r1", _stage="approval"))
if __name__ == "__main__": if __name__ == "__main__":
fns = [v for k, v in sorted(globals().items()) if k.startswith("test_") and callable(v)] fns = [v for k, v in sorted(globals().items()) if k.startswith("test_") and callable(v)]
for fn in fns: for fn in fns:
+127 -2
View File
@@ -15,6 +15,7 @@ import uuid
import csv import csv
import io import io
import re import re
import difflib
import base64 import base64
import threading import threading
from datetime import datetime, timedelta from datetime import datetime, timedelta
@@ -1254,6 +1255,124 @@ def find_intake_match(conn, q, email=None):
return email_hit return email_hit
def _email_edit_distance(a, b):
"""Levenshtein distance between two short strings (emails). Stdlib-only DP; used to flag
near-miss emails (a one- or two-character typo) for the intake fuzzy matcher."""
a = (a or '').strip().lower()
b = (b or '').strip().lower()
if a == b:
return 0
if not a or not b:
return max(len(a), len(b))
prev = list(range(len(b) + 1))
for i, ca in enumerate(a, 1):
cur = [i]
for j, cb in enumerate(b, 1):
cost = 0 if ca == cb else 1
cur.append(min(prev[j] + 1, cur[j - 1] + 1, prev[j - 1] + cost))
prev = cur
return prev[-1]
# Legal-entity suffixes stripped before name comparison so 'Acme Capital' ~ 'Acme Capital LLC'.
# Deliberately tight: only true entity types, NOT name-type words (Capital/Ventures/Partners),
# which are distinctive enough to keep. Intentionally EXCLUDES 'sa'/'ag' (Société Anonyme /
# Aktiengesellschaft) — niche for this portfolio and ambiguous enough as coincidental trailing
# tokens that stripping them inflates false 1.0 matches across distinct firms.
_LEGAL_SUFFIX = {"llc", "lp", "llp", "inc", "incorporated", "ltd", "limited", "co", "corp",
"corporation", "company", "plc", "gmbh", "pte"}
def _strip_legal_suffix(normalized_name):
"""Drop trailing legal-entity suffix tokens (llc/lp/inc/…) from an already-normalized name."""
toks = re.findall(r"[a-z0-9]+", normalized_name)
while toks and toks[-1] in _LEGAL_SUFFIX:
toks.pop()
return " ".join(toks)
def _name_similarity(a, b):
"""0..1 fuzzy similarity between two investor names: the max of difflib's sequence ratio
(catches near-spellings — 'Charlie'/'Charles') and token-set Jaccard overlap (catches
word-order differences). Legal-entity suffixes are stripped first, so two names differing
only by 'LLC'/'LP'/'Inc' score 1.0 (a near-certain duplicate to surface — find_intake_match
won't have caught it, since it compares the full string). Favors recall: a shared common
name-word ('… Capital') can lift unrelated firms into the 0.60.8 band — acceptable noise in
a ranked, human-confirmed shortlist; semantic pruning is the deferred LLM-judge's job."""
a = _normalize_text(a)
b = _normalize_text(b)
if not a or not b:
return 0.0
if a == b:
return 1.0
sa = _strip_legal_suffix(a) or a
sb = _strip_legal_suffix(b) or b
if sa == sb:
return 1.0
ratio = difflib.SequenceMatcher(None, sa, sb).ratio()
ta = set(re.findall(r"[a-z0-9]+", sa))
tb = set(re.findall(r"[a-z0-9]+", sb))
jaccard = len(ta & tb) / len(ta | tb) if (ta or tb) else 0.0
return max(ratio, jaccard)
def find_intake_candidates(conn, q, email=None, limit=5, min_score=0.62, max_email_distance=2):
"""Ranked fuzzy near-matches for the intake bot's disambiguation prompt.
Complements find_intake_match (which is exact-after-normalization): when the exact matcher
misses, this returns the closest existing grid investors so the bot can surface them
in-thread and the human can attach to one — instead of unknowingly creating a duplicate.
Deterministic (stdlib difflib + token overlap + email edit distance), no LLM. Scans the same
canonical grid blob as find_intake_match, so candidate ids are grid row ids the write targets.
EXCLUDES exact matches (score 1.0 — those belong to find_intake_match) and ranks by score."""
row = conn.execute("SELECT grid_json FROM fundraising_state WHERE id = 'main'").fetchone()
if not row or not row['grid_json']:
return []
try:
grid = json.loads(row['grid_json'])
except Exception:
return []
rows = grid.get('rows', []) if isinstance(grid, dict) else []
wanted_name = _normalize_text(q) if q else ''
wanted_email = (email or '').strip().lower()
scored = {}
for r in rows:
if not isinstance(r, dict):
continue
rid = str(r.get('id') or '').strip()
if not rid:
continue
name = str(r.get('investor_name') or '').strip()
# An exact name match belongs to find_intake_match — never echo it back as a candidate.
if wanted_name and _normalize_text(name) == wanted_name:
continue
name_score = _name_similarity(wanted_name, name) if (wanted_name and name) else 0.0
email_score = 0.0
if wanted_email:
contacts = r.get('contacts')
if isinstance(contacts, list):
for c in contacts:
if not isinstance(c, dict):
continue
ce = str(c.get('email') or '').strip().lower()
if not ce:
continue
dist = _email_edit_distance(wanted_email, ce)
# dist 0 is an exact email (find_intake_match's); 1→0.9, 2→0.8 are near-misses
if 0 < dist <= max_email_distance:
email_score = max(email_score, 1.0 - 0.1 * dist)
score = max(name_score, email_score)
if score < min_score: # too weak to be a useful suggestion
continue
matched_on = 'email' if email_score >= name_score else 'name'
# a row can match on both name and email — keep its highest-scoring read
if rid not in scored or score > scored[rid]['score']:
scored[rid] = {"id": rid, "investor_name": name,
"score": round(score, 3), "matched_on": matched_on}
out = sorted(scored.values(), key=lambda x: x['score'], reverse=True)
return out[:limit]
def ensure_fundraising_state_row(conn): def ensure_fundraising_state_row(conn):
existing = conn.execute("SELECT * FROM fundraising_state WHERE id = 'main'").fetchone() existing = conn.execute("SELECT * FROM fundraising_state WHERE id = 'main'").fetchone()
if not existing: if not existing:
@@ -2950,7 +3069,12 @@ class CRMHandler(BaseHTTPRequestHandler):
def handle_intake_match(self, user, params): def handle_intake_match(self, user, params):
"""Read-only: does an investor matching this intake already exist? Used by the """Read-only: does an investor matching this intake already exist? Used by the
Matrix intake bot to label its in-thread proposal new-vs-existing. Returns the Matrix intake bot to label its in-thread proposal new-vs-existing. Returns the
grid row id so an approved note lands on exactly that investor.""" grid row id so an approved note lands on exactly that investor.
`match` is the confident exact match (auto-attached by the bot). When there is no
exact match, `candidates` carries ranked fuzzy near-matches so the bot can surface
a disambiguation shortlist in-thread (the human picks one or creates new) — closing
the duplicate-investor hole the exact-only matcher leaves open."""
q = str(params.get('q') or '').strip() q = str(params.get('q') or '').strip()
email = str(params.get('email') or '').strip() email = str(params.get('email') or '').strip()
if not q and not email: if not q and not email:
@@ -2958,9 +3082,10 @@ class CRMHandler(BaseHTTPRequestHandler):
conn = get_db() conn = get_db()
try: try:
match = find_intake_match(conn, q, email) match = find_intake_match(conn, q, email)
candidates = find_intake_candidates(conn, q, email) if match is None else []
finally: finally:
conn.close() conn.close()
return self.send_json({"data": {"match": match}}) return self.send_json({"data": {"match": match, "candidates": candidates}})
def handle_update_communication(self, user, comm_id, body): def handle_update_communication(self, user, comm_id, body):
conn = get_db() conn = get_db()
+59
View File
@@ -71,6 +71,10 @@ GRID = {
"rows": [ "rows": [
{"id": "rowAcme", "investor_name": "Acme Capital", "notes": "", {"id": "rowAcme", "investor_name": "Acme Capital", "notes": "",
"contacts": [{"name": "Jane Doe", "email": "jane@acme.com", "title": "GP"}]}, "contacts": [{"name": "Jane Doe", "email": "jane@acme.com", "title": "GP"}]},
{"id": "rowCharlie", "investor_name": "Charlie Brown", "notes": "",
"contacts": [{"name": "Charlie Brown", "email": "cb@brown.fund", "title": ""}]},
{"id": "rowBeta", "investor_name": "Beta Capital LLC", "notes": "",
"contacts": [{"name": "Pat Roe", "email": "pat@beta.com", "title": ""}]},
], ],
} }
@@ -119,6 +123,61 @@ def main():
check(st == 200 and (d or {}).get("data", {}).get("match") is None, check(st == 200 and (d or {}).get("data", {}).get("match") is None,
f"no match -> null (got {st}, {d})") f"no match -> null (got {st}, {d})")
print("\n[fuzzy: exact match returns no candidates (bot auto-attaches)]")
st, d = _req(port, "GET", "/api/intake/match?q=Acme%20Capital", token)
data = (d or {}).get("data", {})
check(st == 200 and data.get("match") and data.get("candidates") == [],
f"exact match -> match set, candidates empty (got {data})")
print("\n[fuzzy: near-spelling surfaces a candidate (Charles Brown ~ Charlie Brown)]")
st, d = _req(port, "GET", "/api/intake/match?q=Charles%20Brown", token)
data = (d or {}).get("data", {})
cids = [c["id"] for c in data.get("candidates", [])]
check(data.get("match") is None and "rowCharlie" in cids,
f"near-spelling -> candidate rowCharlie, no exact (got {data})")
print("\n[fuzzy: legal-suffix difference surfaces a candidate (Beta Capital ~ Beta Capital LLC)]")
st, d = _req(port, "GET", "/api/intake/match?q=Beta%20Capital", token)
data = (d or {}).get("data", {})
cids = [c["id"] for c in data.get("candidates", [])]
check(data.get("match") is None and "rowBeta" in cids,
f"legal-suffix -> candidate rowBeta, no exact (got {data})")
print("\n[fuzzy: legal-suffix-only difference ranks as a top candidate (Acme Capital LLC ~ Acme Capital)]")
st, d = _req(port, "GET", "/api/intake/match?q=Acme%20Capital%20LLC", token)
data = (d or {}).get("data", {})
top = (data.get("candidates") or [None])[0]
check(data.get("match") is None and top and top["id"] == "rowAcme" and top["score"] == 1.0,
f"legal-suffix-only -> rowAcme top candidate @1.0, no exact (got {data})")
print("\n[fuzzy: one-character email typo surfaces a candidate by email]")
st, d = _req(port, "GET", "/api/intake/match?email=jhane@acme.com", token)
data = (d or {}).get("data", {})
cands = data.get("candidates", [])
hit = next((c for c in cands if c["id"] == "rowAcme"), None)
check(data.get("match") is None and hit and hit["matched_on"] == "email",
f"email typo -> candidate rowAcme matched_on email (got {data})")
print("\n[fuzzy: two-character email typo (distance 2) still surfaces]")
st, d = _req(port, "GET", "/api/intake/match?email=jane@acne.con", token) # acme->acne, com->con
data = (d or {}).get("data", {})
hit = next((c for c in data.get("candidates", []) if c["id"] == "rowAcme"), None)
check(data.get("match") is None and hit and hit["matched_on"] == "email" and hit["score"] == 0.8,
f"dist-2 email -> rowAcme @0.8 (got {data})")
print("\n[fuzzy: a row matching on BOTH name and email appears once (deduped)]")
st, d = _req(port, "GET", "/api/intake/match?q=Acme%20Capitol&email=jhane@acme.com", token)
data = (d or {}).get("data", {})
acme_hits = [c for c in data.get("candidates", []) if c["id"] == "rowAcme"]
check(data.get("match") is None and len(acme_hits) == 1,
f"name+email both match rowAcme -> single deduped entry (got {data})")
print("\n[fuzzy: nothing close -> empty candidates]")
st, d = _req(port, "GET", "/api/intake/match?q=Zphq%20Nobody%20LP", token)
data = (d or {}).get("data", {})
check(st == 200 and data.get("match") is None and data.get("candidates") == [],
f"unrelated query -> no match, no candidates (got {data})")
print("\n[match: missing q and email -> 400]") print("\n[match: missing q and email -> 400]")
st, _ = _req(port, "GET", "/api/intake/match", token) st, _ = _req(port, "GET", "/api/intake/match", token)
check(st == 400, f"no params -> 400 (got {st})") check(st == 400, f"no params -> 400 (got {st})")
+100 -9
View File
@@ -7,8 +7,15 @@ paths:
Read this before editing `backend/matrix_intake/`. The bot turns a typed message in a Read this before editing `backend/matrix_intake/`. The bot turns a typed message in a
dedicated Matrix room into a proposed fundraising-grid add/edit, gated on **in-thread human dedicated Matrix room into a proposed fundraising-grid add/edit, gated on **in-thread human
approval** before any write. Phase status: **M1 + M2 built** (text intake + approval + write); approval** before any write. Phase status: **M1 + M2 deployed & live** (text intake + approval + write; bot on the Spark,
**M3 (business-card photo) deferred** — Spark Control has no vision model yet. CRM endpoints on the box at **v0.1.0:85**; live-smoked 2026-06-17). **M3 (business-card photo)
deferred** — Spark Control has no vision model yet.
**Post-deploy UX pass — BUILT, not yet deployed (2026-06-17):** fuzzy investor matching
(server-side, **v0.1.0:86** — needs s9pk build+install) + in-thread disambiguation and
conversational natural-language edits (bot-side — needs a Spark `git pull` + restart). See
*Fuzzy matching* below. Tests green (27/27 backend + the offline bot suite); **not yet
live-smoked** — the disambiguation grammar and the Qwen `revise` leg need a Matrix smoke.
## What it is (and isn't) ## What it is (and isn't)
@@ -27,13 +34,56 @@ approval** before any write. Phase status: **M1 + M2 built** (text intake + appr
1. Top-level message in the intake room → `parse.parse_message` → local **Qwen via Spark 1. Top-level message in the intake room → `parse.parse_message` → local **Qwen via Spark
Control** (`spark.py` reuses `backend/ingest/llm.py`; temp 0, JSON only) extracts Control** (`spark.py` reuses `backend/ingest/llm.py`; temp 0, JSON only) extracts
`{intent, investor_name, contact_name, contact_email, contact_title, note}`. `{intent, investor_name, contact_name, contact_email, contact_title, note}`. The original
2. `crm_client.match` (`GET /api/intake/match`) checks new-vs-existing and returns the **grid message text is stashed on the proposal as `_source_text` (needed later for `revise`'s
row id** so an approved note lands on exactly that investor (no duplicate). email-integrity check).
3. The proposal is posted **in a thread** rooted at the user's message; the pending proposal is 2. `crm_client.match` (`GET /api/intake/match`) resolves new-vs-existing. It returns **both** an
held in memory keyed by that thread root (`proposals.ProposalStore`). exact `match` (returns the **grid row id** so an approved note lands on exactly that investor,
4. User replies in-thread: `yes` / `edit field=value` / `no`. On `yes`, `crm_client.commit` no duplicate) **and**, when there's no exact match, a ranked list of fuzzy `candidates` (see
POSTs to `log-communication` tagged `source="matrix_intake"` (provenance in the audit log). *Fuzzy matching* below).
3. Three outcomes drive what gets posted, all **in a thread** rooted at the user's message, plus a
brief **main-timeline nudge** (a plain reply — `matrix_io.make_reply`) so it isn't missed:
- **Exact match** → auto-attach: proposal flips to `meeting_note` with `_match_id` set, rendered
as the normal approval card.
- **Fuzzy candidates, no exact** → a **disambiguation** card (`proposals.render_disambiguation`):
the proposal is held at `_stage="disambiguate"` with `_candidates`, and the human must pick a
**number** / `new` / `no` before it becomes an approval-stage proposal.
- **Neither** → the new-investor approval card.
The nudge is a **pointer only, not a reply target** — you need the thread to act. The pending
proposal is held in memory keyed by the thread root (`proposals.ProposalStore`).
4. User replies **in the thread**. `handle_reply` branches on `_stage`:
- **disambiguate** (`handle_disambiguation`): a number attaches to that candidate (→ `meeting_note`
+ `_match_id`, re-rendered for approval); `new` proceeds as a new investor; `no` discards.
- **approval**: `yes` commits; `no` discards; `edit field=value` is the deterministic fast-path
edit; **anything else is treated as a natural-language revision**`parse.revise` sends
`{current proposal + instruction}` back through local Qwen and re-renders the revised card (a
no-op revision is detected via `proposals.same_fields` and re-prompts instead of saying
"Updated"). On `yes`, `crm_client.commit` POSTs to `log-communication` tagged
`source="matrix_intake"` (provenance in the audit log).
A bare `yes`/`no` typed **top-level** (not in the thread) while a proposal is pending gets a
"reply in the thread" redirect (`store.any_pending()` guard in `handle_intake`), not a
misparsed new intake.
## Fuzzy matching (server-side, ships in the s9pk)
`GET /api/intake/match` returns `{match, candidates}`. `find_intake_match` is unchanged —
**exact-after-normalization**, and an exact match still auto-attaches without disambiguation.
`find_intake_candidates` (new) is the fuzzy layer, **deterministic, no LLM**: it scans the same
canonical grid blob and scores each row by `max(`name similarity`, `email near-match`)`, keeping
rows ≥ `min_score` (0.62), ranked, capped at 5:
- **Name** (`_name_similarity`): max of stdlib `difflib` sequence ratio (near-spellings —
"Charlie"/"Charles") and token-set Jaccard (word-order). **Legal-entity suffixes**
(LLC/LP/Inc/… via `_strip_legal_suffix`) are stripped first, so "Acme Capital" ~ "Acme Capital
LLC" scores 1.0 (a near-certain duplicate `find_intake_match` misses because it compares the
full string) — and is surfaced as a candidate, **never auto-attached** (the human still confirms).
- **Email** (`_email_edit_distance`): Levenshtein ≤ 2 against each contact email (dist 1→0.9,
2→0.8). Distance 0 is an exact email — that's `find_intake_match`'s job, skipped here.
- **Recall-favoring by design:** a shared common name-word ("… Capital") can lift an unrelated firm
into the 0.60.8 band. Acceptable — it's a *ranked, human-confirmed* shortlist, and the cost of an
occasional stray suggestion is far lower than missing a real near-duplicate. **Semantic pruning of
the shortlist (the "Charlie really is Charles" judgment) is a deferred LLM-judge re-rank** — fed
only the shortlist, never the whole LP list — intentionally NOT built in this pass, because the
deterministic filter already surfaces every duplicate the human then resolves.
## Rules / gotchas ## Rules / gotchas
@@ -47,9 +97,27 @@ approval** before any write. Phase status: **M1 + M2 built** (text intake + appr
could attach the wrong one; the human sees it in the proposal and can `edit email=…` before could attach the wrong one; the human sees it in the proposal and can `edit email=…` before
approving. Cross-referencing multiple addresses to the named contact is a deliberate non-goal approving. Cross-referencing multiple addresses to the named contact is a deliberate non-goal
for v1. for v1.
- **Conversational revise keeps the email rule:** `parse.revise` re-runs a free-form correction
through Qwen but **never trusts the model's email field**. A changed address is accepted only if
it literally appears in the *instruction text* (searched first), else the existing
integrity-checked address is kept (`_apply_revision`). The model can edit name/contact/title/note
freely but cannot mint an email. A revision that nulls both investor and contact is rejected (the
proposal can't be emptied to something unactionable). Revise edits fields on the current proposal;
it does **not** re-run the matcher if you rename the firm mid-thread (a known v1 limit — the human
still approves).
- **Deploy is split across two surfaces** (mind which one carries a change): the fuzzy
**`candidates`** come from `server.py` → ship in the **s9pk** (build + install, version-bumped).
The bot's **disambiguation flow + `revise`** live in `backend/matrix_intake/` → ship on the
**Spark** via `git pull` + restart. A bot restart alone won't deliver `candidates` (the box would
return an empty list and the bot just proposes new — safe, but no fuzzy surfacing until the s9pk
is installed). Same lesson as the v83→v84 `/api/intake/match` 404.
- **Double-approve guard:** `handle_reply` pops the pending proposal from the store *before* - **Double-approve guard:** `handle_reply` pops the pending proposal from the store *before*
awaiting the commit, so a second `yes` arriving mid-write is a no-op (asyncio is cooperative; awaiting the commit, so a second `yes` arriving mid-write is a no-op (asyncio is cooperative;
the pop is atomic w.r.t. other events). On commit failure the proposal is restored for retry. the pop is atomic w.r.t. other events). On commit failure the proposal is restored for retry.
*Known minor:* in the **disambiguate** stage the pick re-stores an approval-stage proposal
before its `await say`, so a rapidly-repeated `1` can have the second one fall through to the
NL-revise path (a wasted Spark round-trip that re-prompts) — harmless, nothing commits, not
guarded (low likelihood on a ~5-person team).
- **Local-only parse:** intake text is real LP substance but goes ONLY to local Qwen via Spark - **Local-only parse:** intake text is real LP substance but goes ONLY to local Qwen via Spark
Control, never Claude — so no scrub boundary applies (same basis as the digest). Never call a Control, never Claude — so no scrub boundary applies (same basis as the digest). Never call a
Spark directly; always go through `SPARK_CONTROL_URL`. Spark directly; always go through `SPARK_CONTROL_URL`.
@@ -59,6 +127,29 @@ approval** before any write. Phase status: **M1 + M2 built** (text intake + appr
network; `backend/test_intake_endpoints.py` boots the real server against a temp DB and network; `backend/test_intake_endpoints.py` boots the real server against a temp DB and
covers `/api/intake/match` + the create→match (no-duplicate) contract + provenance. A **live covers `/api/intake/match` + the create→match (no-duplicate) contract + provenance. A **live
Matrix smoke** needs creds + `matrix-nio` installed on the Spark — it can't run in CI. Matrix smoke** needs creds + `matrix-nio` installed on the Spark — it can't run in CI.
- **Grid note line:** the bot sends a **blank `subject`** when there's a note so the CRM's
one-line note summary shows the note text (the CRM renders subject-or-body); a provenance
label is sent only when there's no note. v0.1.0:85 also dropped the redundant `[note]` type
tag from that server-side line (informative types like `[call]` keep theirs).
## Deployment & ops
- **Runs on the Spark** (SSH alias `modelo32`, host `spark-32d0`): repo at
`/home/modelo/ten31-database`, deps in a venv (`.venv`; only `matrix-nio`). Launched detached:
`nohup ./.venv/bin/python backend/matrix_intake/bot.py >/tmp/intake-bot.log 2>&1 &`, pid in
`/tmp/intake-bot.pid`; startup logs `listening as … in room …`.
- **Restart after a `git pull` of bot code:** `kill $(cat /tmp/intake-bot.pid)`, relaunch as
above, re-write the pid. A restart **drops in-memory pending proposals** (re-send to recover).
- **NOT a managed service yet** — won't survive a Spark reboot; restart-on-boot (systemd) is an
open TODO.
- **Server-side endpoints ship in the s9pk, not the bot.** `GET /api/intake/match` and the
`source` provenance on `log-communication` live in `backend/server.py`, so they reach the box
only via an **s9pk build + install** — a bot restart won't deliver them. (Missed in v83: the
box 404'd `/api/intake/match` until **v0.1.0:84**.)
- **`CRM_API_BASE` is the box over the LAN, not localhost** (bot on the Spark, CRM on the box).
`https://immense-voyage.local` (443) is the **StartOS dashboard**, not the CRM — the CRM has
its own interface address (the URL you open in a browser); container port 8080 isn't
LAN-reachable.
## Config ## Config
+3 -2
View File
@@ -50,8 +50,9 @@ export const PACKAGE_TITLE = 'Ten31 Database'
// * 0.1.0:82 (vendor + SRI-pin the front-end libs: React/ReactDOM/Babel now ship in the s9pk and load same-origin from /assets/vendor/ with integrity hashes, so a CDN can never swap prod deps [the v78/v79 blank-screen class] and the box needs no outbound internet to render; plus a committed jsdom render smoke check [start9/0.4/render-smoke.mjs] gating the default `make` build) // * 0.1.0:82 (vendor + SRI-pin the front-end libs: React/ReactDOM/Babel now ship in the s9pk and load same-origin from /assets/vendor/ with integrity hashes, so a CDN can never swap prod deps [the v78/v79 blank-screen class] and the box needs no outbound internet to render; plus a committed jsdom render smoke check [start9/0.4/render-smoke.mjs] gating the default `make` build)
// * 0.1.0:83 (email search/query + windowed digest preview, code-only: Communications investor dropdown now mirrors the list with typed keys [fund:/org:/contact:] so classic-contact/org-domain matches show + are pickable [fixes the empty-dropdown bug], plus a date-range filter, a click-to-expand full-body view [GET /api/email/detail], and a semantic "Search content" mode over indexed email bodies [GET /api/email/search -> ingest hybrid_search, soft-delete-filtered, 503 if Spark/Qdrant down]; Daily Digest gains an in-app windowed preview before send [POST /api/admin/digest/preview, send-now takes the same window] that exercises the real Spark summarizer without touching the daily cursor) // * 0.1.0:83 (email search/query + windowed digest preview, code-only: Communications investor dropdown now mirrors the list with typed keys [fund:/org:/contact:] so classic-contact/org-domain matches show + are pickable [fixes the empty-dropdown bug], plus a date-range filter, a click-to-expand full-body view [GET /api/email/detail], and a semantic "Search content" mode over indexed email bodies [GET /api/email/search -> ingest hybrid_search, soft-delete-filtered, 503 if Spark/Qdrant down]; Daily Digest gains an in-app windowed preview before send [POST /api/admin/digest/preview, send-now takes the same window] that exercises the real Spark summarizer without touching the daily cursor)
// * 0.1.0:84 (Matrix intake bot CRM support — ships the server side of commit 7ad0ee7, which was never packaged: new read-only GET /api/intake/match [new-vs-existing lookup against the canonical fundraising grid blob; returns the grid row id so an approved note lands on the matched investor, no duplicate] + source provenance on POST /api/fundraising/log-communication [audit records source, default "fundraising_grid"]; code-only, no schema change) // * 0.1.0:84 (Matrix intake bot CRM support — ships the server side of commit 7ad0ee7, which was never packaged: new read-only GET /api/intake/match [new-vs-existing lookup against the canonical fundraising grid blob; returns the grid row id so an approved note lands on the matched investor, no duplicate] + source provenance on POST /api/fundraising/log-communication [audit records source, default "fundraising_grid"]; code-only, no schema change)
// * Current: 0.1.0:85 (cosmetic: drop the redundant "[note]" tag from the fundraising-grid note line — now "YYYY-MM-DD Contact: summary"; informative comm types [call, meeting, …] keep their "[type]" tag; shared by the Matrix intake bot + grid-UI logging; no schema change) // * 0.1.0:85 (cosmetic: drop the redundant "[note]" tag from the fundraising-grid note line — now "YYYY-MM-DD Contact: summary"; informative comm types [call, meeting, …] keep their "[type]" tag; shared by the Matrix intake bot + grid-UI logging; no schema change)
export const PACKAGE_VERSION = '0.1.0:85' // * Current: 0.1.0:86 (Matrix intake fuzzy matching: GET /api/intake/match now returns ranked `candidates` [fuzzy near-matches — deterministic difflib name similarity + token overlap + email edit-distance ≤ 2, legal-suffix-aware] alongside the exact `match`, so the bot can surface near-duplicates ["Charlie"/"Charles", "Acme Capital"/"Acme Capital LLC", a one-char email typo] for human confirmation instead of silently creating a second investor; the bot-side disambiguation + conversational-edit UX ships on the Spark, not the s9pk; code-only, no schema change)
export const PACKAGE_VERSION = '0.1.0:86'
export const DATA_MOUNT_PATH = '/data' export const DATA_MOUNT_PATH = '/data'
export const WEB_PORT = 8080 export const WEB_PORT = 8080
+3 -2
View File
@@ -46,8 +46,9 @@ import { v_0_1_0_82 } from './v0.1.0.82'
import { v_0_1_0_83 } from './v0.1.0.83' import { v_0_1_0_83 } from './v0.1.0.83'
import { v_0_1_0_84 } from './v0.1.0.84' import { v_0_1_0_84 } from './v0.1.0.84'
import { v_0_1_0_85 } from './v0.1.0.85' import { v_0_1_0_85 } from './v0.1.0.85'
import { v_0_1_0_86 } from './v0.1.0.86'
export const versionGraph = VersionGraph.of({ export const versionGraph = VersionGraph.of({
current: v_0_1_0_85, current: v_0_1_0_86,
other: [v_0_1_0_39, v_0_1_0_40, v_0_1_0_41, v_0_1_0_42, v_0_1_0_43, v_0_1_0_44, v_0_1_0_45, v_0_1_0_46, v_0_1_0_47, v_0_1_0_48, v_0_1_0_49, v_0_1_0_50, v_0_1_0_51, v_0_1_0_52, v_0_1_0_53, v_0_1_0_54, v_0_1_0_55, v_0_1_0_56, v_0_1_0_57, v_0_1_0_58, v_0_1_0_59, v_0_1_0_60, v_0_1_0_61, v_0_1_0_62, v_0_1_0_63, v_0_1_0_64, v_0_1_0_65, v_0_1_0_66, v_0_1_0_67, v_0_1_0_68, v_0_1_0_69, v_0_1_0_70, v_0_1_0_71, v_0_1_0_72, v_0_1_0_73, v_0_1_0_74, v_0_1_0_75, v_0_1_0_76, v_0_1_0_77, v_0_1_0_78, v_0_1_0_79, v_0_1_0_80, v_0_1_0_81, v_0_1_0_82, v_0_1_0_83, v_0_1_0_84], other: [v_0_1_0_39, v_0_1_0_40, v_0_1_0_41, v_0_1_0_42, v_0_1_0_43, v_0_1_0_44, v_0_1_0_45, v_0_1_0_46, v_0_1_0_47, v_0_1_0_48, v_0_1_0_49, v_0_1_0_50, v_0_1_0_51, v_0_1_0_52, v_0_1_0_53, v_0_1_0_54, v_0_1_0_55, v_0_1_0_56, v_0_1_0_57, v_0_1_0_58, v_0_1_0_59, v_0_1_0_60, v_0_1_0_61, v_0_1_0_62, v_0_1_0_63, v_0_1_0_64, v_0_1_0_65, v_0_1_0_66, v_0_1_0_67, v_0_1_0_68, v_0_1_0_69, v_0_1_0_70, v_0_1_0_71, v_0_1_0_72, v_0_1_0_73, v_0_1_0_74, v_0_1_0_75, v_0_1_0_76, v_0_1_0_77, v_0_1_0_78, v_0_1_0_79, v_0_1_0_80, v_0_1_0_81, v_0_1_0_82, v_0_1_0_83, v_0_1_0_84, v_0_1_0_85],
}) })
+20
View File
@@ -0,0 +1,20 @@
import { VersionInfo } from '@start9labs/start-sdk'
// Matrix intake — fuzzy investor matching. GET /api/intake/match now returns, alongside the
// exact `match`, a ranked list of `candidates`: fuzzy near-matches (deterministic difflib name
// similarity + token overlap + email edit-distance ≤ 2, legal-suffix-aware) the intake bot can
// surface in-thread for the human to pick from — so a near-duplicate name ("Charlie"/"Charles",
// "Acme Capital"/"Acme Capital LLC", a one-char email typo) no longer silently creates a second
// investor. Server-side only (the bot's disambiguation + conversational-edit UX ships on the
// Spark, not in the s9pk). Code-only, no schema change.
export const v_0_1_0_86 = VersionInfo.of({
version: '0.1.0:86',
releaseNotes: {
en_US: [
'Matrix intake: the new-vs-existing lookup now also returns ranked fuzzy near-matches,',
'so a typod or near-duplicate investor name is surfaced for confirmation instead of',
'silently creating a duplicate. No data changes.',
].join(' '),
},
migrations: { up: async () => {}, down: async () => {} },
})