Fix people double-count + duplicate-queue explosion (v0.1.0:51)
Root cause: grid contacts (fundraising_contacts) are the SAME people as the contacts table (the app syncs them by name/email), but resolution matched grid rows by (name + investor-canon) where the two sides derive the investor key from different tables that rarely line up — so nearly every grid contact minted a duplicate person (715 + ~692 ≈ 1406), and the duplicate finder then flagged each twin against its real self (~676 candidates). Fix (entity_resolution.py): - Grid pass matches a grid contact to its existing contacts-table person by PROVABLE keys only (exact email, else exact name within the same investor) and records membership; on a miss it MINTS NOTHING (the old else-branch mint was the double-count source, and guessing by name across firms risks binding two different same-named people). - Targeted, audited cleanup soft-deletes leftover grid-only "twins" (person rows with no 'contacts' link) and superseded pre-:48 'lp'/'organization' rows, guarded so any row carrying enrichment/human data is never dropped (guardrail #3); the tombstoned ids are logged to interaction_log (guardrail #5). - _upsert_entity clears deleted_at on conflict so a re-emitted id is un-tombstoned (no permanent burial); fuzzy-merge losers stay buried via _redirect. entity_merge.py / server.py: the duplicate queue + pending count now filter to candidates whose both sides are still live, so self-healed twins drop out. Verified: offline reproduction test (backend/ingest/test_entity_resolution.py, 10/10) reproduces the 1406-style doubling and proves it collapses; no regression on the synthetic dev set; two adversarial review passes. Known pre-existing identity-key weaknesses (same name+firm+no email collision; shared role inbox over-link) are unchanged by this fix and will be resolved structurally by the contact_id link in the grid/contacts unification. Run "Build search index" after upgrading to recompute the canonical layer. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -31,8 +31,16 @@ def _log(c, actor_id, action, target_id, payload):
|
||||
|
||||
def list_candidates(db, status="pending"):
|
||||
c = _conn(db)
|
||||
# Only surface candidates whose BOTH sides are still live canonical entities.
|
||||
# When entity resolution self-heals (a grid "twin" matched back, so its
|
||||
# duplicate entity is soft-deleted), the candidate that paired them becomes
|
||||
# moot — hide it instead of asking a human to adjudicate a tombstone.
|
||||
rows = [dict(r) for r in c.execute(
|
||||
"SELECT * FROM entity_merge_candidates WHERE status=? ORDER BY confidence DESC, created_at DESC", (status,))]
|
||||
"""SELECT mc.* FROM entity_merge_candidates mc
|
||||
JOIN canonical_entities a ON a.id = mc.entity_a AND a.deleted_at IS NULL
|
||||
JOIN canonical_entities b ON b.id = mc.entity_b AND b.deleted_at IS NULL
|
||||
WHERE mc.status=?
|
||||
ORDER BY mc.confidence DESC, mc.created_at DESC""", (status,))]
|
||||
c.close()
|
||||
return {"candidates": rows, "count": len(rows)}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user