090416f05e
Move subsystem mechanics (migrations, thesis gate, redaction, ingest, email, packaging) out of AGENTS.md into docs/guides/<topic>.md, each scoped by paths: frontmatter and symlinked from .claude/rules/ so Claude Code lazy-loads them. AGENTS.md keeps whole-repo facts and universal guardrails plus a one-line index per guide. Fix the inaccurate ".claude/ is gitignored" note — it is tracked.
1.0 KiB
1.0 KiB
paths
| paths | |
|---|---|
|
Ingest, retrieval & Spark/Qdrant
Read this before editing the ingest pipeline or retrieval modes.
Pipeline
backend/ingest/is chunk → embed → Qdrant plus retrieval modes (search.py,embed.py,qdrant_io.py,sparse.py,entity_resolution.py).- Local models — bge-m3 embeddings, bge-reranker-v2-m3,
/api/search— run always via Spark Control, never against a Spark directly (SPARK_CONTROL_URL). The retrieval/embeddings contract isdocs/EMBEDDINGS.md; honor it.
Hard rule
- Never treat Qdrant (or any derived index) as source of truth. The CRM / SQLite is canonical and the index is rebuildable from it. Code may drop and rebuild the Qdrant collection; it must never read a fact from Qdrant that isn't recoverable from SQLite.
Entity resolution
The two-investor-model reconciliation (classic contacts/lp_profiles vs the fundraising_* grid → canonical IDs) is the core entity-resolution task. See backend/entity_*.py and docs/crm-overview.md.