090416f05e
Move subsystem mechanics (migrations, thesis gate, redaction, ingest, email, packaging) out of AGENTS.md into docs/guides/<topic>.md, each scoped by paths: frontmatter and symlinked from .claude/rules/ so Claude Code lazy-loads them. AGENTS.md keeps whole-repo facts and universal guardrails plus a one-line index per guide. Fix the inaccurate ".claude/ is gitignored" note — it is tracked.
22 lines
1.0 KiB
Markdown
22 lines
1.0 KiB
Markdown
---
|
|
paths:
|
|
- backend/ingest/**
|
|
---
|
|
|
|
# Ingest, retrieval & Spark/Qdrant
|
|
|
|
Read this before editing the ingest pipeline or retrieval modes.
|
|
|
|
## Pipeline
|
|
|
|
- `backend/ingest/` is chunk → embed → Qdrant plus retrieval modes (`search.py`, `embed.py`, `qdrant_io.py`, `sparse.py`, `entity_resolution.py`).
|
|
- Local models — bge-m3 embeddings, bge-reranker-v2-m3, `/api/search` — run **always via Spark Control**, never against a Spark directly (`SPARK_CONTROL_URL`). The retrieval/embeddings contract is `docs/EMBEDDINGS.md`; honor it.
|
|
|
|
## Hard rule
|
|
|
|
- **Never treat Qdrant (or any derived index) as source of truth.** The CRM / SQLite is canonical and the index is rebuildable from it. Code may drop and rebuild the Qdrant collection; it must never read a fact from Qdrant that isn't recoverable from SQLite.
|
|
|
|
## Entity resolution
|
|
|
|
The two-investor-model reconciliation (classic `contacts`/`lp_profiles` vs the `fundraising_*` grid → canonical IDs) is the core entity-resolution task. See `backend/entity_*.py` and `docs/crm-overview.md`.
|