Files
ten31-database/docs/guides/spark-ingest.md
T
Keysat 090416f05e docs: extract subsystem guides; keep AGENTS.md to whole-repo facts
Move subsystem mechanics (migrations, thesis gate, redaction, ingest,
email, packaging) out of AGENTS.md into docs/guides/<topic>.md, each
scoped by paths: frontmatter and symlinked from .claude/rules/ so Claude
Code lazy-loads them. AGENTS.md keeps whole-repo facts and universal
guardrails plus a one-line index per guide. Fix the inaccurate
".claude/ is gitignored" note — it is tracked.
2026-06-12 16:46:49 -05:00

22 lines
1.0 KiB
Markdown

---
paths:
- backend/ingest/**
---
# Ingest, retrieval & Spark/Qdrant
Read this before editing the ingest pipeline or retrieval modes.
## Pipeline
- `backend/ingest/` is chunk → embed → Qdrant plus retrieval modes (`search.py`, `embed.py`, `qdrant_io.py`, `sparse.py`, `entity_resolution.py`).
- Local models — bge-m3 embeddings, bge-reranker-v2-m3, `/api/search` — run **always via Spark Control**, never against a Spark directly (`SPARK_CONTROL_URL`). The retrieval/embeddings contract is `docs/EMBEDDINGS.md`; honor it.
## Hard rule
- **Never treat Qdrant (or any derived index) as source of truth.** The CRM / SQLite is canonical and the index is rebuildable from it. Code may drop and rebuild the Qdrant collection; it must never read a fact from Qdrant that isn't recoverable from SQLite.
## Entity resolution
The two-investor-model reconciliation (classic `contacts`/`lp_profiles` vs the `fundraising_*` grid → canonical IDs) is the core entity-resolution task. See `backend/entity_*.py` and `docs/crm-overview.md`.