Files
Keysat 090416f05e docs: extract subsystem guides; keep AGENTS.md to whole-repo facts
Move subsystem mechanics (migrations, thesis gate, redaction, ingest,
email, packaging) out of AGENTS.md into docs/guides/<topic>.md, each
scoped by paths: frontmatter and symlinked from .claude/rules/ so Claude
Code lazy-loads them. AGENTS.md keeps whole-repo facts and universal
guardrails plus a one-line index per guide. Fix the inaccurate
".claude/ is gitignored" note — it is tracked.
2026-06-12 16:46:49 -05:00

1.0 KiB

paths
paths
backend/ingest/**

Ingest, retrieval & Spark/Qdrant

Read this before editing the ingest pipeline or retrieval modes.

Pipeline

  • backend/ingest/ is chunk → embed → Qdrant plus retrieval modes (search.py, embed.py, qdrant_io.py, sparse.py, entity_resolution.py).
  • Local models — bge-m3 embeddings, bge-reranker-v2-m3, /api/search — run always via Spark Control, never against a Spark directly (SPARK_CONTROL_URL). The retrieval/embeddings contract is docs/EMBEDDINGS.md; honor it.

Hard rule

  • Never treat Qdrant (or any derived index) as source of truth. The CRM / SQLite is canonical and the index is rebuildable from it. Code may drop and rebuild the Qdrant collection; it must never read a fact from Qdrant that isn't recoverable from SQLite.

Entity resolution

The two-investor-model reconciliation (classic contacts/lp_profiles vs the fundraising_* grid → canonical IDs) is the core entity-resolution task. See backend/entity_*.py and docs/crm-overview.md.