---
paths:
  - backend/ingest/**
---

# Ingest, retrieval & Spark/Qdrant

Read this before editing the ingest pipeline or retrieval modes.

## Pipeline

- `backend/ingest/` is chunk → embed → Qdrant plus retrieval modes (`search.py`, `embed.py`, `qdrant_io.py`, `sparse.py`, `entity_resolution.py`).
- Local models — bge-m3 embeddings, bge-reranker-v2-m3, `/api/search` — run **always via Spark Control**, never against a Spark directly (`SPARK_CONTROL_URL`). The retrieval/embeddings contract is `docs/EMBEDDINGS.md`; honor it.

## Hard rule

- **Never treat Qdrant (or any derived index) as source of truth.** The CRM / SQLite is canonical and the index is rebuildable from it. Code may drop and rebuild the Qdrant collection; it must never read a fact from Qdrant that isn't recoverable from SQLite.

## Entity resolution

The two-investor-model reconciliation (classic `contacts`/`lp_profiles` vs the `fundraising_*` grid → canonical IDs) is the core entity-resolution task. See `backend/entity_*.py` and `docs/crm-overview.md`.