Phase 0 foundation: canonical schema, ingest pipeline, CRM MCP server
Workstream A–C substrate for the Ten31 agentic system: - A1: docs/crm-overview.md; CLAUDE.md conventions + guardrail #9 - A2: additive/reversible core migration (canonical_entities, entity_links, interaction_log, relationship_edges, soft-delete) + ledgered runner - B1/B3: chunking + deterministic entity resolution (backend/ingest) - B2: dense (bge-m3) + BM25 sparse ingest to Qdrant crm_chunks - C: CRM MCP server (reads, retrieval modes, logged writes) — no outbound tools - docs: redaction/re-hydration, Gmail enablement runbook - synthetic test data; .env.example; housekeeping (.gitignore, untrack crm.db, drop legacy files + start9/0.3.5) Verified end-to-end on synthetic data + live Sparks (hybrid > dense on entity queries). Real backfill runs on Ten31 infra; index holds synthetic data only. Branch snapshot also captures pre-existing working-tree changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,202 @@
|
||||
# Embeddings + Retrieval through Spark Control (v0.15.0)
|
||||
|
||||
Spark Control now fronts a local RAG stack so your agent/CRM system can do
|
||||
dense embeddings, reranking, and hybrid vector search against one trusted host
|
||||
— same TLS cert and allowlist as the LLM and audio endpoints.
|
||||
|
||||
## What runs where
|
||||
|
||||
| Component | Host | Port | Role |
|
||||
|---|---|---|---|
|
||||
| **spark-embed** | Spark 2 (GPU) | 8088 | `BAAI/bge-m3` dense embeddings (1024-d) + `BAAI/bge-reranker-v2-m3` cross-encoder rerank |
|
||||
| **Qdrant** | Spark 2 (CPU) | 6333/6334 | Vector storage, hybrid dense+sparse retrieval, RRF fusion, payload filtering |
|
||||
| **Spark Control** | Start9 | (your LAN URL) | Proxies all of the above behind one host |
|
||||
|
||||
`spark-embed` is a small FastAPI server built **from the NGC PyTorch image**
|
||||
(the torch we've proven runs on the GB10's sm_121 GPU). We did *not* use HF Text
|
||||
Embeddings Inference because as of 2026 it ships no arm64 CUDA image. No
|
||||
torchaudio, no flash-attn — bge-m3 + the reranker are plain XLM-RoBERTa
|
||||
encoders.
|
||||
|
||||
## Endpoints (all on the Spark Control host)
|
||||
|
||||
### `POST /v1/embeddings` — dense embeddings (OpenAI-compatible)
|
||||
|
||||
```json
|
||||
// request
|
||||
{ "input": "text to embed", "model": "BAAI/bge-m3" }
|
||||
// or { "input": ["batch", "of", "texts"] }
|
||||
|
||||
// response (OpenAI shape)
|
||||
{
|
||||
"object": "list",
|
||||
"data": [ { "object": "embedding", "index": 0, "embedding": [0.01, ...1024 floats] } ],
|
||||
"model": "BAAI/bge-m3",
|
||||
"usage": { "prompt_tokens": 0, "total_tokens": 0 }
|
||||
}
|
||||
```
|
||||
|
||||
Vectors are L2-normalized by default (cosine == dot product). Works with the
|
||||
stock OpenAI Python/JS client by pointing `base_url` at Spark Control.
|
||||
|
||||
### `POST /v1/rerank` — cross-encoder rerank
|
||||
|
||||
```json
|
||||
// request
|
||||
{ "query": "did Brightwater commit?", "documents": ["chunk a", "chunk b", ...],
|
||||
"top_n": 5, "return_documents": false }
|
||||
|
||||
// response (Cohere-ish)
|
||||
{ "object": "rerank.result", "model": "BAAI/bge-reranker-v2-m3",
|
||||
"results": [ { "index": 3, "relevance_score": 5.21 }, { "index": 0, "relevance_score": 1.04 }, ... ] }
|
||||
```
|
||||
|
||||
`relevance_score` is the reranker's raw logit (higher = more relevant; unbounded,
|
||||
roughly −10..+10). Sort desc. Send a candidate set (≤200), not your whole corpus.
|
||||
|
||||
### `POST /api/search` — orchestrated hybrid retrieval
|
||||
|
||||
One call: embeds the query (dense), retrieves from Qdrant (hybrid dense+sparse
|
||||
with RRF **when you supply a sparse vector**, else dense-only), optionally
|
||||
cross-encoder reranks, returns top_k.
|
||||
|
||||
```json
|
||||
// request
|
||||
{
|
||||
"query": "Did Brightwater commit to the Fund III close in Q1?",
|
||||
"collection": "crm_chunks",
|
||||
"top_k": 8,
|
||||
"retrieve_n": 80, // first-stage candidates (default max(50, top_k*10))
|
||||
"sparse": { "indices": [12, 904], "values": [0.7, 1.2] }, // optional BM25 vector for hybrid
|
||||
"fusion": "rrf", // or "dbsf"
|
||||
"filter": { // raw Qdrant filter (pre-filter, see below)
|
||||
"must": [ { "key": "lp_id", "match": { "value": "lp_0427" } } ]
|
||||
},
|
||||
"rerank": true,
|
||||
"text_field": "text", // payload field holding the chunk text
|
||||
"with_payload": true
|
||||
}
|
||||
|
||||
// response
|
||||
{
|
||||
"object": "search.result_list",
|
||||
"model": "BAAI/bge-m3+bge-reranker-v2-m3",
|
||||
"query": "...",
|
||||
"collection": "crm_chunks",
|
||||
"reranked": true,
|
||||
"data": [
|
||||
{ "object": "search.result", "index": 0, "id": "chunk_99c1_3",
|
||||
"score": 5.21, "fused_score": 0.41, "rerank_score": 5.21,
|
||||
"text": "...Brightwater confirmed verbal commitment...",
|
||||
"payload": { "lp_id": "lp_0427", "doc_type": "email", "date_ts": 1771027200, ... } }
|
||||
],
|
||||
"usage": { "embed_ms": 21, "qdrant_ms": 9, "rerank_ms": 140, "candidates": 80 }
|
||||
}
|
||||
```
|
||||
|
||||
`score` is the rerank score when reranked, else the fused/dense score.
|
||||
|
||||
## The sparse (BM25) story — important for entity-heavy data
|
||||
|
||||
bge-m3's dense vectors carry semantic meaning, but exact entity matches
|
||||
(fund names, tickers, people) are a **lexical** signal. For that you want hybrid
|
||||
dense + sparse. Two design facts:
|
||||
|
||||
1. **spark-embed serves dense only.** TEI/Infinity can't emit bge-m3's learned
|
||||
sparse either, and bge-m3's pretrained sparse weights underweight novel
|
||||
entity tokens (brand-new fund names). So we don't use bge-m3 sparse.
|
||||
2. **Use Qdrant BM25 with its built-in IDF**, generated client-side. This learns
|
||||
IDF over *your* corpus, so novel entity strings get correct weight.
|
||||
|
||||
### Your ingest pipeline (the part you own)
|
||||
|
||||
For each chunk, produce and upsert **both** vectors:
|
||||
|
||||
- **dense**: call Spark Control `POST /v1/embeddings` → 1024-d vector.
|
||||
- **sparse**: run [FastEmbed](https://github.com/qdrant/fastembed) BM25 client-side:
|
||||
```python
|
||||
from fastembed import SparseTextEmbedding
|
||||
bm25 = SparseTextEmbedding(model_name="Qdrant/bm25")
|
||||
sp = next(bm25.embed([chunk_text])) # -> {indices, values}
|
||||
```
|
||||
|
||||
Create the collection with a named dense vector and a named sparse vector that
|
||||
uses `modifier: idf` (so Qdrant applies IDF server-side):
|
||||
|
||||
```json
|
||||
PUT /collections/crm_chunks
|
||||
{ "vectors": { "dense": { "size": 1024, "distance": "Cosine" } },
|
||||
"sparse_vectors": { "sparse": { "modifier": "idf" } } }
|
||||
```
|
||||
|
||||
Add payload indexes for your filter fields so filtered queries stay fast:
|
||||
|
||||
```json
|
||||
PUT /collections/crm_chunks/index { "field_name": "lp_id", "field_schema": "keyword" }
|
||||
PUT /collections/crm_chunks/index { "field_name": "doc_type", "field_schema": "keyword" }
|
||||
PUT /collections/crm_chunks/index { "field_name": "date_ts", "field_schema": "integer" }
|
||||
```
|
||||
|
||||
Upsert points with both vectors + payload:
|
||||
|
||||
```json
|
||||
PUT /collections/crm_chunks/points
|
||||
{ "points": [ { "id": 1,
|
||||
"vector": { "dense": [...1024...], "sparse": { "indices": [...], "values": [...] } },
|
||||
"payload": { "lp_id": "lp_0427", "lp_name": "Brightwater Capital",
|
||||
"doc_type": "email", "date_ts": 1771027200, "text": "..." } } ] }
|
||||
```
|
||||
|
||||
### At query time
|
||||
|
||||
Generate the query's BM25 sparse vector with the **same** FastEmbed model, and
|
||||
pass it to `/api/search` as `sparse`. Spark Control fuses dense+sparse with RRF
|
||||
inside Qdrant, then reranks. If you omit `sparse`, you get dense + rerank
|
||||
(still good, just no exact-lexical leg).
|
||||
|
||||
> You can talk to Qdrant directly (`http://<spark2>:6333`) for collection
|
||||
> management and upserts — that's the natural home for ingest. `/api/search` is
|
||||
> the convenience path for the read side so your agents hit one host. If you'd
|
||||
> rather Spark Control proxy Qdrant admin too, say the word.
|
||||
|
||||
## Chunking guidance (entity-heavy CRM)
|
||||
|
||||
- One chunk per email / note / transcript-turn; one chunk per memo *section* —
|
||||
don't split mid-thought.
|
||||
- Keep entity + date as **payload fields** (filterable), not buried in embedded text.
|
||||
- Time-aware: don't merge a 2022 note with a 2026 update in one chunk; store
|
||||
`date_ts` so you can pre-filter and recency-boost.
|
||||
- Resolve entity-name variants ("J. Smith" / "Jonathan Smith" / "JS") to one
|
||||
canonical `lp_id` at ingest, or retrieval fragments across variants.
|
||||
|
||||
## Pre-filtering
|
||||
|
||||
Most agent queries aren't pure semantic — they're "recent emails with Bob about
|
||||
energy". Pass a Qdrant `filter` to restrict the search space *before* vector
|
||||
scoring (faster + more precise). Qdrant also supports server-side recency
|
||||
boosting via Formula/Score-Boosting if you want decay-by-`date_ts` without an
|
||||
app round-trip — ask and we can expose it through `/api/search`.
|
||||
|
||||
## Service discovery + health
|
||||
|
||||
- `GET /api/endpoints` includes `embeddings` and `qdrant` entries (base_url +
|
||||
ready flag) alongside `vllm`, `parakeet`, `kokoro`.
|
||||
- The dashboard shows Embeddings + Qdrant health dots and Start/Restart/Stop
|
||||
controls in the Always-On Services panel.
|
||||
- Spark Control will auto-restart a wedged **embedding** container (GPU CUDA
|
||||
wedge recovery, like the audio services) but **never** auto-restarts Qdrant —
|
||||
it holds your only copy of the index, so a restart is surfaced for manual
|
||||
action instead.
|
||||
|
||||
## Capacity
|
||||
|
||||
At your scale (tens of thousands of chunks now → low hundreds of thousands),
|
||||
this is trivial: ~0.6–1.2 GB of dense vectors at 300k chunks, spark-embed +
|
||||
Qdrant together use a few GB of GPU/RAM on Spark 2's 122 GB. Full re-embed of
|
||||
300k chunks is ~8–15 min, so re-indexing is cheap if you change models.
|
||||
|
||||
## Model upgrade path
|
||||
|
||||
If dense recall becomes the bottleneck, `Qwen3-Embedding-4B` (Matryoshka-trained,
|
||||
tops 2026 MTEB) is the A/B candidate — same `/v1/embeddings` contract, swap the
|
||||
model in spark-embed. bge-m3 is the lower-risk starting point and ships now.
|
||||
Reference in New Issue
Block a user