Files
Keysat 8d839e3714 v0.13.0:4 - redaction gateway, embeddings proxy, expanded audio API
- Add redaction gateway (redaction_gateway.py, redaction/ scrub + tests)
- Add embeddings proxy and spark_embed service (Dockerfile + main.py)
- Expand audio_proxy with speaker-aware handling; deep_health/health/server updates
- Package: configureSparks action + sparkConfig model updates, manifest/main wiring
- Docs: AUDIO_API, EMBEDDINGS, REDACTION_GATEWAY; HANDOFF and runbook/known-issues refresh
2026-06-11 17:45:57 -05:00

203 lines
8.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Embeddings + Retrieval through Spark Control (v0.15.0)
Spark Control now fronts a local RAG stack so your agent/CRM system can do
dense embeddings, reranking, and hybrid vector search against one trusted host
— same TLS cert and allowlist as the LLM and audio endpoints.
## What runs where
| Component | Host | Port | Role |
|---|---|---|---|
| **spark-embed** | Spark 2 (GPU) | 8088 | `BAAI/bge-m3` dense embeddings (1024-d) + `BAAI/bge-reranker-v2-m3` cross-encoder rerank |
| **Qdrant** | Spark 2 (CPU) | 6333/6334 | Vector storage, hybrid dense+sparse retrieval, RRF fusion, payload filtering |
| **Spark Control** | Start9 | (your LAN URL) | Proxies all of the above behind one host |
`spark-embed` is a small FastAPI server built **from the NGC PyTorch image**
(the torch we've proven runs on the GB10's sm_121 GPU). We did *not* use HF Text
Embeddings Inference because as of 2026 it ships no arm64 CUDA image. No
torchaudio, no flash-attn — bge-m3 + the reranker are plain XLM-RoBERTa
encoders.
## Endpoints (all on the Spark Control host)
### `POST /v1/embeddings` — dense embeddings (OpenAI-compatible)
```json
// request
{ "input": "text to embed", "model": "BAAI/bge-m3" }
// or { "input": ["batch", "of", "texts"] }
// response (OpenAI shape)
{
"object": "list",
"data": [ { "object": "embedding", "index": 0, "embedding": [0.01, ...1024 floats] } ],
"model": "BAAI/bge-m3",
"usage": { "prompt_tokens": 0, "total_tokens": 0 }
}
```
Vectors are L2-normalized by default (cosine == dot product). Works with the
stock OpenAI Python/JS client by pointing `base_url` at Spark Control.
### `POST /v1/rerank` — cross-encoder rerank
```json
// request
{ "query": "did Brightwater commit?", "documents": ["chunk a", "chunk b", ...],
"top_n": 5, "return_documents": false }
// response (Cohere-ish)
{ "object": "rerank.result", "model": "BAAI/bge-reranker-v2-m3",
"results": [ { "index": 3, "relevance_score": 5.21 }, { "index": 0, "relevance_score": 1.04 }, ... ] }
```
`relevance_score` is the reranker's raw logit (higher = more relevant; unbounded,
roughly 10..+10). Sort desc. Send a candidate set (≤200), not your whole corpus.
### `POST /api/search` — orchestrated hybrid retrieval
One call: embeds the query (dense), retrieves from Qdrant (hybrid dense+sparse
with RRF **when you supply a sparse vector**, else dense-only), optionally
cross-encoder reranks, returns top_k.
```json
// request
{
"query": "Did Brightwater commit to the Fund III close in Q1?",
"collection": "crm_chunks",
"top_k": 8,
"retrieve_n": 80, // first-stage candidates (default max(50, top_k*10))
"sparse": { "indices": [12, 904], "values": [0.7, 1.2] }, // optional BM25 vector for hybrid
"fusion": "rrf", // or "dbsf"
"filter": { // raw Qdrant filter (pre-filter, see below)
"must": [ { "key": "lp_id", "match": { "value": "lp_0427" } } ]
},
"rerank": true,
"text_field": "text", // payload field holding the chunk text
"with_payload": true
}
// response
{
"object": "search.result_list",
"model": "BAAI/bge-m3+bge-reranker-v2-m3",
"query": "...",
"collection": "crm_chunks",
"reranked": true,
"data": [
{ "object": "search.result", "index": 0, "id": "chunk_99c1_3",
"score": 5.21, "fused_score": 0.41, "rerank_score": 5.21,
"text": "...Brightwater confirmed verbal commitment...",
"payload": { "lp_id": "lp_0427", "doc_type": "email", "date_ts": 1771027200, ... } }
],
"usage": { "embed_ms": 21, "qdrant_ms": 9, "rerank_ms": 140, "candidates": 80 }
}
```
`score` is the rerank score when reranked, else the fused/dense score.
## The sparse (BM25) story — important for entity-heavy data
bge-m3's dense vectors carry semantic meaning, but exact entity matches
(fund names, tickers, people) are a **lexical** signal. For that you want hybrid
dense + sparse. Two design facts:
1. **spark-embed serves dense only.** TEI/Infinity can't emit bge-m3's learned
sparse either, and bge-m3's pretrained sparse weights underweight novel
entity tokens (brand-new fund names). So we don't use bge-m3 sparse.
2. **Use Qdrant BM25 with its built-in IDF**, generated client-side. This learns
IDF over *your* corpus, so novel entity strings get correct weight.
### Your ingest pipeline (the part you own)
For each chunk, produce and upsert **both** vectors:
- **dense**: call Spark Control `POST /v1/embeddings` → 1024-d vector.
- **sparse**: run [FastEmbed](https://github.com/qdrant/fastembed) BM25 client-side:
```python
from fastembed import SparseTextEmbedding
bm25 = SparseTextEmbedding(model_name="Qdrant/bm25")
sp = next(bm25.embed([chunk_text])) # -> {indices, values}
```
Create the collection with a named dense vector and a named sparse vector that
uses `modifier: idf` (so Qdrant applies IDF server-side):
```json
PUT /collections/crm_chunks
{ "vectors": { "dense": { "size": 1024, "distance": "Cosine" } },
"sparse_vectors": { "sparse": { "modifier": "idf" } } }
```
Add payload indexes for your filter fields so filtered queries stay fast:
```json
PUT /collections/crm_chunks/index { "field_name": "lp_id", "field_schema": "keyword" }
PUT /collections/crm_chunks/index { "field_name": "doc_type", "field_schema": "keyword" }
PUT /collections/crm_chunks/index { "field_name": "date_ts", "field_schema": "integer" }
```
Upsert points with both vectors + payload:
```json
PUT /collections/crm_chunks/points
{ "points": [ { "id": 1,
"vector": { "dense": [...1024...], "sparse": { "indices": [...], "values": [...] } },
"payload": { "lp_id": "lp_0427", "lp_name": "Brightwater Capital",
"doc_type": "email", "date_ts": 1771027200, "text": "..." } } ] }
```
### At query time
Generate the query's BM25 sparse vector with the **same** FastEmbed model, and
pass it to `/api/search` as `sparse`. Spark Control fuses dense+sparse with RRF
inside Qdrant, then reranks. If you omit `sparse`, you get dense + rerank
(still good, just no exact-lexical leg).
> You can talk to Qdrant directly (`http://<spark2>:6333`) for collection
> management and upserts — that's the natural home for ingest. `/api/search` is
> the convenience path for the read side so your agents hit one host. If you'd
> rather Spark Control proxy Qdrant admin too, say the word.
## Chunking guidance (entity-heavy CRM)
- One chunk per email / note / transcript-turn; one chunk per memo *section* —
don't split mid-thought.
- Keep entity + date as **payload fields** (filterable), not buried in embedded text.
- Time-aware: don't merge a 2022 note with a 2026 update in one chunk; store
`date_ts` so you can pre-filter and recency-boost.
- Resolve entity-name variants ("J. Smith" / "Jonathan Smith" / "JS") to one
canonical `lp_id` at ingest, or retrieval fragments across variants.
## Pre-filtering
Most agent queries aren't pure semantic — they're "recent emails with Bob about
energy". Pass a Qdrant `filter` to restrict the search space *before* vector
scoring (faster + more precise). Qdrant also supports server-side recency
boosting via Formula/Score-Boosting if you want decay-by-`date_ts` without an
app round-trip — ask and we can expose it through `/api/search`.
## Service discovery + health
- `GET /api/endpoints` includes `embeddings` and `qdrant` entries (base_url +
ready flag) alongside `vllm`, `parakeet`, `kokoro`.
- The dashboard shows Embeddings + Qdrant health dots and Start/Restart/Stop
controls in the Always-On Services panel.
- Spark Control will auto-restart a wedged **embedding** container (GPU CUDA
wedge recovery, like the audio services) but **never** auto-restarts Qdrant —
it holds your only copy of the index, so a restart is surfaced for manual
action instead.
## Capacity
At your scale (tens of thousands of chunks now → low hundreds of thousands),
this is trivial: ~0.61.2 GB of dense vectors at 300k chunks, spark-embed +
Qdrant together use a few GB of GPU/RAM on Spark 2's 122 GB. Full re-embed of
300k chunks is ~815 min, so re-indexing is cheap if you change models.
## Model upgrade path
If dense recall becomes the bottleneck, `Qwen3-Embedding-4B` (Matryoshka-trained,
tops 2026 MTEB) is the A/B candidate — same `/v1/embeddings` contract, swap the
model in spark-embed. bge-m3 is the lower-risk starting point and ships now.