Files
spark-control/docs/EMBEDDINGS.md
T
Keysat 8d839e3714 v0.13.0:4 - redaction gateway, embeddings proxy, expanded audio API
- Add redaction gateway (redaction_gateway.py, redaction/ scrub + tests)
- Add embeddings proxy and spark_embed service (Dockerfile + main.py)
- Expand audio_proxy with speaker-aware handling; deep_health/health/server updates
- Package: configureSparks action + sparkConfig model updates, manifest/main wiring
- Docs: AUDIO_API, EMBEDDINGS, REDACTION_GATEWAY; HANDOFF and runbook/known-issues refresh
2026-06-11 17:45:57 -05:00

8.2 KiB
Raw Permalink Blame History

Embeddings + Retrieval through Spark Control (v0.15.0)

Spark Control now fronts a local RAG stack so your agent/CRM system can do dense embeddings, reranking, and hybrid vector search against one trusted host — same TLS cert and allowlist as the LLM and audio endpoints.

What runs where

Component Host Port Role
spark-embed Spark 2 (GPU) 8088 BAAI/bge-m3 dense embeddings (1024-d) + BAAI/bge-reranker-v2-m3 cross-encoder rerank
Qdrant Spark 2 (CPU) 6333/6334 Vector storage, hybrid dense+sparse retrieval, RRF fusion, payload filtering
Spark Control Start9 (your LAN URL) Proxies all of the above behind one host

spark-embed is a small FastAPI server built from the NGC PyTorch image (the torch we've proven runs on the GB10's sm_121 GPU). We did not use HF Text Embeddings Inference because as of 2026 it ships no arm64 CUDA image. No torchaudio, no flash-attn — bge-m3 + the reranker are plain XLM-RoBERTa encoders.

Endpoints (all on the Spark Control host)

POST /v1/embeddings — dense embeddings (OpenAI-compatible)

// request
{ "input": "text to embed", "model": "BAAI/bge-m3" }
// or  { "input": ["batch", "of", "texts"] }

// response (OpenAI shape)
{
  "object": "list",
  "data": [ { "object": "embedding", "index": 0, "embedding": [0.01, ...1024 floats] } ],
  "model": "BAAI/bge-m3",
  "usage": { "prompt_tokens": 0, "total_tokens": 0 }
}

Vectors are L2-normalized by default (cosine == dot product). Works with the stock OpenAI Python/JS client by pointing base_url at Spark Control.

POST /v1/rerank — cross-encoder rerank

// request
{ "query": "did Brightwater commit?", "documents": ["chunk a", "chunk b", ...],
  "top_n": 5, "return_documents": false }

// response (Cohere-ish)
{ "object": "rerank.result", "model": "BAAI/bge-reranker-v2-m3",
  "results": [ { "index": 3, "relevance_score": 5.21 }, { "index": 0, "relevance_score": 1.04 }, ... ] }

relevance_score is the reranker's raw logit (higher = more relevant; unbounded, roughly 10..+10). Sort desc. Send a candidate set (≤200), not your whole corpus.

POST /api/search — orchestrated hybrid retrieval

One call: embeds the query (dense), retrieves from Qdrant (hybrid dense+sparse with RRF when you supply a sparse vector, else dense-only), optionally cross-encoder reranks, returns top_k.

// request
{
  "query": "Did Brightwater commit to the Fund III close in Q1?",
  "collection": "crm_chunks",
  "top_k": 8,
  "retrieve_n": 80,                 // first-stage candidates (default max(50, top_k*10))
  "sparse": { "indices": [12, 904], "values": [0.7, 1.2] },  // optional BM25 vector for hybrid
  "fusion": "rrf",                  // or "dbsf"
  "filter": {                       // raw Qdrant filter (pre-filter, see below)
    "must": [ { "key": "lp_id", "match": { "value": "lp_0427" } } ]
  },
  "rerank": true,
  "text_field": "text",             // payload field holding the chunk text
  "with_payload": true
}

// response
{
  "object": "search.result_list",
  "model": "BAAI/bge-m3+bge-reranker-v2-m3",
  "query": "...",
  "collection": "crm_chunks",
  "reranked": true,
  "data": [
    { "object": "search.result", "index": 0, "id": "chunk_99c1_3",
      "score": 5.21, "fused_score": 0.41, "rerank_score": 5.21,
      "text": "...Brightwater confirmed verbal commitment...",
      "payload": { "lp_id": "lp_0427", "doc_type": "email", "date_ts": 1771027200, ... } }
  ],
  "usage": { "embed_ms": 21, "qdrant_ms": 9, "rerank_ms": 140, "candidates": 80 }
}

score is the rerank score when reranked, else the fused/dense score.

The sparse (BM25) story — important for entity-heavy data

bge-m3's dense vectors carry semantic meaning, but exact entity matches (fund names, tickers, people) are a lexical signal. For that you want hybrid dense + sparse. Two design facts:

  1. spark-embed serves dense only. TEI/Infinity can't emit bge-m3's learned sparse either, and bge-m3's pretrained sparse weights underweight novel entity tokens (brand-new fund names). So we don't use bge-m3 sparse.
  2. Use Qdrant BM25 with its built-in IDF, generated client-side. This learns IDF over your corpus, so novel entity strings get correct weight.

Your ingest pipeline (the part you own)

For each chunk, produce and upsert both vectors:

  • dense: call Spark Control POST /v1/embeddings → 1024-d vector.
  • sparse: run FastEmbed BM25 client-side:
    from fastembed import SparseTextEmbedding
    bm25 = SparseTextEmbedding(model_name="Qdrant/bm25")
    sp = next(bm25.embed([chunk_text]))   # -> {indices, values}
    

Create the collection with a named dense vector and a named sparse vector that uses modifier: idf (so Qdrant applies IDF server-side):

PUT /collections/crm_chunks
{ "vectors": { "dense": { "size": 1024, "distance": "Cosine" } },
  "sparse_vectors": { "sparse": { "modifier": "idf" } } }

Add payload indexes for your filter fields so filtered queries stay fast:

PUT /collections/crm_chunks/index   { "field_name": "lp_id",     "field_schema": "keyword" }
PUT /collections/crm_chunks/index   { "field_name": "doc_type",  "field_schema": "keyword" }
PUT /collections/crm_chunks/index   { "field_name": "date_ts",   "field_schema": "integer" }

Upsert points with both vectors + payload:

PUT /collections/crm_chunks/points
{ "points": [ { "id": 1,
    "vector": { "dense": [...1024...], "sparse": { "indices": [...], "values": [...] } },
    "payload": { "lp_id": "lp_0427", "lp_name": "Brightwater Capital",
                 "doc_type": "email", "date_ts": 1771027200, "text": "..." } } ] }

At query time

Generate the query's BM25 sparse vector with the same FastEmbed model, and pass it to /api/search as sparse. Spark Control fuses dense+sparse with RRF inside Qdrant, then reranks. If you omit sparse, you get dense + rerank (still good, just no exact-lexical leg).

You can talk to Qdrant directly (http://<spark2>:6333) for collection management and upserts — that's the natural home for ingest. /api/search is the convenience path for the read side so your agents hit one host. If you'd rather Spark Control proxy Qdrant admin too, say the word.

Chunking guidance (entity-heavy CRM)

  • One chunk per email / note / transcript-turn; one chunk per memo section — don't split mid-thought.
  • Keep entity + date as payload fields (filterable), not buried in embedded text.
  • Time-aware: don't merge a 2022 note with a 2026 update in one chunk; store date_ts so you can pre-filter and recency-boost.
  • Resolve entity-name variants ("J. Smith" / "Jonathan Smith" / "JS") to one canonical lp_id at ingest, or retrieval fragments across variants.

Pre-filtering

Most agent queries aren't pure semantic — they're "recent emails with Bob about energy". Pass a Qdrant filter to restrict the search space before vector scoring (faster + more precise). Qdrant also supports server-side recency boosting via Formula/Score-Boosting if you want decay-by-date_ts without an app round-trip — ask and we can expose it through /api/search.

Service discovery + health

  • GET /api/endpoints includes embeddings and qdrant entries (base_url + ready flag) alongside vllm, parakeet, kokoro.
  • The dashboard shows Embeddings + Qdrant health dots and Start/Restart/Stop controls in the Always-On Services panel.
  • Spark Control will auto-restart a wedged embedding container (GPU CUDA wedge recovery, like the audio services) but never auto-restarts Qdrant — it holds your only copy of the index, so a restart is surfaced for manual action instead.

Capacity

At your scale (tens of thousands of chunks now → low hundreds of thousands), this is trivial: ~0.61.2 GB of dense vectors at 300k chunks, spark-embed + Qdrant together use a few GB of GPU/RAM on Spark 2's 122 GB. Full re-embed of 300k chunks is ~815 min, so re-indexing is cheap if you change models.

Model upgrade path

If dense recall becomes the bottleneck, Qwen3-Embedding-4B (Matryoshka-trained, tops 2026 MTEB) is the A/B candidate — same /v1/embeddings contract, swap the model in spark-embed. bge-m3 is the lower-risk starting point and ships now.