# spark-embed — dense embeddings (bge-m3) + reranker (bge-reranker-v2-m3) # Built FROM the NGC PyTorch image that is already proven to run on the DGX # Spark's GB10 (sm_121) GPU — the same base behind our vLLM and Kokoro work. # # Why not HF Text Embeddings Inference (TEI)? As of 2026 TEI ships no arm64 # CUDA image (all *-cuda tags are amd64-only), so it won't run on the Spark. # Building on NGC torch sidesteps that AND avoids torchaudio (the dependency # that sank the WhisperX attempt). bge-m3 + the reranker are XLM-RoBERTa # encoders — no flash-attn, no torchaudio, just SDPA attention on torch. FROM nvcr.io/nvidia/pytorch:25.11-py3 WORKDIR /app # Hard-pin the NGC torch version in a constraints file so pip CANNOT replace it # while resolving sentence-transformers. NGC's torch carries a local version # string (…nv25.11) not on PyPI; pinning it makes pip treat the already-installed # build as satisfying the requirement instead of pulling a PyPI wheel that # wouldn't have sm_121 kernels. (Same technique as the v0.12.0 torch-ABI work.) # transformers is NOT preinstalled in this NGC base, so it installs fresh from # PyPI; we cap it (<5) so a future major can't silently change loading behavior. RUN python -c "import torch; \ open('/tmp/constraints.txt','w').write('torch==%s\n' % torch.__version__)" \ && cat /tmp/constraints.txt \ && pip install --no-cache-dir -c /tmp/constraints.txt \ "sentence-transformers>=3.0" "transformers<5" "fastapi>=0.115" "uvicorn[standard]>=0.30" COPY main.py /app/main.py # Persist HuggingFace model downloads (bge-m3 ~2.3GB + reranker ~2.3GB) on a # mounted volume so container recreates don't re-download. ENV HF_HOME=/data/hf ENV DENSE_MODEL=BAAI/bge-m3 ENV RERANK_MODEL=BAAI/bge-reranker-v2-m3 EXPOSE 8088 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8088"]