NGC PyTorch (the only base with working torch on Spark's ARM64 + sm_120 Blackwell) doesn't ship torchaudio. Stock pip wheels are amd64-only AND ABI-incompatible with NGC's custom torch 2.10.0a anyway. Pip install just fails or crashes at runtime. Real fix: - apt install git cmake build-essential ninja-build - pip install git+https://github.com/pytorch/audio.git@v2.5.1 with TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0" (sm_120 for Blackwell GB10) - this compiles torchaudio against the torch already in the image, so ABI matches by construction Then constraints.txt locks torch + torchvision + torchaudio so the later `pip install whisperx` can't swap any of them. Cost: +3-5 min to the first install. Docker layer cache reuses the built torchaudio on every subsequent rebuild. Torchaudio v2.5.1 is the last tag that builds cleanly against torch 2.5-2.10 — main branch is too volatile against NGC's alpha torch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WhisperX container for Spark 2
Replaces the custom Parakeet wrapper + Sortformer overlay (v0.10/v0.11) with a single mainline pipeline:
- faster-whisper (CTranslate2-optimized) for STT
- pyannote.audio 3.1 for speaker diarization (sliding-window — handles long files in bounded memory, fixes the Sortformer OOM on 90-min audio)
- wav2vec2 forced alignment for word-level timestamps
Exposes the same API surface spark-control already proxies to, so the cutover is a one-URL change in the audio proxy:
GET /health— readiness probeGET /v1/models— model listPOST /v1/audio/transcriptions— OpenAI-shaped STTPOST /v1/audio/transcribe-with-speakers— merged diarized transcript (matches spark-control's response shape exactly)
Deploy to Spark 2
# 1. Copy this directory to Spark 2
rsync -av --delete image/whisperx_container/ <spark-user>@<spark-2-ip>:~/whisperx-build/
# 2. SSH in and build
ssh <spark-user>@<spark-2-ip>
cd ~/whisperx-build
docker build -t whisperx-asr:latest .
# 3. Run alongside the existing parakeet-asr (which stays on 8000 for now)
docker run -d --restart unless-stopped --name whisperx-asr \
--gpus all --memory=40g \
-p 8002:8002 \
-v whisperx-models:/root/.cache/huggingface \
-e HF_TOKEN="$(cat ~/.cache/huggingface/token)" \
-e WHISPER_MODEL=medium \
whisperx-asr:latest
# 4. Watch first-start logs (model load + first health check)
docker logs -f whisperx-asr
Model size knobs
WHISPER_MODEL env var. Defaults to medium. Options:
| Model | Size | Speed (GB10) | Quality |
|---|---|---|---|
tiny |
~75M | ~120x rt | low |
base |
~74M | ~80x rt | ok |
small |
~244M | ~50x rt | good |
medium |
~769M | ~30x rt | excellent (default) |
large-v3 |
~1.5B | ~15x rt | best |
For a 90-min file, medium takes ~3 min STT + ~9 min diarize ≈ ~12 min total.
Memory budget
The --memory=40g cap is intentional. Spark 2 has 122 GB unified, of which
~35 GB is consumed by parakeet-asr + magpie-tts. The 40 GB cap leaves
comfortable headroom for both the model weights (~5 GB) and pyannote's
in-memory features (~5–15 GB for a 90-min audio). If WhisperX hits a
pathological input it gets OOM-killed cleanly instead of swap-thrashing the
whole Spark — the symptom we hit with the unbounded Sortformer container.
Rollback to Parakeet+Sortformer
docker stop whisperx-asr && docker rm whisperx-asr
The parakeet-asr container stays running throughout — spark-control's proxy URL switch is reversible via config or version downgrade.