Files

T

Keysat 98aeef8779 v0.12.0:2 - hotfix: pin NGC's torch versions so pip can't break the ABI

WhisperX docker build was crashing at the model-prewarm step:
  OSError: undefined symbol: torch_library_impl

Root cause: the NGC PyTorch base ships custom builds of torch +
torchaudio + torchvision matched together for Blackwell (sm_120). When
pip installed whisperx, it pulled the latest stock torchaudio wheel as
a transitive dep, which was compiled against a different libtorch and
won't load against NGC's.

Fix: at build time, capture NGC's actual torch/torchaudio/torchvision
versions into /tmp/torch-constraints.txt, then `pip install -c` that
constraint for all subsequent installs. pip can't swap torch out, so
the ABI stays consistent. whisperx and pyannote are happy with
torch>=2.0 — NGC's 2.10.0a0 satisfies that easily.

The pinned versions print to the build log so you can see them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 21:26:08 -05:00

app

v0.12.0:0 - WhisperX as a one-click dashboard install + managed service

2026-05-18 21:02:26 -05:00

Dockerfile

v0.12.0:2 - hotfix: pin NGC's torch versions so pip can't break the ABI

2026-05-18 21:26:08 -05:00

README.md

v0.12.0:0 - WhisperX as a one-click dashboard install + managed service

2026-05-18 21:02:26 -05:00

requirements.txt

v0.12.0:0 - WhisperX as a one-click dashboard install + managed service

2026-05-18 21:02:26 -05:00

README.md

WhisperX container for Spark 2

Replaces the custom Parakeet wrapper + Sortformer overlay (v0.10/v0.11) with a single mainline pipeline:

faster-whisper (CTranslate2-optimized) for STT
pyannote.audio 3.1 for speaker diarization (sliding-window — handles long files in bounded memory, fixes the Sortformer OOM on 90-min audio)
wav2vec2 forced alignment for word-level timestamps

Exposes the same API surface spark-control already proxies to, so the cutover is a one-URL change in the audio proxy:

GET /health — readiness probe
GET /v1/models — model list
POST /v1/audio/transcriptions — OpenAI-shaped STT
POST /v1/audio/transcribe-with-speakers — merged diarized transcript (matches spark-control's response shape exactly)

Deploy to Spark 2

# 1. Copy this directory to Spark 2
rsync -av --delete image/whisperx_container/ <spark-user>@<spark-2-ip>:~/whisperx-build/

# 2. SSH in and build
ssh <spark-user>@<spark-2-ip>
cd ~/whisperx-build
docker build -t whisperx-asr:latest .

# 3. Run alongside the existing parakeet-asr (which stays on 8000 for now)
docker run -d --restart unless-stopped --name whisperx-asr \
  --gpus all --memory=40g \
  -p 8002:8002 \
  -v whisperx-models:/root/.cache/huggingface \
  -e HF_TOKEN="$(cat ~/.cache/huggingface/token)" \
  -e WHISPER_MODEL=medium \
  whisperx-asr:latest

# 4. Watch first-start logs (model load + first health check)
docker logs -f whisperx-asr

Model size knobs

WHISPER_MODEL env var. Defaults to medium. Options:

Model	Size	Speed (GB10)	Quality
`tiny`	~75M	~120x rt	low
`base`	~74M	~80x rt	ok
`small`	~244M	~50x rt	good
`medium`	~769M	~30x rt	excellent (default)
`large-v3`	~1.5B	~15x rt	best

For a 90-min file, medium takes ~3 min STT + ~9 min diarize ≈ ~12 min total.

Memory budget

The --memory=40g cap is intentional. Spark 2 has 122 GB unified, of which ~35 GB is consumed by parakeet-asr + magpie-tts. The 40 GB cap leaves comfortable headroom for both the model weights (~5 GB) and pyannote's in-memory features (~5–15 GB for a 90-min audio). If WhisperX hits a pathological input it gets OOM-killed cleanly instead of swap-thrashing the whole Spark — the symptom we hit with the unbounded Sortformer container.

Rollback to Parakeet+Sortformer

docker stop whisperx-asr && docker rm whisperx-asr

The parakeet-asr container stays running throughout — spark-control's proxy URL switch is reversible via config or version downgrade.

README.md Unescape Escape

WhisperX container for Spark 2

Deploy to Spark 2

Model size knobs

Memory budget

Rollback to Parakeet+Sortformer

README.md