Files
spark-control/image/whisperx_container/Dockerfile
T
Keysat 09a1d3590d v0.12.0:3 - hotfix: build torchaudio from source against NGC's torch
NGC PyTorch (the only base with working torch on Spark's ARM64 + sm_120
Blackwell) doesn't ship torchaudio. Stock pip wheels are amd64-only AND
ABI-incompatible with NGC's custom torch 2.10.0a anyway. Pip install
just fails or crashes at runtime.

Real fix:
  - apt install git cmake build-essential ninja-build
  - pip install git+https://github.com/pytorch/audio.git@v2.5.1
    with TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0" (sm_120 for Blackwell GB10)
  - this compiles torchaudio against the torch already in the image, so
    ABI matches by construction

Then constraints.txt locks torch + torchvision + torchaudio so the later
`pip install whisperx` can't swap any of them.

Cost: +3-5 min to the first install. Docker layer cache reuses the
built torchaudio on every subsequent rebuild.

Torchaudio v2.5.1 is the last tag that builds cleanly against
torch 2.5-2.10 — main branch is too volatile against NGC's alpha torch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:40:50 -05:00

84 lines
3.9 KiB
Docker
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# WhisperX ASR + diarization container for Spark 2 (Blackwell GB10, sm_120).
#
# Replaces the custom Parakeet wrapper + Sortformer overlay with a single
# mainline pipeline: faster-whisper for transcription + pyannote.audio 3.1
# for diarization + wav2vec2 forced alignment for word-level timestamps.
#
# Build (on Spark 2, where Blackwell + nvcr.io credentials are available):
# docker build -t whisperx-asr:latest .
#
# Run:
# docker run -d --restart unless-stopped --name whisperx-asr \
# --gpus all --memory=40g \
# -p 8002:8002 \
# -v whisperx-models:/root/.cache/huggingface \
# -e HF_TOKEN="$(cat ~/.cache/huggingface/token)" \
# -e WHISPER_MODEL=medium \
# whisperx-asr:latest
#
# The memory cap is intentional: even if WhisperX hits a pathological input,
# it gets OOM-killed cleanly instead of swap-thrashing the whole Spark.
FROM nvcr.io/nvidia/pytorch:25.11-py3
# WhisperX runs ffmpeg under the hood for audio decoding.
# git + cmake + build-essential are needed to build torchaudio from source
# (see below); we remove them at the end of the next layer to keep the image
# from growing unnecessarily.
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
ffmpeg git cmake build-essential ninja-build \
&& rm -rf /var/lib/apt/lists/*
# Pin torch + torchvision to whatever NGC actually shipped so pip can't swap
# them out when it satisfies whisperx/pyannote deps. (NGC's torch is a custom
# build with a non-standard local version like "2.10.0a0+b558c986e8.nv25.11"
# — stock pip wheels would clobber it and break the ABI.)
RUN python3 -c "import torch, torchvision; \
import sys; \
sys.stdout.write(f'torch=={torch.__version__}\ntorchvision=={torchvision.__version__}\n')" \
> /tmp/torch-constraints.txt \
&& echo '── pinned torch versions ──' && cat /tmp/torch-constraints.txt
# NGC PyTorch images don't include torchaudio (NVIDIA optimizes for
# vision/text workloads). Stock torchaudio wheels are ABI-incompatible with
# NGC's custom torch 2.10a, so the only working option is building from
# source against the NGC torch already in the image. Pinning to v2.5.1 — the
# last torchaudio tag that builds cleanly against torch 2.52.10 and is a
# proven compatibility target.
ENV USE_CUDA=1 BUILD_SOX=0 TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0"
RUN pip install --break-system-packages --no-cache-dir \
git+https://github.com/pytorch/audio.git@v2.5.1 \
&& python3 -c "import torchaudio; print('torchaudio built:', torchaudio.__version__)"
# Append torchaudio to constraints so pip can't replace it later.
RUN python3 -c "import torchaudio; print(f'torchaudio=={torchaudio.__version__}')" \
>> /tmp/torch-constraints.txt \
&& echo '── final pinned versions ──' && cat /tmp/torch-constraints.txt
# Install whisperx + the FastAPI wrapper deps under the torch+torchaudio
# constraint. pip will satisfy whisperx/pyannote without swapping any of the
# pytorch-family packages.
COPY requirements.txt /tmp/requirements.txt
RUN pip install --break-system-packages --no-cache-dir \
-c /tmp/torch-constraints.txt -r /tmp/requirements.txt
# Pre-warm the default Whisper + alignment models at build time so first-call
# latency on a fresh container is small. (~3 GB cached into the image; if you
# want a smaller image, comment this out and accept the first-call download.)
ARG WHISPER_MODEL=medium
ENV WHISPER_MODEL=${WHISPER_MODEL}
RUN python3 -c "import whisperx; whisperx.load_model('${WHISPER_MODEL}', 'cpu', compute_type='int8')" \
&& python3 -c "import whisperx; whisperx.load_align_model(language_code='en', device='cpu')"
WORKDIR /opt/whisperx
COPY app /opt/whisperx/app
# Expose for spark-control's proxy on Spark 2
EXPOSE 8002
HEALTHCHECK --interval=30s --timeout=10s --start-period=180s \
CMD python3 -c "import urllib.request; urllib.request.urlopen('http://localhost:8002/health')" || exit 1
CMD ["python3", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8002", "--workers", "1"]