# WhisperX container for Spark 2 Replaces the custom Parakeet wrapper + Sortformer overlay (v0.10/v0.11) with a single mainline pipeline: - **faster-whisper** (CTranslate2-optimized) for STT - **pyannote.audio 3.1** for speaker diarization (sliding-window — handles long files in bounded memory, fixes the Sortformer OOM on 90-min audio) - **wav2vec2 forced alignment** for word-level timestamps Exposes the same API surface spark-control already proxies to, so the cutover is a one-URL change in the audio proxy: - `GET /health` — readiness probe - `GET /v1/models` — model list - `POST /v1/audio/transcriptions` — OpenAI-shaped STT - `POST /v1/audio/transcribe-with-speakers` — merged diarized transcript (matches spark-control's response shape exactly) ## Deploy to Spark 2 ```bash # 1. Copy this directory to Spark 2 rsync -av --delete image/whisperx_container/ @:~/whisperx-build/ # 2. SSH in and build ssh @ cd ~/whisperx-build docker build -t whisperx-asr:latest . # 3. Run alongside the existing parakeet-asr (which stays on 8000 for now) docker run -d --restart unless-stopped --name whisperx-asr \ --gpus all --memory=40g \ -p 8002:8002 \ -v whisperx-models:/root/.cache/huggingface \ -e HF_TOKEN="$(cat ~/.cache/huggingface/token)" \ -e WHISPER_MODEL=medium \ whisperx-asr:latest # 4. Watch first-start logs (model load + first health check) docker logs -f whisperx-asr ``` ## Model size knobs `WHISPER_MODEL` env var. Defaults to `medium`. Options: | Model | Size | Speed (GB10) | Quality | |---|---|---|---| | `tiny` | ~75M | ~120x rt | low | | `base` | ~74M | ~80x rt | ok | | `small` | ~244M | ~50x rt | good | | `medium`| ~769M | ~30x rt | excellent (**default**) | | `large-v3`| ~1.5B | ~15x rt | best | For a 90-min file, medium takes ~3 min STT + ~9 min diarize ≈ ~12 min total. ## Memory budget The `--memory=40g` cap is intentional. Spark 2 has 122 GB unified, of which ~35 GB is consumed by parakeet-asr + magpie-tts. The 40 GB cap leaves comfortable headroom for both the model weights (~5 GB) and pyannote's in-memory features (~5–15 GB for a 90-min audio). If WhisperX hits a pathological input it gets OOM-killed cleanly instead of swap-thrashing the whole Spark — the symptom we hit with the unbounded Sortformer container. ## Rollback to Parakeet+Sortformer ```bash docker stop whisperx-asr && docker rm whisperx-asr ``` The parakeet-asr container stays running throughout — spark-control's proxy URL switch is reversible via config or version downgrade.