spark-control/image/whisperx_container/README.md

# WhisperX container for Spark 2

Replaces the custom Parakeet wrapper + Sortformer overlay (v0.10/v0.11) with a
single mainline pipeline:

- **faster-whisper** (CTranslate2-optimized) for STT
- **pyannote.audio 3.1** for speaker diarization (sliding-window — handles
  long files in bounded memory, fixes the Sortformer OOM on 90-min audio)
- **wav2vec2 forced alignment** for word-level timestamps

Exposes the same API surface spark-control already proxies to, so the cutover
is a one-URL change in the audio proxy:

- `GET  /health` — readiness probe
- `GET  /v1/models` — model list
- `POST /v1/audio/transcriptions` — OpenAI-shaped STT
- `POST /v1/audio/transcribe-with-speakers` — merged diarized transcript
  (matches spark-control's response shape exactly)

## Deploy to Spark 2

```bash
# 1. Copy this directory to Spark 2
rsync -av --delete image/whisperx_container/ <spark-user>@<spark-2-ip>:~/whisperx-build/

# 2. SSH in and build
ssh <spark-user>@<spark-2-ip>
cd ~/whisperx-build
docker build -t whisperx-asr:latest .

# 3. Run alongside the existing parakeet-asr (which stays on 8000 for now)
docker run -d --restart unless-stopped --name whisperx-asr \
  --gpus all --memory=40g \
  -p 8002:8002 \
  -v whisperx-models:/root/.cache/huggingface \
  -e HF_TOKEN="$(cat ~/.cache/huggingface/token)" \
  -e WHISPER_MODEL=medium \
  whisperx-asr:latest

# 4. Watch first-start logs (model load + first health check)
docker logs -f whisperx-asr
```

## Model size knobs

`WHISPER_MODEL` env var. Defaults to `medium`. Options:

| Model | Size | Speed (GB10) | Quality |
|---|---|---|---|
| `tiny`  | ~75M  | ~120x rt | low |
| `base`  | ~74M  | ~80x rt  | ok |
| `small` | ~244M | ~50x rt  | good |
| `medium`| ~769M | ~30x rt  | excellent (**default**) |
| `large-v3`| ~1.5B | ~15x rt | best |

For a 90-min file, medium takes ~3 min STT + ~9 min diarize ≈ ~12 min total.

## Memory budget

The `--memory=40g` cap is intentional. Spark 2 has 122 GB unified, of which
~35 GB is consumed by parakeet-asr + magpie-tts. The 40 GB cap leaves
comfortable headroom for both the model weights (~5 GB) and pyannote's
in-memory features (~5–15 GB for a 90-min audio). If WhisperX hits a
pathological input it gets OOM-killed cleanly instead of swap-thrashing the
whole Spark — the symptom we hit with the unbounded Sortformer container.

## Rollback to Parakeet+Sortformer

```bash
docker stop whisperx-asr && docker rm whisperx-asr
```

The parakeet-asr container stays running throughout — spark-control's proxy
URL switch is reversible via config or version downgrade.