Files
spark-control/image/whisperx_container
Keysat 5a0bfba6a3 v0.12.0:0 - WhisperX as a one-click dashboard install + managed service
Replaces the manual rsync+build+run with a proper spark-control feature.
First in the audio path that doesn't require shell access on Spark 2.

What's in the box
─────────────────
* image/whisperx_container/   - the build context (Dockerfile, requirements,
  app/main.py FastAPI wrapper). Mainline pipeline: faster-whisper for STT +
  pyannote 3.1 for diarization + wav2vec2 forced alignment. Single endpoint
  /v1/audio/transcribe-with-speakers returns the exact same shape spark-
  control's existing endpoint does, so the recap-relay PR spec needs no
  changes when we cut over.

* image/app/whisperx_install.py - install manager. ships build context to
  Spark 2 over SSH, runs `docker build`, runs `docker run` with 40 GB
  memory cap (vs Sortformer's unbounded which thrashed Spark 2 on a 90-min
  file), polls /health until both Whisper + pyannote report loaded.

* Audio proxy: /api/audio/transcribe-with-speakers now prefers WhisperX
  when its /health reports diarizer_loaded=true, falls back to the legacy
  Parakeet + Sortformer path otherwise. Same response shape either way.
  Clean cutover, easy rollback (`docker rm whisperx-asr`).

* Dashboard (Audio / Speech tab):
  - "Add WhisperX" banner appears when not installed, with a primary
    "Install WhisperX" button. One click triggers the install.
  - Build progress dialog with phase + elapsed timer + live build log via
    SSE (`/api/whisperx/install/{job_id}/stream`).
  - After install, WhisperX auto-registers as a managed service alongside
    Parakeet and Magpie (Start/Restart/Stop, deep-check, auto-restart).
  - Banner self-hides once /api/whisperx/status reports healthy.

New endpoints
─────────────
  GET  /api/whisperx/status
  POST /api/whisperx/install
  GET  /api/whisperx/install/{job_id}
  GET  /api/whisperx/install/{job_id}/stream  (SSE phase + log)

Config additions (env)
──────────────────────
  WHISPERX_HOST       (defaults to spark2_host)
  WHISPERX_USER       (defaults to spark2_user)
  WHISPERX_CONTAINER  (default: whisperx-asr)
  WHISPERX_PORT       (default: 8002)
  WHISPERX_MODEL      (default: medium; tiny/base/small/medium/large-v3)

Dockerfile
──────────
Added COPY whisperx_container /app/whisperx_container so the runtime
install manager can read the build context from inside the spark-control
image and ship it over SSH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:02:26 -05:00
..

WhisperX container for Spark 2

Replaces the custom Parakeet wrapper + Sortformer overlay (v0.10/v0.11) with a single mainline pipeline:

  • faster-whisper (CTranslate2-optimized) for STT
  • pyannote.audio 3.1 for speaker diarization (sliding-window — handles long files in bounded memory, fixes the Sortformer OOM on 90-min audio)
  • wav2vec2 forced alignment for word-level timestamps

Exposes the same API surface spark-control already proxies to, so the cutover is a one-URL change in the audio proxy:

  • GET /health — readiness probe
  • GET /v1/models — model list
  • POST /v1/audio/transcriptions — OpenAI-shaped STT
  • POST /v1/audio/transcribe-with-speakers — merged diarized transcript (matches spark-control's response shape exactly)

Deploy to Spark 2

# 1. Copy this directory to Spark 2
rsync -av --delete image/whisperx_container/ <spark-user>@<spark-2-ip>:~/whisperx-build/

# 2. SSH in and build
ssh <spark-user>@<spark-2-ip>
cd ~/whisperx-build
docker build -t whisperx-asr:latest .

# 3. Run alongside the existing parakeet-asr (which stays on 8000 for now)
docker run -d --restart unless-stopped --name whisperx-asr \
  --gpus all --memory=40g \
  -p 8002:8002 \
  -v whisperx-models:/root/.cache/huggingface \
  -e HF_TOKEN="$(cat ~/.cache/huggingface/token)" \
  -e WHISPER_MODEL=medium \
  whisperx-asr:latest

# 4. Watch first-start logs (model load + first health check)
docker logs -f whisperx-asr

Model size knobs

WHISPER_MODEL env var. Defaults to medium. Options:

Model Size Speed (GB10) Quality
tiny ~75M ~120x rt low
base ~74M ~80x rt ok
small ~244M ~50x rt good
medium ~769M ~30x rt excellent (default)
large-v3 ~1.5B ~15x rt best

For a 90-min file, medium takes ~3 min STT + ~9 min diarize ≈ ~12 min total.

Memory budget

The --memory=40g cap is intentional. Spark 2 has 122 GB unified, of which ~35 GB is consumed by parakeet-asr + magpie-tts. The 40 GB cap leaves comfortable headroom for both the model weights (~5 GB) and pyannote's in-memory features (~515 GB for a 90-min audio). If WhisperX hits a pathological input it gets OOM-killed cleanly instead of swap-thrashing the whole Spark — the symptom we hit with the unbounded Sortformer container.

Rollback to Parakeet+Sortformer

docker stop whisperx-asr && docker rm whisperx-asr

The parakeet-asr container stays running throughout — spark-control's proxy URL switch is reversible via config or version downgrade.