Files
spark-control/image/whisperx_container/README.md
T
Keysat 5a0bfba6a3 v0.12.0:0 - WhisperX as a one-click dashboard install + managed service
Replaces the manual rsync+build+run with a proper spark-control feature.
First in the audio path that doesn't require shell access on Spark 2.

What's in the box
─────────────────
* image/whisperx_container/   - the build context (Dockerfile, requirements,
  app/main.py FastAPI wrapper). Mainline pipeline: faster-whisper for STT +
  pyannote 3.1 for diarization + wav2vec2 forced alignment. Single endpoint
  /v1/audio/transcribe-with-speakers returns the exact same shape spark-
  control's existing endpoint does, so the recap-relay PR spec needs no
  changes when we cut over.

* image/app/whisperx_install.py - install manager. ships build context to
  Spark 2 over SSH, runs `docker build`, runs `docker run` with 40 GB
  memory cap (vs Sortformer's unbounded which thrashed Spark 2 on a 90-min
  file), polls /health until both Whisper + pyannote report loaded.

* Audio proxy: /api/audio/transcribe-with-speakers now prefers WhisperX
  when its /health reports diarizer_loaded=true, falls back to the legacy
  Parakeet + Sortformer path otherwise. Same response shape either way.
  Clean cutover, easy rollback (`docker rm whisperx-asr`).

* Dashboard (Audio / Speech tab):
  - "Add WhisperX" banner appears when not installed, with a primary
    "Install WhisperX" button. One click triggers the install.
  - Build progress dialog with phase + elapsed timer + live build log via
    SSE (`/api/whisperx/install/{job_id}/stream`).
  - After install, WhisperX auto-registers as a managed service alongside
    Parakeet and Magpie (Start/Restart/Stop, deep-check, auto-restart).
  - Banner self-hides once /api/whisperx/status reports healthy.

New endpoints
─────────────
  GET  /api/whisperx/status
  POST /api/whisperx/install
  GET  /api/whisperx/install/{job_id}
  GET  /api/whisperx/install/{job_id}/stream  (SSE phase + log)

Config additions (env)
──────────────────────
  WHISPERX_HOST       (defaults to spark2_host)
  WHISPERX_USER       (defaults to spark2_user)
  WHISPERX_CONTAINER  (default: whisperx-asr)
  WHISPERX_PORT       (default: 8002)
  WHISPERX_MODEL      (default: medium; tiny/base/small/medium/large-v3)

Dockerfile
──────────
Added COPY whisperx_container /app/whisperx_container so the runtime
install manager can read the build context from inside the spark-control
image and ship it over SSH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:02:26 -05:00

75 lines
2.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# WhisperX container for Spark 2
Replaces the custom Parakeet wrapper + Sortformer overlay (v0.10/v0.11) with a
single mainline pipeline:
- **faster-whisper** (CTranslate2-optimized) for STT
- **pyannote.audio 3.1** for speaker diarization (sliding-window — handles
long files in bounded memory, fixes the Sortformer OOM on 90-min audio)
- **wav2vec2 forced alignment** for word-level timestamps
Exposes the same API surface spark-control already proxies to, so the cutover
is a one-URL change in the audio proxy:
- `GET /health` — readiness probe
- `GET /v1/models` — model list
- `POST /v1/audio/transcriptions` — OpenAI-shaped STT
- `POST /v1/audio/transcribe-with-speakers` — merged diarized transcript
(matches spark-control's response shape exactly)
## Deploy to Spark 2
```bash
# 1. Copy this directory to Spark 2
rsync -av --delete image/whisperx_container/ <spark-user>@<spark-2-ip>:~/whisperx-build/
# 2. SSH in and build
ssh <spark-user>@<spark-2-ip>
cd ~/whisperx-build
docker build -t whisperx-asr:latest .
# 3. Run alongside the existing parakeet-asr (which stays on 8000 for now)
docker run -d --restart unless-stopped --name whisperx-asr \
--gpus all --memory=40g \
-p 8002:8002 \
-v whisperx-models:/root/.cache/huggingface \
-e HF_TOKEN="$(cat ~/.cache/huggingface/token)" \
-e WHISPER_MODEL=medium \
whisperx-asr:latest
# 4. Watch first-start logs (model load + first health check)
docker logs -f whisperx-asr
```
## Model size knobs
`WHISPER_MODEL` env var. Defaults to `medium`. Options:
| Model | Size | Speed (GB10) | Quality |
|---|---|---|---|
| `tiny` | ~75M | ~120x rt | low |
| `base` | ~74M | ~80x rt | ok |
| `small` | ~244M | ~50x rt | good |
| `medium`| ~769M | ~30x rt | excellent (**default**) |
| `large-v3`| ~1.5B | ~15x rt | best |
For a 90-min file, medium takes ~3 min STT + ~9 min diarize ≈ ~12 min total.
## Memory budget
The `--memory=40g` cap is intentional. Spark 2 has 122 GB unified, of which
~35 GB is consumed by parakeet-asr + magpie-tts. The 40 GB cap leaves
comfortable headroom for both the model weights (~5 GB) and pyannote's
in-memory features (~515 GB for a 90-min audio). If WhisperX hits a
pathological input it gets OOM-killed cleanly instead of swap-thrashing the
whole Spark — the symptom we hit with the unbounded Sortformer container.
## Rollback to Parakeet+Sortformer
```bash
docker stop whisperx-asr && docker rm whisperx-asr
```
The parakeet-asr container stays running throughout — spark-control's proxy
URL switch is reversible via config or version downgrade.