v0.10.0:0 - speaker diarization via Sortformer + merged transcribe-with-speakers

Adds a new pipeline for diarized transcription that any client (recap-relay, ad-hoc curl, future Mac-side tools) can call. Pure data pipeline, no LLM or UI included — name resolution / analysis happen downstream where prompts and rendering are configurable. Architecture: Spark 2 / parakeet-asr container: + /opt/parakeet/app/diarizer.py (new: SortformerDiarizer class) + /opt/parakeet/app/main.py (patched: loads diarizer, adds /v1/audio/diarize endpoint) Model: nvidia/diar_sortformer_4spk-v1 (~150 MB, ungated, NeMo native) Spark Control: + POST /api/audio/transcribe-with-speakers Body: multipart file Returns: { duration, language, speakers_detected, segments: [{start_ms, end_ms, speaker, text}, ...], models: {transcription, diarization} } Runs Parakeet ASR + Sortformer in parallel, merges words to speaker turns by timestamp, groups into speaker-change blocks (breaks also on >1.5s silence gaps). + If Parakeet 500s mid-pipeline, kicks deep-health probe and returns 503/Retry-After: 60 — same wedge-recovery pattern as v0.9.0:2. Apply Sortformer patches to the running Parakeet container with: bash image/parakeet_patches/apply.sh <spark2-host> <ssh-user> Patches are reversible — apply.sh backs up the original main.py inside the container at main.py.pre-sortformer before overwriting. Restore by copying that file back and removing diarizer.py, then docker restart. v0.11 follow-up: dashboard "Speech Models" panel to swap/update model versions from the UI instead of needing to re-run apply.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:14:48 -05:00
parent 197655a62b
commit 713cd09cc2
6 changed files with 659 additions and 2 deletions
@@ -209,4 +209,180 @@ def build_router(settings: Settings, deep_health: Any = None) -> APIRouter:
            raise HTTPException(r.status_code, r.text[:500])
        return Response(content=r.content, media_type=r.headers.get("content-type", "application/json"))
    # ---- /api/audio/transcribe-with-speakers (STT + diarization, merged) ----
    @router.post("/api/audio/transcribe-with-speakers")
    async def transcribe_with_speakers(
        file: UploadFile = File(...),
    ) -> dict:
        """Diarized transcription: run Parakeet ASR and Sortformer diarization on
        the same audio in parallel, then merge by timestamp.
        Response shape (designed for downstream UIs like recap-relay):
            {
              "duration": 90.5,
              "language": "en",
              "speakers_detected": ["Speaker_0", "Speaker_1"],
              "segments": [
                {"start_ms": 39308, "end_ms": 51000,
                 "speaker": "Speaker_0", "text": "good morning i think..."},
                ...
              ],
              "models": {
                "transcription": "parakeet-tdt-0.6b-v3",
                "diarization":   "nvidia/diar_sortformer_4spk-v1"
              }
            }
        Each segment is a block of consecutive words by the same speaker. Speaker
        labels are anonymous (Speaker_0, Speaker_1, ...) — name resolution is the
        caller's responsibility (LLM analysis with optional participant hints,
        or manual mapping UI).
        """
        body = await file.read()
        if not body:
            raise HTTPException(400, "Empty file")
        filename = file.filename or "audio.wav"
        content_type = file.content_type or "application/octet-stream"
        async def _call_transcribe(client: httpx.AsyncClient) -> dict:
            files = {"file": (filename, body, content_type)}
            data = {"response_format": "verbose_json"}
            r = await client.post(
                f"{_parakeet_base()}/v1/audio/transcriptions",
                files=files, data=data,
            )
            r.raise_for_status()
            return r.json()
        async def _call_diarize(client: httpx.AsyncClient) -> dict:
            files = {"file": (filename, body, content_type)}
            r = await client.post(
                f"{_parakeet_base()}/v1/audio/diarize",
                files=files,
            )
            r.raise_for_status()
            return r.json()
        # Run both in parallel against the same Parakeet container — Sortformer
        # and Parakeet ASR are independent forward passes that share the GPU.
        try:
            async with httpx.AsyncClient(timeout=600.0) as client:
                stt, diar = await asyncio.gather(
                    _call_transcribe(client),
                    _call_diarize(client),
                )
        except httpx.HTTPStatusError as e:
            # Surface upstream errors. If transcribe wedged, kick deep-health.
            if e.response.status_code == 500 and deep_health is not None:
                try:
                    asyncio.create_task(deep_health.run_one("parakeet"))
                except Exception:
                    pass
                raise HTTPException(
                    status_code=503,
                    detail="Parakeet transient error (likely CUDA wedge). Auto-restart triggered; retry in ~60s.",
                    headers={"Retry-After": "60"},
                )
            raise HTTPException(e.response.status_code, e.response.text[:500])
        except httpx.HTTPError as e:
            raise HTTPException(502, f"parakeet unreachable: {e}")
        merged = _merge_words_with_speakers(
            words=stt.get("words", []),
            diar_turns=diar.get("segments", []),
        )
        return {
            "duration": stt.get("duration") or diar.get("duration") or 0.0,
            "language": stt.get("language", "en"),
            "speakers_detected": diar.get("speakers_detected", []),
            "segments": merged,
            "models": {
                "transcription": stt.get("model") if isinstance(stt.get("model"), str) else "parakeet",
                "diarization": diar.get("model", "sortformer"),
            },
        }
    return router
 # ---- Merge helper: assign speaker to each word, then group into blocks ----
 def _assign_speaker_to_word(word_start_s: float, word_end_s: float, diar_turns: list[dict]) -> str:
    """Find the diarization turn that contains this word, or has the most
    overlap with it. Returns the speaker label, or 'Speaker_unknown' if no
    turn overlaps at all."""
    word_mid = (word_start_s + word_end_s) / 2.0
    # Fast path: find the turn containing the midpoint
    for t in diar_turns:
        if t["start_s"] <= word_mid <= t["end_s"]:
            return t["speaker"]
    # Slow path: pick the turn with max overlap with the word's span
    best_speaker = "Speaker_unknown"
    best_overlap = 0.0
    for t in diar_turns:
        overlap = max(0.0, min(word_end_s, t["end_s"]) - max(word_start_s, t["start_s"]))
        if overlap > best_overlap:
            best_overlap = overlap
            best_speaker = t["speaker"]
    return best_speaker
 def _merge_words_with_speakers(words: list[dict], diar_turns: list[dict]) -> list[dict]:
    """Group consecutive same-speaker words into blocks.
    Each input word: {"start": float_s, "end": float_s, "text": str}  (Parakeet
    verbose_json format; values are seconds).
    Each input turn: {"start_s": float, "end_s": float, "speaker": str}.
    Output: [{"start_ms": int, "end_ms": int, "speaker": str, "text": str}, ...]
    Also breaks a block on a long silence gap (>1.5 s) even within the same
    speaker — keeps blocks readable in UI rendering.
    """
    if not words:
        return []
    SILENCE_BREAK_S = 1.5
    blocks: list[dict] = []
    cur_words: list[str] = []
    cur_speaker: Optional[str] = None
    cur_start_s: Optional[float] = None
    cur_end_s: Optional[float] = None
    for w in words:
        ws = float(w.get("start", 0.0))
        we = float(w.get("end", ws))
        wt = str(w.get("text", ""))
        spk = _assign_speaker_to_word(ws, we, diar_turns)
        is_new_block = (
            cur_speaker is None
            or spk != cur_speaker
            or (cur_end_s is not None and ws - cur_end_s > SILENCE_BREAK_S)
        )
        if is_new_block:
            if cur_speaker is not None:
                blocks.append({
                    "start_ms": int(cur_start_s * 1000),
                    "end_ms": int(cur_end_s * 1000),
                    "speaker": cur_speaker,
                    "text": "".join(cur_words).strip(),
                })
            cur_words = [wt]
            cur_speaker = spk
            cur_start_s = ws
            cur_end_s = we
        else:
            cur_words.append(wt)
            cur_end_s = we
    if cur_speaker is not None and cur_words:
        blocks.append({
            "start_ms": int(cur_start_s * 1000),
            "end_ms": int(cur_end_s * 1000),
            "speaker": cur_speaker,
            "text": "".join(cur_words).strip(),
        })
    return blocks
@@ -0,0 +1,54 @@
 #!/bin/bash
 # Apply Sortformer diarization patches to a running parakeet-asr container.
 #
 # Run from the spark-control repo root on the laptop:
 #   bash image/parakeet_patches/apply.sh <spark2-host> <ssh-user>
 #
 # What it does:
 #   1. Backs up the current /opt/parakeet/app/main.py inside the container
 #      (writable layer; survives docker restart but NOT docker rm).
 #   2. Copies the patched main.py + new diarizer.py into the container.
 #   3. Restarts the container so the new code + Sortformer model load.
 #
 # Reversibility:
 #   - The backup of main.py is at /opt/parakeet/app/main.py.pre-sortformer
 #     inside the container. Restore with:
 #       docker exec parakeet-asr cp /opt/parakeet/app/main.py.pre-sortformer /opt/parakeet/app/main.py
 #       docker exec parakeet-asr rm -f /opt/parakeet/app/diarizer.py
 #       docker restart parakeet-asr
 #   - If the container is ever `docker rm`'d (volume rebuild), re-run this
 #     script. We will eventually fold this into spark-control as an action.
 set -e
 HOST="${1:?usage: apply.sh <spark2-host> <ssh-user>}"
 USER="${2:?usage: apply.sh <spark2-host> <ssh-user>}"
 CONTAINER="${CONTAINER:-parakeet-asr}"
 REPO_DIR="$(cd "$(dirname "$0")" && pwd)"
 echo "→ Backing up current main.py inside ${CONTAINER}..."
 ssh "${USER}@${HOST}" "docker exec ${CONTAINER} sh -c \
  'test -f /opt/parakeet/app/main.py.pre-sortformer || cp /opt/parakeet/app/main.py /opt/parakeet/app/main.py.pre-sortformer'"
 echo "→ Copying diarizer.py into container..."
 ssh "${USER}@${HOST}" "docker exec -i ${CONTAINER} sh -c \
  'cat > /opt/parakeet/app/diarizer.py'" < "${REPO_DIR}/diarizer.py"
 echo "→ Copying patched main.py into container..."
 ssh "${USER}@${HOST}" "docker exec -i ${CONTAINER} sh -c \
  'cat > /opt/parakeet/app/main.py'" < "${REPO_DIR}/main.py"
 echo "→ Verifying syntax inside container..."
 ssh "${USER}@${HOST}" "docker exec ${CONTAINER} python3 -c \
  'import ast; ast.parse(open(\"/opt/parakeet/app/diarizer.py\").read()); ast.parse(open(\"/opt/parakeet/app/main.py\").read()); print(\"py OK\")'"
 echo "→ Restarting ${CONTAINER}..."
 ssh "${USER}@${HOST}" "docker restart ${CONTAINER}"
 echo
 echo "✔ Patches applied. Sortformer model (~150 MB) will download on first load — wait ~30s before testing."
 echo
 echo "Test once it's ready:"
 echo "  curl -sS http://${HOST}:8000/health"
 echo "  curl -sS -X POST http://${HOST}:8000/v1/audio/diarize -F file=@some-audio.mp3 | head -c 500"
@@ -0,0 +1,164 @@
 """Speaker diarization via NVIDIA NeMo Sortformer.
 This module is dropped into the Parakeet container at /opt/parakeet/app/diarizer.py
 and loaded alongside the existing ASR model. The Sortformer model identifies who
 is speaking when in an audio file, output as a list of {start_s, end_s, speaker}
 turns. It does NOT transcribe — pair its output with Parakeet's word-level
 timestamps to produce a diarized transcript.
 Model: nvidia/diar_sortformer_4spk-v1 (~150 MB, NeMo ecosystem, ungated)
 Memory: adds ~200 MB to the running container. Same GPU as Parakeet (Spark 2
 unified GB10). No interference with Parakeet inference because they're called
 on separate code paths and CUDA handles concurrent kernels.
 """
 import io
 import os
 import logging
 import tempfile
 import subprocess
 from pathlib import Path
 from typing import Optional
 import torch
 import soundfile as sf
 import numpy as np
 logger = logging.getLogger(__name__)
 DIARIZER_MODEL = os.getenv("DIARIZER_MODEL", "nvidia/diar_sortformer_4spk-v1")
 TARGET_SAMPLE_RATE = 16000
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
 def _convert_to_wav_16k_mono(audio_bytes: bytes, original_filename: str) -> str:
    """Same conversion as transcriber.py — keeps a uniform input format
    for the diarizer regardless of upload mime type."""
    suffix = Path(original_filename).suffix.lower() if original_filename else ".wav"
    with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp_in:
        tmp_in.write(audio_bytes)
        tmp_in_path = tmp_in.name
    tmp_out_path = tmp_in_path + ".converted.wav"
    try:
        cmd = ["ffmpeg", "-y", "-i", tmp_in_path, "-ac", "1", "-ar", "16000",
               "-sample_fmt", "s16", "-f", "wav", tmp_out_path]
        result = subprocess.run(cmd, capture_output=True, timeout=300)
        if result.returncode != 0:
            raise RuntimeError(f"ffmpeg failed: {result.stderr.decode()[:500]}")
        return tmp_out_path
    finally:
        try: os.unlink(tmp_in_path)
        except OSError: pass
 def _parse_sortformer_segments(raw_output) -> list[dict]:
    """Sortformer.diarize() returns List[List[str]] where each inner list is
    per-file results: each entry is a space-separated 'start_s end_s speaker_label'
    triplet (e.g., '0.00 4.50 speaker_0'). Normalize to our canonical format."""
    if not raw_output:
        return []
    # Single-file invocation → take first inner list
    entries = raw_output[0] if isinstance(raw_output, list) and raw_output and isinstance(raw_output[0], list) else raw_output
    segments = []
    for entry in entries:
        if not entry:
            continue
        if isinstance(entry, str):
            parts = entry.strip().split()
            if len(parts) >= 3:
                try:
                    start = float(parts[0])
                    end = float(parts[1])
                    speaker_raw = parts[2]
                    # Normalize "speaker_0" / "spk_0" / "0" → "Speaker_0"
                    if speaker_raw.lower().startswith("speaker_"):
                        idx = speaker_raw.split("_", 1)[1]
                    elif speaker_raw.lower().startswith("spk_"):
                        idx = speaker_raw.split("_", 1)[1]
                    elif speaker_raw.isdigit():
                        idx = speaker_raw
                    else:
                        idx = speaker_raw
                    segments.append({
                        "start_s": start,
                        "end_s": end,
                        "speaker": f"Speaker_{idx}",
                    })
                except (ValueError, IndexError) as e:
                    logger.warning(f"unparsable sortformer entry: {entry!r} ({e})")
                    continue
    return segments
 class SortformerDiarizer:
    def __init__(self):
        self.model = None
        self._loaded = False
    def load_model(self):
        if self._loaded:
            return
        logger.info(f"Loading diarizer {DIARIZER_MODEL} on {DEVICE}...")
        from nemo.collections.asr.models import SortformerEncLabelModel
        self.model = SortformerEncLabelModel.from_pretrained(DIARIZER_MODEL)
        self.model.eval()
        if DEVICE == "cuda":
            self.model = self.model.cuda()
        self._loaded = True
        logger.info(f"Diarizer loaded on {DEVICE}")
    def diarize(self, audio_bytes: bytes, filename: str = "audio.wav") -> dict:
        """Run diarization on a single audio file.
        Returns:
            {
              "segments": [{"start_s": float, "end_s": float, "speaker": str}, ...],
              "speakers_detected": ["Speaker_0", "Speaker_1", ...],
              "duration": float,
              "model": str,
              "device": str,
            }
        Speaker labels are zero-indexed strings like "Speaker_0", "Speaker_1",
        etc. They are NOT real names — that mapping happens downstream via LLM
        analysis or manual UI correction.
        """
        if not self._loaded:
            self.load_model()
        if not audio_bytes:
            raise ValueError("empty audio")
        wav_path = None
        try:
            wav_path = _convert_to_wav_16k_mono(audio_bytes, filename)
            data, sr = sf.read(wav_path)
            duration = len(data) / sr
            logger.info(f"Diarizing {duration:.1f}s of audio ({filename})")
            with torch.no_grad():
                raw = self.model.diarize(
                    audio=[wav_path],
                    batch_size=1,
                    verbose=False,
                )
            segments = _parse_sortformer_segments(raw)
            speakers = sorted({s["speaker"] for s in segments})
            logger.info(f"Detected {len(speakers)} speakers across {len(segments)} turns")
            if DEVICE == "cuda":
                torch.cuda.empty_cache()
            return {
                "segments": segments,
                "speakers_detected": speakers,
                "duration": round(duration, 3),
                "model": DIARIZER_MODEL,
                "device": DEVICE,
            }
        finally:
            if wav_path:
                try: os.unlink(wav_path)
                except OSError: pass
 diarizer = SortformerDiarizer()
@@ -0,0 +1,158 @@
 import os
 import time
 import logging
 from contextlib import asynccontextmanager
 from typing import Optional
 import torch
 from fastapi import FastAPI, File, Form, UploadFile, HTTPException
 from fastapi.responses import JSONResponse
 from fastapi.middleware.cors import CORSMiddleware
 from app.transcriber import transcriber, MODEL_NAME, DEVICE
 from app.diarizer import diarizer, DIARIZER_MODEL
 logging.basicConfig(level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s")
 logger = logging.getLogger("parakeet-api")
@asynccontextmanager
 async def lifespan(app: FastAPI):
    logger.info(f"Loading ASR model {MODEL_NAME} on {DEVICE}")
    transcriber.load_model()
    logger.info("ASR model ready")
    logger.info(f"Loading diarizer {DIARIZER_MODEL} on {DEVICE}")
    diarizer.load_model()
    logger.info("Diarizer ready")
    yield
 app = FastAPI(title="Parakeet ASR + Sortformer Diarization API", version="1.2.0", lifespan=lifespan)
 app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_credentials=True,
                   allow_methods=["*"], allow_headers=["*"])
@app.get("/")
 async def root():
    return {"service": "parakeet-asr", "model": MODEL_NAME, "diarizer": DIARIZER_MODEL, "device": DEVICE,
            "endpoints": {"transcribe": "/v1/audio/transcriptions",
                         "diarize": "/v1/audio/diarize",
                         "models": "/v1/models", "health": "/health"}}
@app.get("/health")
 async def health():
    return {"status": "ready" if (transcriber._loaded and diarizer._loaded) else "loading",
            "asr_loaded": transcriber._loaded,
            "diarizer_loaded": diarizer._loaded,
            "model": MODEL_NAME,
            "diarizer_model": DIARIZER_MODEL,
            "device": DEVICE}
@app.get("/v1/models")
 async def list_models():
    return {"object": "list", "data": [
        {"id": "parakeet-tdt-0.6b-v3", "object": "model", "owned_by": "nvidia", "kind": "stt"},
        {"id": "whisper-1", "object": "model", "owned_by": "nvidia", "kind": "stt"},
        {"id": DIARIZER_MODEL.split("/")[-1], "object": "model", "owned_by": "nvidia", "kind": "diarization"}]}
@app.post("/v1/audio/transcriptions")
 async def transcribe(
    file: UploadFile = File(...),
    model: Optional[str] = Form(default="parakeet-tdt-0.6b-v3"),
    language: Optional[str] = Form(default=None),
    response_format: Optional[str] = Form(default="json"),
    temperature: Optional[float] = Form(default=0.0),
    prompt: Optional[str] = Form(default=None),
 ):
    if not transcriber._loaded:
        raise HTTPException(status_code=503, detail="Model loading")
    audio_bytes = await file.read()
    if len(audio_bytes) == 0:
        raise HTTPException(status_code=400, detail="Empty file")
    max_size = int(os.getenv("MAX_UPLOAD_MB", "200")) * 1024 * 1024
    if len(audio_bytes) > max_size:
        raise HTTPException(status_code=413, detail=f"File too large")
    want_timestamps = response_format == "verbose_json"
    start_time = time.time()
    try:
        result = transcriber.transcribe(
            audio_bytes, file.filename, language, timestamps=want_timestamps
        )
    except Exception as e:
        logger.exception("Transcription failed")
        raise HTTPException(status_code=500, detail=f"Failed: {e}")
    elapsed = time.time() - start_time
    duration = result.get("duration", 0)
    rtfx = duration / elapsed if elapsed > 0 else 0
    logger.info(f"Done: {duration:.1f}s in {elapsed:.1f}s ({rtfx:.0f}x rt)")
    if response_format == "text":
        return JSONResponse(content=result["text"], media_type="text/plain")
    if response_format == "verbose_json":
        return {
            "task": "transcribe",
            "language": language or "en",
            "duration": duration,
            "text": result["text"],
            "segments": result.get("segments", []),
            "words": result.get("words", []),
        }
    return {"text": result["text"]}
@app.post("/v1/audio/translations")
 async def translate(file: UploadFile = File(...),
    model: Optional[str] = Form(default="parakeet-tdt-0.6b-v3"),
    language: Optional[str] = Form(default=None),
    response_format: Optional[str] = Form(default="json")):
    return await transcribe(file=file, model=model, language=language,
                            response_format=response_format)
@app.post("/v1/audio/diarize")
 async def diarize(
    file: UploadFile = File(...),
 ):
    """Speaker diarization via Sortformer.
    Returns who-spoke-when as a list of turns. Does NOT transcribe — pair this
    output with /v1/audio/transcriptions (verbose_json) and merge by timestamp
    to produce a diarized transcript.
    Response shape:
        {
          "segments": [{"start_s": 0.00, "end_s": 4.50, "speaker": "Speaker_0"}, ...],
          "speakers_detected": ["Speaker_0", "Speaker_1"],
          "duration": 90.5,
          "model": "nvidia/diar_sortformer_4spk-v1",
          "device": "cuda"
        }
    """
    if not diarizer._loaded:
        raise HTTPException(status_code=503, detail="Diarizer loading")
    audio_bytes = await file.read()
    if len(audio_bytes) == 0:
        raise HTTPException(status_code=400, detail="Empty file")
    max_size = int(os.getenv("MAX_UPLOAD_MB", "200")) * 1024 * 1024
    if len(audio_bytes) > max_size:
        raise HTTPException(status_code=413, detail="File too large")
    start_time = time.time()
    try:
        result = diarizer.diarize(audio_bytes, file.filename or "audio.wav")
    except Exception as e:
        logger.exception("Diarization failed")
        raise HTTPException(status_code=500, detail=f"Failed: {e}")
    elapsed = time.time() - start_time
    duration = result.get("duration", 0)
    rtfx = duration / elapsed if elapsed > 0 else 0
    logger.info(f"Diarized {duration:.1f}s in {elapsed:.1f}s ({rtfx:.0f}x rt), "
                f"{len(result['speakers_detected'])} speakers, {len(result['segments'])} turns")
    return result
@@ -0,0 +1,105 @@
 import os
 import time
 import logging
 from contextlib import asynccontextmanager
 from typing import Optional
 import torch
 from fastapi import FastAPI, File, Form, UploadFile, HTTPException
 from fastapi.responses import JSONResponse
 from fastapi.middleware.cors import CORSMiddleware
 from app.transcriber import transcriber, MODEL_NAME, DEVICE
 logging.basicConfig(level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s")
 logger = logging.getLogger("parakeet-api")
@asynccontextmanager
 async def lifespan(app: FastAPI):
    logger.info(f"Loading model {MODEL_NAME} on {DEVICE}")
    transcriber.load_model()
    logger.info("Model ready")
    yield
 app = FastAPI(title="Parakeet ASR API", version="1.1.0", lifespan=lifespan)
 app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_credentials=True,
                   allow_methods=["*"], allow_headers=["*"])
@app.get("/")
 async def root():
    return {"service": "parakeet-asr", "model": MODEL_NAME, "device": DEVICE,
            "endpoints": {"transcribe": "/v1/audio/transcriptions",
                         "models": "/v1/models", "health": "/health"}}
@app.get("/health")
 async def health():
    return {"status": "ready" if transcriber._loaded else "loading",
            "model": MODEL_NAME, "device": DEVICE}
@app.get("/v1/models")
 async def list_models():
    return {"object": "list", "data": [
        {"id": "parakeet-tdt-0.6b-v3", "object": "model", "owned_by": "nvidia"},
        {"id": "whisper-1", "object": "model", "owned_by": "nvidia"}]}
@app.post("/v1/audio/transcriptions")
 async def transcribe(
    file: UploadFile = File(...),
    model: Optional[str] = Form(default="parakeet-tdt-0.6b-v3"),
    language: Optional[str] = Form(default=None),
    response_format: Optional[str] = Form(default="json"),
    temperature: Optional[float] = Form(default=0.0),
    prompt: Optional[str] = Form(default=None),
 ):
    if not transcriber._loaded:
        raise HTTPException(status_code=503, detail="Model loading")
    audio_bytes = await file.read()
    if len(audio_bytes) == 0:
        raise HTTPException(status_code=400, detail="Empty file")
    max_size = int(os.getenv("MAX_UPLOAD_MB", "200")) * 1024 * 1024
    if len(audio_bytes) > max_size:
        raise HTTPException(status_code=413, detail=f"File too large")
    want_timestamps = response_format == "verbose_json"
    start_time = time.time()
    try:
        result = transcriber.transcribe(
            audio_bytes, file.filename, language, timestamps=want_timestamps
        )
    except Exception as e:
        logger.exception("Transcription failed")
        raise HTTPException(status_code=500, detail=f"Failed: {e}")
    elapsed = time.time() - start_time
    duration = result.get("duration", 0)
    rtfx = duration / elapsed if elapsed > 0 else 0
    logger.info(f"Done: {duration:.1f}s in {elapsed:.1f}s ({rtfx:.0f}x rt)")
    if response_format == "text":
        return JSONResponse(content=result["text"], media_type="text/plain")
    if response_format == "verbose_json":
        return {
            "task": "transcribe",
            "language": language or "en",
            "duration": duration,
            "text": result["text"],
            "segments": result.get("segments", []),
            "words": result.get("words", []),
        }
    return {"text": result["text"]}
@app.post("/v1/audio/translations")
 async def translate(file: UploadFile = File(...),
    model: Optional[str] = Form(default="parakeet-tdt-0.6b-v3"),
    language: Optional[str] = Form(default=None),
    response_format: Optional[str] = Form(default="json")):
    return await transcribe(file=file, model=model, language=language,
                            response_format=response_format)
@@ -1,10 +1,10 @@
 import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk'
 export const v0_1_0 = VersionInfo.of({
-  version: '0.9.0:2',
+  version: '0.10.0:0',
  releaseNotes: {
    en_US:
-      'v0.9.0:2 — Open WebUI voice mode UX fix. Parakeet has a recurring CUDA wedge (CUBLAS_STATUS_*_ERROR mid-attention) that fires reliably on Open WebUI\'s WebM/Opus→MP3 audio. Previously the proxy just relayed the upstream 500, Open WebUI showed "Server connection error", and you had to wait up to 5 min for the periodic deep-health probe to detect+restart Parakeet. Now: when Parakeet returns 500, the proxy fires deep-health\'s probe immediately in the background (which contains the same wedge-detect + rate-limited auto-restart logic) and returns 503 with Retry-After: 60 instead. The client gets a clear retry signal and the auto-restart kicks in within seconds. Retrying ~60s later should succeed reliably.',
+      'v0.10.0 — Speaker diarization. Spark Control now offers a merged transcription + diarization endpoint at POST /api/audio/transcribe-with-speakers. Returns the spoken text broken into blocks with anonymous speaker labels (Speaker_0, Speaker_1, ...) and millisecond timestamps — designed as input for downstream apps (recap-relay, custom UIs) that handle speaker→name mapping and LLM analysis with their own configurable prompts. Diarization runs via NVIDIA NeMo Sortformer (nvidia/diar_sortformer_4spk-v1), loaded alongside Parakeet ASR inside the existing parakeet-asr container on Spark 2 — no new infrastructure, ~150 MB model addition. A new /v1/audio/diarize endpoint is also exposed on Parakeet directly for clients that just want speaker turns. Apply Sortformer patches via image/parakeet_patches/apply.sh after install. v0.11 will add a Speech Models dashboard panel for in-UI model swap/update.',
  },
  migrations: {
    up: async ({ effects }) => {},