v0.13.0:1 - per-chunk diarization worker with TitaNet voice fingerprints
Spark Control now exposes a per-chunk worker designed for Recap Relay
to orchestrate against. Recap Relay does the chunking + global speaker
clustering (consistent with how it already handles the Gemini path);
Spark Control handles the GPU-bound per-chunk work.
Parakeet container:
- diarizer.py: now also loads NVIDIA TitaNet speaker-verification model
(~25 MB, NeMo-native, no torchaudio). New diarize_chunk() method
runs Sortformer + extracts one 192-dim voice fingerprint per detected
local speaker (concatenating each speaker's audio across the chunk
and running TitaNet's get_embedding).
- main.py: new POST /v1/audio/diarize-chunk endpoint that returns
segments + speakers_detected + fingerprints + models in one shot.
Spark Control:
- new POST /api/audio/diarize-chunk that proxies to parakeet's new
endpoint. Same CUDA-wedge recovery (503 + deep-health probe + 60s
retry-after) as the other audio endpoints. Returns the raw JSON
upstream because Recap Relay is the consumer; no merging needed.
Response shape Recap Relay receives per chunk:
{
"duration": 300.0,
"segments": [{"start_s","end_s","speaker"}, ...], # LOCAL labels
"speakers_detected": ["Speaker_0","Speaker_1",...],
"fingerprints": {"Speaker_0":[192 floats], ...},
"models": {"diarization":"...","embedding":"..."}
}
Recap Relay's job:
1. Chunk audio (existing chunking infrastructure)
2. POST each chunk to /api/audio/diarize-chunk in parallel
3. Collect all fingerprints from all chunks
4. sklearn AgglomerativeClustering(distance_threshold=0.7, metric=cosine)
5. Re-label segments with global cluster IDs
6. Concatenate transcripts (from a separate parallel call to
/v1/audio/transcriptions) with timestamp offsets and merge with
re-labeled diar segments
After installing v0.13.0:1, click "Reapply patches" on the Speech Models
card to push the updated diarizer.py + main.py into the parakeet
container — TitaNet will download (~25 MB) on first call.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -209,6 +209,60 @@ def build_router(settings: Settings, deep_health: Any = None) -> APIRouter:
|
||||
raise HTTPException(r.status_code, r.text[:500])
|
||||
return Response(content=r.content, media_type=r.headers.get("content-type", "application/json"))
|
||||
|
||||
# ---- /api/audio/diarize-chunk (per-chunk worker for Recap Relay) ----
|
||||
@router.post("/api/audio/diarize-chunk")
|
||||
async def diarize_chunk(file: UploadFile = File(...)) -> dict:
|
||||
"""Per-chunk worker designed for orchestrators (Recap Relay) that
|
||||
handle chunking + cross-chunk speaker clustering themselves.
|
||||
|
||||
Given ONE audio chunk, returns diarization segments (with LOCAL
|
||||
speaker labels — Speaker_0/1/... reset per chunk) AND a 192-dim
|
||||
TitaNet voice fingerprint per detected speaker. The caller is
|
||||
expected to:
|
||||
1. Collect fingerprints from every chunk
|
||||
2. Run cosine-similarity clustering across all of them (e.g.,
|
||||
sklearn AgglomerativeClustering, distance_threshold=0.7)
|
||||
3. Re-label segments using the resulting global cluster IDs
|
||||
|
||||
Pair with a SEPARATE call to /v1/audio/transcriptions on the same
|
||||
chunk to get the text. (Kept separate because the caller may want
|
||||
to cache transcription independently of diarization, or run them
|
||||
on different parts of the pipeline.)
|
||||
|
||||
Response shape:
|
||||
{
|
||||
"duration": 300.0,
|
||||
"segments": [{"start_s", "end_s", "speaker"}, ...],
|
||||
"speakers_detected": ["Speaker_0", "Speaker_1", ...],
|
||||
"fingerprints": {"Speaker_0": [192 floats], "Speaker_1": [...]},
|
||||
"models": {"diarization": "...", "embedding": "..."}
|
||||
}
|
||||
"""
|
||||
body = await file.read()
|
||||
if not body:
|
||||
raise HTTPException(400, "Empty file")
|
||||
files = {"file": (file.filename or "audio.wav", body, file.content_type or "application/octet-stream")}
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=600.0) as client:
|
||||
r = await client.post(f"{_parakeet_base()}/v1/audio/diarize-chunk", files=files)
|
||||
except httpx.HTTPError as e:
|
||||
raise HTTPException(502, f"parakeet unreachable: {e}")
|
||||
|
||||
if r.status_code == 500 and deep_health is not None:
|
||||
# Same CUDA-wedge recovery as the other endpoints
|
||||
try:
|
||||
asyncio.create_task(deep_health.run_one("parakeet"))
|
||||
except Exception:
|
||||
pass
|
||||
raise HTTPException(
|
||||
status_code=503,
|
||||
detail="Parakeet returned a transient error (likely CUDA wedge). Auto-restart triggered; retry in ~60s.",
|
||||
headers={"Retry-After": "60"},
|
||||
)
|
||||
if r.status_code != 200:
|
||||
raise HTTPException(r.status_code, r.text[:500])
|
||||
return r.json()
|
||||
|
||||
# ---- /api/audio/transcribe-with-speakers (STT + diarization, merged) ----
|
||||
@router.post("/api/audio/transcribe-with-speakers")
|
||||
async def transcribe_with_speakers(
|
||||
|
||||
Reference in New Issue
Block a user