v0.13.0:2 - per-segment confidence in diarize-chunk response

Recap Relay dev asked: can the diarization output include a confidence level per segment so the UI can render "Speaker_0?" for uncertain assignments rather than confidently mislabeling? Answer: yes. Sortformer's diarize() with include_tensor_outputs=True returns the per-frame per-speaker sigmoid scores (shape [B, T, 4spk], ~12.6 fps frame rate). The current code argmaxes those into segment strings and throws the raw scores away. Now: for each output segment, compute mean probability of the assigned speaker across the segment's frames → confidence in [0, 1]. Implementation: - diarizer.py: diarize_chunk() now calls diarize() with include_tensor_outputs=True, and a new _attach_confidence() helper derives the per-segment mean probability after parsing the segment strings. The frame-rate is computed from tensor shape vs audio duration (no need to hard-code the model's stride). - All failure paths return confidence=None gracefully — Recap Relay can treat None as "no info" or fall back to a default threshold. Endpoint shape change: segments[] now have an optional `confidence` field in [0, 1] (or None). All other fields unchanged. Existing callers that ignore the field aren't affected. Verified with a 5s test signal that the tensor has shape [1, 63, 4] (63 frames / 5s = 12.6 fps) and values in [0, 1] (sigmoid outputs, independent per speaker so overlap detection works). Real speech values will be much higher than the near-zero values of the pure-tone test signal. Reapply patches on the Speech Models card after installing v0.13.0:2 to pick up the updated diarizer.py + main.py in the parakeet container. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 12:36:25 -05:00
parent e775906caa
commit c7f94381e7
3 changed files with 82 additions and 6 deletions
@@ -1,10 +1,10 @@
 import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk'

 export const v0_1_0 = VersionInfo.of({
-  version: '0.13.0:1',
+  version: '0.13.0:2',
  releaseNotes: {
    en_US:
-      'v0.13.0:1 — per-chunk diarization worker with voice fingerprints. Adds POST /api/audio/diarize-chunk to Spark Control: given one audio chunk, returns Sortformer diarization segments (with LOCAL speaker labels) PLUS a 192-dim TitaNet voice fingerprint per detected speaker. Designed for Recap Relay to call per-chunk and then cluster fingerprints across chunks via cosine similarity for globally consistent speaker IDs. Parakeet container also gets a new /v1/audio/diarize-chunk endpoint and loads NVIDIA TitaNet (nvidia/speakerverification_en_titanet_large, ~25 MB, NeMo-native, no torchaudio drama). Click Reapply patches on the Speech Models card after install to pick up the diarizer.py + main.py updates. Sortformer + Parakeet + Magpie unchanged.',
+      'v0.13.0:2 — per-segment confidence in diarize-chunk. Sortformer outputs per-frame per-speaker sigmoid probabilities (~12.6 fps) that we previously discarded. Now: for each diarization segment, compute mean probability of the assigned speaker across the segment\'s frames → confidence in [0, 1]. Recap Relay (and other consumers) can threshold this to render uncertain segments as "Speaker_0?" with a question mark, or to skip them entirely. Endpoint shape is otherwise unchanged — segments[].confidence is a new field, value may be None on derivation failure. Click Reapply patches on the Speech Models card after install to pick up the updated diarizer.py + main.py.',
  },
  migrations: {
    up: async ({ effects }) => {},