v0.13.0:2 - per-segment confidence in diarize-chunk response

Recap Relay dev asked: can the diarization output include a confidence level per segment so the UI can render "Speaker_0?" for uncertain assignments rather than confidently mislabeling? Answer: yes. Sortformer's diarize() with include_tensor_outputs=True returns the per-frame per-speaker sigmoid scores (shape [B, T, 4spk], ~12.6 fps frame rate). The current code argmaxes those into segment strings and throws the raw scores away. Now: for each output segment, compute mean probability of the assigned speaker across the segment's frames → confidence in [0, 1]. Implementation: - diarizer.py: diarize_chunk() now calls diarize() with include_tensor_outputs=True, and a new _attach_confidence() helper derives the per-segment mean probability after parsing the segment strings. The frame-rate is computed from tensor shape vs audio duration (no need to hard-code the model's stride). - All failure paths return confidence=None gracefully — Recap Relay can treat None as "no info" or fall back to a default threshold. Endpoint shape change: segments[] now have an optional `confidence` field in [0, 1] (or None). All other fields unchanged. Existing callers that ignore the field aren't affected. Verified with a 5s test signal that the tensor has shape [1, 63, 4] (63 frames / 5s = 12.6 fps) and values in [0, 1] (sigmoid outputs, independent per speaker so overlap detection works). Real speech values will be much higher than the near-zero values of the pure-tone test signal. Reapply patches on the Speech Models card after installing v0.13.0:2 to pick up the updated diarizer.py + main.py in the parakeet container. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 12:36:25 -05:00
parent e775906caa
commit c7f94381e7
3 changed files with 82 additions and 6 deletions
@@ -175,7 +175,10 @@ async def diarize_chunk(
    Response shape:
        {
          "duration": 300.0,
-          "segments": [{"start_s": 1.2, "end_s": 4.8, "speaker": "Speaker_0"}, ...],
+          "segments": [
+            {"start_s": 1.2, "end_s": 4.8, "speaker": "Speaker_0", "confidence": 0.78},
+            ...
+          ],
          "speakers_detected": ["Speaker_0", "Speaker_1", "Speaker_2"],
          "fingerprints": {
            "Speaker_0": [0.123, -0.045, ..., 0.211],   # 192-dim TitaNet embedding
@@ -188,6 +191,13 @@ async def diarize_chunk(
          }
        }

+    confidence per segment: mean probability that the assigned speaker was
+    active across the segment's frames (Sortformer's raw per-frame per-
+    speaker sigmoid outputs). Range [0, 1], higher = more confident.
+    Clean speech typically >0.5; ambiguous regions (overlap, weak signal)
+    fall lower. None on derivation failure. Recap Relay can threshold
+    this to render uncertain segments as "Speaker_0?" in the UI.
+
    Speaker labels are LOCAL to this chunk. Run cosine-similarity clustering
    across the fingerprints from all chunks to merge `chunkA.Speaker_0` with
    `chunkB.Speaker_2` when they're the same voice. Recommended threshold: