Chunk overlap + overlap-aware stitching

Chunks were contiguous (start = prev end) with a naïve offset-concat stitch — no overlap. That cut sentences at boundaries, denied the diarizer context at edges, and let one voice split across chunks (the MH/Unknown_0 problem). Now each ~150s body is sliced with a 15s margin on both sides ([bodyStart-15, bodyEnd+15]); the stitcher keeps a segment only in the chunk that owns its MIDPOINT (body region) and drops it from the neighbour's margin — so boundary-spanning speech is seen whole by the backend and kept exactly once. - SessionPackager.PlannedChunk gains bodyStart/bodyEnd; planChunks adds overlapSeconds. - TranscriptAssembler.ChunkResult carries body bounds (defaults keep-all → no-overlap behaviour preserved for existing callers); assemble dedups by midpoint-in-body. - TranscriptPipeline passes body bounds through. Complements (doesn't replace) the fragment-smoothing + reconciliation safety nets; this is the upstream fix. ~+20% backend audio per interior chunk. 63/63 XCTest (new: overlap window layout + boundary-segment dedup).
2026-06-08 13:03:56 -05:00
parent 1c133c8970
commit ab910cf742
4 changed files with 60 additions and 18 deletions
@@ -91,7 +91,8 @@ final class TranscriptPipeline {
                known[name] = fp
            }
            voiceprints.update(with: response)
-            results.append(.init(chunkStart: chunk.start, response: response))
+            results.append(.init(chunkStart: chunk.start, response: response,
+                                 bodyStart: chunk.bodyStart, bodyEnd: chunk.bodyEnd))
        }
        await progress?(plan.count, plan.count)