Chunk overlap + overlap-aware stitching

Chunks were contiguous (start = prev end) with a naïve offset-concat stitch — no
overlap. That cut sentences at boundaries, denied the diarizer context at edges, and
let one voice split across chunks (the MH/Unknown_0 problem). Now each ~150s body is
sliced with a 15s margin on both sides ([bodyStart-15, bodyEnd+15]); the stitcher
keeps a segment only in the chunk that owns its MIDPOINT (body region) and drops it
from the neighbour's margin — so boundary-spanning speech is seen whole by the
backend and kept exactly once.

- SessionPackager.PlannedChunk gains bodyStart/bodyEnd; planChunks adds overlapSeconds.
- TranscriptAssembler.ChunkResult carries body bounds (defaults keep-all → no-overlap
  behaviour preserved for existing callers); assemble dedups by midpoint-in-body.
- TranscriptPipeline passes body bounds through.

Complements (doesn't replace) the fragment-smoothing + reconciliation safety nets;
this is the upstream fix. ~+20% backend audio per interior chunk. 63/63 XCTest
(new: overlap window layout + boundary-segment dedup).
This commit is contained in:
Grant Gilliam
2026-06-08 13:03:56 -05:00
parent 1c133c8970
commit ab910cf742
4 changed files with 60 additions and 18 deletions
@@ -91,7 +91,8 @@ final class TranscriptPipeline {
known[name] = fp
}
voiceprints.update(with: response)
results.append(.init(chunkStart: chunk.start, response: response))
results.append(.init(chunkStart: chunk.start, response: response,
bodyStart: chunk.bodyStart, bodyEnd: chunk.bodyEnd))
}
await progress?(plan.count, plan.count)