Fix transcript chunker context overflow; full-coverage extraction defaults

chunk_text split only on "\n\n", but ASR transcripts have none (speaker turns are joined by a single "\n"), so whole 2-3h episodes (~250K chars) went to the extractor in one call and 400'd on context overflow. Fall through paragraph -> line -> sentence -> word -> hard char-slice so no chunk exceeds the cap regardless of punctuation; guard max_chars < 1.

Default extraction to recall-first full coverage (chunk_chars 12K, max_chunks 999) and expose both as run-extract --chunk-chars / --max-chunks.
This commit is contained in:
Keysat
2026-06-15 22:28:12 -05:00
parent cabb8a3d6c
commit 5deffddb17
4 changed files with 50 additions and 16 deletions
+4
View File
@@ -33,6 +33,10 @@ falsification hypotheses (H1H6) are in `DESIGN_v2.md`.
- **Episode-pipelining** in `transcribe_worker` — download/chunk the next episode while transcribing the
current one, to close the inter-episode GPU idle gap (the per-chunk 2-in-flight path is already done).
- **Corpus-management UI** — add to the corpus over time and see the full corpus selection.
- **Expose pipeline tunables in the UI (with the UI topic).** Extraction chunk size + per-doc chunk cap,
audio chunk length, audio concurrency, etc. are currently hardcoded defaults (now also CLI flags on
`run-extract`: `--chunk-chars`, `--max-chunks`). Surface them in the UI so they're visible/adjustable,
not black-box assumptions we forget about. Tie to the corpus-management UI work.
- **Forward live operation** — the only real test: scoring un-pre-selected signals as they arrive, with
the dual-evaluation ledger as arbiter.