Fix transcript chunker context overflow; full-coverage extraction defaults

chunk_text split only on "\n\n", but ASR transcripts have none (speaker turns are joined by a single "\n"), so whole 2-3h episodes (~250K chars) went to the extractor in one call and 400'd on context overflow. Fall through paragraph -> line -> sentence -> word -> hard char-slice so no chunk exceeds the cap regardless of punctuation; guard max_chars < 1. Default extraction to recall-first full coverage (chunk_chars 12K, max_chunks 999) and expose both as run-extract --chunk-chars / --max-chunks.
2026-06-15 22:28:12 -05:00
parent cabb8a3d6c
commit 5deffddb17
4 changed files with 50 additions and 16 deletions
@@ -28,8 +28,8 @@ def _document_text(doc, *, user_agent: str) -> str:
    raise ValueError(f"no text source for {doc['doc_id']} (kind={doc['kind']}, url={doc['url']})")


-def run_extract(conn, sc, cfg, *, limit: int = 10, max_chunks_per_doc: int = 4,
-                chunk_chars: int = 18_000, lease_seconds: int = 900,
+def run_extract(conn, sc, cfg, *, limit: int = 10, max_chunks_per_doc: int = 999,
+                chunk_chars: int = 12_000, lease_seconds: int = 900,
                worker_id: str = "extract-1") -> dict:
    from .backends import from_config as backend_from_config
    backend = backend_from_config(cfg, sc)