Handoff: durable chunker/Gemini rules; Strike extraction in progress
Record two recurrence-prone gotchas in Key operational rules (transcript chunking must cap every chunk; Gemini backend must disable thinking) and rewrite Current state for the in-progress Gemini extraction batch and the gating Strike test.
This commit is contained in:
@@ -89,6 +89,15 @@ ASR + diarizer, **bge-m3** embeddings, **Qdrant** on `:87`. The gateway is the o
|
||||
independent). Unconfirmed: Unchained, Debifi, Coinkite (held quarantined pending Grant's materiality call).
|
||||
- **Gemini quota is a rolling ~24h window** (~291 hour-long episodes / ~51M tokens), not a calendar-day
|
||||
reset. Bulk transcription overflows there; expect 429 RESOURCE_EXHAUSTED past the window.
|
||||
- **Transcript chunking is recall-first and MUST cap every chunk.** ASR transcripts have NO blank-line
|
||||
paragraphs (speaker turns joined by a single `\n`), so `extract.claims.chunk_text` falls through
|
||||
`\n\n`→`\n`→sentence→word→hard-slice; splitting only on `\n\n` (the old bug) sent whole 2–3 h episodes
|
||||
in ONE call → context-overflow 400s. Extraction defaults to full coverage at 12K chars/chunk
|
||||
(`run-extract --chunk-chars/--max-chunks`); bigger chunks risk lost-in-the-middle recall loss.
|
||||
- **Gemini extraction backend disables thinking.** `gemini-2.5-flash` thinks by default and burns the
|
||||
output-token budget on reasoning → MAX_TOKENS → truncated JSON → 0 claims; the backend sets
|
||||
`thinking_budget=0` (mirrors the local path's `enable_thinking=False`). Gemini = overflow for PUBLIC
|
||||
data only; keep `EXTRACTION_BACKEND=local` in `.env`, flip it inline per-run when overflowing.
|
||||
- **Scoring-brain internals are scoped to a guide.** Before editing `signal_engine/signals/`, read
|
||||
**`docs/guides/scoring-brain.md`** — the classifier invariants (REALIZED-ONLY, ROLE-MATCH, claim_type
|
||||
hard-evidence guard, max_tokens budget, claim_id bracket-strip), the EISC cluster-cap, and the
|
||||
@@ -105,17 +114,19 @@ IP appears only as an env-var default.
|
||||
|
||||
## Current state (snapshot — overwrite each session; longer-term backlog → `ROADMAP.md`)
|
||||
|
||||
- **Battery adversarial test: PASSES.** Corpus built (23 docs via the `docs` fetcher); after the three
|
||||
scoring fixes the engine reads demand-net rising (+3.9) while **supply stays flat at 0.0** — correctly
|
||||
rejecting Cantor's *announced* $2B and borrower-side collateral claims as not-realized-supply.
|
||||
- **Strike adversarial test: STALLED — needs a manual resume; no result yet.** The independent leg (What
|
||||
Bitcoin Did, Stephan Livera, Kevin Rooke, Anita Posch, Cafe Bitcoin, + River research — all
|
||||
independent) is ~586/671 transcribed. The `run_strike_pipeline.sh` watcher proceeded on that partial
|
||||
corpus, but its extraction worker **died (2026-06-11)** after only 17/635 podcast docs; the stale lease
|
||||
is cleared, so **608 bitcoin-podcast extract jobs are pending**. Spark vLLM is healthy → **RESUME:**
|
||||
`run-extract --limit 700 --max-chunks 4` → `embed-claims` → `two-sided --conviction STRIKE2022
|
||||
--modes live,test` (PASS = quiet in live, fires in test). The Spark **audio fix** (semaphore-of-2 +
|
||||
retry-backoff) is committed and validated (~2.5× faster, zero episode aborts).
|
||||
- **§7.1 power-infra backtest:** qualified YES (corpus-gated; runway/precision caveats in `DESIGN_v2.md`).
|
||||
- Corpus now spans bitcoin podcasts, SEC/FMP company filings (incl. 6 major banks + Robinhood, a new
|
||||
`banks` cluster), the Battery text corpus, and River research. EISC edges seeded for the bitcoin cluster.
|
||||
- **Strike adversarial test: extraction running — this is the gating step.** Root-caused & FIXED the
|
||||
long-form 400s (the old `\n\n`-only chunker sent whole 2–3 h episodes in one call → context overflow).
|
||||
Now recall-first full coverage (12K chars). Draining the ~700-doc / ~5.7k-chunk extract backlog through
|
||||
the **Gemini backend** (one-time PUBLIC overflow, `EXTRACTION_BACKEND=gemini` inline) to finish faster
|
||||
(~6–7 h serial); validated live — dense (~7.5 claims/chunk), zero failures; 27 prior 400-failures
|
||||
requeued. **NEXT when it finishes:** `embed-claims` → `two-sided --conviction STRIKE2022 --modes
|
||||
live,test` (PASS = quiet in live, fires in test).
|
||||
- **Battery adversarial test: PASSES** (unchanged — demand-net rises, supply stays flat at 0.0).
|
||||
**§7.1 power-infra backtest:** qualified YES (corpus-gated; caveats in `DESIGN_v2.md`).
|
||||
- **2 commits made, UNPUSHED** — push to `main` was blocked by the permission classifier (enforcing the
|
||||
old no-push-to-main rule); awaiting approval (`git push origin main`). Commits: chunker fix +
|
||||
recall-first defaults; Gemini thinking-budget fix.
|
||||
- **Open decisions for Grant:** (a) push the 2 commits; (b) speed-up approach — recommended real-time
|
||||
concurrency over the async Batch API (serial Gemini runs as the fallback meanwhile).
|
||||
- Corpus spans bitcoin podcasts, SEC/FMP filings (+`banks` cluster), the Battery corpus, River research;
|
||||
EISC edges seeded for the bitcoin cluster.
|
||||
|
||||
Reference in New Issue
Block a user