Record Strike test conditional pass; track reflexivity follow-up

Full STRIKE2022 pipeline ran clean: 63 filing extracts (3,330 claims, 56,008 total), embed-claims indexed all into Qdrant, two-sided live/test. Engine refuses the false positive (net=+0.25, capped single bitcoin cluster, far below the 2.0 firing bar) but the own_network-drop reflexivity demo is unexercised (own_net=0, live==test) because RHR/CD/Bitcoin.Review were deferred at transcription 2026-06-08. Accepted as a conditional pass; un-defer follow-up tracked in ROADMAP.
Handoff: record Gemini timeout lesson; Strike extraction near complete
2026-06-16 12:39:13 -05:00 · 2026-06-16 08:45:12 -05:00 · 2026-06-16 08:45:12 -05:00 · 2026-06-15 22:47:20 -05:00 · 2026-06-15 22:28:12 -05:00 · 2026-06-15 22:28:12 -05:00
6 changed files with 120 additions and 43 deletions
@@ -89,6 +89,17 @@ ASR + diarizer, **bge-m3** embeddings, **Qdrant** on `:87`. The gateway is the o
  independent). Unconfirmed: Unchained, Debifi, Coinkite (held quarantined pending Grant's materiality call).
 - **Gemini quota is a rolling ~24h window** (~291 hour-long episodes / ~51M tokens), not a calendar-day
  reset. Bulk transcription overflows there; expect 429 RESOURCE_EXHAUSTED past the window.
 - **Transcript chunking is recall-first and MUST cap every chunk.** ASR transcripts have NO blank-line
  paragraphs (speaker turns joined by a single `\n`), so `extract.claims.chunk_text` falls through
  `\n\n`→`\n`→sentence→word→hard-slice; splitting only on `\n\n` (the old bug) sent whole 2–3 h episodes
  in ONE call → context-overflow 400s. Extraction defaults to full coverage at 12K chars/chunk
  (`run-extract --chunk-chars/--max-chunks`); bigger chunks risk lost-in-the-middle recall loss.
 - **Gemini extraction backend: disable thinking AND set a timeout.** `gemini-2.5-flash` thinks by
  default and burns the output-token budget on reasoning → MAX_TOKENS → truncated JSON → 0 claims, so
  the backend sets `thinking_budget=0` (mirrors local `enable_thinking=False`). It also sets an HTTP
  **timeout (120 s) + 4 retries** — a timeout-less call once hung the single-threaded worker ~50 min
  (transient 504/read-timeouts then self-heal). Gemini = overflow for PUBLIC data only; keep
  `EXTRACTION_BACKEND=local` in `.env`, flip it inline per-run when overflowing.
 - **Scoring-brain internals are scoped to a guide.** Before editing `signal_engine/signals/`, read
  **`docs/guides/scoring-brain.md`** — the classifier invariants (REALIZED-ONLY, ROLE-MATCH, claim_type
  hard-evidence guard, max_tokens budget, claim_id bracket-strip), the EISC cluster-cap, and the
@@ -105,17 +116,25 @@ IP appears only as an env-var default.
 ## Current state (snapshot — overwrite each session; longer-term backlog → `ROADMAP.md`)
- **Battery adversarial test: PASSES.** Corpus built (23 docs via the `docs` fetcher); after the three
+- **Strike adversarial test: CONDITIONAL PASS (2026-06-16).** Full pipeline ran: extraction drained the
-  scoring fixes the engine reads demand-net rising (+3.9) while **supply stays flat at 0.0** — correctly
+  final 63 filing jobs via the Gemini backend (3,330 claims, **0 failures/timeouts/429s**) → **56,008
-  rejecting Cantor's *announced* $2B and borrower-side collateral claims as not-realized-supply.
+  claims** total → `embed-claims` indexed all 56,008 into Qdrant (points == claims, hybrid dense+BM25) →
- **Strike adversarial test: STALLED — needs a manual resume; no result yet.** The independent leg (What
+  `two-sided --conviction STRIKE2022 --modes live,test`. **Result:** the engine correctly **refuses the
-  Bitcoin Did, Stephan Livera, Kevin Rooke, Anita Posch, Cafe Bitcoin, + River research — all
+  false positive** — the two scoring nodes (`lightning-retail-acceptance`, `merchant-lightning-integration`)
-  independent) is ~586/671 transcribed. The `run_strike_pipeline.sh` watcher proceeded on that partial
+  sit at `net=+0.25`, the *capped single-bitcoin-cluster* value, far below the `EISC_FLOOR=2.0` firing bar
-  corpus, but its extraction worker **died (2026-06-11)** after only 17/635 podcast docs; the stale lease
+  (`signals/bar.py`); the §3 "Bitcoin is one capped cluster" guardrail holds. **But the reflexivity
-  is cleared, so **608 bitcoin-podcast extract jobs are pending**. Spark vLLM is healthy → **RESUME:**
+  *demonstration* (live < test via own_network drop) is NOT exercised:** `own_net=0` and live==test because
-  `run-extract --limit 700 --max-chunks 4` → `embed-claims` → `two-sided --conviction STRIKE2022
+  the own_network shows that carry the reflexive Lightning chatter — **RHR (80), CD (77), Bitcoin.Review (12)
-  --modes live,test` (PASS = quiet in live, fires in test). The Spark **audio fix** (semaphore-of-2 +
+  = 169 eps — were deferred at transcription 2026-06-08** ("focus on WBD/Livera/Rooke/Anita"; no audio
-  retry-backoff) is committed and validated (~2.5× faster, zero episode aborts).
+  downloaded), so they have 0 claims. (TFTC partially transcribed, 19/80 → 329 claims; the current +0.25
- **§7.1 power-infra backtest:** qualified YES (corpus-gated; runway/precision caveats in `DESIGN_v2.md`).
+  comes from *independent* bitcoin-cluster shows.) **Operator call 2026-06-16: accept the conditional pass,
- Corpus now spans bitcoin podcasts, SEC/FMP company filings (incl. 6 major banks + Robinhood, a new
+  no audio-GPU spend now.** To fully demonstrate reflexivity later: un-defer + transcribe the RHR/CD
-  `banks` cluster), the Battery text corpus, and River research. EISC edges seeded for the bitcoin cluster.
+  2022–23 Lightning-retail window → re-extract → re-run `two-sided` (then test mode should fire while live
  stays quiet). Tracked in `ROADMAP.md`.
 - **Battery test PASSES; §7.1 power-infra qualified YES** (both unchanged).
 - **5 commits ahead of `origin/main`, UNPUSHED** (Gitea `immense-voyage.local`). The `backends.py`
  timeout/retry change the prior handoff flagged as uncommitted was in fact committed (`87b6b05`); tree is
  otherwise clean apart from this Current-state edit. Run `git push origin main`.
 - Corpus: bitcoin podcasts (own_network: TFTC partial; RHR/CD/Bitcoin.Review deferred), SEC/FMP filings
  (+`banks` cluster, now extracted: Robinhood 2216, Morgan Stanley 644, JPMorgan 382… + power-infra names
  Oklo/NuScale/Cipher/TeraWulf), Battery corpus, River research; EISC edges seeded for the bitcoin cluster.
@@ -18,6 +18,13 @@ falsification hypotheses (H1–H6) are in `DESIGN_v2.md`.
 - **MD&A targeting** for filings — extract Item 7, not front-matter boilerplate.
 ## Corpus & independence
 - **Complete the Strike reflexivity demonstration (deferred 2026-06-16).** The STRIKE2022 adversarial test
  is a *conditional* pass: the engine refuses the false positive via the capped single-bitcoin-cluster
  guard (`net=+0.25` ≪ `EISC_FLOOR=2.0`), but the own_network-drop mechanism (live quiet < test fires) is
  unexercised because RHR (80) / CD (77) / Bitcoin.Review (12) were deferred at transcription on 2026-06-08
  (`own_net=0`, live==test). To finish: un-defer + transcribe the **RHR/CD 2022–23 Lightning-retail window**
  → `run-extract` → re-run `two-sided --conviction STRIKE2022 --modes live,test`; PASS = test fires while
  live stays quiet. Costs constrained audio-GPU time, hence deferred.
 - **Confirm materiality** of the remaining `own_network`-flagged sources with Grant: Unchained, Debifi,
  Coinkite (Bitcoin.Review). Immaterial → flip to independent (the River/Swan precedent).
 - **BTC Sessions (Ben Perrin)** — strongest still-missing independent high-Strike merchant/wallet-adoption
@@ -33,6 +40,10 @@ falsification hypotheses (H1–H6) are in `DESIGN_v2.md`.
 - **Episode-pipelining** in `transcribe_worker` — download/chunk the next episode while transcribing the
  current one, to close the inter-episode GPU idle gap (the per-chunk 2-in-flight path is already done).
 - **Corpus-management UI** — add to the corpus over time and see the full corpus selection.
 - **Expose pipeline tunables in the UI (with the UI topic).** Extraction chunk size + per-doc chunk cap,
  audio chunk length, audio concurrency, etc. are currently hardcoded defaults (now also CLI flags on
  `run-extract`: `--chunk-chars`, `--max-chunks`). Surface them in the UI so they're visible/adjustable,
  not black-box assumptions we forget about. Tie to the corpus-management UI work.
 - **Forward live operation** — the only real test: scoring un-pre-selected signals as they arrive, with
  the dual-evaluation ledger as arbiter.
@@ -254,7 +254,8 @@ def cmd_run_extract(args: argparse.Namespace) -> int:
    conn = db.connect(cfg.db_path)
    db.init_db(conn)
    sc = from_config(cfg)
-    result = run_extract(conn, sc, cfg, limit=args.limit, max_chunks_per_doc=args.max_chunks)
+    result = run_extract(conn, sc, cfg, limit=args.limit, max_chunks_per_doc=args.max_chunks,
                         chunk_chars=args.chunk_chars)
    print(f"extraction: {result['jobs_processed']} jobs, {result['claims_written']} claims written")
    return 0
@@ -581,7 +582,10 @@ def build_parser() -> argparse.ArgumentParser:
    re = sub.add_parser("run-extract", help="Drain 'extract' jobs → claims via the local LLM (§4.2)")
    re.add_argument("--limit", type=int, default=5, help="max jobs to process this run")
-    re.add_argument("--max-chunks", type=int, default=4, help="max chunks per document")
+    re.add_argument("--max-chunks", type=int, default=999,
                    help="max chunks per document (default: full coverage (999))")
    re.add_argument("--chunk-chars", type=int, default=12_000,
                    help="chars per extraction chunk; smaller = better recall, more LLM calls")
    re.set_defaults(func=cmd_run_extract)
    sub.add_parser("queue-status", help="Backfill queue counts by type/state").set_defaults(func=cmd_queue_status)
@@ -10,6 +10,7 @@ A backend exposes: complete_json(messages, max_tokens) -> str  (a JSON object st
 from __future__ import annotations
 import logging
 import time
 log = logging.getLogger(__name__)
@@ -32,27 +33,43 @@ class GeminiBackend:
    API is the eventual scale path; this synchronous form is the drop-in fallback."""
    name = "gemini"
-    def __init__(self, api_key: str, model: str = "gemini-2.5-flash") -> None:
+    def __init__(self, api_key: str, model: str = "gemini-2.5-flash", *,
                 timeout_s: float = 120.0, retries: int = 4) -> None:
        from google import genai  # guarded import; pip install google-genai
        from google.genai import types
        self._genai = genai
-        self.client = genai.Client(api_key=api_key)
+        self._types = types
        # http_options.timeout is in MILLISECONDS — without it a stalled request hangs the (single-
        # threaded) worker forever; one such hang froze a whole batch for ~50 min before this fix.
        self.client = genai.Client(api_key=api_key,
                                   http_options=types.HttpOptions(timeout=int(timeout_s * 1000)))
        self.model = model
        self.retries = retries
    def complete_json(self, messages: list[dict], *, max_tokens: int = 4000) -> str:
-        from google.genai import types
+        types = self._types
        system = "\n\n".join(m["content"] for m in messages if m["role"] == "system")
        user = "\n\n".join(m["content"] for m in messages if m["role"] != "system")
-        resp = self.client.models.generate_content(
+        cfg = types.GenerateContentConfig(
            model=self.model,
            contents=user,
            config=types.GenerateContentConfig(
            system_instruction=system or None,
            temperature=0,
            max_output_tokens=max_tokens,
            response_mime_type="application/json",
-            ),
+            # Gemini 2.5 thinks by default and spends the output budget on reasoning tokens —
            # it hit MAX_TOKENS with ~3.8k thoughts and a truncated JSON body (0 claims parsed).
            # Extraction is deterministic, no-CoT (mirrors the local path's enable_thinking=False).
            thinking_config=types.ThinkingConfig(thinking_budget=0),
        )
        for attempt in range(self.retries + 1):
            try:
                resp = self.client.models.generate_content(model=self.model, contents=user, config=cfg)
                return resp.text or "{}"
            except Exception as e:  # noqa: BLE001 — timeout/5xx/429/network: back off and retry
                if attempt >= self.retries:
                    raise
                sleep = 2.0 * (2 ** attempt)
                log.warning("Gemini call failed (%s); retry %d/%d in %.0fs", e, attempt + 1, self.retries, sleep)
                time.sleep(sleep)
 def from_config(cfg, sc) -> "LocalQwenBackend | GeminiBackend":
@@ -30,25 +30,51 @@ def register_seed_topics(conn: sqlite3.Connection) -> None:
    conn.commit()
 # Coarse→fine split boundaries. Transcripts arrive as `Speaker: turn` lines joined by a SINGLE
 # newline (ASR output has no blank-line paragraphs), filings as paragraph text — so splitting on
 # "\n\n" alone never fires on a transcript and the whole episode would go in one call. "" is the
 # per-character hard cap that guarantees termination regardless of punctuation.
 _SEPARATORS = ["\n\n", "\n", ". ", " ", ""]
 def chunk_text(text: str, max_chars: int) -> list[str]:
-    """Split on paragraph boundaries into windows that fit the model context alongside the prompt."""
+    """Pack text into windows that each fit the model context alongside the prompt.
    Falls through paragraph → line → sentence → word → hard char-slice, so NO chunk ever exceeds
    max_chars however the source is punctuated, while keeping speaker turns intact when they fit.
    """
    if max_chars < 1:  # else _pack recurses past the last separator → IndexError
        raise ValueError(f"max_chars must be >= 1, got {max_chars}")
    text = text.strip()
    if not text:
        return []
    return _pack(text, max_chars, _SEPARATORS)
 def _pack(text: str, max_chars: int, seps: list[str]) -> list[str]:
    """Recursively pack `text` on the coarsest separator in `seps` that keeps chunks within
    max_chars, descending to a finer one only for a part that is itself still too big."""
    if len(text) <= max_chars:
        return [text]
-    chunks: list[str] = []
+    sep, rest = seps[0], seps[1:]
-    cur: list[str] = []
+    parts = list(text) if sep == "" else text.split(sep)
-    size = 0
+    out: list[str] = []
-    for para in text.split("\n\n"):
+    cur = ""
-        if size + len(para) > max_chars and cur:
+    for p in parts:
-            chunks.append("\n\n".join(cur))
+        candidate = p if not cur else cur + sep + p
-            cur, size = [], 0
+        if len(candidate) <= max_chars:
-        cur.append(para)
+            cur = candidate
-        size += len(para) + 2
+            continue
        if cur:
-        chunks.append("\n\n".join(cur))
+            out.append(cur)
-    return chunks
+        if len(p) <= max_chars:
            cur = p
        else:  # a single part still too big → split it on the next-finer boundary
            out.extend(_pack(p, max_chars, rest))
            cur = ""
    if cur:
        out.append(cur)
    return out
 def _parse_claims(content: str) -> list[dict]:
@@ -28,8 +28,8 @@ def _document_text(doc, *, user_agent: str) -> str:
    raise ValueError(f"no text source for {doc['doc_id']} (kind={doc['kind']}, url={doc['url']})")
-def run_extract(conn, sc, cfg, *, limit: int = 10, max_chunks_per_doc: int = 4,
+def run_extract(conn, sc, cfg, *, limit: int = 10, max_chunks_per_doc: int = 999,
-                chunk_chars: int = 18_000, lease_seconds: int = 900,
+                chunk_chars: int = 12_000, lease_seconds: int = 900,
                worker_id: str = "extract-1") -> dict:
    from .backends import from_config as backend_from_config
    backend = backend_from_config(cfg, sc)
Author	SHA1	Message	Date
Keysat	0cba2626d3	Record Strike test conditional pass; track reflexivity follow-up Full STRIKE2022 pipeline ran clean: 63 filing extracts (3,330 claims, 56,008 total), embed-claims indexed all into Qdrant, two-sided live/test. Engine refuses the false positive (net=+0.25, capped single bitcoin cluster, far below the 2.0 firing bar) but the own_network-drop reflexivity demo is unexercised (own_net=0, live==test) because RHR/CD/Bitcoin.Review were deferred at transcription 2026-06-08. Accepted as a conditional pass; un-defer follow-up tracked in ROADMAP.	2026-06-16 12:39:13 -05:00
Keysat	0b001b49d5	Handoff: record Gemini timeout lesson; Strike extraction near complete Fold the hang lesson into the Gemini operational rule (disable thinking AND set a timeout) and refresh Current state for the in-progress Gemini extraction (~68 docs left, 52.7k claims) and the gating Strike test.	2026-06-16 08:45:12 -05:00
Keysat	87b6b05d67	Add request timeout and retry to Gemini extraction backend A timeout-less generate_content call hung the single-threaded extract worker for ~50 min mid-batch. Set an HTTP timeout (120s) plus 4 retries with backoff, mirroring SparkControl._post; transient 504/read-timeouts now self-heal instead of freezing the run.	2026-06-16 08:45:12 -05:00
Keysat	6f4698a98c	Handoff: durable chunker/Gemini rules; Strike extraction in progress Record two recurrence-prone gotchas in Key operational rules (transcript chunking must cap every chunk; Gemini backend must disable thinking) and rewrite Current state for the in-progress Gemini extraction batch and the gating Strike test.	2026-06-15 22:47:20 -05:00
Keysat	e8d50efdf4	Disable Gemini thinking budget in extraction backend gemini-2.5-flash thinks by default and spent ~3.8k of the 4k output budget on reasoning, hitting MAX_TOKENS with a truncated JSON body -> 0 claims parsed. Set thinking_budget=0 so the full budget goes to the answer (mirrors the local path's enable_thinking=False). On the validation chunk this went from 0 -> 11 claims.	2026-06-15 22:28:12 -05:00
Keysat	5deffddb17	Fix transcript chunker context overflow; full-coverage extraction defaults chunk_text split only on "\n\n", but ASR transcripts have none (speaker turns are joined by a single "\n"), so whole 2-3h episodes (~250K chars) went to the extractor in one call and 400'd on context overflow. Fall through paragraph -> line -> sentence -> word -> hard char-slice so no chunk exceeds the cap regardless of punctuation; guard max_chars < 1. Default extraction to recall-first full coverage (chunk_chars 12K, max_chunks 999) and expose both as run-extract --chunk-chars / --max-chunks.	2026-06-15 22:28:12 -05:00