Compare commits

...

6 Commits

Author SHA1 Message Date
Keysat 0cba2626d3 Record Strike test conditional pass; track reflexivity follow-up
Full STRIKE2022 pipeline ran clean: 63 filing extracts (3,330 claims,
56,008 total), embed-claims indexed all into Qdrant, two-sided live/test.
Engine refuses the false positive (net=+0.25, capped single bitcoin
cluster, far below the 2.0 firing bar) but the own_network-drop
reflexivity demo is unexercised (own_net=0, live==test) because
RHR/CD/Bitcoin.Review were deferred at transcription 2026-06-08.
Accepted as a conditional pass; un-defer follow-up tracked in ROADMAP.
2026-06-16 12:39:13 -05:00
Keysat 0b001b49d5 Handoff: record Gemini timeout lesson; Strike extraction near complete
Fold the hang lesson into the Gemini operational rule (disable thinking AND set a timeout) and refresh Current state for the in-progress Gemini extraction (~68 docs left, 52.7k claims) and the gating Strike test.
2026-06-16 08:45:12 -05:00
Keysat 87b6b05d67 Add request timeout and retry to Gemini extraction backend
A timeout-less generate_content call hung the single-threaded extract worker for ~50 min mid-batch. Set an HTTP timeout (120s) plus 4 retries with backoff, mirroring SparkControl._post; transient 504/read-timeouts now self-heal instead of freezing the run.
2026-06-16 08:45:12 -05:00
Keysat 6f4698a98c Handoff: durable chunker/Gemini rules; Strike extraction in progress
Record two recurrence-prone gotchas in Key operational rules (transcript chunking must cap every chunk; Gemini backend must disable thinking) and rewrite Current state for the in-progress Gemini extraction batch and the gating Strike test.
2026-06-15 22:47:20 -05:00
Keysat e8d50efdf4 Disable Gemini thinking budget in extraction backend
gemini-2.5-flash thinks by default and spent ~3.8k of the 4k output budget on reasoning, hitting MAX_TOKENS with a truncated JSON body -> 0 claims parsed. Set thinking_budget=0 so the full budget goes to the answer (mirrors the local path's enable_thinking=False). On the validation chunk this went from 0 -> 11 claims.
2026-06-15 22:28:12 -05:00
Keysat 5deffddb17 Fix transcript chunker context overflow; full-coverage extraction defaults
chunk_text split only on "\n\n", but ASR transcripts have none (speaker turns are joined by a single "\n"), so whole 2-3h episodes (~250K chars) went to the extractor in one call and 400'd on context overflow. Fall through paragraph -> line -> sentence -> word -> hard char-slice so no chunk exceeds the cap regardless of punctuation; guard max_chars < 1.

Default extraction to recall-first full coverage (chunk_chars 12K, max_chunks 999) and expose both as run-extract --chunk-chars / --max-chunks.
2026-06-15 22:28:12 -05:00
6 changed files with 120 additions and 43 deletions
+33 -14
View File
@@ -89,6 +89,17 @@ ASR + diarizer, **bge-m3** embeddings, **Qdrant** on `:87`. The gateway is the o
independent). Unconfirmed: Unchained, Debifi, Coinkite (held quarantined pending Grant's materiality call). independent). Unconfirmed: Unchained, Debifi, Coinkite (held quarantined pending Grant's materiality call).
- **Gemini quota is a rolling ~24h window** (~291 hour-long episodes / ~51M tokens), not a calendar-day - **Gemini quota is a rolling ~24h window** (~291 hour-long episodes / ~51M tokens), not a calendar-day
reset. Bulk transcription overflows there; expect 429 RESOURCE_EXHAUSTED past the window. reset. Bulk transcription overflows there; expect 429 RESOURCE_EXHAUSTED past the window.
- **Transcript chunking is recall-first and MUST cap every chunk.** ASR transcripts have NO blank-line
paragraphs (speaker turns joined by a single `\n`), so `extract.claims.chunk_text` falls through
`\n\n``\n`→sentence→word→hard-slice; splitting only on `\n\n` (the old bug) sent whole 23 h episodes
in ONE call → context-overflow 400s. Extraction defaults to full coverage at 12K chars/chunk
(`run-extract --chunk-chars/--max-chunks`); bigger chunks risk lost-in-the-middle recall loss.
- **Gemini extraction backend: disable thinking AND set a timeout.** `gemini-2.5-flash` thinks by
default and burns the output-token budget on reasoning → MAX_TOKENS → truncated JSON → 0 claims, so
the backend sets `thinking_budget=0` (mirrors local `enable_thinking=False`). It also sets an HTTP
**timeout (120 s) + 4 retries** — a timeout-less call once hung the single-threaded worker ~50 min
(transient 504/read-timeouts then self-heal). Gemini = overflow for PUBLIC data only; keep
`EXTRACTION_BACKEND=local` in `.env`, flip it inline per-run when overflowing.
- **Scoring-brain internals are scoped to a guide.** Before editing `signal_engine/signals/`, read - **Scoring-brain internals are scoped to a guide.** Before editing `signal_engine/signals/`, read
**`docs/guides/scoring-brain.md`** — the classifier invariants (REALIZED-ONLY, ROLE-MATCH, claim_type **`docs/guides/scoring-brain.md`** — the classifier invariants (REALIZED-ONLY, ROLE-MATCH, claim_type
hard-evidence guard, max_tokens budget, claim_id bracket-strip), the EISC cluster-cap, and the hard-evidence guard, max_tokens budget, claim_id bracket-strip), the EISC cluster-cap, and the
@@ -105,17 +116,25 @@ IP appears only as an env-var default.
## Current state (snapshot — overwrite each session; longer-term backlog → `ROADMAP.md`) ## Current state (snapshot — overwrite each session; longer-term backlog → `ROADMAP.md`)
- **Battery adversarial test: PASSES.** Corpus built (23 docs via the `docs` fetcher); after the three - **Strike adversarial test: CONDITIONAL PASS (2026-06-16).** Full pipeline ran: extraction drained the
scoring fixes the engine reads demand-net rising (+3.9) while **supply stays flat at 0.0** — correctly final 63 filing jobs via the Gemini backend (3,330 claims, **0 failures/timeouts/429s**) → **56,008
rejecting Cantor's *announced* $2B and borrower-side collateral claims as not-realized-supply. claims** total → `embed-claims` indexed all 56,008 into Qdrant (points == claims, hybrid dense+BM25) →
- **Strike adversarial test: STALLED — needs a manual resume; no result yet.** The independent leg (What `two-sided --conviction STRIKE2022 --modes live,test`. **Result:** the engine correctly **refuses the
Bitcoin Did, Stephan Livera, Kevin Rooke, Anita Posch, Cafe Bitcoin, + River research — all false positive** — the two scoring nodes (`lightning-retail-acceptance`, `merchant-lightning-integration`)
independent) is ~586/671 transcribed. The `run_strike_pipeline.sh` watcher proceeded on that partial sit at `net=+0.25`, the *capped single-bitcoin-cluster* value, far below the `EISC_FLOOR=2.0` firing bar
corpus, but its extraction worker **died (2026-06-11)** after only 17/635 podcast docs; the stale lease (`signals/bar.py`); the §3 "Bitcoin is one capped cluster" guardrail holds. **But the reflexivity
is cleared, so **608 bitcoin-podcast extract jobs are pending**. Spark vLLM is healthy → **RESUME:** *demonstration* (live < test via own_network drop) is NOT exercised:** `own_net=0` and live==test because
`run-extract --limit 700 --max-chunks 4``embed-claims``two-sided --conviction STRIKE2022 the own_network shows that carry the reflexive Lightning chatter — **RHR (80), CD (77), Bitcoin.Review (12)
--modes live,test` (PASS = quiet in live, fires in test). The Spark **audio fix** (semaphore-of-2 + = 169 eps — were deferred at transcription 2026-06-08** ("focus on WBD/Livera/Rooke/Anita"; no audio
retry-backoff) is committed and validated (~2.5× faster, zero episode aborts). downloaded), so they have 0 claims. (TFTC partially transcribed, 19/80 → 329 claims; the current +0.25
- **§7.1 power-infra backtest:** qualified YES (corpus-gated; runway/precision caveats in `DESIGN_v2.md`). comes from *independent* bitcoin-cluster shows.) **Operator call 2026-06-16: accept the conditional pass,
- Corpus now spans bitcoin podcasts, SEC/FMP company filings (incl. 6 major banks + Robinhood, a new no audio-GPU spend now.** To fully demonstrate reflexivity later: un-defer + transcribe the RHR/CD
`banks` cluster), the Battery text corpus, and River research. EISC edges seeded for the bitcoin cluster. 202223 Lightning-retail window → re-extract → re-run `two-sided` (then test mode should fire while live
stays quiet). Tracked in `ROADMAP.md`.
- **Battery test PASSES; §7.1 power-infra qualified YES** (both unchanged).
- **5 commits ahead of `origin/main`, UNPUSHED** (Gitea `immense-voyage.local`). The `backends.py`
timeout/retry change the prior handoff flagged as uncommitted was in fact committed (`87b6b05`); tree is
otherwise clean apart from this Current-state edit. Run `git push origin main`.
- Corpus: bitcoin podcasts (own_network: TFTC partial; RHR/CD/Bitcoin.Review deferred), SEC/FMP filings
(+`banks` cluster, now extracted: Robinhood 2216, Morgan Stanley 644, JPMorgan 382… + power-infra names
Oklo/NuScale/Cipher/TeraWulf), Battery corpus, River research; EISC edges seeded for the bitcoin cluster.
+11
View File
@@ -18,6 +18,13 @@ falsification hypotheses (H1H6) are in `DESIGN_v2.md`.
- **MD&A targeting** for filings — extract Item 7, not front-matter boilerplate. - **MD&A targeting** for filings — extract Item 7, not front-matter boilerplate.
## Corpus & independence ## Corpus & independence
- **Complete the Strike reflexivity demonstration (deferred 2026-06-16).** The STRIKE2022 adversarial test
is a *conditional* pass: the engine refuses the false positive via the capped single-bitcoin-cluster
guard (`net=+0.25``EISC_FLOOR=2.0`), but the own_network-drop mechanism (live quiet < test fires) is
unexercised because RHR (80) / CD (77) / Bitcoin.Review (12) were deferred at transcription on 2026-06-08
(`own_net=0`, live==test). To finish: un-defer + transcribe the **RHR/CD 202223 Lightning-retail window**
`run-extract` → re-run `two-sided --conviction STRIKE2022 --modes live,test`; PASS = test fires while
live stays quiet. Costs constrained audio-GPU time, hence deferred.
- **Confirm materiality** of the remaining `own_network`-flagged sources with Grant: Unchained, Debifi, - **Confirm materiality** of the remaining `own_network`-flagged sources with Grant: Unchained, Debifi,
Coinkite (Bitcoin.Review). Immaterial → flip to independent (the River/Swan precedent). Coinkite (Bitcoin.Review). Immaterial → flip to independent (the River/Swan precedent).
- **BTC Sessions (Ben Perrin)** — strongest still-missing independent high-Strike merchant/wallet-adoption - **BTC Sessions (Ben Perrin)** — strongest still-missing independent high-Strike merchant/wallet-adoption
@@ -33,6 +40,10 @@ falsification hypotheses (H1H6) are in `DESIGN_v2.md`.
- **Episode-pipelining** in `transcribe_worker` — download/chunk the next episode while transcribing the - **Episode-pipelining** in `transcribe_worker` — download/chunk the next episode while transcribing the
current one, to close the inter-episode GPU idle gap (the per-chunk 2-in-flight path is already done). current one, to close the inter-episode GPU idle gap (the per-chunk 2-in-flight path is already done).
- **Corpus-management UI** — add to the corpus over time and see the full corpus selection. - **Corpus-management UI** — add to the corpus over time and see the full corpus selection.
- **Expose pipeline tunables in the UI (with the UI topic).** Extraction chunk size + per-doc chunk cap,
audio chunk length, audio concurrency, etc. are currently hardcoded defaults (now also CLI flags on
`run-extract`: `--chunk-chars`, `--max-chunks`). Surface them in the UI so they're visible/adjustable,
not black-box assumptions we forget about. Tie to the corpus-management UI work.
- **Forward live operation** — the only real test: scoring un-pre-selected signals as they arrive, with - **Forward live operation** — the only real test: scoring un-pre-selected signals as they arrive, with
the dual-evaluation ledger as arbiter. the dual-evaluation ledger as arbiter.
+6 -2
View File
@@ -254,7 +254,8 @@ def cmd_run_extract(args: argparse.Namespace) -> int:
conn = db.connect(cfg.db_path) conn = db.connect(cfg.db_path)
db.init_db(conn) db.init_db(conn)
sc = from_config(cfg) sc = from_config(cfg)
result = run_extract(conn, sc, cfg, limit=args.limit, max_chunks_per_doc=args.max_chunks) result = run_extract(conn, sc, cfg, limit=args.limit, max_chunks_per_doc=args.max_chunks,
chunk_chars=args.chunk_chars)
print(f"extraction: {result['jobs_processed']} jobs, {result['claims_written']} claims written") print(f"extraction: {result['jobs_processed']} jobs, {result['claims_written']} claims written")
return 0 return 0
@@ -581,7 +582,10 @@ def build_parser() -> argparse.ArgumentParser:
re = sub.add_parser("run-extract", help="Drain 'extract' jobs → claims via the local LLM (§4.2)") re = sub.add_parser("run-extract", help="Drain 'extract' jobs → claims via the local LLM (§4.2)")
re.add_argument("--limit", type=int, default=5, help="max jobs to process this run") re.add_argument("--limit", type=int, default=5, help="max jobs to process this run")
re.add_argument("--max-chunks", type=int, default=4, help="max chunks per document") re.add_argument("--max-chunks", type=int, default=999,
help="max chunks per document (default: full coverage (999))")
re.add_argument("--chunk-chars", type=int, default=12_000,
help="chars per extraction chunk; smaller = better recall, more LLM calls")
re.set_defaults(func=cmd_run_extract) re.set_defaults(func=cmd_run_extract)
sub.add_parser("queue-status", help="Backfill queue counts by type/state").set_defaults(func=cmd_queue_status) sub.add_parser("queue-status", help="Backfill queue counts by type/state").set_defaults(func=cmd_queue_status)
+25 -8
View File
@@ -10,6 +10,7 @@ A backend exposes: complete_json(messages, max_tokens) -> str (a JSON object st
from __future__ import annotations from __future__ import annotations
import logging import logging
import time
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
@@ -32,27 +33,43 @@ class GeminiBackend:
API is the eventual scale path; this synchronous form is the drop-in fallback.""" API is the eventual scale path; this synchronous form is the drop-in fallback."""
name = "gemini" name = "gemini"
def __init__(self, api_key: str, model: str = "gemini-2.5-flash") -> None: def __init__(self, api_key: str, model: str = "gemini-2.5-flash", *,
timeout_s: float = 120.0, retries: int = 4) -> None:
from google import genai # guarded import; pip install google-genai from google import genai # guarded import; pip install google-genai
from google.genai import types
self._genai = genai self._genai = genai
self.client = genai.Client(api_key=api_key) self._types = types
# http_options.timeout is in MILLISECONDS — without it a stalled request hangs the (single-
# threaded) worker forever; one such hang froze a whole batch for ~50 min before this fix.
self.client = genai.Client(api_key=api_key,
http_options=types.HttpOptions(timeout=int(timeout_s * 1000)))
self.model = model self.model = model
self.retries = retries
def complete_json(self, messages: list[dict], *, max_tokens: int = 4000) -> str: def complete_json(self, messages: list[dict], *, max_tokens: int = 4000) -> str:
from google.genai import types types = self._types
system = "\n\n".join(m["content"] for m in messages if m["role"] == "system") system = "\n\n".join(m["content"] for m in messages if m["role"] == "system")
user = "\n\n".join(m["content"] for m in messages if m["role"] != "system") user = "\n\n".join(m["content"] for m in messages if m["role"] != "system")
resp = self.client.models.generate_content( cfg = types.GenerateContentConfig(
model=self.model,
contents=user,
config=types.GenerateContentConfig(
system_instruction=system or None, system_instruction=system or None,
temperature=0, temperature=0,
max_output_tokens=max_tokens, max_output_tokens=max_tokens,
response_mime_type="application/json", response_mime_type="application/json",
), # Gemini 2.5 thinks by default and spends the output budget on reasoning tokens —
# it hit MAX_TOKENS with ~3.8k thoughts and a truncated JSON body (0 claims parsed).
# Extraction is deterministic, no-CoT (mirrors the local path's enable_thinking=False).
thinking_config=types.ThinkingConfig(thinking_budget=0),
) )
for attempt in range(self.retries + 1):
try:
resp = self.client.models.generate_content(model=self.model, contents=user, config=cfg)
return resp.text or "{}" return resp.text or "{}"
except Exception as e: # noqa: BLE001 — timeout/5xx/429/network: back off and retry
if attempt >= self.retries:
raise
sleep = 2.0 * (2 ** attempt)
log.warning("Gemini call failed (%s); retry %d/%d in %.0fs", e, attempt + 1, self.retries, sleep)
time.sleep(sleep)
def from_config(cfg, sc) -> "LocalQwenBackend | GeminiBackend": def from_config(cfg, sc) -> "LocalQwenBackend | GeminiBackend":
+38 -12
View File
@@ -30,25 +30,51 @@ def register_seed_topics(conn: sqlite3.Connection) -> None:
conn.commit() conn.commit()
# Coarse→fine split boundaries. Transcripts arrive as `Speaker: turn` lines joined by a SINGLE
# newline (ASR output has no blank-line paragraphs), filings as paragraph text — so splitting on
# "\n\n" alone never fires on a transcript and the whole episode would go in one call. "" is the
# per-character hard cap that guarantees termination regardless of punctuation.
_SEPARATORS = ["\n\n", "\n", ". ", " ", ""]
def chunk_text(text: str, max_chars: int) -> list[str]: def chunk_text(text: str, max_chars: int) -> list[str]:
"""Split on paragraph boundaries into windows that fit the model context alongside the prompt.""" """Pack text into windows that each fit the model context alongside the prompt.
Falls through paragraph → line → sentence → word → hard char-slice, so NO chunk ever exceeds
max_chars however the source is punctuated, while keeping speaker turns intact when they fit.
"""
if max_chars < 1: # else _pack recurses past the last separator → IndexError
raise ValueError(f"max_chars must be >= 1, got {max_chars}")
text = text.strip() text = text.strip()
if not text: if not text:
return [] return []
return _pack(text, max_chars, _SEPARATORS)
def _pack(text: str, max_chars: int, seps: list[str]) -> list[str]:
"""Recursively pack `text` on the coarsest separator in `seps` that keeps chunks within
max_chars, descending to a finer one only for a part that is itself still too big."""
if len(text) <= max_chars: if len(text) <= max_chars:
return [text] return [text]
chunks: list[str] = [] sep, rest = seps[0], seps[1:]
cur: list[str] = [] parts = list(text) if sep == "" else text.split(sep)
size = 0 out: list[str] = []
for para in text.split("\n\n"): cur = ""
if size + len(para) > max_chars and cur: for p in parts:
chunks.append("\n\n".join(cur)) candidate = p if not cur else cur + sep + p
cur, size = [], 0 if len(candidate) <= max_chars:
cur.append(para) cur = candidate
size += len(para) + 2 continue
if cur: if cur:
chunks.append("\n\n".join(cur)) out.append(cur)
return chunks if len(p) <= max_chars:
cur = p
else: # a single part still too big → split it on the next-finer boundary
out.extend(_pack(p, max_chars, rest))
cur = ""
if cur:
out.append(cur)
return out
def _parse_claims(content: str) -> list[dict]: def _parse_claims(content: str) -> list[dict]:
+2 -2
View File
@@ -28,8 +28,8 @@ def _document_text(doc, *, user_agent: str) -> str:
raise ValueError(f"no text source for {doc['doc_id']} (kind={doc['kind']}, url={doc['url']})") raise ValueError(f"no text source for {doc['doc_id']} (kind={doc['kind']}, url={doc['url']})")
def run_extract(conn, sc, cfg, *, limit: int = 10, max_chunks_per_doc: int = 4, def run_extract(conn, sc, cfg, *, limit: int = 10, max_chunks_per_doc: int = 999,
chunk_chars: int = 18_000, lease_seconds: int = 900, chunk_chars: int = 12_000, lease_seconds: int = 900,
worker_id: str = "extract-1") -> dict: worker_id: str = "extract-1") -> dict:
from .backends import from_config as backend_from_config from .backends import from_config as backend_from_config
backend = backend_from_config(cfg, sc) backend = backend_from_config(cfg, sc)