Compare commits
8 Commits
cabb8a3d6c
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 6d2c3fbf08 | |||
| 29bca8b387 | |||
| 0cba2626d3 | |||
| 0b001b49d5 | |||
| 87b6b05d67 | |||
| 6f4698a98c | |||
| e8d50efdf4 | |||
| 5deffddb17 |
@@ -89,6 +89,17 @@ ASR + diarizer, **bge-m3** embeddings, **Qdrant** on `:87`. The gateway is the o
|
||||
independent). Unconfirmed: Unchained, Debifi, Coinkite (held quarantined pending Grant's materiality call).
|
||||
- **Gemini quota is a rolling ~24h window** (~291 hour-long episodes / ~51M tokens), not a calendar-day
|
||||
reset. Bulk transcription overflows there; expect 429 RESOURCE_EXHAUSTED past the window.
|
||||
- **Transcript chunking is recall-first and MUST cap every chunk.** ASR transcripts have NO blank-line
|
||||
paragraphs (speaker turns joined by a single `\n`), so `extract.claims.chunk_text` falls through
|
||||
`\n\n`→`\n`→sentence→word→hard-slice; splitting only on `\n\n` (the old bug) sent whole 2–3 h episodes
|
||||
in ONE call → context-overflow 400s. Extraction defaults to full coverage at 12K chars/chunk
|
||||
(`run-extract --chunk-chars/--max-chunks`); bigger chunks risk lost-in-the-middle recall loss.
|
||||
- **Gemini extraction backend: disable thinking AND set a timeout.** `gemini-2.5-flash` thinks by
|
||||
default and burns the output-token budget on reasoning → MAX_TOKENS → truncated JSON → 0 claims, so
|
||||
the backend sets `thinking_budget=0` (mirrors local `enable_thinking=False`). It also sets an HTTP
|
||||
**timeout (120 s) + 4 retries** — a timeout-less call once hung the single-threaded worker ~50 min
|
||||
(transient 504/read-timeouts then self-heal). Gemini = overflow for PUBLIC data only; keep
|
||||
`EXTRACTION_BACKEND=local` in `.env`, flip it inline per-run when overflowing.
|
||||
- **Scoring-brain internals are scoped to a guide.** Before editing `signal_engine/signals/`, read
|
||||
**`docs/guides/scoring-brain.md`** — the classifier invariants (REALIZED-ONLY, ROLE-MATCH, claim_type
|
||||
hard-evidence guard, max_tokens budget, claim_id bracket-strip), the EISC cluster-cap, and the
|
||||
@@ -105,17 +116,19 @@ IP appears only as an env-var default.
|
||||
|
||||
## Current state (snapshot — overwrite each session; longer-term backlog → `ROADMAP.md`)
|
||||
|
||||
- **Battery adversarial test: PASSES.** Corpus built (23 docs via the `docs` fetcher); after the three
|
||||
scoring fixes the engine reads demand-net rising (+3.9) while **supply stays flat at 0.0** — correctly
|
||||
rejecting Cantor's *announced* $2B and borrower-side collateral claims as not-realized-supply.
|
||||
- **Strike adversarial test: STALLED — needs a manual resume; no result yet.** The independent leg (What
|
||||
Bitcoin Did, Stephan Livera, Kevin Rooke, Anita Posch, Cafe Bitcoin, + River research — all
|
||||
independent) is ~586/671 transcribed. The `run_strike_pipeline.sh` watcher proceeded on that partial
|
||||
corpus, but its extraction worker **died (2026-06-11)** after only 17/635 podcast docs; the stale lease
|
||||
is cleared, so **608 bitcoin-podcast extract jobs are pending**. Spark vLLM is healthy → **RESUME:**
|
||||
`run-extract --limit 700 --max-chunks 4` → `embed-claims` → `two-sided --conviction STRIKE2022
|
||||
--modes live,test` (PASS = quiet in live, fires in test). The Spark **audio fix** (semaphore-of-2 +
|
||||
retry-backoff) is committed and validated (~2.5× faster, zero episode aborts).
|
||||
- **§7.1 power-infra backtest:** qualified YES (corpus-gated; runway/precision caveats in `DESIGN_v2.md`).
|
||||
- Corpus now spans bitcoin podcasts, SEC/FMP company filings (incl. 6 major banks + Robinhood, a new
|
||||
`banks` cluster), the Battery text corpus, and River research. EISC edges seeded for the bitcoin cluster.
|
||||
- **Strike adversarial test: CONDITIONAL PASS (2026-06-16).** Pipeline complete end-to-end — extraction
|
||||
drained the last 63 filing jobs (Gemini, 3,330 claims, 0 failures) → **56,008 claims** all embedded in
|
||||
Qdrant → `two-sided --conviction STRIKE2022 --modes live,test`. The engine **refuses the false positive**:
|
||||
the Lightning-retail nodes net `+0.25` (capped single bitcoin cluster, ≪ `EISC_FLOOR=2.0`), §3 guardrail
|
||||
holding. The own_network-drop *reflexivity demo* is unexercised (`own_net=0`, live==test) because
|
||||
RHR/CD/Bitcoin.Review (169 eps) were deferred at transcription 2026-06-08; operator accepted the
|
||||
conditional pass, no audio-GPU spend now. How to read the net value + the full follow-up:
|
||||
`docs/guides/scoring-brain.md` (STRIKE2022) and `ROADMAP.md`.
|
||||
- **Battery test PASSES; §7.1 power-infra qualified YES** (both unchanged).
|
||||
- **Corpus:** bitcoin podcasts (own_network: TFTC partial 19/80; RHR/CD/Bitcoin.Review deferred), SEC/FMP
|
||||
filings (`banks` cluster now extracted + power-infra names Oklo/NuScale/Cipher/TeraWulf), Battery corpus,
|
||||
River research; EISC edges seeded for the bitcoin cluster.
|
||||
- **Repo:** clean, in sync with `origin/main` (Gitea). No automated test suite (on ROADMAP).
|
||||
- **NEXT (priority order, all fresh scopes — confirm direction first):** (1) frontier-fan-out test H6, the
|
||||
untested half of the §1.1 validation; (2) complete the Strike reflexivity demo when audio-GPU budget
|
||||
allows (un-defer RHR/CD 2022–23 → re-extract → re-run); (3) Job A discovery scorers for the forward pilot.
|
||||
|
||||
+22
@@ -18,6 +18,13 @@ falsification hypotheses (H1–H6) are in `DESIGN_v2.md`.
|
||||
- **MD&A targeting** for filings — extract Item 7, not front-matter boilerplate.
|
||||
|
||||
## Corpus & independence
|
||||
- **Complete the Strike reflexivity demonstration (deferred 2026-06-16).** The STRIKE2022 adversarial test
|
||||
is a *conditional* pass: the engine refuses the false positive via the capped single-bitcoin-cluster
|
||||
guard (`net=+0.25` ≪ `EISC_FLOOR=2.0`), but the own_network-drop mechanism (live quiet < test fires) is
|
||||
unexercised because RHR (80) / CD (77) / Bitcoin.Review (12) were deferred at transcription on 2026-06-08
|
||||
(`own_net=0`, live==test). To finish: un-defer + transcribe the **RHR/CD 2022–23 Lightning-retail window**
|
||||
→ `run-extract` → re-run `two-sided --conviction STRIKE2022 --modes live,test`; PASS = test fires while
|
||||
live stays quiet. Costs constrained audio-GPU time, hence deferred.
|
||||
- **Confirm materiality** of the remaining `own_network`-flagged sources with Grant: Unchained, Debifi,
|
||||
Coinkite (Bitcoin.Review). Immaterial → flip to independent (the River/Swan precedent).
|
||||
- **BTC Sessions (Ben Perrin)** — strongest still-missing independent high-Strike merchant/wallet-adoption
|
||||
@@ -33,6 +40,21 @@ falsification hypotheses (H1–H6) are in `DESIGN_v2.md`.
|
||||
- **Episode-pipelining** in `transcribe_worker` — download/chunk the next episode while transcribing the
|
||||
current one, to close the inter-episode GPU idle gap (the per-chunk 2-in-flight path is already done).
|
||||
- **Corpus-management UI** — add to the corpus over time and see the full corpus selection.
|
||||
- **Expose pipeline tunables in the UI (with the UI topic).** Extraction chunk size + per-doc chunk cap,
|
||||
audio chunk length, audio concurrency, etc. are currently hardcoded defaults (now also CLI flags on
|
||||
`run-extract`: `--chunk-chars`, `--max-chunks`). Surface them in the UI so they're visible/adjustable,
|
||||
not black-box assumptions we forget about. Tie to the corpus-management UI work.
|
||||
- **Daily activity digest email.** A `daily-digest` CLI command rendering a "last 24h" report —
|
||||
corpus throughput (`documents.ingested_at`/`processed_at`, `claims.extracted_at` by kind/cluster),
|
||||
queue health (`backfill.queue.stats()` — surface **failed/stuck** jobs), Qdrant index lag, infra
|
||||
(`spark-status`), and a **key-findings** section (new `ledger` rows by `date_logged`; `candidate_scores`
|
||||
that `cleared_evidence_bar`). All timestamps already default to `datetime('now')`, so the window is a
|
||||
one-liner; the activity half is buildable today. Deliver via **SMTP** (stdlib `smtplib`+`email`, no new
|
||||
dep, configurable per service — `SMTP_HOST/PORT/USER/PASS`, `DIGEST_TO`); ship a `--stdout` dry-run
|
||||
mode; schedule via launchd on the Mac. Two dependencies: (1) the findings section is only real once the
|
||||
**Job A discovery scorers** run on a schedule — until then it's stubbed/echoes manual adversarial runs;
|
||||
(2) sovereignty (guardrail #7) — SMTP through your own/ten31 server keeps it inside the boundary; do NOT
|
||||
route through a third-party email API if findings ever carry Battery/Strike or positioning substance.
|
||||
- **Forward live operation** — the only real test: scoring un-pre-selected signals as they arrive, with
|
||||
the dual-evaluation ledger as arbiter.
|
||||
|
||||
|
||||
@@ -70,6 +70,11 @@ Pre-registered failed convictions used to test the engine against its target fai
|
||||
stays quiet in `live` (own_network dropped) while it would fire in `test`** — the engine refusing the
|
||||
intra-cluster echo. Run `two-sided --conviction STRIKE2022 --modes live,test`. The REALIZED-ONLY rule
|
||||
is load-bearing here (speculative "Lightning will revolutionize payments" is `predictive`, not signal).
|
||||
**Reading the output:** a single capped bitcoin cluster nets `eisc≈0.25` — already sub-bar vs
|
||||
`EISC_FLOOR=2.0`, so a `+0.25` "quiet in live" can be the *cluster cap* refusing the false positive,
|
||||
NOT the own_network drop. Check `own_net`: if it's 0, live==test and the reflexivity mechanism is
|
||||
unexercised (the affirmers are independent), so a quiet `live` does not by itself prove the echo-drop —
|
||||
you need own_network affirms present (`own_net>0`) for `test` to fire above `live`.
|
||||
|
||||
**Standing rule S1:** derivatives resolve on OUTCOME (scaled substance), never milestones or enablers.
|
||||
An announced program / a regulatory unblock / a single bank's toe-in is CONTEXT, not corroboration.
|
||||
|
||||
@@ -254,7 +254,8 @@ def cmd_run_extract(args: argparse.Namespace) -> int:
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
sc = from_config(cfg)
|
||||
result = run_extract(conn, sc, cfg, limit=args.limit, max_chunks_per_doc=args.max_chunks)
|
||||
result = run_extract(conn, sc, cfg, limit=args.limit, max_chunks_per_doc=args.max_chunks,
|
||||
chunk_chars=args.chunk_chars)
|
||||
print(f"extraction: {result['jobs_processed']} jobs, {result['claims_written']} claims written")
|
||||
return 0
|
||||
|
||||
@@ -581,7 +582,10 @@ def build_parser() -> argparse.ArgumentParser:
|
||||
|
||||
re = sub.add_parser("run-extract", help="Drain 'extract' jobs → claims via the local LLM (§4.2)")
|
||||
re.add_argument("--limit", type=int, default=5, help="max jobs to process this run")
|
||||
re.add_argument("--max-chunks", type=int, default=4, help="max chunks per document")
|
||||
re.add_argument("--max-chunks", type=int, default=999,
|
||||
help="max chunks per document (default: full coverage (999))")
|
||||
re.add_argument("--chunk-chars", type=int, default=12_000,
|
||||
help="chars per extraction chunk; smaller = better recall, more LLM calls")
|
||||
re.set_defaults(func=cmd_run_extract)
|
||||
|
||||
sub.add_parser("queue-status", help="Backfill queue counts by type/state").set_defaults(func=cmd_queue_status)
|
||||
|
||||
@@ -10,6 +10,7 @@ A backend exposes: complete_json(messages, max_tokens) -> str (a JSON object st
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import time
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
@@ -32,27 +33,43 @@ class GeminiBackend:
|
||||
API is the eventual scale path; this synchronous form is the drop-in fallback."""
|
||||
name = "gemini"
|
||||
|
||||
def __init__(self, api_key: str, model: str = "gemini-2.5-flash") -> None:
|
||||
def __init__(self, api_key: str, model: str = "gemini-2.5-flash", *,
|
||||
timeout_s: float = 120.0, retries: int = 4) -> None:
|
||||
from google import genai # guarded import; pip install google-genai
|
||||
from google.genai import types
|
||||
self._genai = genai
|
||||
self.client = genai.Client(api_key=api_key)
|
||||
self._types = types
|
||||
# http_options.timeout is in MILLISECONDS — without it a stalled request hangs the (single-
|
||||
# threaded) worker forever; one such hang froze a whole batch for ~50 min before this fix.
|
||||
self.client = genai.Client(api_key=api_key,
|
||||
http_options=types.HttpOptions(timeout=int(timeout_s * 1000)))
|
||||
self.model = model
|
||||
self.retries = retries
|
||||
|
||||
def complete_json(self, messages: list[dict], *, max_tokens: int = 4000) -> str:
|
||||
from google.genai import types
|
||||
types = self._types
|
||||
system = "\n\n".join(m["content"] for m in messages if m["role"] == "system")
|
||||
user = "\n\n".join(m["content"] for m in messages if m["role"] != "system")
|
||||
resp = self.client.models.generate_content(
|
||||
model=self.model,
|
||||
contents=user,
|
||||
config=types.GenerateContentConfig(
|
||||
system_instruction=system or None,
|
||||
temperature=0,
|
||||
max_output_tokens=max_tokens,
|
||||
response_mime_type="application/json",
|
||||
),
|
||||
cfg = types.GenerateContentConfig(
|
||||
system_instruction=system or None,
|
||||
temperature=0,
|
||||
max_output_tokens=max_tokens,
|
||||
response_mime_type="application/json",
|
||||
# Gemini 2.5 thinks by default and spends the output budget on reasoning tokens —
|
||||
# it hit MAX_TOKENS with ~3.8k thoughts and a truncated JSON body (0 claims parsed).
|
||||
# Extraction is deterministic, no-CoT (mirrors the local path's enable_thinking=False).
|
||||
thinking_config=types.ThinkingConfig(thinking_budget=0),
|
||||
)
|
||||
return resp.text or "{}"
|
||||
for attempt in range(self.retries + 1):
|
||||
try:
|
||||
resp = self.client.models.generate_content(model=self.model, contents=user, config=cfg)
|
||||
return resp.text or "{}"
|
||||
except Exception as e: # noqa: BLE001 — timeout/5xx/429/network: back off and retry
|
||||
if attempt >= self.retries:
|
||||
raise
|
||||
sleep = 2.0 * (2 ** attempt)
|
||||
log.warning("Gemini call failed (%s); retry %d/%d in %.0fs", e, attempt + 1, self.retries, sleep)
|
||||
time.sleep(sleep)
|
||||
|
||||
|
||||
def from_config(cfg, sc) -> "LocalQwenBackend | GeminiBackend":
|
||||
|
||||
@@ -30,25 +30,51 @@ def register_seed_topics(conn: sqlite3.Connection) -> None:
|
||||
conn.commit()
|
||||
|
||||
|
||||
# Coarse→fine split boundaries. Transcripts arrive as `Speaker: turn` lines joined by a SINGLE
|
||||
# newline (ASR output has no blank-line paragraphs), filings as paragraph text — so splitting on
|
||||
# "\n\n" alone never fires on a transcript and the whole episode would go in one call. "" is the
|
||||
# per-character hard cap that guarantees termination regardless of punctuation.
|
||||
_SEPARATORS = ["\n\n", "\n", ". ", " ", ""]
|
||||
|
||||
|
||||
def chunk_text(text: str, max_chars: int) -> list[str]:
|
||||
"""Split on paragraph boundaries into windows that fit the model context alongside the prompt."""
|
||||
"""Pack text into windows that each fit the model context alongside the prompt.
|
||||
|
||||
Falls through paragraph → line → sentence → word → hard char-slice, so NO chunk ever exceeds
|
||||
max_chars however the source is punctuated, while keeping speaker turns intact when they fit.
|
||||
"""
|
||||
if max_chars < 1: # else _pack recurses past the last separator → IndexError
|
||||
raise ValueError(f"max_chars must be >= 1, got {max_chars}")
|
||||
text = text.strip()
|
||||
if not text:
|
||||
return []
|
||||
return _pack(text, max_chars, _SEPARATORS)
|
||||
|
||||
|
||||
def _pack(text: str, max_chars: int, seps: list[str]) -> list[str]:
|
||||
"""Recursively pack `text` on the coarsest separator in `seps` that keeps chunks within
|
||||
max_chars, descending to a finer one only for a part that is itself still too big."""
|
||||
if len(text) <= max_chars:
|
||||
return [text]
|
||||
chunks: list[str] = []
|
||||
cur: list[str] = []
|
||||
size = 0
|
||||
for para in text.split("\n\n"):
|
||||
if size + len(para) > max_chars and cur:
|
||||
chunks.append("\n\n".join(cur))
|
||||
cur, size = [], 0
|
||||
cur.append(para)
|
||||
size += len(para) + 2
|
||||
sep, rest = seps[0], seps[1:]
|
||||
parts = list(text) if sep == "" else text.split(sep)
|
||||
out: list[str] = []
|
||||
cur = ""
|
||||
for p in parts:
|
||||
candidate = p if not cur else cur + sep + p
|
||||
if len(candidate) <= max_chars:
|
||||
cur = candidate
|
||||
continue
|
||||
if cur:
|
||||
out.append(cur)
|
||||
if len(p) <= max_chars:
|
||||
cur = p
|
||||
else: # a single part still too big → split it on the next-finer boundary
|
||||
out.extend(_pack(p, max_chars, rest))
|
||||
cur = ""
|
||||
if cur:
|
||||
chunks.append("\n\n".join(cur))
|
||||
return chunks
|
||||
out.append(cur)
|
||||
return out
|
||||
|
||||
|
||||
def _parse_claims(content: str) -> list[dict]:
|
||||
|
||||
@@ -28,8 +28,8 @@ def _document_text(doc, *, user_agent: str) -> str:
|
||||
raise ValueError(f"no text source for {doc['doc_id']} (kind={doc['kind']}, url={doc['url']})")
|
||||
|
||||
|
||||
def run_extract(conn, sc, cfg, *, limit: int = 10, max_chunks_per_doc: int = 4,
|
||||
chunk_chars: int = 18_000, lease_seconds: int = 900,
|
||||
def run_extract(conn, sc, cfg, *, limit: int = 10, max_chunks_per_doc: int = 999,
|
||||
chunk_chars: int = 12_000, lease_seconds: int = 900,
|
||||
worker_id: str = "extract-1") -> dict:
|
||||
from .backends import from_config as backend_from_config
|
||||
backend = backend_from_config(cfg, sc)
|
||||
|
||||
Reference in New Issue
Block a user