Compare commits
8 Commits
cabb8a3d6c
..
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 6d2c3fbf08 | |||
| 29bca8b387 | |||
| 0cba2626d3 | |||
| 0b001b49d5 | |||
| 87b6b05d67 | |||
| 6f4698a98c | |||
| e8d50efdf4 | |||
| 5deffddb17 |
@@ -89,6 +89,17 @@ ASR + diarizer, **bge-m3** embeddings, **Qdrant** on `:87`. The gateway is the o
|
|||||||
independent). Unconfirmed: Unchained, Debifi, Coinkite (held quarantined pending Grant's materiality call).
|
independent). Unconfirmed: Unchained, Debifi, Coinkite (held quarantined pending Grant's materiality call).
|
||||||
- **Gemini quota is a rolling ~24h window** (~291 hour-long episodes / ~51M tokens), not a calendar-day
|
- **Gemini quota is a rolling ~24h window** (~291 hour-long episodes / ~51M tokens), not a calendar-day
|
||||||
reset. Bulk transcription overflows there; expect 429 RESOURCE_EXHAUSTED past the window.
|
reset. Bulk transcription overflows there; expect 429 RESOURCE_EXHAUSTED past the window.
|
||||||
|
- **Transcript chunking is recall-first and MUST cap every chunk.** ASR transcripts have NO blank-line
|
||||||
|
paragraphs (speaker turns joined by a single `\n`), so `extract.claims.chunk_text` falls through
|
||||||
|
`\n\n`→`\n`→sentence→word→hard-slice; splitting only on `\n\n` (the old bug) sent whole 2–3 h episodes
|
||||||
|
in ONE call → context-overflow 400s. Extraction defaults to full coverage at 12K chars/chunk
|
||||||
|
(`run-extract --chunk-chars/--max-chunks`); bigger chunks risk lost-in-the-middle recall loss.
|
||||||
|
- **Gemini extraction backend: disable thinking AND set a timeout.** `gemini-2.5-flash` thinks by
|
||||||
|
default and burns the output-token budget on reasoning → MAX_TOKENS → truncated JSON → 0 claims, so
|
||||||
|
the backend sets `thinking_budget=0` (mirrors local `enable_thinking=False`). It also sets an HTTP
|
||||||
|
**timeout (120 s) + 4 retries** — a timeout-less call once hung the single-threaded worker ~50 min
|
||||||
|
(transient 504/read-timeouts then self-heal). Gemini = overflow for PUBLIC data only; keep
|
||||||
|
`EXTRACTION_BACKEND=local` in `.env`, flip it inline per-run when overflowing.
|
||||||
- **Scoring-brain internals are scoped to a guide.** Before editing `signal_engine/signals/`, read
|
- **Scoring-brain internals are scoped to a guide.** Before editing `signal_engine/signals/`, read
|
||||||
**`docs/guides/scoring-brain.md`** — the classifier invariants (REALIZED-ONLY, ROLE-MATCH, claim_type
|
**`docs/guides/scoring-brain.md`** — the classifier invariants (REALIZED-ONLY, ROLE-MATCH, claim_type
|
||||||
hard-evidence guard, max_tokens budget, claim_id bracket-strip), the EISC cluster-cap, and the
|
hard-evidence guard, max_tokens budget, claim_id bracket-strip), the EISC cluster-cap, and the
|
||||||
@@ -105,17 +116,19 @@ IP appears only as an env-var default.
|
|||||||
|
|
||||||
## Current state (snapshot — overwrite each session; longer-term backlog → `ROADMAP.md`)
|
## Current state (snapshot — overwrite each session; longer-term backlog → `ROADMAP.md`)
|
||||||
|
|
||||||
- **Battery adversarial test: PASSES.** Corpus built (23 docs via the `docs` fetcher); after the three
|
- **Strike adversarial test: CONDITIONAL PASS (2026-06-16).** Pipeline complete end-to-end — extraction
|
||||||
scoring fixes the engine reads demand-net rising (+3.9) while **supply stays flat at 0.0** — correctly
|
drained the last 63 filing jobs (Gemini, 3,330 claims, 0 failures) → **56,008 claims** all embedded in
|
||||||
rejecting Cantor's *announced* $2B and borrower-side collateral claims as not-realized-supply.
|
Qdrant → `two-sided --conviction STRIKE2022 --modes live,test`. The engine **refuses the false positive**:
|
||||||
- **Strike adversarial test: STALLED — needs a manual resume; no result yet.** The independent leg (What
|
the Lightning-retail nodes net `+0.25` (capped single bitcoin cluster, ≪ `EISC_FLOOR=2.0`), §3 guardrail
|
||||||
Bitcoin Did, Stephan Livera, Kevin Rooke, Anita Posch, Cafe Bitcoin, + River research — all
|
holding. The own_network-drop *reflexivity demo* is unexercised (`own_net=0`, live==test) because
|
||||||
independent) is ~586/671 transcribed. The `run_strike_pipeline.sh` watcher proceeded on that partial
|
RHR/CD/Bitcoin.Review (169 eps) were deferred at transcription 2026-06-08; operator accepted the
|
||||||
corpus, but its extraction worker **died (2026-06-11)** after only 17/635 podcast docs; the stale lease
|
conditional pass, no audio-GPU spend now. How to read the net value + the full follow-up:
|
||||||
is cleared, so **608 bitcoin-podcast extract jobs are pending**. Spark vLLM is healthy → **RESUME:**
|
`docs/guides/scoring-brain.md` (STRIKE2022) and `ROADMAP.md`.
|
||||||
`run-extract --limit 700 --max-chunks 4` → `embed-claims` → `two-sided --conviction STRIKE2022
|
- **Battery test PASSES; §7.1 power-infra qualified YES** (both unchanged).
|
||||||
--modes live,test` (PASS = quiet in live, fires in test). The Spark **audio fix** (semaphore-of-2 +
|
- **Corpus:** bitcoin podcasts (own_network: TFTC partial 19/80; RHR/CD/Bitcoin.Review deferred), SEC/FMP
|
||||||
retry-backoff) is committed and validated (~2.5× faster, zero episode aborts).
|
filings (`banks` cluster now extracted + power-infra names Oklo/NuScale/Cipher/TeraWulf), Battery corpus,
|
||||||
- **§7.1 power-infra backtest:** qualified YES (corpus-gated; runway/precision caveats in `DESIGN_v2.md`).
|
River research; EISC edges seeded for the bitcoin cluster.
|
||||||
- Corpus now spans bitcoin podcasts, SEC/FMP company filings (incl. 6 major banks + Robinhood, a new
|
- **Repo:** clean, in sync with `origin/main` (Gitea). No automated test suite (on ROADMAP).
|
||||||
`banks` cluster), the Battery text corpus, and River research. EISC edges seeded for the bitcoin cluster.
|
- **NEXT (priority order, all fresh scopes — confirm direction first):** (1) frontier-fan-out test H6, the
|
||||||
|
untested half of the §1.1 validation; (2) complete the Strike reflexivity demo when audio-GPU budget
|
||||||
|
allows (un-defer RHR/CD 2022–23 → re-extract → re-run); (3) Job A discovery scorers for the forward pilot.
|
||||||
|
|||||||
+22
@@ -18,6 +18,13 @@ falsification hypotheses (H1–H6) are in `DESIGN_v2.md`.
|
|||||||
- **MD&A targeting** for filings — extract Item 7, not front-matter boilerplate.
|
- **MD&A targeting** for filings — extract Item 7, not front-matter boilerplate.
|
||||||
|
|
||||||
## Corpus & independence
|
## Corpus & independence
|
||||||
|
- **Complete the Strike reflexivity demonstration (deferred 2026-06-16).** The STRIKE2022 adversarial test
|
||||||
|
is a *conditional* pass: the engine refuses the false positive via the capped single-bitcoin-cluster
|
||||||
|
guard (`net=+0.25` ≪ `EISC_FLOOR=2.0`), but the own_network-drop mechanism (live quiet < test fires) is
|
||||||
|
unexercised because RHR (80) / CD (77) / Bitcoin.Review (12) were deferred at transcription on 2026-06-08
|
||||||
|
(`own_net=0`, live==test). To finish: un-defer + transcribe the **RHR/CD 2022–23 Lightning-retail window**
|
||||||
|
→ `run-extract` → re-run `two-sided --conviction STRIKE2022 --modes live,test`; PASS = test fires while
|
||||||
|
live stays quiet. Costs constrained audio-GPU time, hence deferred.
|
||||||
- **Confirm materiality** of the remaining `own_network`-flagged sources with Grant: Unchained, Debifi,
|
- **Confirm materiality** of the remaining `own_network`-flagged sources with Grant: Unchained, Debifi,
|
||||||
Coinkite (Bitcoin.Review). Immaterial → flip to independent (the River/Swan precedent).
|
Coinkite (Bitcoin.Review). Immaterial → flip to independent (the River/Swan precedent).
|
||||||
- **BTC Sessions (Ben Perrin)** — strongest still-missing independent high-Strike merchant/wallet-adoption
|
- **BTC Sessions (Ben Perrin)** — strongest still-missing independent high-Strike merchant/wallet-adoption
|
||||||
@@ -33,6 +40,21 @@ falsification hypotheses (H1–H6) are in `DESIGN_v2.md`.
|
|||||||
- **Episode-pipelining** in `transcribe_worker` — download/chunk the next episode while transcribing the
|
- **Episode-pipelining** in `transcribe_worker` — download/chunk the next episode while transcribing the
|
||||||
current one, to close the inter-episode GPU idle gap (the per-chunk 2-in-flight path is already done).
|
current one, to close the inter-episode GPU idle gap (the per-chunk 2-in-flight path is already done).
|
||||||
- **Corpus-management UI** — add to the corpus over time and see the full corpus selection.
|
- **Corpus-management UI** — add to the corpus over time and see the full corpus selection.
|
||||||
|
- **Expose pipeline tunables in the UI (with the UI topic).** Extraction chunk size + per-doc chunk cap,
|
||||||
|
audio chunk length, audio concurrency, etc. are currently hardcoded defaults (now also CLI flags on
|
||||||
|
`run-extract`: `--chunk-chars`, `--max-chunks`). Surface them in the UI so they're visible/adjustable,
|
||||||
|
not black-box assumptions we forget about. Tie to the corpus-management UI work.
|
||||||
|
- **Daily activity digest email.** A `daily-digest` CLI command rendering a "last 24h" report —
|
||||||
|
corpus throughput (`documents.ingested_at`/`processed_at`, `claims.extracted_at` by kind/cluster),
|
||||||
|
queue health (`backfill.queue.stats()` — surface **failed/stuck** jobs), Qdrant index lag, infra
|
||||||
|
(`spark-status`), and a **key-findings** section (new `ledger` rows by `date_logged`; `candidate_scores`
|
||||||
|
that `cleared_evidence_bar`). All timestamps already default to `datetime('now')`, so the window is a
|
||||||
|
one-liner; the activity half is buildable today. Deliver via **SMTP** (stdlib `smtplib`+`email`, no new
|
||||||
|
dep, configurable per service — `SMTP_HOST/PORT/USER/PASS`, `DIGEST_TO`); ship a `--stdout` dry-run
|
||||||
|
mode; schedule via launchd on the Mac. Two dependencies: (1) the findings section is only real once the
|
||||||
|
**Job A discovery scorers** run on a schedule — until then it's stubbed/echoes manual adversarial runs;
|
||||||
|
(2) sovereignty (guardrail #7) — SMTP through your own/ten31 server keeps it inside the boundary; do NOT
|
||||||
|
route through a third-party email API if findings ever carry Battery/Strike or positioning substance.
|
||||||
- **Forward live operation** — the only real test: scoring un-pre-selected signals as they arrive, with
|
- **Forward live operation** — the only real test: scoring un-pre-selected signals as they arrive, with
|
||||||
the dual-evaluation ledger as arbiter.
|
the dual-evaluation ledger as arbiter.
|
||||||
|
|
||||||
|
|||||||
@@ -70,6 +70,11 @@ Pre-registered failed convictions used to test the engine against its target fai
|
|||||||
stays quiet in `live` (own_network dropped) while it would fire in `test`** — the engine refusing the
|
stays quiet in `live` (own_network dropped) while it would fire in `test`** — the engine refusing the
|
||||||
intra-cluster echo. Run `two-sided --conviction STRIKE2022 --modes live,test`. The REALIZED-ONLY rule
|
intra-cluster echo. Run `two-sided --conviction STRIKE2022 --modes live,test`. The REALIZED-ONLY rule
|
||||||
is load-bearing here (speculative "Lightning will revolutionize payments" is `predictive`, not signal).
|
is load-bearing here (speculative "Lightning will revolutionize payments" is `predictive`, not signal).
|
||||||
|
**Reading the output:** a single capped bitcoin cluster nets `eisc≈0.25` — already sub-bar vs
|
||||||
|
`EISC_FLOOR=2.0`, so a `+0.25` "quiet in live" can be the *cluster cap* refusing the false positive,
|
||||||
|
NOT the own_network drop. Check `own_net`: if it's 0, live==test and the reflexivity mechanism is
|
||||||
|
unexercised (the affirmers are independent), so a quiet `live` does not by itself prove the echo-drop —
|
||||||
|
you need own_network affirms present (`own_net>0`) for `test` to fire above `live`.
|
||||||
|
|
||||||
**Standing rule S1:** derivatives resolve on OUTCOME (scaled substance), never milestones or enablers.
|
**Standing rule S1:** derivatives resolve on OUTCOME (scaled substance), never milestones or enablers.
|
||||||
An announced program / a regulatory unblock / a single bank's toe-in is CONTEXT, not corroboration.
|
An announced program / a regulatory unblock / a single bank's toe-in is CONTEXT, not corroboration.
|
||||||
|
|||||||
@@ -254,7 +254,8 @@ def cmd_run_extract(args: argparse.Namespace) -> int:
|
|||||||
conn = db.connect(cfg.db_path)
|
conn = db.connect(cfg.db_path)
|
||||||
db.init_db(conn)
|
db.init_db(conn)
|
||||||
sc = from_config(cfg)
|
sc = from_config(cfg)
|
||||||
result = run_extract(conn, sc, cfg, limit=args.limit, max_chunks_per_doc=args.max_chunks)
|
result = run_extract(conn, sc, cfg, limit=args.limit, max_chunks_per_doc=args.max_chunks,
|
||||||
|
chunk_chars=args.chunk_chars)
|
||||||
print(f"extraction: {result['jobs_processed']} jobs, {result['claims_written']} claims written")
|
print(f"extraction: {result['jobs_processed']} jobs, {result['claims_written']} claims written")
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
@@ -581,7 +582,10 @@ def build_parser() -> argparse.ArgumentParser:
|
|||||||
|
|
||||||
re = sub.add_parser("run-extract", help="Drain 'extract' jobs → claims via the local LLM (§4.2)")
|
re = sub.add_parser("run-extract", help="Drain 'extract' jobs → claims via the local LLM (§4.2)")
|
||||||
re.add_argument("--limit", type=int, default=5, help="max jobs to process this run")
|
re.add_argument("--limit", type=int, default=5, help="max jobs to process this run")
|
||||||
re.add_argument("--max-chunks", type=int, default=4, help="max chunks per document")
|
re.add_argument("--max-chunks", type=int, default=999,
|
||||||
|
help="max chunks per document (default: full coverage (999))")
|
||||||
|
re.add_argument("--chunk-chars", type=int, default=12_000,
|
||||||
|
help="chars per extraction chunk; smaller = better recall, more LLM calls")
|
||||||
re.set_defaults(func=cmd_run_extract)
|
re.set_defaults(func=cmd_run_extract)
|
||||||
|
|
||||||
sub.add_parser("queue-status", help="Backfill queue counts by type/state").set_defaults(func=cmd_queue_status)
|
sub.add_parser("queue-status", help="Backfill queue counts by type/state").set_defaults(func=cmd_queue_status)
|
||||||
|
|||||||
@@ -10,6 +10,7 @@ A backend exposes: complete_json(messages, max_tokens) -> str (a JSON object st
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
import logging
|
import logging
|
||||||
|
import time
|
||||||
|
|
||||||
log = logging.getLogger(__name__)
|
log = logging.getLogger(__name__)
|
||||||
|
|
||||||
@@ -32,27 +33,43 @@ class GeminiBackend:
|
|||||||
API is the eventual scale path; this synchronous form is the drop-in fallback."""
|
API is the eventual scale path; this synchronous form is the drop-in fallback."""
|
||||||
name = "gemini"
|
name = "gemini"
|
||||||
|
|
||||||
def __init__(self, api_key: str, model: str = "gemini-2.5-flash") -> None:
|
def __init__(self, api_key: str, model: str = "gemini-2.5-flash", *,
|
||||||
|
timeout_s: float = 120.0, retries: int = 4) -> None:
|
||||||
from google import genai # guarded import; pip install google-genai
|
from google import genai # guarded import; pip install google-genai
|
||||||
|
from google.genai import types
|
||||||
self._genai = genai
|
self._genai = genai
|
||||||
self.client = genai.Client(api_key=api_key)
|
self._types = types
|
||||||
|
# http_options.timeout is in MILLISECONDS — without it a stalled request hangs the (single-
|
||||||
|
# threaded) worker forever; one such hang froze a whole batch for ~50 min before this fix.
|
||||||
|
self.client = genai.Client(api_key=api_key,
|
||||||
|
http_options=types.HttpOptions(timeout=int(timeout_s * 1000)))
|
||||||
self.model = model
|
self.model = model
|
||||||
|
self.retries = retries
|
||||||
|
|
||||||
def complete_json(self, messages: list[dict], *, max_tokens: int = 4000) -> str:
|
def complete_json(self, messages: list[dict], *, max_tokens: int = 4000) -> str:
|
||||||
from google.genai import types
|
types = self._types
|
||||||
system = "\n\n".join(m["content"] for m in messages if m["role"] == "system")
|
system = "\n\n".join(m["content"] for m in messages if m["role"] == "system")
|
||||||
user = "\n\n".join(m["content"] for m in messages if m["role"] != "system")
|
user = "\n\n".join(m["content"] for m in messages if m["role"] != "system")
|
||||||
resp = self.client.models.generate_content(
|
cfg = types.GenerateContentConfig(
|
||||||
model=self.model,
|
|
||||||
contents=user,
|
|
||||||
config=types.GenerateContentConfig(
|
|
||||||
system_instruction=system or None,
|
system_instruction=system or None,
|
||||||
temperature=0,
|
temperature=0,
|
||||||
max_output_tokens=max_tokens,
|
max_output_tokens=max_tokens,
|
||||||
response_mime_type="application/json",
|
response_mime_type="application/json",
|
||||||
),
|
# Gemini 2.5 thinks by default and spends the output budget on reasoning tokens —
|
||||||
|
# it hit MAX_TOKENS with ~3.8k thoughts and a truncated JSON body (0 claims parsed).
|
||||||
|
# Extraction is deterministic, no-CoT (mirrors the local path's enable_thinking=False).
|
||||||
|
thinking_config=types.ThinkingConfig(thinking_budget=0),
|
||||||
)
|
)
|
||||||
|
for attempt in range(self.retries + 1):
|
||||||
|
try:
|
||||||
|
resp = self.client.models.generate_content(model=self.model, contents=user, config=cfg)
|
||||||
return resp.text or "{}"
|
return resp.text or "{}"
|
||||||
|
except Exception as e: # noqa: BLE001 — timeout/5xx/429/network: back off and retry
|
||||||
|
if attempt >= self.retries:
|
||||||
|
raise
|
||||||
|
sleep = 2.0 * (2 ** attempt)
|
||||||
|
log.warning("Gemini call failed (%s); retry %d/%d in %.0fs", e, attempt + 1, self.retries, sleep)
|
||||||
|
time.sleep(sleep)
|
||||||
|
|
||||||
|
|
||||||
def from_config(cfg, sc) -> "LocalQwenBackend | GeminiBackend":
|
def from_config(cfg, sc) -> "LocalQwenBackend | GeminiBackend":
|
||||||
|
|||||||
@@ -30,25 +30,51 @@ def register_seed_topics(conn: sqlite3.Connection) -> None:
|
|||||||
conn.commit()
|
conn.commit()
|
||||||
|
|
||||||
|
|
||||||
|
# Coarse→fine split boundaries. Transcripts arrive as `Speaker: turn` lines joined by a SINGLE
|
||||||
|
# newline (ASR output has no blank-line paragraphs), filings as paragraph text — so splitting on
|
||||||
|
# "\n\n" alone never fires on a transcript and the whole episode would go in one call. "" is the
|
||||||
|
# per-character hard cap that guarantees termination regardless of punctuation.
|
||||||
|
_SEPARATORS = ["\n\n", "\n", ". ", " ", ""]
|
||||||
|
|
||||||
|
|
||||||
def chunk_text(text: str, max_chars: int) -> list[str]:
|
def chunk_text(text: str, max_chars: int) -> list[str]:
|
||||||
"""Split on paragraph boundaries into windows that fit the model context alongside the prompt."""
|
"""Pack text into windows that each fit the model context alongside the prompt.
|
||||||
|
|
||||||
|
Falls through paragraph → line → sentence → word → hard char-slice, so NO chunk ever exceeds
|
||||||
|
max_chars however the source is punctuated, while keeping speaker turns intact when they fit.
|
||||||
|
"""
|
||||||
|
if max_chars < 1: # else _pack recurses past the last separator → IndexError
|
||||||
|
raise ValueError(f"max_chars must be >= 1, got {max_chars}")
|
||||||
text = text.strip()
|
text = text.strip()
|
||||||
if not text:
|
if not text:
|
||||||
return []
|
return []
|
||||||
|
return _pack(text, max_chars, _SEPARATORS)
|
||||||
|
|
||||||
|
|
||||||
|
def _pack(text: str, max_chars: int, seps: list[str]) -> list[str]:
|
||||||
|
"""Recursively pack `text` on the coarsest separator in `seps` that keeps chunks within
|
||||||
|
max_chars, descending to a finer one only for a part that is itself still too big."""
|
||||||
if len(text) <= max_chars:
|
if len(text) <= max_chars:
|
||||||
return [text]
|
return [text]
|
||||||
chunks: list[str] = []
|
sep, rest = seps[0], seps[1:]
|
||||||
cur: list[str] = []
|
parts = list(text) if sep == "" else text.split(sep)
|
||||||
size = 0
|
out: list[str] = []
|
||||||
for para in text.split("\n\n"):
|
cur = ""
|
||||||
if size + len(para) > max_chars and cur:
|
for p in parts:
|
||||||
chunks.append("\n\n".join(cur))
|
candidate = p if not cur else cur + sep + p
|
||||||
cur, size = [], 0
|
if len(candidate) <= max_chars:
|
||||||
cur.append(para)
|
cur = candidate
|
||||||
size += len(para) + 2
|
continue
|
||||||
if cur:
|
if cur:
|
||||||
chunks.append("\n\n".join(cur))
|
out.append(cur)
|
||||||
return chunks
|
if len(p) <= max_chars:
|
||||||
|
cur = p
|
||||||
|
else: # a single part still too big → split it on the next-finer boundary
|
||||||
|
out.extend(_pack(p, max_chars, rest))
|
||||||
|
cur = ""
|
||||||
|
if cur:
|
||||||
|
out.append(cur)
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
def _parse_claims(content: str) -> list[dict]:
|
def _parse_claims(content: str) -> list[dict]:
|
||||||
|
|||||||
@@ -28,8 +28,8 @@ def _document_text(doc, *, user_agent: str) -> str:
|
|||||||
raise ValueError(f"no text source for {doc['doc_id']} (kind={doc['kind']}, url={doc['url']})")
|
raise ValueError(f"no text source for {doc['doc_id']} (kind={doc['kind']}, url={doc['url']})")
|
||||||
|
|
||||||
|
|
||||||
def run_extract(conn, sc, cfg, *, limit: int = 10, max_chunks_per_doc: int = 4,
|
def run_extract(conn, sc, cfg, *, limit: int = 10, max_chunks_per_doc: int = 999,
|
||||||
chunk_chars: int = 18_000, lease_seconds: int = 900,
|
chunk_chars: int = 12_000, lease_seconds: int = 900,
|
||||||
worker_id: str = "extract-1") -> dict:
|
worker_id: str = "extract-1") -> dict:
|
||||||
from .backends import from_config as backend_from_config
|
from .backends import from_config as backend_from_config
|
||||||
backend = backend_from_config(cfg, sc)
|
backend = backend_from_config(cfg, sc)
|
||||||
|
|||||||
Reference in New Issue
Block a user