Files

T

Keysat 6f4698a98c Handoff: durable chunker/Gemini rules; Strike extraction in progress

Record two recurrence-prone gotchas in Key operational rules (transcript chunking must cap every chunk; Gemini backend must disable thinking) and rewrite Current state for the in-progress Gemini extraction batch and the gating Strike test.

2026-06-15 22:47:20 -05:00

10 KiB

Raw Blame History

Ten31 Signal Engine — AGENTS.md

Inbox check: At session start, if ~/Projects/standards/INBOX.md exists, scan it for items tagged (ten31-signal-engine) and surface them before proposing next steps; triage with /triage.

A recurring pipeline that ingests a growing corpus of audio (podcasts, YouTube) and text (SEC filings, earnings calls, policy/lender/research docs), extracts structured propositions ("claims"), and surfaces signal over time through Ten31's investment thesis as a relevance lens — logging every surfaced signal as a falsifiable prediction scored against reality.

Source of truth (in order): ten31-signal-engine-handoff.md (the spec — wins on any conflict; §refs point into it) › DESIGN_v2.md (the living decision/falsification log — read before changing scoring) › this file's Current state. README.md is the user-facing intro.

The spine — NON-NEGOTIABLE guardrails (never violate)

Nominate-then-judge. Statistics & graph structure NOMINATE candidates; the frontier model only JUDGES / FANS OUT a pre-filtered shortlist. The frontier never nominates from the raw corpus.
Propositions, not vibes. Extract atomic claims; separate topic from stance.
Discount convergence by connectedness. Independence is earned, not counted — the EISC graph (source edges + voiceprints) downweights echo. Bitcoin is one capped cluster: within-cluster agreement can NOT masquerade as independent corroboration; cross-cluster earns the multiplier.
Thesis is a LENS, not a gate on truth. The engine must surface signals against Ten31's thesis, not just for it.
Dual-evaluation ledger from day one — precision AND recall; every signal is a logged prediction.
~95% local compute via Spark Control. Call the gateway's HTTP endpoints; do NOT stand up your own vLLM / Whisper / Qdrant. Gemini is an explicit overflow lever for PUBLIC data only.
Sovereignty boundary (hard). Exposure/positioning/conviction data and the Strike/Battery investment memos NEVER go to the frontier. Route sensitive frontier calls through /scrub → frontier → /rehydrate (scrub identities, not substance). Read the memos LOCALLY only.

Two jobs: A — Discovery (emergent themes via independent cross-cluster convergence scored on acceleration; contrarian stances; their intersection). B — Conviction-action gap (fan held convictions to 2nd/3rd-order derivatives, catch early corroboration — the countermeasure to the 2023 "power is the binding constraint on AI/compute" miss: right on the root, late to the derivatives).

Architecture

signal_engine/ (Python package, run as python -m signal_engine <cmd>):

config.py — env-driven Config (+ .env loader). spark/client.py — the SINGLE gateway chokepoint (no other module knows the gateway URL); scrub/rehydrate live here.
ingest/ — edgar (SEC), earnings (FMP REST), feeds+podcasts (RSS), download, chunker, transcribe_worker (local Parakeet), gemini_transcribe (bulk overflow), docs (HTML/PDF/RSS text fetcher for policy/lender/research), identify, speaker_stitch.
extract/ — claims+worker (proposition extraction), backends (LocalQwen | Gemini), prompt, html_text. embedstore/ — embedder + qdrant_store (hybrid dense+BM25).
signals/ (the scoring brain) — independence (EISC), asof (look-ahead guard), windows, under_acted (Job B), bar (two-tier gate), two_sided (affirms−denies net-corroboration), llm_helpers (derivative_relevance), confusion (precision/recall), external (price/outcome fetcher), ledger_writer (§6.6 prediction ledger), resolver (stub), run.
store/ — db (SQLite + idempotent migrations), schema.sql, seed, sources. backfill/queue.py (the job queue). ui/app.py (FastAPI corpus/eval UI). util.py.
Data lands in data/ (gitignored): signal.db, transcripts/, docs/, audio-cache/.

Flow: seed sources/convictions/fanout → ingest (→ documents + transcribe/extract jobs) → run-transcribe / run-extract drain the queue → claims → embed-claims (Qdrant) → scorers (backtest, two-sided) read the proposition store as-of a date.

Build / run

Setup: virtualenv at .venv (Python 3.14). .venv/bin/pip install -r requirements.txt.
Invoke: .venv/bin/python -m signal_engine <cmd>. --help is authoritative; the rest is a map: init-db; seeding seed-sources/seed-convictions/seed-fanout/seed-edges/load-feeds; ingest ingest-edgar/ingest-earnings/ingest-podcast/ingest-doc/ingest-doc-manifest/ ingest-feed-text; queue drain run-transcribe/run-transcribe-gemini/run-extract; index embed-claims/search; score backtest/two-sided/confusion-matrix; inspect queue-status/ spark-status/feed-peek/provenance/db-tables; serve (UI).
DB: python -m signal_engine init-db (idempotent — re-creates schema + runs additive migrations).
Tests: ⚠️ no automated test suite yet (no tests/, no pytest). Verification is by running commands against the live gateway. Adding a test harness is on the ROADMAP.
Lint/format: none configured. Match the surrounding style (dense, §-referenced docstrings).

Spark Control infra (`SPARK_CONTROL_URL`, self-signed TLS → `SPARK_VERIFY_TLS=false`)

One gateway fronts two DGX Sparks: vLLM RedHatAI/Qwen3.6-35B-A3B-NVFP4 on :103; Parakeet ASR + diarizer, bge-m3 embeddings, Qdrant on :87. The gateway is the only URL anything calls.

AUDIO concurrency (learned 2026-06-09): single serial GPU shared with the operator's production meeting app. Cap 2 in-flight (ceiling 3), GLOBAL across both audio endpoints — a process-wide BoundedSemaphore (AUDIO_CONCURRENCY env, default 2). Going wider buys zero throughput. Transient 1–4s "busy blips" (broken-pipe/503/timeout) are NOT failures → short retry-backoff. The transcribe_worker runs a 2-wide chunk pool; the old size-1 lock was ~2.5× slower.

Key operational rules (learned this build — easy to get wrong)

own_network quarantine is MATERIALITY-driven, not "any investment." Quarantine (drop in live scoring, keep in test) only for MATERIAL ties where the source is part of Ten31's voice: the partners' own shows (TFTC, Citadel Dispatch, Rabbit Hole Recap), the Battery partnership, material portfolio leads. Immaterial passive stakes → INDEPENDENT (River and Swan/Cafe Bitcoin were corrected to independent). Unconfirmed: Unchained, Debifi, Coinkite (held quarantined pending Grant's materiality call).
Gemini quota is a rolling ~24h window (~291 hour-long episodes / ~51M tokens), not a calendar-day reset. Bulk transcription overflows there; expect 429 RESOURCE_EXHAUSTED past the window.
Transcript chunking is recall-first and MUST cap every chunk. ASR transcripts have NO blank-line paragraphs (speaker turns joined by a single \n), so extract.claims.chunk_text falls through \n\n→\n→sentence→word→hard-slice; splitting only on \n\n (the old bug) sent whole 2–3 h episodes in ONE call → context-overflow 400s. Extraction defaults to full coverage at 12K chars/chunk (run-extract --chunk-chars/--max-chunks); bigger chunks risk lost-in-the-middle recall loss.
Gemini extraction backend disables thinking. gemini-2.5-flash thinks by default and burns the output-token budget on reasoning → MAX_TOKENS → truncated JSON → 0 claims; the backend sets thinking_budget=0 (mirrors the local path's enable_thinking=False). Gemini = overflow for PUBLIC data only; keep EXTRACTION_BACKEND=local in .env, flip it inline per-run when overflowing.
Scoring-brain internals are scoped to a guide. Before editing signal_engine/signals/, read docs/guides/scoring-brain.md — the classifier invariants (REALIZED-ONLY, ROLE-MATCH, claim_type hard-evidence guard, max_tokens budget, claim_id bracket-strip), the EISC cluster-cap, and the Battery/Strike adversarial-test PASS criteria. Don't regress those invariants (they're what make Battery pass). Full decision log: DESIGN_v2.md.

Secrets / env

Real values live in .env (gitignored). .env.example lists the names. Keys used: SPARK_CONTROL_URL, SPARK_VERIFY_TLS, LOCAL_LLM_MODEL, EMBED_MODEL, TRANSCRIBE_MODEL, AUDIO_CONCURRENCY, EXTRACTION_BACKEND, GEMINI_API_KEY, GEMINI_MODEL, ANTHROPIC_API_KEY, FMP_API_KEY, EDGAR_USER_AGENT, DATA_DIR, UI_PORT, LOG_LEVEL. Never commit key values; the private LAN gateway IP appears only as an env-var default.

Current state (snapshot — overwrite each session; longer-term backlog → `ROADMAP.md`)

Strike adversarial test: extraction running — this is the gating step. Root-caused & FIXED the long-form 400s (the old \n\n-only chunker sent whole 2–3 h episodes in one call → context overflow). Now recall-first full coverage (12K chars). Draining the ~700-doc / ~5.7k-chunk extract backlog through the Gemini backend (one-time PUBLIC overflow, EXTRACTION_BACKEND=gemini inline) to finish faster (~6–7 h serial); validated live — dense (~7.5 claims/chunk), zero failures; 27 prior 400-failures requeued. NEXT when it finishes: embed-claims → two-sided --conviction STRIKE2022 --modes live,test (PASS = quiet in live, fires in test).
Battery adversarial test: PASSES (unchanged — demand-net rises, supply stays flat at 0.0). §7.1 power-infra backtest: qualified YES (corpus-gated; caveats in DESIGN_v2.md).
2 commits made, UNPUSHED — push to main was blocked by the permission classifier (enforcing the old no-push-to-main rule); awaiting approval (git push origin main). Commits: chunker fix + recall-first defaults; Gemini thinking-budget fix.
Open decisions for Grant: (a) push the 2 commits; (b) speed-up approach — recommended real-time concurrency over the async Batch API (serial Gemini runs as the fallback meanwhile).
Corpus spans bitcoin podcasts, SEC/FMP filings (+banks cluster), the Battery corpus, River research; EISC edges seeded for the bitcoin cluster.

10 KiB Raw Blame History Unescape Escape