0b001b49d5
Fold the hang lesson into the Gemini operational rule (disable thinking AND set a timeout) and refresh Current state for the in-progress Gemini extraction (~68 docs left, 52.7k claims) and the gating Strike test.
133 lines
10 KiB
Markdown
133 lines
10 KiB
Markdown
# Ten31 Signal Engine — AGENTS.md
|
||
|
||
> **Inbox check:** At session start, if `~/Projects/standards/INBOX.md` exists, scan it for items
|
||
> tagged `(ten31-signal-engine)` and surface them before proposing next steps; triage with `/triage`.
|
||
|
||
A recurring pipeline that ingests a growing corpus of **audio** (podcasts, YouTube) and **text** (SEC
|
||
filings, earnings calls, policy/lender/research docs), extracts structured **propositions ("claims")**,
|
||
and surfaces **signal over time** through Ten31's investment thesis as a *relevance lens* — logging every
|
||
surfaced signal as a **falsifiable prediction** scored against reality.
|
||
|
||
**Source of truth (in order):** `ten31-signal-engine-handoff.md` (the spec — wins on any conflict; §refs
|
||
point into it) › `DESIGN_v2.md` (the living decision/falsification log — read before changing scoring) ›
|
||
this file's **Current state**. `README.md` is the user-facing intro.
|
||
|
||
## The spine — NON-NEGOTIABLE guardrails (never violate)
|
||
|
||
1. **Nominate-then-judge.** Statistics & graph structure NOMINATE candidates; the frontier model only
|
||
JUDGES / FANS OUT a pre-filtered shortlist. The frontier never nominates from the raw corpus.
|
||
2. **Propositions, not vibes.** Extract atomic claims; separate **topic** from **stance**.
|
||
3. **Discount convergence by connectedness.** Independence is earned, not counted — the EISC graph
|
||
(source edges + voiceprints) downweights echo. Bitcoin is one *capped* cluster: within-cluster
|
||
agreement can NOT masquerade as independent corroboration; cross-cluster earns the multiplier.
|
||
4. **Thesis is a LENS, not a gate on truth.** The engine must surface signals *against* Ten31's thesis,
|
||
not just for it.
|
||
5. **Dual-evaluation ledger from day one** — precision AND recall; every signal is a logged prediction.
|
||
6. **~95% local compute via Spark Control.** Call the gateway's HTTP endpoints; do NOT stand up your own
|
||
vLLM / Whisper / Qdrant. Gemini is an explicit *overflow* lever for PUBLIC data only.
|
||
7. **Sovereignty boundary (hard).** Exposure/positioning/conviction data and the Strike/Battery
|
||
investment memos NEVER go to the frontier. Route sensitive frontier calls through
|
||
`/scrub → frontier → /rehydrate` (scrub identities, not substance). Read the memos LOCALLY only.
|
||
|
||
**Two jobs:** **A — Discovery** (emergent themes via independent cross-cluster *convergence* scored on
|
||
acceleration; contrarian stances; their intersection). **B — Conviction-action gap** (fan held
|
||
convictions to 2nd/3rd-order derivatives, catch early corroboration — the countermeasure to the 2023
|
||
"power is the binding constraint on AI/compute" miss: right on the root, late to the derivatives).
|
||
|
||
## Architecture
|
||
|
||
`signal_engine/` (Python package, run as `python -m signal_engine <cmd>`):
|
||
- `config.py` — env-driven `Config` (+ `.env` loader). `spark/client.py` — the SINGLE gateway chokepoint
|
||
(no other module knows the gateway URL); scrub/rehydrate live here.
|
||
- `ingest/` — `edgar` (SEC), `earnings` (FMP REST), `feeds`+`podcasts` (RSS), `download`, `chunker`,
|
||
`transcribe_worker` (local Parakeet), `gemini_transcribe` (bulk overflow), `docs` (HTML/PDF/RSS text
|
||
fetcher for policy/lender/research), `identify`, `speaker_stitch`.
|
||
- `extract/` — `claims`+`worker` (proposition extraction), `backends` (LocalQwen | Gemini), `prompt`,
|
||
`html_text`. `embedstore/` — `embedder` + `qdrant_store` (hybrid dense+BM25).
|
||
- `signals/` (the scoring brain) — `independence` (EISC), `asof` (look-ahead guard), `windows`,
|
||
`under_acted` (Job B), `bar` (two-tier gate), `two_sided` (affirms−denies net-corroboration),
|
||
`llm_helpers` (`derivative_relevance`), `confusion` (precision/recall), `external` (price/outcome
|
||
fetcher), `ledger_writer` (§6.6 prediction ledger), `resolver` (stub), `run`.
|
||
- `store/` — `db` (SQLite + idempotent migrations), `schema.sql`, `seed`, `sources`. `backfill/queue.py`
|
||
(the job queue). `ui/app.py` (FastAPI corpus/eval UI). `util.py`.
|
||
- Data lands in `data/` (gitignored): `signal.db`, `transcripts/`, `docs/`, `audio-cache/`.
|
||
|
||
**Flow:** seed sources/convictions/fanout → ingest (→ `documents` + `transcribe`/`extract` jobs) →
|
||
`run-transcribe` / `run-extract` drain the queue → `claims` → `embed-claims` (Qdrant) → scorers
|
||
(`backtest`, `two-sided`) read the proposition store as-of a date.
|
||
|
||
## Build / run
|
||
|
||
- **Setup:** virtualenv at `.venv` (Python 3.14). `.venv/bin/pip install -r requirements.txt`.
|
||
- **Invoke:** `.venv/bin/python -m signal_engine <cmd>`. **`--help` is authoritative**; the rest is a
|
||
map: `init-db`; seeding `seed-sources`/`seed-convictions`/`seed-fanout`/`seed-edges`/`load-feeds`;
|
||
ingest `ingest-edgar`/`ingest-earnings`/`ingest-podcast`/`ingest-doc`/`ingest-doc-manifest`/
|
||
`ingest-feed-text`; queue drain `run-transcribe`/`run-transcribe-gemini`/`run-extract`; index
|
||
`embed-claims`/`search`; score `backtest`/`two-sided`/`confusion-matrix`; inspect `queue-status`/
|
||
`spark-status`/`feed-peek`/`provenance`/`db-tables`; `serve` (UI).
|
||
- **DB:** `python -m signal_engine init-db` (idempotent — re-creates schema + runs additive migrations).
|
||
- **Tests:** ⚠️ **no automated test suite yet** (no `tests/`, no pytest). Verification is by running
|
||
commands against the live gateway. Adding a test harness is on the ROADMAP.
|
||
- **Lint/format:** none configured. Match the surrounding style (dense, §-referenced docstrings).
|
||
|
||
## Spark Control infra (`SPARK_CONTROL_URL`, self-signed TLS → `SPARK_VERIFY_TLS=false`)
|
||
|
||
One gateway fronts two DGX Sparks: **vLLM** `RedHatAI/Qwen3.6-35B-A3B-NVFP4` on `:103`; **Parakeet**
|
||
ASR + diarizer, **bge-m3** embeddings, **Qdrant** on `:87`. The gateway is the only URL anything calls.
|
||
- **AUDIO concurrency (learned 2026-06-09):** single serial GPU shared with the operator's production
|
||
meeting app. Cap **2 in-flight (ceiling 3), GLOBAL across both audio endpoints** — a process-wide
|
||
`BoundedSemaphore` (`AUDIO_CONCURRENCY` env, default 2). Going wider buys zero throughput. Transient
|
||
1–4s "busy blips" (broken-pipe/503/timeout) are NOT failures → short retry-backoff. The
|
||
`transcribe_worker` runs a 2-wide chunk pool; the old size-1 lock was ~2.5× slower.
|
||
|
||
## Key operational rules (learned this build — easy to get wrong)
|
||
|
||
- **`own_network` quarantine is MATERIALITY-driven, not "any investment."** Quarantine (drop in live
|
||
scoring, keep in test) only for MATERIAL ties where the source is part of Ten31's voice: the partners'
|
||
own shows (TFTC, Citadel Dispatch, Rabbit Hole Recap), the Battery *partnership*, material portfolio
|
||
leads. **Immaterial passive stakes → INDEPENDENT** (River and Swan/Cafe Bitcoin were corrected to
|
||
independent). Unconfirmed: Unchained, Debifi, Coinkite (held quarantined pending Grant's materiality call).
|
||
- **Gemini quota is a rolling ~24h window** (~291 hour-long episodes / ~51M tokens), not a calendar-day
|
||
reset. Bulk transcription overflows there; expect 429 RESOURCE_EXHAUSTED past the window.
|
||
- **Transcript chunking is recall-first and MUST cap every chunk.** ASR transcripts have NO blank-line
|
||
paragraphs (speaker turns joined by a single `\n`), so `extract.claims.chunk_text` falls through
|
||
`\n\n`→`\n`→sentence→word→hard-slice; splitting only on `\n\n` (the old bug) sent whole 2–3 h episodes
|
||
in ONE call → context-overflow 400s. Extraction defaults to full coverage at 12K chars/chunk
|
||
(`run-extract --chunk-chars/--max-chunks`); bigger chunks risk lost-in-the-middle recall loss.
|
||
- **Gemini extraction backend: disable thinking AND set a timeout.** `gemini-2.5-flash` thinks by
|
||
default and burns the output-token budget on reasoning → MAX_TOKENS → truncated JSON → 0 claims, so
|
||
the backend sets `thinking_budget=0` (mirrors local `enable_thinking=False`). It also sets an HTTP
|
||
**timeout (120 s) + 4 retries** — a timeout-less call once hung the single-threaded worker ~50 min
|
||
(transient 504/read-timeouts then self-heal). Gemini = overflow for PUBLIC data only; keep
|
||
`EXTRACTION_BACKEND=local` in `.env`, flip it inline per-run when overflowing.
|
||
- **Scoring-brain internals are scoped to a guide.** Before editing `signal_engine/signals/`, read
|
||
**`docs/guides/scoring-brain.md`** — the classifier invariants (REALIZED-ONLY, ROLE-MATCH, claim_type
|
||
hard-evidence guard, max_tokens budget, claim_id bracket-strip), the EISC cluster-cap, and the
|
||
Battery/Strike adversarial-test PASS criteria. Don't regress those invariants (they're what make
|
||
Battery pass). Full decision log: `DESIGN_v2.md`.
|
||
|
||
## Secrets / env
|
||
|
||
Real values live in **`.env`** (gitignored). `.env.example` lists the names. Keys used: `SPARK_CONTROL_URL`,
|
||
`SPARK_VERIFY_TLS`, `LOCAL_LLM_MODEL`, `EMBED_MODEL`, `TRANSCRIBE_MODEL`, `AUDIO_CONCURRENCY`,
|
||
`EXTRACTION_BACKEND`, `GEMINI_API_KEY`, `GEMINI_MODEL`, `ANTHROPIC_API_KEY`, `FMP_API_KEY`,
|
||
`EDGAR_USER_AGENT`, `DATA_DIR`, `UI_PORT`, `LOG_LEVEL`. Never commit key values; the private LAN gateway
|
||
IP appears only as an env-var default.
|
||
|
||
## Current state (snapshot — overwrite each session; longer-term backlog → `ROADMAP.md`)
|
||
|
||
- **Strike adversarial test: extraction ~complete — this is the gating step.** Chunker 400 bug FIXED
|
||
(old `\n\n`-only split sent whole 2–3 h episodes in one call → context overflow); extraction is now
|
||
recall-first full coverage (12K chars). Drained the ~700-doc backlog via the **Gemini backend**
|
||
(one-time PUBLIC overflow), hardened with thinking-off + 120 s timeout/4 retries after a timeout-less
|
||
call hung the worker ~50 min. At session end **~68 docs left** (the SEC/FMP filings tail); **52.7k
|
||
claims** in the DB, 0 failures, 3 transient timeouts. **NEXT once extraction finishes:** `embed-claims`
|
||
→ `two-sided --conviction STRIKE2022 --modes live,test` (PASS = quiet in live, fires in test).
|
||
- **If extraction stopped early** (session ended): resume with `EXTRACTION_BACKEND=gemini run-extract
|
||
--limit 800` (drop the env var for the local Qwen path). Pending claims still need `embed-claims` after.
|
||
- **Battery test PASSES; §7.1 power-infra qualified YES** (both unchanged).
|
||
- **3 commits local + `backends.py` uncommitted (Gemini timeout/retry) — ALL UNPUSHED.** Push to `main`
|
||
is blocked by the permission classifier; run `git push origin main` (your updated rule allows main).
|
||
- Corpus: bitcoin podcasts, SEC/FMP filings (+`banks` cluster), Battery corpus, River research; EISC
|
||
edges seeded for the bitcoin cluster.
|