Files
ten31-signal-engine/AGENTS.md
T
Keysat 6f4698a98c Handoff: durable chunker/Gemini rules; Strike extraction in progress
Record two recurrence-prone gotchas in Key operational rules (transcript chunking must cap every chunk; Gemini backend must disable thinking) and rewrite Current state for the in-progress Gemini extraction batch and the gating Strike test.
2026-06-15 22:47:20 -05:00

133 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Ten31 Signal Engine — AGENTS.md
> **Inbox check:** At session start, if `~/Projects/standards/INBOX.md` exists, scan it for items
> tagged `(ten31-signal-engine)` and surface them before proposing next steps; triage with `/triage`.
A recurring pipeline that ingests a growing corpus of **audio** (podcasts, YouTube) and **text** (SEC
filings, earnings calls, policy/lender/research docs), extracts structured **propositions ("claims")**,
and surfaces **signal over time** through Ten31's investment thesis as a *relevance lens* — logging every
surfaced signal as a **falsifiable prediction** scored against reality.
**Source of truth (in order):** `ten31-signal-engine-handoff.md` (the spec — wins on any conflict; §refs
point into it) `DESIGN_v2.md` (the living decision/falsification log — read before changing scoring)
this file's **Current state**. `README.md` is the user-facing intro.
## The spine — NON-NEGOTIABLE guardrails (never violate)
1. **Nominate-then-judge.** Statistics & graph structure NOMINATE candidates; the frontier model only
JUDGES / FANS OUT a pre-filtered shortlist. The frontier never nominates from the raw corpus.
2. **Propositions, not vibes.** Extract atomic claims; separate **topic** from **stance**.
3. **Discount convergence by connectedness.** Independence is earned, not counted — the EISC graph
(source edges + voiceprints) downweights echo. Bitcoin is one *capped* cluster: within-cluster
agreement can NOT masquerade as independent corroboration; cross-cluster earns the multiplier.
4. **Thesis is a LENS, not a gate on truth.** The engine must surface signals *against* Ten31's thesis,
not just for it.
5. **Dual-evaluation ledger from day one** — precision AND recall; every signal is a logged prediction.
6. **~95% local compute via Spark Control.** Call the gateway's HTTP endpoints; do NOT stand up your own
vLLM / Whisper / Qdrant. Gemini is an explicit *overflow* lever for PUBLIC data only.
7. **Sovereignty boundary (hard).** Exposure/positioning/conviction data and the Strike/Battery
investment memos NEVER go to the frontier. Route sensitive frontier calls through
`/scrub → frontier → /rehydrate` (scrub identities, not substance). Read the memos LOCALLY only.
**Two jobs:** **A — Discovery** (emergent themes via independent cross-cluster *convergence* scored on
acceleration; contrarian stances; their intersection). **B — Conviction-action gap** (fan held
convictions to 2nd/3rd-order derivatives, catch early corroboration — the countermeasure to the 2023
"power is the binding constraint on AI/compute" miss: right on the root, late to the derivatives).
## Architecture
`signal_engine/` (Python package, run as `python -m signal_engine <cmd>`):
- `config.py` — env-driven `Config` (+ `.env` loader). `spark/client.py` — the SINGLE gateway chokepoint
(no other module knows the gateway URL); scrub/rehydrate live here.
- `ingest/``edgar` (SEC), `earnings` (FMP REST), `feeds`+`podcasts` (RSS), `download`, `chunker`,
`transcribe_worker` (local Parakeet), `gemini_transcribe` (bulk overflow), `docs` (HTML/PDF/RSS text
fetcher for policy/lender/research), `identify`, `speaker_stitch`.
- `extract/``claims`+`worker` (proposition extraction), `backends` (LocalQwen | Gemini), `prompt`,
`html_text`. `embedstore/``embedder` + `qdrant_store` (hybrid dense+BM25).
- `signals/` (the scoring brain) — `independence` (EISC), `asof` (look-ahead guard), `windows`,
`under_acted` (Job B), `bar` (two-tier gate), `two_sided` (affirmsdenies net-corroboration),
`llm_helpers` (`derivative_relevance`), `confusion` (precision/recall), `external` (price/outcome
fetcher), `ledger_writer` (§6.6 prediction ledger), `resolver` (stub), `run`.
- `store/``db` (SQLite + idempotent migrations), `schema.sql`, `seed`, `sources`. `backfill/queue.py`
(the job queue). `ui/app.py` (FastAPI corpus/eval UI). `util.py`.
- Data lands in `data/` (gitignored): `signal.db`, `transcripts/`, `docs/`, `audio-cache/`.
**Flow:** seed sources/convictions/fanout → ingest (→ `documents` + `transcribe`/`extract` jobs) →
`run-transcribe` / `run-extract` drain the queue → `claims``embed-claims` (Qdrant) → scorers
(`backtest`, `two-sided`) read the proposition store as-of a date.
## Build / run
- **Setup:** virtualenv at `.venv` (Python 3.14). `.venv/bin/pip install -r requirements.txt`.
- **Invoke:** `.venv/bin/python -m signal_engine <cmd>`. **`--help` is authoritative**; the rest is a
map: `init-db`; seeding `seed-sources`/`seed-convictions`/`seed-fanout`/`seed-edges`/`load-feeds`;
ingest `ingest-edgar`/`ingest-earnings`/`ingest-podcast`/`ingest-doc`/`ingest-doc-manifest`/
`ingest-feed-text`; queue drain `run-transcribe`/`run-transcribe-gemini`/`run-extract`; index
`embed-claims`/`search`; score `backtest`/`two-sided`/`confusion-matrix`; inspect `queue-status`/
`spark-status`/`feed-peek`/`provenance`/`db-tables`; `serve` (UI).
- **DB:** `python -m signal_engine init-db` (idempotent — re-creates schema + runs additive migrations).
- **Tests:** ⚠️ **no automated test suite yet** (no `tests/`, no pytest). Verification is by running
commands against the live gateway. Adding a test harness is on the ROADMAP.
- **Lint/format:** none configured. Match the surrounding style (dense, §-referenced docstrings).
## Spark Control infra (`SPARK_CONTROL_URL`, self-signed TLS → `SPARK_VERIFY_TLS=false`)
One gateway fronts two DGX Sparks: **vLLM** `RedHatAI/Qwen3.6-35B-A3B-NVFP4` on `:103`; **Parakeet**
ASR + diarizer, **bge-m3** embeddings, **Qdrant** on `:87`. The gateway is the only URL anything calls.
- **AUDIO concurrency (learned 2026-06-09):** single serial GPU shared with the operator's production
meeting app. Cap **2 in-flight (ceiling 3), GLOBAL across both audio endpoints** — a process-wide
`BoundedSemaphore` (`AUDIO_CONCURRENCY` env, default 2). Going wider buys zero throughput. Transient
14s "busy blips" (broken-pipe/503/timeout) are NOT failures → short retry-backoff. The
`transcribe_worker` runs a 2-wide chunk pool; the old size-1 lock was ~2.5× slower.
## Key operational rules (learned this build — easy to get wrong)
- **`own_network` quarantine is MATERIALITY-driven, not "any investment."** Quarantine (drop in live
scoring, keep in test) only for MATERIAL ties where the source is part of Ten31's voice: the partners'
own shows (TFTC, Citadel Dispatch, Rabbit Hole Recap), the Battery *partnership*, material portfolio
leads. **Immaterial passive stakes → INDEPENDENT** (River and Swan/Cafe Bitcoin were corrected to
independent). Unconfirmed: Unchained, Debifi, Coinkite (held quarantined pending Grant's materiality call).
- **Gemini quota is a rolling ~24h window** (~291 hour-long episodes / ~51M tokens), not a calendar-day
reset. Bulk transcription overflows there; expect 429 RESOURCE_EXHAUSTED past the window.
- **Transcript chunking is recall-first and MUST cap every chunk.** ASR transcripts have NO blank-line
paragraphs (speaker turns joined by a single `\n`), so `extract.claims.chunk_text` falls through
`\n\n``\n`→sentence→word→hard-slice; splitting only on `\n\n` (the old bug) sent whole 23 h episodes
in ONE call → context-overflow 400s. Extraction defaults to full coverage at 12K chars/chunk
(`run-extract --chunk-chars/--max-chunks`); bigger chunks risk lost-in-the-middle recall loss.
- **Gemini extraction backend disables thinking.** `gemini-2.5-flash` thinks by default and burns the
output-token budget on reasoning → MAX_TOKENS → truncated JSON → 0 claims; the backend sets
`thinking_budget=0` (mirrors the local path's `enable_thinking=False`). Gemini = overflow for PUBLIC
data only; keep `EXTRACTION_BACKEND=local` in `.env`, flip it inline per-run when overflowing.
- **Scoring-brain internals are scoped to a guide.** Before editing `signal_engine/signals/`, read
**`docs/guides/scoring-brain.md`** — the classifier invariants (REALIZED-ONLY, ROLE-MATCH, claim_type
hard-evidence guard, max_tokens budget, claim_id bracket-strip), the EISC cluster-cap, and the
Battery/Strike adversarial-test PASS criteria. Don't regress those invariants (they're what make
Battery pass). Full decision log: `DESIGN_v2.md`.
## Secrets / env
Real values live in **`.env`** (gitignored). `.env.example` lists the names. Keys used: `SPARK_CONTROL_URL`,
`SPARK_VERIFY_TLS`, `LOCAL_LLM_MODEL`, `EMBED_MODEL`, `TRANSCRIBE_MODEL`, `AUDIO_CONCURRENCY`,
`EXTRACTION_BACKEND`, `GEMINI_API_KEY`, `GEMINI_MODEL`, `ANTHROPIC_API_KEY`, `FMP_API_KEY`,
`EDGAR_USER_AGENT`, `DATA_DIR`, `UI_PORT`, `LOG_LEVEL`. Never commit key values; the private LAN gateway
IP appears only as an env-var default.
## Current state (snapshot — overwrite each session; longer-term backlog → `ROADMAP.md`)
- **Strike adversarial test: extraction running — this is the gating step.** Root-caused & FIXED the
long-form 400s (the old `\n\n`-only chunker sent whole 23 h episodes in one call → context overflow).
Now recall-first full coverage (12K chars). Draining the ~700-doc / ~5.7k-chunk extract backlog through
the **Gemini backend** (one-time PUBLIC overflow, `EXTRACTION_BACKEND=gemini` inline) to finish faster
(~67 h serial); validated live — dense (~7.5 claims/chunk), zero failures; 27 prior 400-failures
requeued. **NEXT when it finishes:** `embed-claims``two-sided --conviction STRIKE2022 --modes
live,test` (PASS = quiet in live, fires in test).
- **Battery adversarial test: PASSES** (unchanged — demand-net rises, supply stays flat at 0.0).
**§7.1 power-infra backtest:** qualified YES (corpus-gated; caveats in `DESIGN_v2.md`).
- **2 commits made, UNPUSHED** — push to `main` was blocked by the permission classifier (enforcing the
old no-push-to-main rule); awaiting approval (`git push origin main`). Commits: chunker fix +
recall-first defaults; Gemini thinking-budget fix.
- **Open decisions for Grant:** (a) push the 2 commits; (b) speed-up approach — recommended real-time
concurrency over the async Batch API (serial Gemini runs as the fallback meanwhile).
- Corpus spans bitcoin podcasts, SEC/FMP filings (+`banks` cluster), the Battery corpus, River research;
EISC edges seeded for the bitcoin cluster.