Files

136 lines
9.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Ten31 Signal Engine (pilot)
A recurring pipeline that ingests a growing corpus of audio (podcasts, YouTube) and text
(SEC filings, earnings-call transcripts), extracts structured **propositions**, and surfaces
**signal over time**, filtered through Ten31's thesis as a *relevance lens* — with every surfaced
signal logged as a **falsifiable prediction** so the system is scored against reality.
**Spec / source of truth:** [`ten31-signal-engine-handoff.md`](ten31-signal-engine-handoff.md). It wins
over this README on any conflict. Section refs below (§) point into it.
---
## The spine (do not violate — §2, §5)
> **Statistics & graph structure NOMINATE candidates; the frontier model only JUDGES and FANS OUT a
> pre-filtered shortlist. The prediction ledger is the final arbiter.**
- **Job A — Discovery (§1):** surface what Grant doesn't yet see — emergent themes (independent
cross-cluster *convergence*, scored on **acceleration** not size), credible contrarian minority
stances, and the prize: their **intersection** (a consensus about to flip).
- **Job B — Conviction-action gap (§1.1):** fan held convictions to 2nd/3rd-order derivatives and fire
`conviction(high) × exposure(low) × rising independent corroboration`. Countermeasure to the 2023
"power is the binding constraint" call where the seed was held but derivatives under-acted.
- **Nominate-then-judge boundary:** geometry/stats emit a shortlist that already cleared a quantitative
bar; the frontier model judges/synthesizes/fans-out only. A fanned-out derivative is a **hypothesis,
not a signal**, until independent corpus corroboration confirms it (§2.3, §4.6).
- **The lens tags relevance; it must NOT gate truth (§5.7).** The engine must be able to surface an
accelerating signal *against* Ten31's thesis (the B1B3 breakers). `thesis_seam` is a tag, never a filter.
---
## Locked pilot decisions
| Topic | Decision | Ref |
|---|---|---|
| **Packaging** | Ship as a **StartOS 0.4.0 s9pk** (operator call). Code is package-ready; TS wrapper generated once the daemon/UI/dependency contract stabilizes. | §10, §13 |
| **Earnings transcripts** | **FMP Ultimate** transcript API as backbone (audio isn't reliably fetchable: no uniform feed, ~3090d replay expiry). EDGAR filings are the durable core. Self-transcribed audio deferred. | §4.1, §12 |
| **Conviction exposure** | Coarse **NAV bands**: `none / lt2 / 2to10 / gt10`. Grant sole editor, monthly review. | §3.1 |
| **Topic vocabulary** | **Hybrid**: seeded controlled list + emergent topics batch-merged on a schedule. | §4.2 |
| **Frontier model** | Default `claude-opus-4-8` (override via `FRONTIER_MODEL`). Scrubbed-vs-unscrubbed quality A/B is a pilot deliverable. | §4.6 |
---
## Architecture (modules map 1:1 to spec layers)
```
ingest/ §4.1 RSS/YouTube/EDGAR/FMP fetch + scheduler; long-audio chunking; speaker stitch
extract/ §4.2 local LLM → 0..N structured claim units per chunk (willing to emit ZERO)
embedstore/ §4.3 embed DISTILLED propositions + client-side BM25 → Qdrant hybrid
signals/ §4.4 cluster acceleration · stance distributions · bridge edges · scoring
independence/ §4.5 source graph + voiceprint library → convergence discounting
frontier/ §4.6 judge · synthesis · conviction fan-out (scrub→frontier→rehydrate; SHORTLIST/SEEDS only)
ledger/ §4.7 dual-evaluation ledger + conviction log + earned credibility
spark/ §13 THE single chokepoint for all Spark Control HTTP (no other module knows the URL)
store/ SQLite schema + seed loaders
backfill/ §13.4 client-side GPU-hours queue (extraction = heavier serial load; audio sequential)
```
**Two invariants enforced in exactly one place each:**
- All gateway HTTP funnels through `spark/``spark/client.py` holds the base URL, the self-signed TLS
skip, 503 retry/backoff, and a process-wide **audio lock** (sequential audio, §4.1).
- The "model never sees Grant's rating first" rule (§6.7) is **structural**: ratings live in a separate
`human_evaluations` table; the model-facing code reads `ledger`, which has no `grant_rating` column.
### Data stores
- **SQLite** (`store/schema.sql`): sources, documents, claims, topics, source_edges, voiceprints,
conviction_log, fanout_nodes, ledger, human_evaluations, backfill_jobs. The whole system state is a SELECT.
- **Qdrant** (via Spark Control): one `propositions` collection, **hybrid** dense `bge-m3` (1024-d) +
sparse BM25 (`Qdrant/bm25`, `modifier: idf`). Points are distilled propositions, never raw chunks.
---
## Backfill queue (§13.4) — measured in GPU-hours, not real-time
Scheduler jobs are **producers** (fetch/dedup/enqueue); a **single worker** drains the GPU queue one job
at a time → no parallel audio → no 503 by construction. Jobs are leased (crash-safe resume), idempotent on
`hash(content + prompt-version)`. Extraction is the binding load. Rough pilot estimate (~300 episodes +
25 companies): transcription ~6 GPU-h, **extraction ~90 GPU-h**, wall-clock ~46 days (audio on Spark 2 and
extraction on Spark 1 run in parallel). The queue self-calibrates from measured `gpu_seconds`.
---
## Build order (§11) & status
1.**Foundation** — config, `spark/` client, SQLite schema, conviction-log seed, **ledger scaffold live day one**.
2.**Ingestion + backfill queue** — backfill queue; EDGAR + FMP earnings; **audio path live-proven** (download → Parakeet transcribe + Sortformer diarize → align → voiceprint-stitch → speaker-attributed transcript); 20 podcast feeds resolved (`seeds/podcast_feeds.resolved.yaml`). Broad corpus ingested: **785 company docs (379 filings + 406 earnings), 802 extract jobs**. ⬜ Remaining: podcast audio backfill (RSS-full + dated YouTube pulls) + audio-cache cleanup; foreign-filer forms (20-F/6-K for CCJ/TSM/IREN).
3.**Extraction worker + §4.2 prompt****live** on text (earnings: 26 claims, insight-heavy) AND audio (Dwarkesh: 5 claims, correct speaker attribution). Backfill draining in background (`logs/extract-backfill.log`). ⬜ Refinement: target Item 7 MD&A for filings.
4. 🟡 **Embedding + storage** — ✅ Qdrant `propositions` hybrid collection (bge-m3 + BM25) **live**; hybrid search+rerank verified surfacing the power-infra theme. ⬜ embed the growing claim set; clustering.
5. 🟡 **Scoring brain** (`signal_engine/signals/`) — ✅ EISC independence primitive (verified), as-of harness, windowed acceleration, **under-acted-conviction (Job B)**, the quantitative bar, ledger writer, resolver stub, orchestrator. Speaker-name independence edges live. ⬜ Job A scorers (emergence/stance/intersection) deferred per blueprint.
6.**Source-independence graph** — voiceprint cosine + speaker-name edges feed EISC; "one guest doing the rounds" collapses to ~1 voice (verified).
7. 🟡 **§7.1 backtest** — runs end-to-end; **corpus-gated finding**: power-infra corroboration is currently single-cluster (energy companies), so it correctly won't clear the cross-cluster bar (k_eff≥2). Needs the cross-cluster podcast corpus (transcription backfill running). The scoring is disciplined, not broken.
8. ⬜ Frontier judge + synthesis + live fan-out (Anthropic key live; routed scrub→frontier→rehydrate) — deferred; v1 backtest uses a hand-written fan-out.
9. 🟡 **Web UI** (`signal_engine/ui/`, `serve`) — ✅ corpus management (dashboard, add/view sources, per-source claim inspection). ⬜ human-eval rating interface (§6.7) on the same app.
10.**Run the §7.1 backtest FIRST**, then the forward pilot; disagreement analysis; scaling decision.
**Scaling lever (post-pilot):** bulk extraction can offload to the Gemini batch API (operator can provide a key) while local Qwen stays default — public corpus only, never conviction/exposure data (§4).
---
## Running the foundation
```bash
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
DATA_DIR=./data .venv/bin/python -m signal_engine init-db
DATA_DIR=./data .venv/bin/python -m signal_engine seed-convictions
DATA_DIR=./data .venv/bin/python -m signal_engine seed-convictions --file seeds/conviction_log.backtest-2023.seed.yaml
# When the gateway URL is provided:
SPARK_CONTROL_URL=https://<lan-addr> .venv/bin/python -m signal_engine spark-status
```
**Pipeline + UI commands** (config from `.env`):
```bash
.venv/bin/python -m signal_engine seed-sources && python -m signal_engine load-feeds
.venv/bin/python -m signal_engine ingest-edgar --ticker CEG --since 2023-01-01 --until 2023-12-31
.venv/bin/python -m signal_engine ingest-earnings --ticker NVDA --since 2023-01-01 --until 2023-12-31
.venv/bin/python -m signal_engine run-extract --limit 4 # local-LLM claim extraction
.venv/bin/python -m signal_engine embed-claims # → Qdrant hybrid collection
.venv/bin/python -m signal_engine search --query "power is the binding constraint on AI"
.venv/bin/python -m signal_engine serve # corpus UI → http://localhost:8000
```
Config is all env-driven (`signal_engine/config.py`) so the same code runs as a plain process now and as a
StartOS daemon later (injected via the s9pk `store.json` FileModel).
---
## Operator-provided items (block runtime, not the foundation)
- **Spark Control LAN base URL + TLS cert** (same-LAN self-signed → `SPARK_VERIFY_TLS=false`).
- **Confirm the exact §13.2 model strings** (currently defaulted: `qwen3.6-35b-a3b-nvfp4`, `bge-m3`, `parakeet-tdt-0.6b`).
- **`ANTHROPIC_API_KEY`** (frontier step) and **`FMP_API_KEY`** (earnings transcripts) — or defer FMP and start filings-only.
- **§7.1 backtest corpus:** which 2023-era podcast archives + filings we can actually fetch (the gating risk).
- **StartOS host CPU arch** (drives which `.s9pk` to install — the host, not the ARM Sparks) and Spark Control's
health-check IDs + outbound-gateway = clearnet (not Tor) — verified against a live box at packaging time.
```