136 lines
9.8 KiB
Markdown
136 lines
9.8 KiB
Markdown
# Ten31 Signal Engine (pilot)
|
||
|
||
A recurring pipeline that ingests a growing corpus of audio (podcasts, YouTube) and text
|
||
(SEC filings, earnings-call transcripts), extracts structured **propositions**, and surfaces
|
||
**signal over time**, filtered through Ten31's thesis as a *relevance lens* — with every surfaced
|
||
signal logged as a **falsifiable prediction** so the system is scored against reality.
|
||
|
||
**Spec / source of truth:** [`ten31-signal-engine-handoff.md`](ten31-signal-engine-handoff.md). It wins
|
||
over this README on any conflict. Section refs below (§) point into it.
|
||
|
||
---
|
||
|
||
## The spine (do not violate — §2, §5)
|
||
|
||
> **Statistics & graph structure NOMINATE candidates; the frontier model only JUDGES and FANS OUT a
|
||
> pre-filtered shortlist. The prediction ledger is the final arbiter.**
|
||
|
||
- **Job A — Discovery (§1):** surface what Grant doesn't yet see — emergent themes (independent
|
||
cross-cluster *convergence*, scored on **acceleration** not size), credible contrarian minority
|
||
stances, and the prize: their **intersection** (a consensus about to flip).
|
||
- **Job B — Conviction-action gap (§1.1):** fan held convictions to 2nd/3rd-order derivatives and fire
|
||
`conviction(high) × exposure(low) × rising independent corroboration`. Countermeasure to the 2023
|
||
"power is the binding constraint" call where the seed was held but derivatives under-acted.
|
||
- **Nominate-then-judge boundary:** geometry/stats emit a shortlist that already cleared a quantitative
|
||
bar; the frontier model judges/synthesizes/fans-out only. A fanned-out derivative is a **hypothesis,
|
||
not a signal**, until independent corpus corroboration confirms it (§2.3, §4.6).
|
||
- **The lens tags relevance; it must NOT gate truth (§5.7).** The engine must be able to surface an
|
||
accelerating signal *against* Ten31's thesis (the B1–B3 breakers). `thesis_seam` is a tag, never a filter.
|
||
|
||
---
|
||
|
||
## Locked pilot decisions
|
||
|
||
| Topic | Decision | Ref |
|
||
|---|---|---|
|
||
| **Packaging** | Ship as a **StartOS 0.4.0 s9pk** (operator call). Code is package-ready; TS wrapper generated once the daemon/UI/dependency contract stabilizes. | §10, §13 |
|
||
| **Earnings transcripts** | **FMP Ultimate** transcript API as backbone (audio isn't reliably fetchable: no uniform feed, ~30–90d replay expiry). EDGAR filings are the durable core. Self-transcribed audio deferred. | §4.1, §12 |
|
||
| **Conviction exposure** | Coarse **NAV bands**: `none / lt2 / 2to10 / gt10`. Grant sole editor, monthly review. | §3.1 |
|
||
| **Topic vocabulary** | **Hybrid**: seeded controlled list + emergent topics batch-merged on a schedule. | §4.2 |
|
||
| **Frontier model** | Default `claude-opus-4-8` (override via `FRONTIER_MODEL`). Scrubbed-vs-unscrubbed quality A/B is a pilot deliverable. | §4.6 |
|
||
|
||
---
|
||
|
||
## Architecture (modules map 1:1 to spec layers)
|
||
|
||
```
|
||
ingest/ §4.1 RSS/YouTube/EDGAR/FMP fetch + scheduler; long-audio chunking; speaker stitch
|
||
extract/ §4.2 local LLM → 0..N structured claim units per chunk (willing to emit ZERO)
|
||
embedstore/ §4.3 embed DISTILLED propositions + client-side BM25 → Qdrant hybrid
|
||
signals/ §4.4 cluster acceleration · stance distributions · bridge edges · scoring
|
||
independence/ §4.5 source graph + voiceprint library → convergence discounting
|
||
frontier/ §4.6 judge · synthesis · conviction fan-out (scrub→frontier→rehydrate; SHORTLIST/SEEDS only)
|
||
ledger/ §4.7 dual-evaluation ledger + conviction log + earned credibility
|
||
spark/ §13 THE single chokepoint for all Spark Control HTTP (no other module knows the URL)
|
||
store/ SQLite schema + seed loaders
|
||
backfill/ §13.4 client-side GPU-hours queue (extraction = heavier serial load; audio sequential)
|
||
```
|
||
|
||
**Two invariants enforced in exactly one place each:**
|
||
- All gateway HTTP funnels through `spark/` — `spark/client.py` holds the base URL, the self-signed TLS
|
||
skip, 503 retry/backoff, and a process-wide **audio lock** (sequential audio, §4.1).
|
||
- The "model never sees Grant's rating first" rule (§6.7) is **structural**: ratings live in a separate
|
||
`human_evaluations` table; the model-facing code reads `ledger`, which has no `grant_rating` column.
|
||
|
||
### Data stores
|
||
- **SQLite** (`store/schema.sql`): sources, documents, claims, topics, source_edges, voiceprints,
|
||
conviction_log, fanout_nodes, ledger, human_evaluations, backfill_jobs. The whole system state is a SELECT.
|
||
- **Qdrant** (via Spark Control): one `propositions` collection, **hybrid** dense `bge-m3` (1024-d) +
|
||
sparse BM25 (`Qdrant/bm25`, `modifier: idf`). Points are distilled propositions, never raw chunks.
|
||
|
||
---
|
||
|
||
## Backfill queue (§13.4) — measured in GPU-hours, not real-time
|
||
|
||
Scheduler jobs are **producers** (fetch/dedup/enqueue); a **single worker** drains the GPU queue one job
|
||
at a time → no parallel audio → no 503 by construction. Jobs are leased (crash-safe resume), idempotent on
|
||
`hash(content + prompt-version)`. Extraction is the binding load. Rough pilot estimate (~300 episodes +
|
||
25 companies): transcription ~6 GPU-h, **extraction ~90 GPU-h**, wall-clock ~4–6 days (audio on Spark 2 and
|
||
extraction on Spark 1 run in parallel). The queue self-calibrates from measured `gpu_seconds`.
|
||
|
||
---
|
||
|
||
## Build order (§11) & status
|
||
|
||
1. ✅ **Foundation** — config, `spark/` client, SQLite schema, conviction-log seed, **ledger scaffold live day one**.
|
||
2. ✅ **Ingestion + backfill queue** — backfill queue; EDGAR + FMP earnings; **audio path live-proven** (download → Parakeet transcribe + Sortformer diarize → align → voiceprint-stitch → speaker-attributed transcript); 20 podcast feeds resolved (`seeds/podcast_feeds.resolved.yaml`). Broad corpus ingested: **785 company docs (379 filings + 406 earnings), 802 extract jobs**. ⬜ Remaining: podcast audio backfill (RSS-full + dated YouTube pulls) + audio-cache cleanup; foreign-filer forms (20-F/6-K for CCJ/TSM/IREN).
|
||
3. ✅ **Extraction worker + §4.2 prompt** — **live** on text (earnings: 26 claims, insight-heavy) AND audio (Dwarkesh: 5 claims, correct speaker attribution). Backfill draining in background (`logs/extract-backfill.log`). ⬜ Refinement: target Item 7 MD&A for filings.
|
||
4. 🟡 **Embedding + storage** — ✅ Qdrant `propositions` hybrid collection (bge-m3 + BM25) **live**; hybrid search+rerank verified surfacing the power-infra theme. ⬜ embed the growing claim set; clustering.
|
||
5. 🟡 **Scoring brain** (`signal_engine/signals/`) — ✅ EISC independence primitive (verified), as-of harness, windowed acceleration, **under-acted-conviction (Job B)**, the quantitative bar, ledger writer, resolver stub, orchestrator. Speaker-name independence edges live. ⬜ Job A scorers (emergence/stance/intersection) deferred per blueprint.
|
||
6. ✅ **Source-independence graph** — voiceprint cosine + speaker-name edges feed EISC; "one guest doing the rounds" collapses to ~1 voice (verified).
|
||
7. 🟡 **§7.1 backtest** — runs end-to-end; **corpus-gated finding**: power-infra corroboration is currently single-cluster (energy companies), so it correctly won't clear the cross-cluster bar (k_eff≥2). Needs the cross-cluster podcast corpus (transcription backfill running). The scoring is disciplined, not broken.
|
||
8. ⬜ Frontier judge + synthesis + live fan-out (Anthropic key live; routed scrub→frontier→rehydrate) — deferred; v1 backtest uses a hand-written fan-out.
|
||
9. 🟡 **Web UI** (`signal_engine/ui/`, `serve`) — ✅ corpus management (dashboard, add/view sources, per-source claim inspection). ⬜ human-eval rating interface (§6.7) on the same app.
|
||
10. ⬜ **Run the §7.1 backtest FIRST**, then the forward pilot; disagreement analysis; scaling decision.
|
||
|
||
**Scaling lever (post-pilot):** bulk extraction can offload to the Gemini batch API (operator can provide a key) while local Qwen stays default — public corpus only, never conviction/exposure data (§4).
|
||
|
||
---
|
||
|
||
## Running the foundation
|
||
|
||
```bash
|
||
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
|
||
DATA_DIR=./data .venv/bin/python -m signal_engine init-db
|
||
DATA_DIR=./data .venv/bin/python -m signal_engine seed-convictions
|
||
DATA_DIR=./data .venv/bin/python -m signal_engine seed-convictions --file seeds/conviction_log.backtest-2023.seed.yaml
|
||
# When the gateway URL is provided:
|
||
SPARK_CONTROL_URL=https://<lan-addr> .venv/bin/python -m signal_engine spark-status
|
||
```
|
||
|
||
**Pipeline + UI commands** (config from `.env`):
|
||
```bash
|
||
.venv/bin/python -m signal_engine seed-sources && python -m signal_engine load-feeds
|
||
.venv/bin/python -m signal_engine ingest-edgar --ticker CEG --since 2023-01-01 --until 2023-12-31
|
||
.venv/bin/python -m signal_engine ingest-earnings --ticker NVDA --since 2023-01-01 --until 2023-12-31
|
||
.venv/bin/python -m signal_engine run-extract --limit 4 # local-LLM claim extraction
|
||
.venv/bin/python -m signal_engine embed-claims # → Qdrant hybrid collection
|
||
.venv/bin/python -m signal_engine search --query "power is the binding constraint on AI"
|
||
.venv/bin/python -m signal_engine serve # corpus UI → http://localhost:8000
|
||
```
|
||
|
||
Config is all env-driven (`signal_engine/config.py`) so the same code runs as a plain process now and as a
|
||
StartOS daemon later (injected via the s9pk `store.json` FileModel).
|
||
|
||
---
|
||
|
||
## Operator-provided items (block runtime, not the foundation)
|
||
|
||
- **Spark Control LAN base URL + TLS cert** (same-LAN self-signed → `SPARK_VERIFY_TLS=false`).
|
||
- **Confirm the exact §13.2 model strings** (currently defaulted: `qwen3.6-35b-a3b-nvfp4`, `bge-m3`, `parakeet-tdt-0.6b`).
|
||
- **`ANTHROPIC_API_KEY`** (frontier step) and **`FMP_API_KEY`** (earnings transcripts) — or defer FMP and start filings-only.
|
||
- **§7.1 backtest corpus:** which 2023-era podcast archives + filings we can actually fetch (the gating risk).
|
||
- **StartOS host CPU arch** (drives which `.s9pk` to install — the host, not the ARM Sparks) and Spark Control's
|
||
health-check IDs + outbound-gateway = clearnet (not Tor) — verified against a live box at packaging time.
|
||
```
|