Initial commit: Ten31 Signal Engine (ingest, scoring brain, corpus seeds)
This commit is contained in:
+40
@@ -0,0 +1,40 @@
|
||||
# Python
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
.venv/
|
||||
venv/
|
||||
*.egg-info/
|
||||
.pytest_cache/
|
||||
|
||||
# Data / runtime
|
||||
data/
|
||||
logs/
|
||||
*.db
|
||||
*.db-wal
|
||||
*.db-shm
|
||||
audio-cache/
|
||||
|
||||
# StartOS s9pk build artifacts
|
||||
*.s9pk
|
||||
node_modules/
|
||||
javascript/
|
||||
docker-images/
|
||||
|
||||
# Secrets & local env
|
||||
.env
|
||||
.env.*
|
||||
!.env.example
|
||||
*.local
|
||||
|
||||
# Claude Code — deny by default, allow-list shared wiring.
|
||||
# .claude/ also accumulates worktrees, editor configs, and OS cruft; commit
|
||||
# only the shared parts so new local scratch (or a stray secret) stays out.
|
||||
.claude/*
|
||||
!.claude/rules/
|
||||
!.claude/agents/
|
||||
!.claude/commands/
|
||||
!.claude/skills/
|
||||
!.claude/settings.json
|
||||
|
||||
# OS cruft
|
||||
.DS_Store
|
||||
@@ -0,0 +1,36 @@
|
||||
# Battery corpus scope + the regulatory-timing finding (DESIGN_v2 §3-step-6, lineage-aware)
|
||||
|
||||
Scoped via lineage-tagged research (workflow `w4pt3n0wq`). This is the first installment of the
|
||||
lineage-aware corpus expansion — and it produced a sharper resolution of the Battery thesis than the
|
||||
adoption-evidence labels alone.
|
||||
|
||||
## The finding (CORRECTED by principal ground-truth, 2026-06-08): the supply failure was capital-provider WILLINGNESS, not regulation
|
||||
> **Retraction.** An earlier version of this note (and the scoping agent) claimed the supply failure was "structurally regulatory" (SAB-121 blocked it). **That was wrong — a plausible-narrative over-fit, the exact failure mode the engine exists to catch — and is retracted.**
|
||||
|
||||
The institutional-supply leg of D1 failed because **capital providers — legally free to fund Battery — were unwilling to provide dollars** for a novel product from a new company in an emerging structure. **Nothing *restricted* them.**
|
||||
- Regulatory uncertainty plausibly dampened institutional risk appetite for the space *broadly*, but it was **not Battery's gate**. **Proof:** the 2025 regulatory unblock (SAB-122 Jan 2025, FDIC/OCC reversals) did **NOT** produce institutional dollars for Battery — **they remain unraised as of 2026.** If regulation had been the binding constraint, removing it would have moved the outcome. It didn't.
|
||||
- So the binding constraint is **capital-provider willingness / risk-appetite for the product**, not regulatory permissibility.
|
||||
|
||||
**Why this matters for the engine (the sharpened S1 lesson):** an engine fanning the REGULATORY/enabler axis would have generated a FALSE "supply is unblocking in 2025" signal — a plausible-but-wrong causal model producing a *false positive*. The TRUE resolution is **actual committed/deployed capital to fund the loans at scale**, which stayed NO throughout (the frozen pre-registered criterion was right; only the causal gloss drifted). Enablers (regulation, a single bank's toe-in) are CONTEXT, never the resolver — they can resolve YES while the conviction fails.
|
||||
|
||||
A correct two-sided read across 2022–2024+: **demand-net rising** (named lenders, real borrower appetite) while the **supply-net stays flat** measured as *actual capital committed to fund the loans* — and critically, NOT mistaking the 2025 regulatory enabler for supply arriving.
|
||||
|
||||
## What resolves which axis (set this BEFORE listing the corpus, so policy can't creep back into the resolver)
|
||||
- **SUPPLY axis (the binding constraint) resolves ONLY on actual committed/deployed dollars** funding BTC-collateralized loans at scale — Battery's own raise status, named lenders' *funding sources* (whose balance sheet, drawn vs. announced), and dollars actually deployed. This stays **NO** through the window and **NO post-2025-unblock** (Battery remains unraised as of 2026).
|
||||
- **DEMAND axis resolves on borrower appetite + originated volume** — named originators, cumulative origination. This goes **YES**.
|
||||
- **Policy / regulatory documents are CONTEXT, never the supply resolver.** They explain ambient institutional risk-appetite for the space; they do **not** measure whether capital reached Battery. The 2025 unblock is the proof: it flipped the policy axis YES while supply stayed NO. An engine that wired policy into the supply resolver would have printed a false "supply unblocking" signal. Tag policy `axis=context`, weight 0 into the supply score.
|
||||
|
||||
## The corpus, by lineage (tag at ingestion; the downstream tier is the same-event echo)
|
||||
- **PRIMARY (original disclosures/events — ingest as dated documents):** **the supply-axis primaries that actually resolve it** — Battery's own raise/funding disclosures, named lenders' funding-source disclosures (Cantor/Maple/FalconX *who-funds-it*, Unchained/Ledn balance-sheet vs. flow), and deployed-dollar reporting; then SEC SABs, OCC Interpretive Letters, FDIC FILs, Fed SR letters **(policy — `axis=context`, NOT supply)**; GlobalCapital, Bloomberg, Galaxy Research crypto-lending reports, S&P/Moody's structured-finance (trade); Unchained/Ledn/Strike/SALT/debifi blogs (demand); Cantor PR (supply *announcement* — distinguish announced from funded); JPMorgan/Goldman/MS earnings + filings (incumbent-entry — **we already ingest these via EDGAR + FMP**).
|
||||
- **INDEPENDENT (separate vantage, different incentives):** CSBS + GAO (opposing regulators/oversight — leading indicators); Deloitte/KPMG (audit-liability, real-time read of what the rule does to the books); top-tier bank-regulatory law firms (represent the banks deciding); 9fin, PitchBook LCD, Private Debt Investor, Asset Securitization Report (independent credit data); bank 10-K/10-Q.
|
||||
- **DOWNSTREAM (same-event echo — tag + near-zero independence, like own_network):** the long tail of law-firm client alerts + Mondaq; Creditflux, SCI; crypto trade press (Ledger Insights, Banking Dive, The Block, CoinDesk, Decrypt). **Cantor's $2B program is flagged "THE ECHO-TRAP CASE"** — every outlet on it = ONE event, not corroboration.
|
||||
|
||||
## Ingestion plan (the build to actually run Battery) — ordered by what resolves the supply axis
|
||||
1. **Committed-capital evidence (the supply resolver — highest priority)** = Battery's own raise/funding disclosures + named lenders' *funding-source* disclosures (whose dollars, drawn vs. announced, deployed at scale). This is what the supply-net is measured on. Pull from company disclosures, PitchBook/9fin/credit-data, and the announced-vs-funded distinction on Cantor/Maple/FalconX. **Low volume, decisive.**
|
||||
2. **Lender + borrower disclosures** = a handful of company blogs (RSS where available) for the demand-net (origination volume, named originators).
|
||||
3. **Bank treatment** = already available (EDGAR filings + FMP earnings calls for JPM/GS/MS — extend the existing company corpus); read for *deployed* exposure, not stated intent.
|
||||
4. **Policy primaries (`axis=context`, NOT supply)** = a small set of structured, dated, RSS-accessible document feeds (SEC/OCC/FDIC/Fed) — ingest as `filing`-like text docs (clean, no transcription). Cheap to pull; **weight 0 into the supply score** — present only to explain ambient risk-appetite and to demonstrate the 2025 policy-YES / supply-NO divergence.
|
||||
5. **Independent credit-data + Big-4 + law-firm** = sample, lineage-tagged independent.
|
||||
6. **Downstream tier** = ingest but tag `lineage=downstream` → near-zero independence in EISC (same mechanism as `own_network`). The Cantor echo must not read as corroboration.
|
||||
|
||||
This needs a small **policy-document + blog fetcher** (different from RSS-audio and EDGAR) — the next build for the Battery leg. The Strike leg (bitcoin podcasts) is already transcribing.
|
||||
+126
@@ -0,0 +1,126 @@
|
||||
# Ten31 Signal Engine — Design v2 (hypotheses + falsification protocol)
|
||||
|
||||
**Status:** This is **not** a settled-decisions document. It is a set of design *hypotheses* with *falsification criteria*, plus the *evaluation protocol* that is supposed to prove them wrong. It exists because the dev, the implementing agent, and the principal converged in argument — and convergence among three reasoning processes that share priors is **not** evidence the design is right. The discipline from here is empirical: stop improving the design by argument; subject it to data that can falsify it.
|
||||
|
||||
**The single most important correction from the v1 review:** all evaluation so far was **recall-only** ("given something we know happened, did the engine catch it?"). The question a fund actually cares about is **precision** ("of everything the engine fires on, what fraction pans out?"). You cannot adjudicate a precision/recall tradeoff — which is what the whole gate debate *is* — while measuring one side. Therefore the evaluation protocol (§1) is the first-class artifact; the design hypotheses (§2) are secondary and only matter once §1 can grade them.
|
||||
|
||||
---
|
||||
|
||||
## 1. Evaluation protocol (the first-class section)
|
||||
|
||||
### 1.1 Pre-registration (committed before any outcome data is pulled)
|
||||
To avoid the same hindsight leakage that the hand-written fan-out had (we picked the derivatives we knew were right, phrased to match the corpus), the resolution criteria below are **fixed before** fetching any price/external data. A **single uniform rule** is applied to all derivatives — no per-derivative threshold tuning.
|
||||
|
||||
**Resolution rule (uniform, pre-registered):**
|
||||
- Each derivative maps to a **proxy basket** of liquid equities most directly exposed to it (committed in `seeds/resolution.K2023.yaml`).
|
||||
- Build an **equal-weight, start-normalized basket index** over 2023-01-01 → 2025-06-30.
|
||||
- **"Confirmed real"** iff the index first reaches **≥ +40% vs. its 2023-01 baseline** on some date `D`, **and** is still **≥ +25%** 90 days after `D` (anti-transient-spike). `D` = the **repricing date**.
|
||||
- Otherwise **"not confirmed"** in-window.
|
||||
- These thresholds (40% / 25% / 90d) and the baskets are the pre-registration. They are deliberately coarse and uniform so they cannot be gamed per-derivative.
|
||||
|
||||
### 1.2 The engine's two signal dates (to price what the gate costs)
|
||||
For each derivative we record **both**:
|
||||
- **cleared_date** — earliest as_of where it cleared the evidence bar (the engine's actual fire).
|
||||
- **whisper_date** — earliest as_of where the engine saw *accelerating, multi-claim* corroboration **regardless of the independence floor** (`n_confirmed ≥ 4 AND a_corrob > 0`). This is the signal the gate may have suppressed.
|
||||
|
||||
The gap between `whisper_date` and `cleared_date` is **the lead time the gate costs us**, measured, not argued.
|
||||
|
||||
### 1.3 The confusion matrix (precision AND recall)
|
||||
Per derivative, classify:
|
||||
- **TP** = cleared **and** confirmed-real (report lead time = repricing_date − cleared_date).
|
||||
- **FP** = cleared **and** not-confirmed (false alarm — the precision killer).
|
||||
- **FN** = had a whisper / signal but never cleared, **and** confirmed-real (the gate ate a real signal — the uranium hypothesis).
|
||||
- **TN** = no signal **and** not-confirmed.
|
||||
|
||||
Report **precision = TP/(TP+FP)** and **recall = TP/(TP+FN)** — and the **whisper-level** matrix too (re-classify using whisper_date), so we can see what precision/recall would be *without* the independence floor. The delta between the two matrices is the empirical answer to the gate debate.
|
||||
|
||||
### 1.4 Held-out precision case (pending the principal)
|
||||
The 9-derivative matrix measures precision on one (known-winner) conviction. For a clean false-positive rate we need an **adversarial case: a conviction Ten31 held that *did not* pan out**, seeded and run the same way. *Open — Grant to provide a candidate failed/abandoned conviction.* Without it, precision is measured on a favorable sample and should be reported as such.
|
||||
|
||||
### 1.5 The deeper thing the resolver tests
|
||||
The lead-time **distribution** (not any single signal) tests the engine's *foundational assumption* — that discourse leads capital with usable, consistent lead time (§6.2). If the distribution is wide, near-zero, or sometimes negative, the leading-indicator thesis partially collapses. The resolver is therefore not just a grader of signals; it is a test of the premise. This is why it is built first.
|
||||
|
||||
---
|
||||
|
||||
## 2. Design hypotheses (each with a falsification criterion)
|
||||
|
||||
Adopted from the v1 review; **none are settled** until §1 grades them.
|
||||
|
||||
- **S1 (STANDING RULE) — resolve on the OUTCOME (scaled substance), never on milestones or ENABLERS.** Two Battery lessons:
|
||||
- *Milestone vs. substance:* the same reality resolved YES on "≥1 major institution enters" (Goldman, one loan, 2022-04) and NO on "≥$X capital *deployed* at scale" (none).
|
||||
- *Enabler vs. outcome (the sharper one):* the regulatory ENABLER improved (SAB-121 → SAB-122, Jan 2025) yet the OUTCOME didn't move — Battery *still* could not raise institutional dollars post-unblock. **An engine fanning the enabler ("regulation is unblocking") would have falsely "confirmed" supply on the 2025 deregulation.** Battery's binding constraint was never regulatory *permissibility* (nothing legally barred a capital provider from funding it) — it was capital-provider *willingness* to fund a novel product from a new company, which persisted as NO even after the backdrop improved.
|
||||
So: phrase derivatives as deployed/at-scale substance ("≥$X capital deployed", "sustained volume > Y"), never as checkbox milestones ("a bank entered") **or enablers** ("regulation permits it now"). Enablers/catalysts are CONTEXT, not resolution — they can resolve YES while the conviction fails.
|
||||
|
||||
- **H1 — The contested gate is the EISC independence FLOOR (+ same-cluster discount), not cross-cluster.** v1 internal diagnosis: of 9 derivatives, the binding gate is overwhelmingly `EISC < 2.0` (uranium: 15 confirmations, 5 sources, highest acceleration in the run, suppressed by EISC=1.6; picks-and-shovels' early 2023-11 whisper suppressed by EISC=1.6, delayed 6 months). **Falsify:** if the confusion matrix shows the EISC-suppressed derivatives did **not** reprice early, the floor was right and v1's relaxation instinct is wrong.
|
||||
- **Corrected gate framing.** Cross-cluster was always a *crude proxy for independence of information* — job-invariant, not a Job-A-only rule. Keep the **two-tier** resolution (log the single-cluster whisper to start the clock; act only on broader/independent confirmation) but as *independence-tiering*, not job-splitting.
|
||||
- **H2 — Weight independence by `claim_type`.** Operational/descriptive disclosure from a self-interested source (order book, capex, interconnect queue) is more trustworthy than its predictive/opinion claims. **Falsify:** if down-weighting predictive self-interested claims does not improve precision on the held-out case.
|
||||
- **H3 — Independence is lineage, not cluster** — an *edge-population* problem on the existing EISC graph (`source_edges.edge_type` already supports citation/community), not a rebuild. Cheap interim: **source-type independence priors** + `claim_type` (H2). **NOT temporal-precedence as a standalone discount** — the dev's catch: "later near-identical = discount" misclassifies independent re-derivation as echo (3 operators reporting queues = the corroboration we want). Precedence may only *direction* an already-detected dependency edge, never stand alone.
|
||||
- **H4 — Replace the 2nd-derivative with a persistence / level-crossing test** on the corroboration arrival rate ("crossed N and stayed non-decreasing across k windows; shrinkage at low n"), with **per-source-type window cadence** (filings quarterly, podcasts weekly). **Falsify:** re-run the backtest with the persistence estimator; if it does not reduce the sign-flip variance visible in Appendix A, the 2nd-derivative wasn't the problem.
|
||||
- **H5 — Two-sided Job B (a product feature, not just a defense).** Track **net independence-weighted corroboration (affirms − denies)**; *accelerating contradiction + over-exposure* = an "eroding-conviction / reduce" signal of equal value to under-acted-conviction. Also the structural reflexivity defense (Job B is confirmation-seeking by construction — fan out from beliefs, filter to affirms-only).
|
||||
- **H6 — The frontier fan-out is the UNTESTED half, and it is the half that maps to the actual miss (§1.1).** The backtest hand-wrote the derivative tree (hindsight). Design the test: seed the 2023 conviction, give the model **2023-only** context, let it **propose** derivatives, score that tree's precision/recall vs. what actually repriced, and compare frontier-proposed vs. hand-written. **Grade on ALTITUDE, not just topical correctness (per S1):** "does it propose the derivative at the right altitude — scaled substance vs. first-instance milestone" — because that choice alone flips the verdict (Goldman vs. Cantor). A model that proposes topically-right but milestone-altitude derivatives will look prescient on headlines while missing the real conviction state.
|
||||
- **Extraction fixes (low-risk):** Item-7 (MD&A) targeting for filings; relations populated at a **linking stage** over the embedded store, not by the extractor (a spec clarification the dev owns).
|
||||
|
||||
---
|
||||
|
||||
## 3. Build sequence
|
||||
1. **Resolver + the pre-registered confusion matrix on the 9 derivatives already run** (this measures precision, settles H1 empirically, and tests the lead-time premise). ← *building now.*
|
||||
2. **Independence/estimator rework** (H2 claim_type + source-type priors; H4 persistence estimator) — in parallel; makes the *next* backtest trustworthy.
|
||||
3. **Frontier-fan-out test** (H6) — right behind the estimator; cheap, strategically central.
|
||||
4. **Two-sided Job B** (H5).
|
||||
5. **Held-out failed-conviction case** (§1.4) — once provided.
|
||||
6. **Broad corpus expansion — last, and lineage-aware** (toward independent vantage points: operator/expert-network, supply-chain OEMs/EPCs/ISOs, FERC/interconnection filings, policy — NOT more correlated sell-side/trade-press).
|
||||
|
||||
## 4. Standing discipline
|
||||
Every additional round of elegant consensus past this point has decreasing value and increasing risk of mistaking coherence for correctness. Changes to scoring parameters (the EISC floor especially) must be justified by the confusion matrix / held-out precision, **not** by making a known case clear. The resolution criteria in §1.1 are frozen; if they change, that is a pre-registration event to be logged and justified, not a quiet tune.
|
||||
|
||||
### 4.1 Pre-registration change log (post-hoc normalizations — logged per §4)
|
||||
Any judgment call made *after* seeing outcome data is a logged pre-registration event, with reasoning. Two on the adversarial cases (both correct, both post-outcome — logging them is what protects the labels from "we graded it how we wanted"):
|
||||
1. **STRIKE-card-rail-disruption polarity normalization.** The verifier agent returned `verdict=confirmed`, but its evidence was *zero* bitcoin/Lightning erosion attribution across all five Visa/MC 10-Ks. The `confirm_iff` criterion was "erosion IS attributed," so the criterion was NOT met. Normalized to `played_out: no` (thesis failed). Reasoning: the agent used "confirmed" to mean "confirmed the no-erosion finding"; the pre-registered criterion is about whether disruption occurred, which it did not. Polarity, not substance.
|
||||
2. **BATTERY-incumbent-entry milestone-vs-substance split.** The criterion as written ("≥1 major institution publicly entered") resolves YES (Goldman, one loan, 2022-04). Recorded as `played_out: token` with the explicit note that the *substance* (institutional capital at scale, the separate BATTERY-institutional-supply check) resolved NO. Reasoning: the milestone phrasing is a weak proxy for the thesis; logging both verdicts on the same reality is the evidence for standing rule S1. **The criterion phrasing itself was the flaw**, which is now fixed forward by S1 — but the original label is preserved, not silently rewritten.
|
||||
3. **BATTERY causal misattribution — CORRECTION (principal ground-truth, 2026-06-08).** The agent's corpus-scoping narrative claimed the institutional-supply failure was "structurally regulatory" (SAB-121 blocked it). **This was wrong and is retracted.** Principal correction: nothing legally restricted capital providers from funding Battery; the binding constraint was capital-provider *willingness* to fund a novel product / new company. Proof it was NOT regulatory: the 2025 regulatory unblock did NOT produce institutional dollars for Battery (still unraised as of 2026). The frozen pre-registered criterion (BATTERY-institutional-supply = "$X *deployed* at scale", `played_out: no`) was correct and unchanged; only the post-hoc causal gloss was wrong — a textbook instance of the plausible-narrative failure mode this engine exists to catch, logged here rather than buried. Policy/regulatory sources are demoted to CONTEXT for the Battery supply axis, not its resolver (see BATTERY_corpus_scope.md, corrected).
|
||||
|
||||
---
|
||||
|
||||
## v2.1 amendments (post-confusion-matrix; two protocol corrections + two adversarial cases)
|
||||
|
||||
The first confusion matrix (§5 below / the power-infra run) and the dev's review forced two protocol corrections *before this doc ossifies*, plus the two failed-conviction cases the precision axis was missing.
|
||||
|
||||
### Correction A — measure RUNWAY, not lead-time-vs-first-tick
|
||||
Lead-time (days before the first price tick) is a *trading* metric. Ten31 is a long-duration, often-private holder. The right metric is **remaining durable runway at signal** = the fraction of the eventual durable move still ahead when the signal fired:
|
||||
`runway = (durable_peak − index_at_signal) / (durable_peak − baseline)`.
|
||||
- High runway = the signal was actionable for a holder even if "late" by a trading clock.
|
||||
- **Do not penalize a modestly-late public read** — public comps lead the private rounds we actually enter, so a signal a few weeks/a month after the first public tick still left a real window. This is a *lens on the public number*, not a private-markets dataset to source. Public markets remain the workhorse instrument.
|
||||
- *Why it matters here:* the power-infra cells with "negative lead-time" (uranium −46d, picks −116d) had most of a +225%/+392% move still ahead at signal → high runway → valuable to a holder. Lead-time mis-scored them; runway corrects it.
|
||||
|
||||
### Correction B — tag derivatives by DISTANCE-FROM-EDGE, never filter on it
|
||||
The engine's job is to surface derivatives — *including* ones outside the current mandate (the original AI/compute miss *was* a mandate-expansion failure; an engine that pre-filters to in-mandate reproduces exactly that blindness). So every derivative carries `distance_from_edge ∈ {in_mandate, one_hop, two_hop}` for **triage only**. (Uranium = two_hop — the most dramatic power-infra cell but the least decision-relevant; the engine should still surface it, tagged, and let judgment decide.) Schema: `fanout_nodes.distance_from_edge`. **No scorer may filter on it.**
|
||||
|
||||
### The two adversarial failed-conviction cases (the negatives — §1.4 satisfied)
|
||||
Both are convictions Ten31 still holds where the *falsifiable thesis* mis-fired. Pre-registered criteria committed in `seeds/resolution.STRIKE2022.yaml` / `resolution.BATTERY2022.yaml` *before* outcome-labeling; seeds in `conviction_log.adversarial.seed.yaml` + `fanout.STRIKE2022/BATTERY2022.seed.yaml`.
|
||||
|
||||
- **STRIKE2022 (the single most valuable negative).** Failed thesis = Lightning-as-retail-payments-network displacing card rails. It was **narrative-driven** — the bitcoin-podcast cluster told this story loudly in 2022 = the exact single-cluster, reflexive, talk-our-own-book corroboration the relaxed gate would wave through. **The test:** if the engine *clears* the Lightning-payments derivatives on bitcoin-cluster chatter that never showed up in real retail-payment volume, that's the **false positive** that proves the cross-cluster/lineage discipline. Also a two-sided test: did *non-bitcoin* sources ever corroborate, or only the book-talkers?
|
||||
- **BATTERY2022 (the two-sided / timing case).** Thesis = bitcoin-as-collateral credit goes mainstream (D1). Right on **demand**, early/wrong on **supply** (institutional lending capital didn't arrive at scale). **The test:** would the two-sided scorer (net independence-weighted affirms − denies) have surfaced that the *supply-side* corroboration was flat while *demand-side* rose — a genuine, non-hindsight "timing is early" disconfirmation?
|
||||
|
||||
### Corpus implication (flagged, needed to RUN these — not yet ingested)
|
||||
- The STRIKE false-positive test **requires the bitcoin-cluster Lightning-payments discourse to be in the corpus** — which is the orbit §7.4 *deliberately excluded* (TFTC / Bitcoin Alpha / the Odell-Bent network, as "Ten31's own priors"). **The test of the echo-chamber detector needs the echo chamber in the corpus.** That exclusion was right for Job A discovery and is exactly wrong for this precision test.
|
||||
- Both cases resolve on **adoption-evidence indicators** (Lightning retail volume; institutional BTC-credit capital / SAB-121 / bank entry), **not equity-price baskets** — a different resolver leg than the power-infra prices. Gathering that evidence (pre-registered) + ingesting the 2022–2024 bitcoin/institutional corpus is the next build to actually run them.
|
||||
|
||||
---
|
||||
|
||||
## v2.2 (post-bitcoin-cases) — three conditions BUILT, outcomes verified, scope set
|
||||
|
||||
### The three conditions (all built + unit-verified)
|
||||
1. **own_network quarantine — a PERMANENT live-scoring rule, not backtest-only.** `sources.own_network` + `effective_independent_N(mode=...)` default to `mode='live'` everywhere, which DROPS own_network sources (the Odell/Bent partner orbit = Ten31 listening to itself). The only path that keeps them is `two_sided.trajectory(mode='test')` — the explicit reflexivity fixture. Confirmed standing: every forward bitcoin-thesis signal automatically excludes the own_network orbit.
|
||||
2. **Resolution criteria tightened** to named/countable/dated (hostile-checker standard).
|
||||
3. **Two-sided net-corroboration** (`signals/two_sided.py`) — independence-weighted affirms − denies — is the instrument for the adversarial cases (NOT runway).
|
||||
|
||||
### Adversarial outcomes (hostile-verified, frozen — `seeds/resolution_outcomes.adversarial.yaml`)
|
||||
- **STRIKE2022 = thesis DEAD** on all 3 (merchant integration <2-of-3 at scale; no material retail Lightning volume; zero card erosion in 5 Visa/MC 10-Ks). The clean false-positive test.
|
||||
- **BATTERY2022** = demand REAL (≥4 named lenders); institutional supply FAILED in-window (Cantor $2B announced 2024-07, deployed 2025-05); incumbent-entry TOKEN (Goldman one loan 2022-04 — see S1); SAB-121 rescinded 2025-01-23 (edge).
|
||||
|
||||
### Scope (the two cases want OPPOSITE things — do not conflate)
|
||||
- **STRIKE: narrow and FROZEN.** 3 own_network + 3 independent shows, exactly. Purpose = prove the quarantine: clear in test mode (own_network visible), silent in live (own_network dropped). Widening muddies the contrast — narrow is the feature.
|
||||
- **BATTERY: scoped SEPARATELY and broader, lineage discipline ON.** Resolution lives in institutional-credit / policy discourse (SAB-121 commentary, bank earnings-call digital-asset-lending treatment, private-credit trade press, named lenders' disclosures) — NOT bitcoin podcasts. Doubles as the first installment of the §3-step-6 lineage-aware expansion. TRAP: most institutional-credit press is downstream of the same few announcements (every outlet on the Cantor $2B headline = ONE event, not corroboration) — tag downstream/same-event at ingestion.
|
||||
|
||||
### Standing line (unchanged)
|
||||
These two cases prove the failure-mode machinery (quarantine catches the echo; two-sided catches the timing split). They are hindsight-known. The precision number that decides whether this is worth running comes only from FORWARD operation on signals nobody pre-selected. Once these run clean, the next move is LIVE, not another backtest.
|
||||
@@ -0,0 +1,288 @@
|
||||
# Ten31 Signal Engine — Pilot Backtest Write-up
|
||||
|
||||
**Author:** Claude (Claude Code), implementing dev
|
||||
**For:** Grant + the dev who authored the handoff/scoping document
|
||||
**Date:** 2026-06-08
|
||||
**Status:** Pilot build complete; §7.1 backtest executed end-to-end with a *qualified* result. This document is the honest assessment, the judgment calls I made, and the open questions for a second opinion.
|
||||
|
||||
> **Read this as a peer review request, not a victory lap.** The engine works end-to-end and surfaced the right thesis, but the *signal quality* on the current corpus is coarse, and several design tensions in the handoff doc only became visible once there was real data flowing through. Those tensions — especially the cross-cluster gating question Grant raised — are the point of this write-up.
|
||||
|
||||
---
|
||||
|
||||
## 1. Executive summary
|
||||
|
||||
I built the full pilot per the handoff: ingestion (audio + text) → local claim extraction → hybrid vector store → the "scoring brain" (independence-discounted, as-of-disciplined nomination) → the §7.1 backtest → a dual-evaluation ledger. It runs against the operator's real local-compute stack (Spark Control) and a real ~6,600-claim corpus drawn from ~25 companies and a handful of podcasts.
|
||||
|
||||
**The §7.1 backtest verdict is a qualified YES.** Seeded with the 2023 Kirkwood "power is the binding constraint" conviction and marched as-of across 2023–2024, the under-acted-conviction scorer:
|
||||
|
||||
- **surfaced the root thesis cross-cluster in May 2023** (energy *and* AI sources, independent), and
|
||||
- **surfaced the headline derivative ("size up the power-infra picks-and-shovels") in May 2024**, along with transformers and utilities-repriced.
|
||||
|
||||
So the mechanism the project exists to build — *fan a held conviction to its derivatives and catch the world starting to corroborate them* — demonstrably works on real history.
|
||||
|
||||
**But three honest caveats keep it from being a clean win**, and they drive the open questions:
|
||||
|
||||
1. The signal is **noisy** (the acceleration metric swings between earnings seasons; there's visible run-to-run variance).
|
||||
2. The cross-cluster breadth shows up at the **root** level, not the **derivative** level — the specific power-infra derivatives stay energy-cluster-corroborated.
|
||||
3. The derivatives only clear because I **relaxed a cross-cluster gate for Job B** — a judgment call (§7 below) that is exactly what Grant wants to debate.
|
||||
|
||||
The most important open question, in Grant's words: *is strict cross-cluster gating limiting our ability to pick up signal early — and is the real fix to dramatically broaden the cluster taxonomy and the corpus?* I think the answer is largely yes, and I lay out why in §8.
|
||||
|
||||
---
|
||||
|
||||
## 2. What was built (architecture as implemented)
|
||||
|
||||
3,347 lines of Python, 44 modules. Everything local-compute runs through the operator's existing **Spark Control** gateway (we call HTTP endpoints; we did not stand up vLLM/Whisper/Qdrant). The one external call is the bounded frontier step (not exercised in the backtest — see §7, deferred).
|
||||
|
||||
| Layer (handoff §) | What's built | Notes |
|
||||
|---|---|---|
|
||||
| **Ingestion — text (§4.1)** | SEC EDGAR (10-K/10-Q/20-F/40-F), FMP earnings-call transcripts | Earnings-call *audio* proved unfetchable (no uniform feed, ~30–90d replay expiry) → FMP transcript API, per §12. Filings dedup on accession; earnings on symbol+quarter. |
|
||||
| **Ingestion — audio (§4.1, §4.5)** | RSS + YouTube fetch, long-audio chunking (~2.5 min), **Parakeet transcribe + Sortformer diarize + 192-d TitaNet voiceprints**, cross-chunk speaker stitching, a persisted voiceprint library | Verified live: a real podcast → speaker-attributed transcript → claims. |
|
||||
| **Speaker identity (§4.5)** | Voiceprint cosine matching across episodes/shows **+ LLM speaker-naming** (host/guest from the intro) → name-based independence edges | Grant's idea: name-based overlap is robust to voiceprint drift across shows. Both edge types feed the independence graph. |
|
||||
| **Extraction (§4.2)** | Local Qwen, the finalized claim schema, JSON-mode, temp 0, "willing to emit zero" | Pluggable backend: **local Qwen (default) or Gemini batch** (validated, for overflow/scale; public corpus only). |
|
||||
| **Embedding + store (§4.3)** | bge-m3 dense + BM25 sparse → Qdrant hybrid collection; retrieval + rerank via the gateway | Embeds distilled propositions, not raw chunks. |
|
||||
| **Scoring brain (§4.4, §4.5)** | EISC independence primitive; as-of harness; windowed acceleration; **under-acted-conviction (Job B) scorer**; the quantitative bar; ledger writer; resolver (stub) | See §3. Job A scorers (emergence/stance/intersection) and the frontier judge/fan-out are **deferred** per the blueprint build-order — the backtest is Job B only. |
|
||||
| **Backfill queue (§13.4)** | Client-side GPU-hours queue: idempotent, leased/crash-safe, prioritized | Extraction ran ~900 docs on one GPU as a serial job. Transcription on the other GPU in parallel. |
|
||||
| **Provenance / dedup** | Layered: stable item-id (robust pre-GPU guard) + normalized title/date (cross-mirror) + content-hash (audit only) | Corrected after Grant flagged that a transcript hash is a brittle dedup key. |
|
||||
| **Ledger (§4.7, §6)** | SQLite dual-evaluation ledger; logs every bar-clearer; resolution columns separated from scoring (look-ahead guard) | Live with its first entries. |
|
||||
| **UI** | FastAPI corpus-management app (dashboard, add/view sources, inspect per-source claims) | The "menu" to grow and audit the corpus over time. |
|
||||
|
||||
**Corpus the backtest ran on (snapshot):** 6,569 claims (5,129 embedded at backtest time), from 411 filings + 410 earnings transcripts + 82 podcast episodes (4 RSS-full shows for 2022–2023: Dwarkesh, Hidden Forces, All-In, Invest Like the Best; plus a partial Catalyst slice). Claim types: 2,780 predictive / 1,447 interpretive / 2,267 descriptive / 75 reactive. Clusters: **energy 3,135 · ai_tech 2,329 · bitcoin 765 · vc_consensus 139 · macro 103 · generalist 98.** 90 voiceprints (35 named), 10 shared-guest edges.
|
||||
|
||||
**Note the cluster imbalance** — it's central to §8. The corpus is overwhelmingly company filings/earnings (two clusters, energy + ai_tech) with a thin podcast layer. That is not a balanced cross-cluster corpus.
|
||||
|
||||
---
|
||||
|
||||
## 3. The scoring brain (how nomination works)
|
||||
|
||||
This is the part where the handoff's hard constraints (§5) had to become concrete code. Design was done via a 3-way design panel (statistical / graph / pragmatic lenses) synthesized into one blueprint; I then built it.
|
||||
|
||||
- **EISC — Effective Independent Source Count (the §4.5 differentiator).** Given the sources converging on a topic, discount by connectedness using a noisy-OR connectedness matrix + inverse-row-sum. Verified on synthetic cases: 5 identical clones → ~1.0 voice; 5 cross-cluster independents → ~5.0; all-bitcoin → floored ~0.4; "one guest doing the rounds" across many shows → ~1.0. (I improved the cross-cluster multiplier over the blueprint so a single guest spanning many clusters can't fake the gold-tier bonus.) **Every count that feeds a score routes through EISC — never a raw source count.**
|
||||
- **As-of harness (§6.6).** Every scorer reads an `as_of`-filtered view; nothing reads the raw claims table. At nomination time only claims dated ≤ as_of are visible. This is what makes the backtest honest (no look-ahead).
|
||||
- **Windowed acceleration (§4.4).** The signal is the discrete 2nd derivative of the EISC-weighted claim flow per topic — *not* raw size. Window length must match corpus cadence (90 days for quarterly filings; 28 for weekly podcasts).
|
||||
- **Under-acted-conviction / Job B (§4.4).** `conviction_weight × exposure_gap × rising_independent_corroboration`. Corroboration = retrieve (hybrid search) → LLM filter to affirms-only → independence-weighted acceleration over the confirmed set. **Exposure is joined locally and never crosses the frontier boundary** (§4.6).
|
||||
- **The quantitative bar (§5.1).** Two tiers: an *evidence bar* (clears hard gates → log a ledger row, the denominator) and a *promotion bar* (also clears a score threshold → would go to the frontier judge). Stats nominate; the model would only judge a pre-filtered shortlist.
|
||||
|
||||
---
|
||||
|
||||
## 4. The §7.1 backtest — methodology
|
||||
|
||||
Per the handoff (§7.1 is the headline pilot test), I ran it **before** any forward pilot.
|
||||
|
||||
- **Seed:** the 2023 Kirkwood conviction `K2023` ("compute will ~1000x; energy becomes the binding constraint; interruptible load is the edge"), logged in the human-owned conviction log with high conviction / low exposure (`lt2`).
|
||||
- **Fan-out (v1, hand-written):** Per the blueprint's build order, I **hand-wrote** the 2nd/3rd-order derivative tree (grid interconnect, transformers, substations, cooling, gas turbines, nuclear, uranium, utilities repriced, and the headline "size up power-infra picks-and-shovels"). *Why hand-written:* it removes the frontier from the first backtest and isolates the real question — **does the scoring surface the derivative once it exists?** — from the separate question of whether the frontier can *propose* the right derivatives. (That second question is untested; see §6.)
|
||||
- **Run:** marched a quarterly `as_of` from 2023-03 to 2024-09 (7–9 points), 90-day windows. At each as_of, for each derivative: retrieve corroboration from the corpus, LLM-filter to genuine affirmations, compute independence-weighted acceleration, apply the bar, log every clearer to the ledger.
|
||||
- **Look-ahead control:** all retrieval/scoring at as_of only sees claims dated ≤ as_of. The resolver (forward leg) is a separate, isolated pass (a stub for now — see §6).
|
||||
|
||||
---
|
||||
|
||||
## 5. The §7.1 backtest — results
|
||||
|
||||
I ran it twice: once on the company-only corpus (~4,500 claims), then a "sharpened" re-run after the cross-cluster podcast claims landed (~5,100 embedded). **Presenting both is deliberate — the differences between them are themselves a finding (run-to-run variance / noise).**
|
||||
|
||||
### Run 1 — company corpus (~4,500 claims)
|
||||
| Derivative | First cleared evidence bar | Evidence at clear |
|
||||
|---|---|---|
|
||||
| **Root: "power is the binding constraint"** | **2023-05-30** | EISC 3.0, 4 sources, **k_eff=2 (cross-cluster: energy+AI)**, accel +1.0 |
|
||||
| **Headline: "picks-and-shovels"** | 2024-05-24 | EISC 2.0, 5 sources, k_eff=1, score 2.56 |
|
||||
| Utilities repriced | 2024-05-24 | EISC 2.5, **8 sources**, k_eff=1, built steadily from 2023 (src 1→2→4→8) |
|
||||
| nuclear / transformers / gas / uranium / cooling | peaked but did **not** clear | EISC or acceleration fell short in the cleared window |
|
||||
|
||||
### Run 2 — + cross-cluster podcast claims (~5,100 embedded)
|
||||
| Derivative | First cleared | Note |
|
||||
|---|---|---|
|
||||
| **Root** | **2023-05-30** | unchanged (cross-cluster) |
|
||||
| **Headline: "picks-and-shovels"** | **2024-05-24** | peak 3.33; notably it *scored* 3.33 back at 2023-11 but EISC was 1.6, just under the 2.0 floor, so it logged-but-didn't-clear then |
|
||||
| **Transformers** | **2024-05-24** | newly cleared (peak 4.80) |
|
||||
| Uranium | did not clear | peak 7.04 (!) but never simultaneously cleared all gates |
|
||||
| **Utilities repriced** | did **not** clear | cleared in Run 1, *not* in Run 2 — **this is the run-to-run variance / noise, exhibited directly** |
|
||||
|
||||
**What the numbers say, honestly:**
|
||||
|
||||
- The **root thesis is a genuinely clean result** — it cleared cross-cluster (k_eff=2) in May 2023 in both runs, *independent of the contested design call*. The system would have flagged "the world is starting to corroborate that power is the binding constraint, and Ten31 is under-exposed" in mid-2023.
|
||||
- The **derivatives surface, but messily.** They clear mid-2024, mostly single-cluster, and *which* ones clear shifts between runs. The acceleration (2nd derivative) flips sign between earnings seasons (`+2.6 → −2.2 → +1.6 → −1.0`), so a derivative clears in whatever window the curvature happens to be positive. That is fragile.
|
||||
|
||||
---
|
||||
|
||||
## 6. Honest assessment
|
||||
|
||||
### What worked well
|
||||
1. **The end-to-end machine is real and disciplined.** Ingest (text *and* audio) → local extraction → hybrid store → independence-discounted nomination → as-of-honest backtest → ledger. It runs on the operator's actual stack, on a real multi-thousand-claim corpus.
|
||||
2. **The EISC independence primitive does its job.** "Five shows, one guest" collapses to ~1 voice; the bitcoin cluster is structurally floored; cross-cluster gets the bonus. This is the heart of §4.5 and it behaves correctly and auditably (every score is reconstructable from its inputs).
|
||||
3. **Extraction discipline holds.** The extractor emits *zero* on boilerplate (8-Ks, 10-K front-matter) and rich, well-typed claims on earnings Q&A (~82% interpretive/predictive vs. descriptive). Earnings calls massively out-yield filings for signal — a concrete finding that confirms a §4.1 hypothesis.
|
||||
4. **The root-thesis result is the real validation.** The single most important thing §7.1 asked — would the engine have surfaced this in time — is *yes* for the root conviction, cross-cluster, in 2023.
|
||||
5. **The as-of discipline + the ledger are correct by construction.** Resolution is structurally separated from scoring; the denominator started day one; the model never sees a human rating before logging. The anti-self-deception machinery is in place.
|
||||
|
||||
### Limitations & open questions (the important half)
|
||||
1. **Noise on sparse, quarterly, single-domain data.** The 2nd-derivative acceleration is fragile when claims cluster in earnings seasons. The blueprint *deliberately deferred* the statistical smoothing (weighted-quadratic fits, significance gates, shrinkage) as premature at small n. **Open question:** with a bigger corpus, is raw 2nd-difference enough, or do we need that smoothing now? The run-to-run variance suggests we need *something*.
|
||||
2. **Cross-cluster breadth is at the root, not the derivatives.** The diagnosis was concrete: in 2022–2023, AI-company *earnings* barely mentioned electricity as a constraint (that narrative hit 2024–25). So the niche power-infra derivatives are corroborated almost entirely by the *energy* cluster. The cross-domain early discussion lived in *specialist* discourse (energy/macro podcasts), which we under-sampled. **This is the crux — see §8.**
|
||||
3. **The frontier fan-out is untested.** The backtest used a *hand-written* derivative tree. We have **not** validated whether the frontier model, given the seed conviction, would *propose* the right derivatives (grid/transformers/nuclear/…). That's a separate and important test (it's the other half of Job B). It's deferred, not done.
|
||||
4. **No lead-time measured yet.** The resolver (external-confirmation leg) is a stub. We can say the engine *surfaced* the derivatives at specific dates, but we have not yet measured earliness against the *actual* repricing of power infrastructure (the alpha measurement, §6.3). That needs price/event data and forward time.
|
||||
5. **Filing extraction targets the wrong thing.** It reads filings front-to-back; 10-K front-matter and risk-factors are low-yield. It should target Item 7 (MD&A). This skews filing claims toward boilerplate and likely costs us signal.
|
||||
6. **Stance/relation extraction is thin.** The local extractor sees one chunk at a time, so it rarely wires the cross-document `relation` links the §4.2 schema assumes. The Job A contrarian scorer therefore needs a separate LLM stance-folding pass (designed, not built). **Worth flagging to the handoff author:** the schema implies relation-linking that is hard to populate at extraction time.
|
||||
|
||||
---
|
||||
|
||||
## 7. Judgment calls I made (please scrutinize all of these)
|
||||
|
||||
Every place I made a decision the handoff didn't fully specify, or where I diverged:
|
||||
|
||||
1. **[BIGGEST] Relaxed the cross-cluster gate for Job B.** The design blueprint applied the §4.5 cross-cluster rule (`k_eff ≥ 2`) as a *universal* hard gate. I removed it as a *hard gate for the under-acted-conviction (Job B) scorer* — keeping EISC ≥ 2.0 (genuine independence) and a ≥2-source requirement, and letting cross-cluster *boost the score* instead of gating it. **Rationale:** the handoff §4.4 defines Job B as *"rising independent corroboration,"* whereas §4.5's cross-cluster-is-gold framing is about Job A *discovery* (avoiding echo chambers). N independent energy companies confirming a power thesis is corroboration, not an echo. **This is the difference between the derivatives clearing or not** — with the strict gate, *only the root clears* (cross-cluster, 2023). This is the #1 thing to debate (§8).
|
||||
2. **Window length = 90 days for the backtest** (blueprint default was 28). 28-day windows are degenerate on quarterly filings/earnings (most windows empty). Made it configurable; 90d for filing-cadence corpora, 28d for weekly podcasts. *Open question: mixed-cadence corpora (filings + podcasts) want different windows simultaneously — currently one global value.*
|
||||
3. **Improved the EISC cross-cluster multiplier.** Blueprint counted "distinct non-capped clusters present." I changed it to count only clusters that contribute ≥ 0.5 of an independent voice — so "one guest spanning 4 clusters" can't earn the gold multiplier. (A correctness fix, not a divergence in intent.)
|
||||
4. **Hand-wrote the fan-out for v1** (per blueprint build-order). The derivative *phrasings* are mine, and the LLM relevance filter judges corroboration against those phrasings — so wording matters. A frontier-generated tree might phrase them to match the corpus better (or worse). Untested.
|
||||
5. **Deferred the statistical-significance machinery** (Design 1's fitted curves / bootstraps / z-gates) as premature at pilot n — kept the hard minimum-evidence gates, not the smoothing. This is *why* the signal is noisy. Reconsider as the corpus grows (§6.1).
|
||||
6. **Build order: Job B first; Job A (emergence/stance/intersection) and the frontier judge/fan-out deferred.** So the backtest tested Job B only, with no frontier in the loop. Faithful to the blueprint, but it means large parts of the §4 design are designed-not-built.
|
||||
7. **Filings = 10-K/10-Q/20-F/40-F only** (skipped 8-K/6-K as low-yield current-reports). Earnings via FMP. Podcasts = the 4 RSS-full shows + a partial Catalyst slice. **I did not get the specialist energy/macro podcasts** (Catalyst/Columbia Energy/Macro Voices/Odd Lots) for 2022–2023 — they're YouTube-only with slow date-windowed enumeration. This under-samples exactly the cluster breadth the derivatives needed.
|
||||
8. **Local Qwen for all extraction + scoring LLM helpers.** Gemini validated as an overflow backend but not used in the backtest.
|
||||
|
||||
---
|
||||
|
||||
## 8. The central debate: cross-cluster gating vs. corpus breadth
|
||||
|
||||
This is the section to take into the brainstorm. Grant's framing (paraphrased): *strict cross-cluster gating may limit our ability to pick up signal early; perhaps the real fix is that the cluster list is too small and there isn't enough breadth within each cluster, so the corpus needs to be dramatically increased.* I think this is the right instinct, and here's the structured case.
|
||||
|
||||
### The tension, precisely
|
||||
- §4.5 is unambiguous and correct *for Job A discovery*: cross-cluster convergence is gold, within-cluster is near-noise (five bitcoin shows agreeing = the prior, not signal).
|
||||
- But **Job B (derivatives / fan-out) has the opposite early-signal dynamic.** A niche derivative's *earliest* corroboration almost always comes from the single most-relevant cluster — the people closest to it. Power-infra repricing showed up *first* in energy-company earnings and energy-specialist discourse, and only *later* spread to AI companies and generalist macro. **Requiring cross-cluster corroboration means you only fire once the signal has already spread — which is precisely when you've lost the lead time.** The backtest demonstrates this exactly: the cross-cluster version of the signal (the root) is real but broad; the *actionable derivative* corroboration is single-cluster and earlier.
|
||||
|
||||
This is, I think, a genuine gap in the handoff: §4.5's "within-cluster is near-noise" was written with discovery in mind and is in tension with §4.4's "rising independent corroboration" for Job B. The implementation had to pick; I picked "relax for Job B." **The dev who wrote the spec should weigh in on whether that's the intended reading.**
|
||||
|
||||
### Why this points at corpus breadth (Grant's hypothesis), and I agree
|
||||
The reason single-cluster corroboration feels uncomfortable is the fear of an echo chamber (energy companies talking their book). **The principled fix isn't to demand cross-cluster — it's to make "independent within a domain" *mean something*, which requires breadth.** Right now:
|
||||
- We have **6 coarse clusters**, and the corpus is dominated by **two** of them (energy, ai_tech), almost entirely **company filings/earnings**. Within "energy," CEG/VST/TLN/NEE are independent issuers but they're all *sell-side-of-their-own-demand* — partly correlated by construction.
|
||||
- A handful of podcasts (4 shows) provide the only non-company voices, and the *specialist* energy/macro podcasts that would carry the early cross-domain signal weren't ingested for the backtest window.
|
||||
|
||||
So the corpus is both **too narrow** (few clusters, two dominant) and **too shallow within clusters** (few genuinely independent voice-types per cluster). Two complementary directions:
|
||||
|
||||
1. **Finer cluster taxonomy.** "Energy" → {power utilities, grid/equipment, nuclear/uranium, gas, energy-specialist media}. "AI/tech" → {chips, hyperscalers, data-center REITs, AI-specialist media}. Add clusters the pilot omitted entirely: **sell-side research, trade press / industry newsletters, expert-network transcripts, specialist substacks, conference/earnings-adjacent commentary, policy/regulatory.** With a finer taxonomy, *cross-sub-cluster* convergence (e.g., a nuclear operator **and** a grid-equipment maker **and** an energy-trade newsletter) becomes a meaningful *early* signal — and the strict cross-cluster gate becomes defensible again because the clusters are now granular enough to convergence early.
|
||||
2. **Dramatically more breadth within each cluster.** More issuers, far more podcasts/media, and crucially the *specialist* sources where derivatives are discussed first. This is the difference between "4 energy companies" (correlated) and "20 independent energy-ecosystem voices of different types" (genuinely independent).
|
||||
|
||||
### My recommendation for the debate (not a decision — a starting position)
|
||||
- **Short term:** keep Job B's gate at *independence* (EISC ≥ 2, ≥2 sources) for the **evidence/logging tier** — so we *catch and log* early single-cluster corroboration and start the lead-time clock — and use **cross-cluster as the promotion/confidence tier** (the thing we'd actually act on). This preserves earliness *and* honesty: we log the early single-cluster whisper, but we don't treat it as high-confidence until it's broadened.
|
||||
- **Medium term (the real fix, Grant's point):** broaden the cluster taxonomy and dramatically expand the corpus — especially the specialist/media sources and finer sub-clusters. This likely does more for signal quality than any scoring tweak, and it would let us *re-tighten* the cross-cluster requirement without losing earliness, because convergence would happen earlier across a richer cluster space.
|
||||
- **Either way:** build the **resolver / lead-time** measurement next, because *"did it clear the bar"* is far less interesting than *"how early did it clear vs. the actual repricing"* — and that number is what tells us whether the relaxed gate is finding alpha or just noise.
|
||||
|
||||
---
|
||||
|
||||
## 9. Suggested agenda for the brainstorm with the handoff author
|
||||
|
||||
1. **The §4.4-vs-§4.5 tension for Job B.** Is "rising independent corroboration" meant to allow single-cluster (independent-within-domain) corroboration, with cross-cluster as a confidence multiplier? Or is cross-cluster a hard requirement even for derivatives (accepting later signal)? *This is the load-bearing question.*
|
||||
2. **Cluster taxonomy + corpus breadth.** How far to broaden clusters and sources? Which new source *types* matter most (sell-side, trade press, expert networks, specialist media)? What's the target corpus size for the cross-cluster signal to be early *and* honest?
|
||||
3. **The temporal statistic.** Is raw 2nd-difference acceleration the right signal, or do we adopt the deferred smoothing now? The run-to-run variance argues for the latter.
|
||||
4. **Frontier fan-out validation.** Design a test for whether the frontier *proposes* the right derivatives from a seed conviction (the untested half of Job B).
|
||||
5. **Lead-time / resolution.** What external-confirmation data (price, signed deals, policy) feeds the resolver, and how do we grade earliness?
|
||||
6. **Filing extraction → MD&A targeting**, and the relation/stance extraction gap (does the §4.2 schema's relation-linking need a dedicated pass?).
|
||||
|
||||
---
|
||||
|
||||
## 10. Appendix
|
||||
|
||||
**Corpus at backtest time:** 6,569 claims (5,129 embedded) · 411 filings + 410 earnings + 82 podcasts + 3 youtube · 47 sources · 90 voiceprints (35 named) · 10 shared-guest edges · 4 ledger rows · 81 candidate-score rows.
|
||||
|
||||
**Key parameters:** windows 90d × 3 (84/270-day lookback); EISC floor 2.0; under-acted score floor 0.3; coupling κ {shared_guest 0.85, citation 0.45, community 0.60}; cluster coupling {bitcoin 0.55, vc_consensus 0.35, other-same 0.25}; bitcoin/capped contribution ≤ 0.25.
|
||||
|
||||
**The contested gate, in code:** `signal_engine/signals/bar.py::_under_acted` — the `k_eff ≥ 2` requirement is commented out with the rationale; re-adding it reverts to "only the root clears."
|
||||
|
||||
**Reproduce:** `python -m signal_engine backtest --conviction K2023 --start 2023-03-01 --end 2024-09-01 --step-days 90 --window-days 90`. Trajectories print per-derivative with the evidence at each as_of.
|
||||
|
||||
**Module map:** `ingest/` (fetch + transcribe + diarize + identify), `extract/` (claims + backends), `embedstore/` (Qdrant hybrid), `signals/` (the scoring brain: independence, asof, windows, under_acted, bar, ledger_writer, resolver, run), `frontier/` (designed, deferred), `spark/` (the single gateway client), `store/` (schema + seeds), `ui/` (corpus app).
|
||||
|
||||
---
|
||||
|
||||
*Bottom line for the brainstorm: the engine is built, disciplined, and it surfaced the right thesis on real history. The honest gap is signal quality, and the highest-leverage fix is almost certainly corpus breadth + a finer cluster taxonomy (Grant's instinct), which would also let us resolve the cross-cluster gating debate from a position of strength rather than scarcity.*
|
||||
|
||||
---
|
||||
|
||||
> **Note on dates:** the quarterly as-of march is 2023-03, -05, -08, -11, 2024-02, -05, -08. The **2023-12 and 2024-03** columns are two ad-hoc single-date smoke runs (off the quarterly grid) that happen to be stored in the same table — included for completeness. The score for the SAME node at adjacent dates (e.g. 2023-11 vs 2023-12) swinging from 3.3 to 0 is itself a vivid illustration of the cadence-sensitivity problem.
|
||||
|
||||
## Appendix A — Full score trajectories (the noise, concretely)
|
||||
|
||||
Every under-acted-conviction node × every as-of date that was scored. `★` = cleared the evidence bar. The point of showing this: watch the score and the acceleration `a` swing between adjacent quarters — that is the noise the write-up (§6.1) describes.
|
||||
|
||||
| derivative | 2023-03 | 2023-05 | 2023-08 | 2023-11 | 2023-12 | 2024-02 | 2024-03 | 2024-05 | 2024-08 |
|
||||
|---|---|---|---|---|---|---|---|---|---|
|
||||
| K2023 | 0.0 | 2.4★ | 0.0 | 0.0 | 1.6 | 0.8 | 0.8 | 0.0 | 0.0 |
|
||||
| K2023-cooling | 0.8 | 0.0 | 0.0 | 0.0 | 0.0 | 1.6 | 1.6 | 0.0 | 0.0 |
|
||||
| K2023-gas-turbines | 0.0 | 0.0 | 0.0 | 0.8 | 0.8 | 0.0 | 0.0 | 0.0 | 0.0 |
|
||||
| K2023-grid-interconnect | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
|
||||
| K2023-nuclear | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.8 | 0.0 | 0.0 | 0.0 |
|
||||
| K2023-picks-and-shovels | 0.0 | 0.0 | 0.0 | 3.3 | 0.0 | 0.0 | 0.0 | 2.6★ | 0.0 |
|
||||
| K2023-transformers | 0.0 | 0.0 | 0.8 | 0.5 | 0.0 | 0.0 | 0.0 | 4.8★ | 0.0 |
|
||||
| K2023-uranium | 0.0 | 0.0 | 7.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
|
||||
| K2023-utilities-repriced | 0.8 | 0.0 | 0.8 | 0.0 | 0.8 | 0.0 | 1.6 | 0.0 | 0.0 |
|
||||
|
||||
### Detail — the acceleration sign-flips (why it's noisy)
|
||||
|
||||
For the headline derivative and the root, the raw inputs at each as-of (conf=confirmed corroborating claims, src=distinct sources, eisc=independence-weighted count, a=acceleration/2nd-derivative, k_eff=distinct independent clusters):
|
||||
|
||||
|
||||
**K2023**
|
||||
|
||||
| as_of | score | cleared | conf | src | eisc | a | k_eff |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| 2023-03-01 | 0.00 | — | 0 | 0 | 0.0 | 0.0 | 0 |
|
||||
| 2023-05-30 | 2.40 | YES | 6 | 4 | 3.0 | 1.0 | 2 |
|
||||
| 2023-08-28 | 0.00 | — | 6 | 4 | 0.0 | -5.0 | 0 |
|
||||
| 2023-11-26 | 0.00 | — | 6 | 4 | 0.0 | 3.0 | 0 |
|
||||
| 2023-12-01 | 1.60 | — | 6 | 1 | 1.0 | 2.0 | 1 |
|
||||
| 2024-02-24 | 0.80 | — | 7 | 4 | 1.0 | 1.0 | 1 |
|
||||
| 2024-03-01 | 0.80 | — | 6 | 4 | 1.0 | 1.0 | 1 |
|
||||
| 2024-05-24 | 0.00 | — | 9 | 6 | 1.6 | -0.4 | 1 |
|
||||
| 2024-08-22 | 0.00 | — | 10 | 7 | 1.0 | -1.2 | 1 |
|
||||
|
||||
**K2023-picks-and-shovels**
|
||||
|
||||
| as_of | score | cleared | conf | src | eisc | a | k_eff |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| 2023-03-01 | 0.00 | — | 0 | 0 | 0.0 | 0.0 | 0 |
|
||||
| 2023-05-30 | 0.00 | — | 2 | 2 | 1.0 | -1.0 | 1 |
|
||||
| 2023-08-28 | 0.00 | — | 2 | 2 | 0.0 | -1.0 | 0 |
|
||||
| 2023-11-26 | 3.33 | — | 4 | 3 | 1.6 | 2.6 | 1 |
|
||||
| 2023-12-01 | 0.00 | — | 0 | 0 | 0.0 | 0.0 | 0 |
|
||||
| 2024-02-24 | 0.00 | — | 5 | 3 | 1.0 | -2.2 | 1 |
|
||||
| 2024-03-01 | 0.00 | — | 0 | 0 | 0.0 | 0.0 | 0 |
|
||||
| 2024-05-24 | 2.56 | YES | 10 | 5 | 2.0 | 1.6 | 1 |
|
||||
| 2024-08-22 | 0.00 | — | 5 | 3 | 0.0 | -1.0 | 0 |
|
||||
|
||||
**K2023-utilities-repriced**
|
||||
|
||||
| as_of | score | cleared | conf | src | eisc | a | k_eff |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| 2023-03-01 | 0.80 | — | 1 | 1 | 1.0 | 1.0 | 1 |
|
||||
| 2023-05-30 | 0.00 | — | 0 | 0 | 0.0 | 0.0 | 0 |
|
||||
| 2023-08-28 | 0.80 | — | 1 | 1 | 1.0 | 1.0 | 1 |
|
||||
| 2023-11-26 | 0.00 | — | 3 | 2 | 1.0 | -1.0 | 1 |
|
||||
| 2023-12-01 | 0.77 | — | 4 | 2 | 1.6 | 0.6 | 1 |
|
||||
| 2024-02-24 | 0.00 | — | 4 | 3 | 1.0 | 0.0 | 1 |
|
||||
| 2024-03-01 | 1.60 | — | 7 | 4 | 2.0 | 1.0 | 1 |
|
||||
| 2024-05-24 | 0.00 | — | 0 | 0 | 0.0 | 0.0 | 0 |
|
||||
| 2024-08-22 | 0.00 | — | 16 | 7 | 2.286 | -1.714 | 1 |
|
||||
|
||||
**K2023-nuclear**
|
||||
|
||||
| as_of | score | cleared | conf | src | eisc | a | k_eff |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| 2023-03-01 | 0.00 | — | 6 | 4 | 1.0 | 0.0 | 1 |
|
||||
| 2023-05-30 | 2.05 | — | 5 | 3 | 1.6 | 1.6 | 1 |
|
||||
| 2023-08-28 | 0.00 | — | 10 | 7 | 1.0 | -7.0 | 1 |
|
||||
| 2023-11-26 | 0.00 | — | 0 | 0 | 0.0 | 0.0 | 0 |
|
||||
| 2023-12-01 | 0.00 | — | 0 | 0 | 0.0 | 0.0 | 0 |
|
||||
| 2024-02-24 | 0.80 | — | 6 | 4 | 1.0 | 1.0 | 1 |
|
||||
| 2024-03-01 | 0.00 | — | 2 | 2 | 0.0 | 0.0 | 0 |
|
||||
| 2024-05-24 | 0.00 | — | 0 | 0 | 0.0 | 0.0 | 0 |
|
||||
| 2024-08-22 | 0.00 | — | 12 | 4 | 1.0 | -2.0 | 1 |
|
||||
|
||||
**K2023-transformers**
|
||||
|
||||
| as_of | score | cleared | conf | src | eisc | a | k_eff |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| 2023-03-01 | 0.00 | — | 0 | 0 | 0.0 | 0.0 | 0 |
|
||||
| 2023-05-30 | 0.00 | — | 0 | 0 | 0.0 | 0.0 | 0 |
|
||||
| 2023-08-28 | 0.80 | — | 1 | 1 | 1.0 | 1.0 | 1 |
|
||||
| 2023-11-26 | 0.48 | — | 4 | 2 | 1.0 | 0.6 | 1 |
|
||||
| 2023-12-01 | 0.00 | — | 0 | 0 | 0.0 | 0.0 | 0 |
|
||||
| 2024-02-24 | 0.00 | — | 4 | 2 | 0.0 | -1.0 | 0 |
|
||||
| 2024-03-01 | 0.00 | — | 6 | 4 | 0.0 | -1.0 | 0 |
|
||||
| 2024-05-24 | 4.80 | YES | 8 | 5 | 2.0 | 3.0 | 1 |
|
||||
| 2024-08-22 | 0.00 | — | 8 | 5 | 1.6 | -1.6 | 1 |
|
||||
@@ -0,0 +1,135 @@
|
||||
# Ten31 Signal Engine (pilot)
|
||||
|
||||
A recurring pipeline that ingests a growing corpus of audio (podcasts, YouTube) and text
|
||||
(SEC filings, earnings-call transcripts), extracts structured **propositions**, and surfaces
|
||||
**signal over time**, filtered through Ten31's thesis as a *relevance lens* — with every surfaced
|
||||
signal logged as a **falsifiable prediction** so the system is scored against reality.
|
||||
|
||||
**Spec / source of truth:** [`ten31-signal-engine-handoff.md`](ten31-signal-engine-handoff.md). It wins
|
||||
over this README on any conflict. Section refs below (§) point into it.
|
||||
|
||||
---
|
||||
|
||||
## The spine (do not violate — §2, §5)
|
||||
|
||||
> **Statistics & graph structure NOMINATE candidates; the frontier model only JUDGES and FANS OUT a
|
||||
> pre-filtered shortlist. The prediction ledger is the final arbiter.**
|
||||
|
||||
- **Job A — Discovery (§1):** surface what Grant doesn't yet see — emergent themes (independent
|
||||
cross-cluster *convergence*, scored on **acceleration** not size), credible contrarian minority
|
||||
stances, and the prize: their **intersection** (a consensus about to flip).
|
||||
- **Job B — Conviction-action gap (§1.1):** fan held convictions to 2nd/3rd-order derivatives and fire
|
||||
`conviction(high) × exposure(low) × rising independent corroboration`. Countermeasure to the 2023
|
||||
"power is the binding constraint" call where the seed was held but derivatives under-acted.
|
||||
- **Nominate-then-judge boundary:** geometry/stats emit a shortlist that already cleared a quantitative
|
||||
bar; the frontier model judges/synthesizes/fans-out only. A fanned-out derivative is a **hypothesis,
|
||||
not a signal**, until independent corpus corroboration confirms it (§2.3, §4.6).
|
||||
- **The lens tags relevance; it must NOT gate truth (§5.7).** The engine must be able to surface an
|
||||
accelerating signal *against* Ten31's thesis (the B1–B3 breakers). `thesis_seam` is a tag, never a filter.
|
||||
|
||||
---
|
||||
|
||||
## Locked pilot decisions
|
||||
|
||||
| Topic | Decision | Ref |
|
||||
|---|---|---|
|
||||
| **Packaging** | Ship as a **StartOS 0.4.0 s9pk** (operator call). Code is package-ready; TS wrapper generated once the daemon/UI/dependency contract stabilizes. | §10, §13 |
|
||||
| **Earnings transcripts** | **FMP Ultimate** transcript API as backbone (audio isn't reliably fetchable: no uniform feed, ~30–90d replay expiry). EDGAR filings are the durable core. Self-transcribed audio deferred. | §4.1, §12 |
|
||||
| **Conviction exposure** | Coarse **NAV bands**: `none / lt2 / 2to10 / gt10`. Grant sole editor, monthly review. | §3.1 |
|
||||
| **Topic vocabulary** | **Hybrid**: seeded controlled list + emergent topics batch-merged on a schedule. | §4.2 |
|
||||
| **Frontier model** | Default `claude-opus-4-8` (override via `FRONTIER_MODEL`). Scrubbed-vs-unscrubbed quality A/B is a pilot deliverable. | §4.6 |
|
||||
|
||||
---
|
||||
|
||||
## Architecture (modules map 1:1 to spec layers)
|
||||
|
||||
```
|
||||
ingest/ §4.1 RSS/YouTube/EDGAR/FMP fetch + scheduler; long-audio chunking; speaker stitch
|
||||
extract/ §4.2 local LLM → 0..N structured claim units per chunk (willing to emit ZERO)
|
||||
embedstore/ §4.3 embed DISTILLED propositions + client-side BM25 → Qdrant hybrid
|
||||
signals/ §4.4 cluster acceleration · stance distributions · bridge edges · scoring
|
||||
independence/ §4.5 source graph + voiceprint library → convergence discounting
|
||||
frontier/ §4.6 judge · synthesis · conviction fan-out (scrub→frontier→rehydrate; SHORTLIST/SEEDS only)
|
||||
ledger/ §4.7 dual-evaluation ledger + conviction log + earned credibility
|
||||
spark/ §13 THE single chokepoint for all Spark Control HTTP (no other module knows the URL)
|
||||
store/ SQLite schema + seed loaders
|
||||
backfill/ §13.4 client-side GPU-hours queue (extraction = heavier serial load; audio sequential)
|
||||
```
|
||||
|
||||
**Two invariants enforced in exactly one place each:**
|
||||
- All gateway HTTP funnels through `spark/` — `spark/client.py` holds the base URL, the self-signed TLS
|
||||
skip, 503 retry/backoff, and a process-wide **audio lock** (sequential audio, §4.1).
|
||||
- The "model never sees Grant's rating first" rule (§6.7) is **structural**: ratings live in a separate
|
||||
`human_evaluations` table; the model-facing code reads `ledger`, which has no `grant_rating` column.
|
||||
|
||||
### Data stores
|
||||
- **SQLite** (`store/schema.sql`): sources, documents, claims, topics, source_edges, voiceprints,
|
||||
conviction_log, fanout_nodes, ledger, human_evaluations, backfill_jobs. The whole system state is a SELECT.
|
||||
- **Qdrant** (via Spark Control): one `propositions` collection, **hybrid** dense `bge-m3` (1024-d) +
|
||||
sparse BM25 (`Qdrant/bm25`, `modifier: idf`). Points are distilled propositions, never raw chunks.
|
||||
|
||||
---
|
||||
|
||||
## Backfill queue (§13.4) — measured in GPU-hours, not real-time
|
||||
|
||||
Scheduler jobs are **producers** (fetch/dedup/enqueue); a **single worker** drains the GPU queue one job
|
||||
at a time → no parallel audio → no 503 by construction. Jobs are leased (crash-safe resume), idempotent on
|
||||
`hash(content + prompt-version)`. Extraction is the binding load. Rough pilot estimate (~300 episodes +
|
||||
25 companies): transcription ~6 GPU-h, **extraction ~90 GPU-h**, wall-clock ~4–6 days (audio on Spark 2 and
|
||||
extraction on Spark 1 run in parallel). The queue self-calibrates from measured `gpu_seconds`.
|
||||
|
||||
---
|
||||
|
||||
## Build order (§11) & status
|
||||
|
||||
1. ✅ **Foundation** — config, `spark/` client, SQLite schema, conviction-log seed, **ledger scaffold live day one**.
|
||||
2. ✅ **Ingestion + backfill queue** — backfill queue; EDGAR + FMP earnings; **audio path live-proven** (download → Parakeet transcribe + Sortformer diarize → align → voiceprint-stitch → speaker-attributed transcript); 20 podcast feeds resolved (`seeds/podcast_feeds.resolved.yaml`). Broad corpus ingested: **785 company docs (379 filings + 406 earnings), 802 extract jobs**. ⬜ Remaining: podcast audio backfill (RSS-full + dated YouTube pulls) + audio-cache cleanup; foreign-filer forms (20-F/6-K for CCJ/TSM/IREN).
|
||||
3. ✅ **Extraction worker + §4.2 prompt** — **live** on text (earnings: 26 claims, insight-heavy) AND audio (Dwarkesh: 5 claims, correct speaker attribution). Backfill draining in background (`logs/extract-backfill.log`). ⬜ Refinement: target Item 7 MD&A for filings.
|
||||
4. 🟡 **Embedding + storage** — ✅ Qdrant `propositions` hybrid collection (bge-m3 + BM25) **live**; hybrid search+rerank verified surfacing the power-infra theme. ⬜ embed the growing claim set; clustering.
|
||||
5. 🟡 **Scoring brain** (`signal_engine/signals/`) — ✅ EISC independence primitive (verified), as-of harness, windowed acceleration, **under-acted-conviction (Job B)**, the quantitative bar, ledger writer, resolver stub, orchestrator. Speaker-name independence edges live. ⬜ Job A scorers (emergence/stance/intersection) deferred per blueprint.
|
||||
6. ✅ **Source-independence graph** — voiceprint cosine + speaker-name edges feed EISC; "one guest doing the rounds" collapses to ~1 voice (verified).
|
||||
7. 🟡 **§7.1 backtest** — runs end-to-end; **corpus-gated finding**: power-infra corroboration is currently single-cluster (energy companies), so it correctly won't clear the cross-cluster bar (k_eff≥2). Needs the cross-cluster podcast corpus (transcription backfill running). The scoring is disciplined, not broken.
|
||||
8. ⬜ Frontier judge + synthesis + live fan-out (Anthropic key live; routed scrub→frontier→rehydrate) — deferred; v1 backtest uses a hand-written fan-out.
|
||||
9. 🟡 **Web UI** (`signal_engine/ui/`, `serve`) — ✅ corpus management (dashboard, add/view sources, per-source claim inspection). ⬜ human-eval rating interface (§6.7) on the same app.
|
||||
10. ⬜ **Run the §7.1 backtest FIRST**, then the forward pilot; disagreement analysis; scaling decision.
|
||||
|
||||
**Scaling lever (post-pilot):** bulk extraction can offload to the Gemini batch API (operator can provide a key) while local Qwen stays default — public corpus only, never conviction/exposure data (§4).
|
||||
|
||||
---
|
||||
|
||||
## Running the foundation
|
||||
|
||||
```bash
|
||||
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
|
||||
DATA_DIR=./data .venv/bin/python -m signal_engine init-db
|
||||
DATA_DIR=./data .venv/bin/python -m signal_engine seed-convictions
|
||||
DATA_DIR=./data .venv/bin/python -m signal_engine seed-convictions --file seeds/conviction_log.backtest-2023.seed.yaml
|
||||
# When the gateway URL is provided:
|
||||
SPARK_CONTROL_URL=https://<lan-addr> .venv/bin/python -m signal_engine spark-status
|
||||
```
|
||||
|
||||
**Pipeline + UI commands** (config from `.env`):
|
||||
```bash
|
||||
.venv/bin/python -m signal_engine seed-sources && python -m signal_engine load-feeds
|
||||
.venv/bin/python -m signal_engine ingest-edgar --ticker CEG --since 2023-01-01 --until 2023-12-31
|
||||
.venv/bin/python -m signal_engine ingest-earnings --ticker NVDA --since 2023-01-01 --until 2023-12-31
|
||||
.venv/bin/python -m signal_engine run-extract --limit 4 # local-LLM claim extraction
|
||||
.venv/bin/python -m signal_engine embed-claims # → Qdrant hybrid collection
|
||||
.venv/bin/python -m signal_engine search --query "power is the binding constraint on AI"
|
||||
.venv/bin/python -m signal_engine serve # corpus UI → http://localhost:8000
|
||||
```
|
||||
|
||||
Config is all env-driven (`signal_engine/config.py`) so the same code runs as a plain process now and as a
|
||||
StartOS daemon later (injected via the s9pk `store.json` FileModel).
|
||||
|
||||
---
|
||||
|
||||
## Operator-provided items (block runtime, not the foundation)
|
||||
|
||||
- **Spark Control LAN base URL + TLS cert** (same-LAN self-signed → `SPARK_VERIFY_TLS=false`).
|
||||
- **Confirm the exact §13.2 model strings** (currently defaulted: `qwen3.6-35b-a3b-nvfp4`, `bge-m3`, `parakeet-tdt-0.6b`).
|
||||
- **`ANTHROPIC_API_KEY`** (frontier step) and **`FMP_API_KEY`** (earnings transcripts) — or defer FMP and start filings-only.
|
||||
- **§7.1 backtest corpus:** which 2023-era podcast archives + filings we can actually fetch (the gating risk).
|
||||
- **StartOS host CPU arch** (drives which `.s9pk` to install — the host, not the ARM Sparks) and Spark Control's
|
||||
health-check IDs + outbound-gateway = clearnet (not Tor) — verified against a live box at packaging time.
|
||||
```
|
||||
@@ -0,0 +1,27 @@
|
||||
# Ten31 Signal Engine — pilot dependencies, grouped by pipeline layer.
|
||||
# Foundation (init-db, seed, spark client) needs only: requests, PyYAML, numpy.
|
||||
|
||||
# --- core / foundation ---
|
||||
requests>=2.31
|
||||
PyYAML>=6.0
|
||||
numpy>=1.26
|
||||
|
||||
# --- ingestion (§4.1) ---
|
||||
feedparser>=6.0 # podcast RSS (conditional GET via etag/modified)
|
||||
yt-dlp>=2025.1 # YouTube audio (needs a PO-token provider sidecar; see README)
|
||||
edgartools>=3.0 # SEC EDGAR filings (sets UA, throttles ≤10 rps)
|
||||
# FMP earnings transcripts are plain REST via `requests` (no SDK)
|
||||
|
||||
# --- scheduling + queue (§13.4) ---
|
||||
APScheduler>=3.10
|
||||
|
||||
# --- embeddings/vectors (§4.3) — vectors live in Qdrant behind Spark Control ---
|
||||
qdrant-client>=1.12
|
||||
fastembed>=0.4 # client-side BM25 sparse vectors (Qdrant/bm25, modifier: idf)
|
||||
|
||||
# --- frontier (§4.6), bounded final step ---
|
||||
anthropic>=0.40
|
||||
|
||||
# --- eval UI (§4.7 / human eval) ---
|
||||
fastapi>=0.110
|
||||
uvicorn>=0.29
|
||||
Executable
+47
@@ -0,0 +1,47 @@
|
||||
#!/bin/bash
|
||||
# Queued Strike pipeline: waits for the 4-show transcription to finish, then extracts → embeds →
|
||||
# runs the STRIKE2022 two-sided reflexivity test (live vs test). Robust: proceeds with whatever is
|
||||
# transcribed if the worker dies, and the 24h cap is a backstop. All work persists in the DB, so a
|
||||
# crash mid-run is resumable by re-running run-extract / embed-claims / two-sided by hand.
|
||||
set -u
|
||||
cd /Users/macpro/Projects/ten31-signal-engine
|
||||
PY=.venv/bin/python
|
||||
LOG=data/strike_pipeline.log
|
||||
SHOWS="('pod-whatbitcoindid','pod-stephanlivera','pod-kevinrooke','pod-anitaposch')"
|
||||
say(){ echo "[$(date '+%H:%M:%S')] $*" | tee -a "$LOG"; }
|
||||
|
||||
say "Strike pipeline QUEUED — waiting for transcription of the 4 independent shows to finish."
|
||||
|
||||
# 1) Wait for transcription completion (remaining=0) OR the transcribe worker dying. 24h backstop.
|
||||
for i in $(seq 1 480); do
|
||||
R=$(sqlite3 data/signal.db "SELECT COUNT(*) FROM backfill_jobs j JOIN documents d ON j.target_id=d.doc_id WHERE j.job_type='transcribe' AND j.state IN ('pending','running','leased') AND d.source_id IN $SHOWS;")
|
||||
D=$(sqlite3 data/signal.db "SELECT COUNT(*) FROM backfill_jobs j JOIN documents d ON j.target_id=d.doc_id WHERE j.job_type='transcribe' AND j.state='done' AND d.source_id IN $SHOWS;")
|
||||
ALIVE=$(ps aux | grep -cE "[r]un-transcribe") # matches run-transcribe AND run-transcribe-gemini
|
||||
say "transcribe: remaining=$R done=$D worker_alive=$ALIVE (poll $i)"
|
||||
if [ "$R" = "0" ]; then say "transcription COMPLETE."; break; fi
|
||||
if [ "$ALIVE" = "0" ]; then say "transcribe worker not alive and work remains ($R) — proceeding with partial corpus."; break; fi
|
||||
sleep 180
|
||||
done
|
||||
|
||||
# 2) Prioritize bitcoin-cluster podcast extract jobs (the independent legs + the 19 TFTC for test-mode contrast)
|
||||
say "prioritizing + extracting bitcoin-podcast claims (local Qwen on the now-free Spark)..."
|
||||
sqlite3 data/signal.db "UPDATE backfill_jobs SET priority=8 WHERE job_type='extract' AND state='pending' AND parent_doc_id IN (SELECT d.doc_id FROM documents d JOIN sources s ON d.source_id=s.source_id WHERE s.source_cluster='bitcoin' AND s.kind='podcast');"
|
||||
|
||||
# 3) Extract (priority-8 podcasts drain first). Loop in batches so a transient gateway hiccup doesn't end it.
|
||||
for pass in 1 2 3 4 5 6; do
|
||||
PEND=$(sqlite3 data/signal.db "SELECT COUNT(*) FROM backfill_jobs j JOIN documents d ON j.target_id=d.doc_id JOIN sources s ON d.source_id=s.source_id WHERE j.job_type='extract' AND j.state='pending' AND s.source_cluster='bitcoin' AND s.kind='podcast';")
|
||||
say "extract pass $pass: $PEND bitcoin-podcast extract jobs pending"
|
||||
[ "$PEND" = "0" ] && break
|
||||
$PY -m signal_engine run-extract --limit 250 --max-chunks 4 2>&1 | grep -vE "httpx" | tail -3 | tee -a "$LOG"
|
||||
done
|
||||
|
||||
# 4) Embed all pending claims → Qdrant
|
||||
say "embedding claims..."
|
||||
$PY -m signal_engine embed-claims 2>&1 | grep -vE "httpx|HF_TOKEN|huggingface|show_warning|Fetching|files:" | tail -3 | tee -a "$LOG"
|
||||
|
||||
# 5) STRIKE2022 two-sided: live (own_network TFTC/CD/RHR dropped) vs test (kept) — the reflexivity contrast
|
||||
say "=== STRIKE2022 TWO-SIDED RESULT (live vs test) ==="
|
||||
$PY -m signal_engine two-sided --conviction STRIKE2022 --modes live,test \
|
||||
--dates 2022-12-31,2023-06-30,2023-12-31 --window-days 180 2>&1 \
|
||||
| grep -vE "httpx|HF_TOKEN|huggingface|show_warning|Fetching|files:" | tee -a "$LOG"
|
||||
say "Strike pipeline DONE."
|
||||
@@ -0,0 +1,34 @@
|
||||
"""Autonomous sharpening pass: wait for the cross-cluster podcast claims to extract, re-embed, then
|
||||
re-run the §7.1 backtest. Run in the background; writes logs/backtest2.log for review."""
|
||||
import sqlite3
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
|
||||
DB = "data/signal.db"
|
||||
PY = ".venv/bin/python"
|
||||
|
||||
|
||||
def pending_podcast_extract() -> int:
|
||||
return sqlite3.connect(DB).execute(
|
||||
"SELECT COUNT(*) FROM backfill_jobs WHERE job_type='extract' AND state='pending' "
|
||||
"AND target_id LIKE 'pod:%'"
|
||||
).fetchone()[0]
|
||||
|
||||
|
||||
for i in range(60): # up to ~2h
|
||||
p = pending_podcast_extract()
|
||||
print(f"[sharpen] iter {i}: podcast extract pending={p}", flush=True)
|
||||
if p <= 2:
|
||||
break
|
||||
time.sleep(120)
|
||||
|
||||
print("[sharpen] embedding accumulated claims...", flush=True)
|
||||
subprocess.run([PY, "-m", "signal_engine", "embed-claims"], stdout=sys.stdout, stderr=subprocess.STDOUT)
|
||||
|
||||
print("[sharpen] re-running backtest...", flush=True)
|
||||
with open("logs/backtest2.log", "w") as f:
|
||||
subprocess.run([PY, "-m", "signal_engine", "backtest", "--conviction", "K2023",
|
||||
"--start", "2023-03-01", "--end", "2024-09-01", "--step-days", "90",
|
||||
"--window-days", "90"], stdout=f, stderr=subprocess.STDOUT)
|
||||
print("[sharpen] DONE — see logs/backtest2.log", flush=True)
|
||||
@@ -0,0 +1,34 @@
|
||||
# Battery corpus documents to fetch (verified FREE+FETCHABLE, manifest workflow w7559rp1x).
|
||||
# Loaded by: signal_engine ingest-doc-manifest --file seeds/battery_docs.manifest.yaml
|
||||
# Each → documents row (kind=filing) + extract job. method auto-detects html/pdf; override where needed.
|
||||
docs:
|
||||
# ===== SUPPLY: in-window deployed-dollar disclosures (the only true supply measures) =====
|
||||
- {source: bat-galaxy, method: html, date: "2024-01-01", title: "Galaxy Research — The State of Crypto Lending", url: "https://www.galaxy.com/insights/research/the-state-of-crypto-lending"}
|
||||
- {source: bat-galaxy, method: pdf, date: "2024-01-01", title: "Galaxy Research — State of Crypto Lending (full PDF)", url: "https://assets.ctfassets.net/h62aj7eo1csj/4vkA9567QmK4pyYoPBtrQa/fb039fd97d657d8151dcf4d3e969e481/The_State_of_Crypto_Lending_-_Galaxy_Research.pdf"}
|
||||
- {source: co-coin, method: html, date: "2024-02-15", title: "Coinbase Q4'23 Shareholder Letter ($399M loans / $546M credit)", url: "https://www.sec.gov/Archives/edgar/data/1679788/000167978824000019/shareholderletterq42023.htm"}
|
||||
- {source: bat-galaxy, method: html, date: "2024-11-08", title: "Galaxy Q3'24 results (avg loan book $863M)", url: "https://www.newswire.ca/news-releases/galaxy-announces-third-quarter-2024-financial-results-837407031.html"}
|
||||
- {source: bat-9fin, method: html, date: "2024-01-01", title: "9fin — Bitcoin-backed direct lending brings crypto to private credit", url: "https://www.9fin.com/insights/bitcoin--backed-direct-lending-brings-crypto-to-private"}
|
||||
- {source: bat-coinspeaker, method: html, date: "2023-09-01", title: "Coinbase to Launch Crypto Lending for Institutions (>$57M, Reg D)", url: "https://www.coinspeaker.com/coinbase-crypto-lending-institutional/"}
|
||||
# ===== SUPPLY: announced-not-funded boundary anchors (the binding negative) =====
|
||||
- {source: bat-cantor, method: html, date: "2024-07-27", title: "Cantor Fitzgerald to Launch Bitcoin Financing Business ($2B ANNOUNCED)", url: "https://www.prnewswire.com/news-releases/cantor-fitzgerald-to-launch-bitcoin-financing-business-302208093.html"}
|
||||
- {source: bat-falconx, method: html, date: "2025-05-27", title: "FalconX Closes First Bitcoin-Backed Financing from Cantor (FUNDED, out-of-window)", url: "https://www.falconx.io/newsroom/falconx-closes-first-bitcoin-backed-financing-from-cantor"}
|
||||
- {source: bat-maple, method: html, date: "2025-05-27", title: "Maple Closes First Cantor Tranche (FUNDED, out-of-window)", url: "https://maple.finance/insights/maple-cantor"}
|
||||
- {source: bat-coindesk, method: html, date: "2025-05-27", title: "CoinDesk — Cantor Debuts Bitcoin Lending, First Tranches to FalconX/Maple", url: "https://www.coindesk.com/business/2025/05/27/wall-street-giant-cantor-debuts-bitcoin-lending-business-with-first-tranches-to-falconx-maple"}
|
||||
- {source: bat-twoprime, method: html, date: "2023-11-08", title: "Two Prime sees $2B demand for bitcoin-backed loans (zero deployed disclosed)", url: "https://www.coindesk.com/markets/2023/11/08/investment-advisor-two-prime-sees-2b-in-demand-for-bitcoin-backed-loans"}
|
||||
- {source: bat-battery, method: html, date: "2024-11-25", title: "Newmarket Launches Battery Finance (1 funded deal; no pooled committed capital)", url: "https://www.newswire.com/news/newmarket-launches-battery-finance-a-pioneering-asset-management-and-22469517"}
|
||||
- {source: bat-bitcoinmag, method: html, date: "2024-11-25", title: "Bitcoin Magazine — Newmarket Launches Battery Finance (interview)", url: "https://bitcoinmagazine.com/business/newmarket-capital-launches-battery-finance-bitcoin-collateralized-loan-strategy-"}
|
||||
# ===== DEMAND: originator disclosures =====
|
||||
- {source: bat-unchained, method: html, date: "2024-01-01", title: "Unchained — Announcing Institutional Lending (>$500M cumulative)", url: "https://www.unchained.com/blog/announcing-institutional-lending"}
|
||||
- {source: bat-ledn, method: html, date: "2023-12-31", title: "Ledn Open Book — Dec 2023", url: "https://www.ledn.io/post/lednopenbook-dec23"}
|
||||
# ===== MARKET DATA (independent) =====
|
||||
- {source: bat-asr, method: html, date: "2022-04-01", title: "Asset Securitization Report — Milo crypto-mortgage (BTC collateral)", url: "https://asreport.americanbanker.com/news/crypto-mortgage-product-allows-borrowers-to-post-bitcoin-as-collateral"}
|
||||
# ===== POLICY CONTEXT (axis=context — weight 0 into supply) =====
|
||||
- {source: bat-sec, method: html, date: "2022-04-11", title: "SEC SAB-121 (govinfo full text)", url: "https://www.govinfo.gov/content/pkg/FR-2022-04-11/html/2022-07196.htm"}
|
||||
- {source: bat-fed, method: html, date: "2022-08-16", title: "Fed SR 22-6 / CA 22-6 — crypto-asset activities (press release)", url: "https://www.federalreserve.gov/newsevents/pressreleases/bcreg20220816a.htm"}
|
||||
- {source: bat-fed, method: pdf, date: "2022-08-16", title: "Fed SR 22-6 — letter PDF (loans collateralized by crypto-assets)", url: "https://www.federalreserve.gov/newsevents/pressreleases/files/bcreg20250424a3.pdf"}
|
||||
- {source: bat-fed, method: pdf, date: "2023-08-08", title: "Fed SR 23-7 — Novel Activities Supervision Program (PDF)", url: "https://www.federalreserve.gov/newsevents/pressreleases/files/bcreg20250815a1.pdf"}
|
||||
- {source: bat-fdic, method: html, date: "2022-04-07", title: "FDIC FIL-16-2022 — crypto prior-notification", url: "https://www.fdic.gov/news/inactive-financial-institution-letters/2022/fil22016.html"}
|
||||
- {source: bat-occ, method: pdf, date: "2021-11-23", title: "OCC Interpretive Letter 1179 — supervisory non-objection gate", url: "https://www.occ.gov/topics/charters-and-licensing/interpretations-and-decisions/2021/int1179.pdf"}
|
||||
- {source: bat-occ, method: pdf, date: "2020-07-22", title: "OCC Interpretive Letter 1170 — national-bank crypto custody", url: "https://www.occ.gov/topics/charters-and-licensing/interpretations-and-decisions/2020/int1170.pdf"}
|
||||
# ===== Terminal-bracket policy (2025, out-of-window CONTEXT: the unblock that produced no capital) =====
|
||||
- {source: bat-sec, method: html, date: "2025-01-23", title: "SEC SAB-122 (rescission of SAB-121)", url: "https://www.federalregister.gov/documents/2025/01/30/2025-01864/staff-accounting-bulletin-no-122"}
|
||||
@@ -0,0 +1,23 @@
|
||||
# ADVERSARIAL failed-conviction cases (DESIGN_v2 §1.4) — the NEGATIVES the eval was missing.
|
||||
# Both are convictions Ten31 ACTUALLY held where the FALSIFIABLE THESIS mis-fired (on mechanism /
|
||||
# timing) even though Ten31 remains bullish on the companies. That is the most honest kind of negative
|
||||
# — and the kind we are most tempted to grade leniently, which is why the resolution criteria are
|
||||
# pre-registered (seeds/resolution.*.yaml) before any outcome-labeling. Let the criteria be ugly.
|
||||
convictions:
|
||||
- id: STRIKE2022
|
||||
seam: debasement_bitcoin
|
||||
conviction_level: high
|
||||
current_exposure: lt2 # TEST parameterization (so the scorer can fire) — NOT Strike's real exposure (~40% of book)
|
||||
exposure_note: "test param; real Strike exposure is HIGH (largest position). The 2022 PAYMENTS thesis is the falsifiable negative."
|
||||
thematic_proposition: "Bitcoin + the Lightning network becomes a retail payments network that materially disrupts the card rails (Visa/Mastercard/Amex); Lightning retail acceptance and merchant-payment volume scale."
|
||||
team_conviction_note: "Ten31 led Strike's Series B (2022) on this thesis; STILL high conviction in Strike the company. But the 2022 payments-network thesis FAILED — Strike succeeded by becoming a bitcoin financial-services platform (exchange + BTC-collateralized lending), which the memo barely names. The failed thesis was NARRATIVE-driven (the whole bitcoin-podcast cluster told 'Lightning eats retail payments' loudly in 2022) — the exact single-cluster, reflexive, talk-our-own-book corroboration the relaxed gate is most likely to wave through."
|
||||
disconfirming_signal: "Retail Lightning payments fail to materialize at scale; card-network volumes show no erosion; Strike's growth comes from exchange/lending not merchant payments."
|
||||
|
||||
- id: BATTERY2022
|
||||
seam: debasement_bitcoin
|
||||
conviction_level: high
|
||||
current_exposure: lt2
|
||||
exposure_note: "Battery Finance position"
|
||||
thematic_proposition: "Bitcoin-as-collateral credit goes mainstream: institutional/incumbent capital funds bitcoin-collateralized lending at scale within 24-36 months; >=1 major traditional institution enters. (= conviction D1 + R3.)"
|
||||
team_conviction_note: "Ten31 invested in Battery Finance (2022); STILL high conviction. The thesis was RIGHT on the demand side (borrower appetite for BTC-collateralized credit) and EARLY/wrong on the SUPPLY side (Battery could not raise institutional lending capital at scale). A direct, datable instance of D1/R3 not arriving on schedule — the instructive two-sided test (demand rising, supply flat)."
|
||||
disconfirming_signal: "Institutional/bank capital for BTC-collateralized credit stays scarce; no major incumbent entry; the supply side stalls while demand rises."
|
||||
@@ -0,0 +1,21 @@
|
||||
# Backtest seed — handoff §7.1 (the HEADLINE pilot validation, run BEFORE the forward pilot).
|
||||
#
|
||||
# Seed ONLY the ~2023 Kirkwood conviction, run the pipeline over a period-correct ~2023 corpus, and
|
||||
# check whether the under-acted-conviction signal surfaces the derivative:
|
||||
# "size up the power-infrastructure picks-and-shovels of the buildout"
|
||||
# (grid interconnect, transformers, substations, cooling, gas turbines, nuclear, uranium, public
|
||||
# picks-and-shovels). A clear yes/no on that derivative is the strongest validation the system
|
||||
# does the job Ten31 actually needs (§1.1, §7.1).
|
||||
#
|
||||
# CRITICAL (§6.6 look-ahead guard): consensus, embeddings, and corroboration must be computed
|
||||
# AS-OF the log date — never with knowledge that the theme was real in absolute terms.
|
||||
|
||||
convictions:
|
||||
- id: K2023
|
||||
seam: energy_compute
|
||||
conviction_level: high
|
||||
current_exposure: lt2 # the historical reality: seed conviction held, derivatives under-acted
|
||||
exposure_note: "Seed conviction held in ~2023; derivative branches NOT systematically sized into — the failure §1.1 exists to prevent."
|
||||
thematic_proposition: "Bitcoin mining and AI are both 'distributed compute'; interruptible/flexible load is the differentiator; the world will need to ~1000x rack space over the decade, so power becomes the binding constraint."
|
||||
team_conviction_note: "Jonathan Kirkwood, publicly articulated ~2023. Root call correct and early; the MISS was the derivative tree, not the prediction."
|
||||
disconfirming_signal: "Compute demand growth stalls; power clears without becoming the binding constraint."
|
||||
@@ -0,0 +1,147 @@
|
||||
# Conviction log seed — handoff §3.1. HUMAN-OWNED: Grant edits this file; `seed-convictions` upserts it.
|
||||
#
|
||||
# Structural rule (§3.1): `thematic_proposition` is the TRACKABLE half the corpus can corroborate and
|
||||
# that gets fanned out and scored. `team_conviction_note` is context ONLY — the engine must never
|
||||
# present theme corroboration as validation of the team bet beneath it.
|
||||
#
|
||||
# `current_exposure` is a coarse NAV band (operator decision): none | lt2 | 2to10 | gt10 | unset.
|
||||
# The v1 levels below are the §3.1 draft; exposure is left `unset` with the original prose preserved in
|
||||
# `exposure_note` — Grant to FINALIZE the NAV bands (§12 governance item).
|
||||
|
||||
convictions:
|
||||
# ---------- ROOT (the forcing function) ----------
|
||||
- id: R1
|
||||
seam: root
|
||||
conviction_level: high
|
||||
current_exposure: unset
|
||||
exposure_note: "pervasive"
|
||||
thematic_proposition: "Sovereign debt keeps being monetized not repaid; fiat debasement persists; bitcoin is adopted as the neutral non-debasable reserve capital migrates to."
|
||||
disconfirming_signal: "Durable fiscal surpluses + falling debt/GDP + no reserve diversification."
|
||||
|
||||
- id: R2
|
||||
seam: root
|
||||
conviction_level: high
|
||||
current_exposure: unset
|
||||
exposure_note: "thesis-wide"
|
||||
thematic_proposition: "AI drives the marginal cost of the reproducible toward zero; value accrues to the scarce/verifiable; bitcoin gains relative share as the 'strongest horse'; pricing-in-bitcoin grows."
|
||||
disconfirming_signal: "Scarce/verifiable assets earn no premium as AI content saturates."
|
||||
|
||||
- id: R3
|
||||
seam: root
|
||||
conviction_level: med-high
|
||||
current_exposure: unset
|
||||
exposure_note: "pervasive (esp. custody/credit names)"
|
||||
thematic_proposition: "Strategic bitcoin reserves (US/nation-states), SAB-121 repeal enabling bank custody, and ETF/treasury inflows create a price-inelastic bid and invert allocator career risk."
|
||||
disconfirming_signal: "Reserve plans stall or reverse; banks stay out; policy turns adversarial."
|
||||
|
||||
# ---------- ENERGY <-> COMPUTE ----------
|
||||
- id: E1
|
||||
seam: energy_compute
|
||||
conviction_level: high
|
||||
current_exposure: unset
|
||||
exposure_note: "MED-HIGH (Giga, Satoshi Energy)"
|
||||
thematic_proposition: "Power, not chips, is the binding constraint on AI buildout through ~2027-28; the seam picks-and-shovels are under-priced."
|
||||
team_conviction_note: "Giga, Satoshi Energy."
|
||||
disconfirming_signal: "Chips/capital remain the bottleneck; interconnect clears fast."
|
||||
|
||||
- id: E2
|
||||
seam: energy_compute
|
||||
conviction_level: high
|
||||
current_exposure: unset
|
||||
exposure_note: "MED (Giga power-market optimization, Satoshi)"
|
||||
thematic_proposition: "The miner flexible-load playbook (demand response, behind-the-meter) goes mainstream for AI data centers and grids; mining fluency is a transferable underwriting edge."
|
||||
disconfirming_signal: "Data centers reject flexible load; the fluency proves non-transferable."
|
||||
|
||||
- id: E3
|
||||
seam: energy_compute
|
||||
conviction_level: med
|
||||
current_exposure: unset
|
||||
exposure_note: "Giga (straddle) vs Upstream (mining-only)"
|
||||
thematic_proposition: "Mining-native operators that pivot into / straddle AI/HPC capture the convergence; mining-only underperforms."
|
||||
team_conviction_note: "Deliberately low-conviction seed — engine should help resolve."
|
||||
disconfirming_signal: "Pure-play mining outperforms straddlers."
|
||||
|
||||
# ---------- DEBASEMENT <-> BITCOIN ----------
|
||||
- id: D1
|
||||
seam: debasement_bitcoin
|
||||
conviction_level: high
|
||||
current_exposure: unset
|
||||
exposure_note: "HIGH (Strike; Battery, Unchained, debifi, AnchorWatch)"
|
||||
thematic_proposition: "Bitcoin-as-collateral goes mainstream: new BTC-collateralized credit products proliferate, spreads compress, and >=1 major traditional institution enters within 24-36 months. As products mature, holders borrow rather than sell, shrinking marginal supply."
|
||||
disconfirming_signal: "Stays a crypto-native niche; no incumbent entry; spreads hold."
|
||||
|
||||
- id: D2
|
||||
seam: debasement_bitcoin
|
||||
conviction_level: high
|
||||
current_exposure: unset
|
||||
exposure_note: "portfolio-wide"
|
||||
thematic_proposition: "Incumbents buy, not build: legacy finance/tech acquires bitcoin-natives rather than building in-house (the published exit thesis)."
|
||||
disconfirming_signal: "Incumbents build in-house or via crypto-generalists; no strategic M&A."
|
||||
|
||||
- id: D3
|
||||
seam: debasement_bitcoin
|
||||
conviction_level: med-high
|
||||
current_exposure: unset
|
||||
exposure_note: "enablers (Fold, AnchorWatch, Giga/Upstream)"
|
||||
thematic_proposition: "Bitcoin commercialization of legacy operating businesses: compressed-multiple firms become structurally advantaged when rearchitected around bitcoin (treasury, settlement, self-hosted infra, stranded energy)."
|
||||
disconfirming_signal: "Legacy adoption stalls; no margin advantage."
|
||||
|
||||
- id: D4
|
||||
seam: debasement_bitcoin
|
||||
conviction_level: high
|
||||
current_exposure: unset
|
||||
exposure_note: "HIGH (largest position, ~40%)"
|
||||
thematic_proposition: "Strike re-rates as a bitcoin bank, not payments: market values it as exchange + major retail BTC-collateralized lender + global access (70+ jurisdictions), not legacy payments."
|
||||
team_conviction_note: "Team conviction high; tracked SEPARATELY from the thematic re-rating."
|
||||
disconfirming_signal: "Stays valued/stuck as payments; lending/exchange don't scale."
|
||||
|
||||
# ---------- AI <-> DATA-OWNERSHIP (PRIME under-acted-conviction target) ----------
|
||||
- id: A1
|
||||
seam: ai_data_ownership
|
||||
conviction_level: high
|
||||
current_exposure: unset
|
||||
exposure_note: "LOW (Start9, OpenSecret/Maple, maybe Primal; small checks)"
|
||||
thematic_proposition: "Owned judgment is the last margin: AI commoditizes competence and profit on undifferentiated output erodes toward zero, so durable margin needs owned/protected proprietary data + judgment; demand grows for sovereign-root + confidential-inference infra."
|
||||
disconfirming_signal: "Enterprises cede data/inference with no margin penalty."
|
||||
|
||||
- id: A2
|
||||
seam: ai_data_ownership
|
||||
conviction_level: med
|
||||
current_exposure: unset
|
||||
exposure_note: "LOW"
|
||||
thematic_proposition: "The segment that can't cede (regulated, IP-sensitive, adversarial jurisdictions) adopts owned infra + confidential inference even as the majority cedes to convenience."
|
||||
disconfirming_signal: "Even the IP-sensitive segment fully cedes."
|
||||
|
||||
- id: A3
|
||||
seam: ai_data_ownership
|
||||
conviction_level: low
|
||||
current_exposure: unset
|
||||
exposure_note: "LOW"
|
||||
thematic_proposition: "Start9 broadens beyond the bitcoiner niche (SaaS -> on-prem reversion)."
|
||||
team_conviction_note: "Explicitly uncertain — team high, theme unproven ('maybe drinking our own koolaid, tbd'). Low-conviction seed the engine should help resolve."
|
||||
disconfirming_signal: "Stays bitcoiner-niche."
|
||||
|
||||
# ---------- MONITORED THESIS-BREAKERS (engine must surface these AGAINST the thesis, §5.7) ----------
|
||||
- id: B1
|
||||
seam: root
|
||||
is_thesis_breaker: true
|
||||
conviction_level: low
|
||||
current_exposure: unset
|
||||
thematic_proposition: "Quantum acceleration compresses CRQC timelines inside NIST 2035 before mitigations deploy (bitcoin-leg breaker)."
|
||||
disconfirming_signal: "n/a — this is a breaker the engine monitors FOR, not against."
|
||||
|
||||
- id: B2
|
||||
seam: energy_compute
|
||||
is_thesis_breaker: true
|
||||
conviction_level: low
|
||||
current_exposure: unset
|
||||
thematic_proposition: "AI permanently outbids mining for power, pushing mining to only truly-stranded margin (energy-leg breaker)."
|
||||
disconfirming_signal: "n/a — breaker the engine monitors FOR."
|
||||
|
||||
- id: B3
|
||||
seam: debasement_bitcoin
|
||||
is_thesis_breaker: true
|
||||
conviction_level: low
|
||||
current_exposure: unset
|
||||
thematic_proposition: "Stablecoins/CBDCs capture the neutral-reserve role, or bitcoin fails as the exit (tests the complementary-stablecoin view)."
|
||||
disconfirming_signal: "n/a — breaker the engine monitors FOR."
|
||||
@@ -0,0 +1,17 @@
|
||||
# Hand-written fan-out for the BATTERY 2022 bitcoin-as-collateral-credit thesis (D1) — two-sided test.
|
||||
# The instructive split: DEMAND derivatives expected RISING (confirming), SUPPLY derivatives expected
|
||||
# FLAT (the disconfirmation / timing-early signal the two-sided scorer should surface).
|
||||
parent_conviction_id: BATTERY2022
|
||||
nodes:
|
||||
- node_id: BATTERY-demand-borrower-appetite
|
||||
distance_from_edge: in_mandate
|
||||
derivative_proposition: "Borrower demand for bitcoin-collateralized credit is rising (new products, origination growth, fund formations)."
|
||||
- node_id: BATTERY-institutional-supply
|
||||
distance_from_edge: in_mandate
|
||||
derivative_proposition: "Named institutional capital PROVIDERS have actually DEPLOYED significant dollars (not merely announced or made-available) to FUND third-party bitcoin-collateralized loans at scale. (Capital-provider side only — a firm posting bitcoin as collateral to RECEIVE a loan is borrower-side demand, not supply; an announced/planned program is not deployed capital.)"
|
||||
- node_id: BATTERY-incumbent-entry
|
||||
distance_from_edge: in_mandate
|
||||
derivative_proposition: "A major traditional financial institution has entered bitcoin-collateralized lending (D1's explicit milestone)."
|
||||
- node_id: BATTERY-custody-policy-enablement
|
||||
distance_from_edge: one_hop
|
||||
derivative_proposition: "Regulatory and custody developments (SAB-121 repeal, bank custody clearance) are enabling institutional bitcoin credit."
|
||||
@@ -0,0 +1,22 @@
|
||||
# Hand-written fan-out for the §7.1 backtest (build-order step 4). These are the 2nd/3rd-order
|
||||
# derivatives the frontier WOULD have generated from the 2023 Kirkwood conviction (K2023). Hand-writing
|
||||
# them for v1 removes the frontier dependency from the FIRST backtest and isolates the real question:
|
||||
# does the SCORING surface the derivative once it exists? (The §1.1 derivative tree, verbatim intent.)
|
||||
parent_conviction_id: K2023
|
||||
nodes:
|
||||
- node_id: K2023-grid-interconnect
|
||||
derivative_proposition: "Electrical grid interconnection capacity becomes the gating constraint on new data-center and compute load; interconnect queue times and grid upgrade costs blow out."
|
||||
- node_id: K2023-transformers
|
||||
derivative_proposition: "Demand for large power transformers and electrical equipment outstrips supply; lead times extend dramatically and pricing rises."
|
||||
- node_id: K2023-nuclear
|
||||
derivative_proposition: "Nuclear power — existing plants, long-term PPAs, SMRs — is repriced upward as firm clean baseload to power AI data centers."
|
||||
- node_id: K2023-gas-turbines
|
||||
derivative_proposition: "Natural-gas turbines and on-site/behind-the-meter generation see surging demand to power data centers where the grid cannot deliver in time."
|
||||
- node_id: K2023-cooling
|
||||
derivative_proposition: "Data-center cooling and thermal management (liquid cooling) demand accelerates as compute power density rises."
|
||||
- node_id: K2023-uranium
|
||||
derivative_proposition: "Uranium and nuclear-fuel supply tightens as nuclear demand to power compute rises."
|
||||
- node_id: K2023-utilities-repriced
|
||||
derivative_proposition: "Independent power producers and utilities with firm generation are repriced upward as data-center electricity demand surges."
|
||||
- node_id: K2023-picks-and-shovels
|
||||
derivative_proposition: "The public picks-and-shovels of the AI power buildout — power infrastructure, equipment, contracts, and generation serving data centers — are systematically under-priced and should be sized up."
|
||||
@@ -0,0 +1,15 @@
|
||||
# Hand-written fan-out for the STRIKE 2022 payments-network thesis (adversarial negative).
|
||||
# distance_from_edge: tag for TRIAGE only (DESIGN_v2.1) — NEVER a filter. These are all in_mandate
|
||||
# (core bitcoin). The test: does the engine CLEAR any of these on bitcoin-cluster reflexive chatter
|
||||
# when retail Lightning payments never materialized? Each clear = a candidate FALSE POSITIVE.
|
||||
parent_conviction_id: STRIKE2022
|
||||
nodes:
|
||||
- node_id: STRIKE-lightning-retail-acceptance
|
||||
distance_from_edge: in_mandate
|
||||
derivative_proposition: "Lightning-based retail payment acceptance is accelerating across major merchants and point-of-sale systems."
|
||||
- node_id: STRIKE-card-rail-disruption
|
||||
distance_from_edge: in_mandate
|
||||
derivative_proposition: "Card-network payment volume and interchange economics are being disrupted by bitcoin/Lightning payment rails."
|
||||
- node_id: STRIKE-merchant-lightning-integration
|
||||
distance_from_edge: in_mandate
|
||||
derivative_proposition: "Major merchants and processors are integrating Lightning for retail payments at scale (NCR / Shopify / Blackhawk-class), beyond pilots."
|
||||
@@ -0,0 +1,33 @@
|
||||
# Resolved + VERIFIED podcast feeds (background research, 2026-06-07). Loaded via `load-feeds`.
|
||||
# backtest_2022_2023 = how to reach the §7.1 window (2022-2023) for this show:
|
||||
# rss_full — the RSS feed itself carries the full back-catalog into 2022-2023
|
||||
# rss_2023_only — show launched in 2023; RSS is full from launch (no 2022 by design)
|
||||
# youtube_only — RSS is a truncated rolling window; 2022-2023 must come from YouTube/site archive
|
||||
# launched_later — show didn't exist in 2022-2023 (no backtest contribution)
|
||||
# unavailable — no owned audio feed at all
|
||||
#
|
||||
# KEY FINDING: most podcast RSS feeds are rolling windows (~15-25 eps). For the backtest, the clean
|
||||
# RSS-back-catalog shows (Hidden Forces, Dwarkesh, What Bitcoin Did, All-In, Invest Like the Best)
|
||||
# anchor the 2022-2023 podcast leg; truncated shows need yt-dlp channel pulls with --dateafter/--datebefore.
|
||||
|
||||
feeds:
|
||||
- {id: pod-oddlots, rss_url: "https://www.omnycontent.com/d/playlist/e73c998e-6e60-432f-8610-ae210140c5b1/8a94442e-5a74-4fa2-8b8d-ae27003a8d6b/982f5071-765c-403d-969d-ae27003a8d83/podcast.rss", youtube_channel_url: "https://www.youtube.com/playlist?list=PLe4PRejZgr0MuA6M0zkZyy-99-qc87wKV", backtest_2022_2023: youtube_only, note: "Omny RSS truncated to ~15 recent; show since 2015. 2022-2023 via YouTube/Apple/Spotify or Bloomberg YT."}
|
||||
- {id: pod-forwardguidance, rss_url: "https://feeds.megaphone.fm/forwardguidance", youtube_channel_url: "https://www.youtube.com/@ForwardGuidanceBW", backtest_2022_2023: youtube_only, note: "RSS truncated. Hosts the 2022-2023 Jack Farley macro content — KEY backtest source via YouTube @ForwardGuidanceBW."}
|
||||
- {id: pod-macrovoices, rss_url: "https://feed.podbean.com/macrovoices/feed.xml", youtube_channel_url: "https://www.youtube.com/@macrovoices7508", backtest_2022_2023: youtube_only, note: "RSS ~25 recent; weekly since 2016 (energy-heavy). 2022-2023 via macrovoices.com archive (29 pages) or YouTube."}
|
||||
- {id: pod-grantwilliams, rss_url: "https://feed.podbean.com/ttmygh/feed.xml", youtube_channel_url: "https://www.youtube.com/@GWTTMYGH", backtest_2022_2023: youtube_only, note: "RSS ~20 recent; much back-catalog paywalled (Copper/Silver). 2022-2023 via YouTube @GWTTMYGH."}
|
||||
- {id: pod-monetarymatters, rss_url: "https://feeds.megaphone.fm/EWWMN1909747317", youtube_channel_url: "https://www.youtube.com/@Monetary-Matters", backtest_2022_2023: launched_later, note: "Launched Sep 2024 — NO 2022-2023. For 2022-2023 Jack Farley use Forward Guidance instead."}
|
||||
- {id: pod-hiddenforces, rss_url: "https://hiddenforces.libsyn.com/rss", youtube_channel_url: "https://www.youtube.com/channel/UC8URhgYos5fjHqFSO4RSIEg", backtest_2022_2023: rss_full, note: "FULL via RSS: 507 eps to 2017 (libsyn). 67 eps in 2022, 55 in 2023. Anchor backtest source."}
|
||||
- {id: pod-dwarkesh, rss_url: "https://apple.dwarkesh-podcast.workers.dev/feed.rss", youtube_channel_url: "https://www.youtube.com/c/DwarkeshPatel", backtest_2022_2023: rss_full, note: "FULL via RSS: 128 eps to 2020; 25 in 2022, 20 in 2023. NB do NOT use api.substack.com/feed/podcast/69345.rss (stale/truncated)."}
|
||||
- {id: pod-nopriors, rss_url: "https://feeds.megaphone.fm/nopriors", youtube_channel_url: "https://www.youtube.com/@NoPriorsPodcast", backtest_2022_2023: rss_2023_only, note: "FULL from Feb 2023 launch (165 eps); no 2022 by design."}
|
||||
- {id: pod-latentspace, rss_url: "https://api.substack.com/feed/podcast/1084089.rss", youtube_channel_url: "https://www.youtube.com/@LatentSpacePod", backtest_2022_2023: rss_2023_only, note: "FULL from Feb 2023 launch (207 eps, 50 in 2023); no 2022 by design."}
|
||||
- {id: pod-cognitiverev, rss_url: "https://feeds.megaphone.fm/RINTP3108857801", youtube_channel_url: "https://www.youtube.com/@CognitiveRevolutionPodcast", backtest_2022_2023: rss_2023_only, note: "FULL from Feb 2023 launch (348 eps, 91 in 2023); no 2022 by design."}
|
||||
- {id: pod-bg2, rss_url: "https://anchor.fm/s/f06c2370/podcast/rss", youtube_channel_url: "https://www.youtube.com/@Bg2Pod", backtest_2022_2023: launched_later, note: "Launched Jan 2024 — NO 2022-2023."}
|
||||
- {id: pod-a16z, rss_url: "https://feeds.simplecast.com/JGE3yC0V", youtube_channel_url: "https://www.youtube.com/@a16z", backtest_2022_2023: youtube_only, note: "RSS truncated to ~15 recent; show since 2014 (~1000 eps). 2022-2023 via a16z.com/podcasts or YouTube."}
|
||||
- {id: pod-catalyst, rss_url: "https://feeds.megaphone.fm/catalyst", youtube_channel_url: "https://www.youtube.com/channel/UC1dCBgJnwO5fgNVEn2BgQbg", backtest_2022_2023: youtube_only, note: "RSS ~15 recent; 258 eps since 2021 (energy). 2022-2023 via Latitude Media site or YouTube Catalyst playlist."}
|
||||
- {id: pod-columbiaenergy, rss_url: "https://columbiaenergyexchange.libsyn.com/rss", youtube_channel_url: "https://www.youtube.com/channel/UC5vAhRqHufSZNB9coZG5t6Q", backtest_2022_2023: youtube_only, note: "RSS ~28 recent; long-running (CGEP). 2022-2023 via energypolicy.columbia.edu archive (has transcripts!) or YouTube."}
|
||||
- {id: pod-doomberg, rss_url: null, youtube_channel_url: null, backtest_2022_2023: unavailable, note: "NO owned audio feed — Doomberg is a Substack newsletter; only guest appearances on other shows. Consider ingesting the written Substack (text) or drop from the audio set."}
|
||||
- {id: pod-bitcoinlayer, rss_url: "https://feeds.simplecast.com/Y2219Riv", youtube_channel_url: "https://www.youtube.com/@TheBitcoinLayer", backtest_2022_2023: youtube_only, note: "RSS ~25 recent (partial); show since 2022. 2022-2023 via YouTube @TheBitcoinLayer. cluster_capped_low."}
|
||||
- {id: pod-whatbitcoindid, rss_url: "https://feeds.acast.com/public/shows/69d4f193b76468caacc5068f", youtube_channel_url: "https://www.youtube.com/@WhatBitcoinDid", backtest_2022_2023: rss_full, note: "FULL via RSS (Acast): 1060 eps to 2017; 155 in 2022, 156 in 2023. cluster_capped_low. (Peter McCormack show, ID 1317356120 — not the Danny Knowles spinoff.)"}
|
||||
- {id: pod-allin, rss_url: "https://rss.libsyn.com/shows/254861/destinations/1928300.xml", youtube_channel_url: "https://www.youtube.com/@allin", backtest_2022_2023: rss_full, note: "FULL via RSS: 384 eps to 2020; 61 in 2022, 50 in 2023. Consensus barometer."}
|
||||
- {id: pod-iltb, rss_url: "https://feeds.megaphone.fm/investlikethebest", youtube_channel_url: "https://www.youtube.com/@ILTB_Podcast", backtest_2022_2023: rss_full, note: "FULL via RSS: 582 eps to 2016; 57 in 2022, 53 in 2023."}
|
||||
- {id: pod-lex, rss_url: "https://lexfridman.com/feed/podcast/", youtube_channel_url: "https://www.youtube.com/@lexfridman", backtest_2022_2023: youtube_only, note: "RSS truncated to ~19 recent; show since 2018. 2022-2023 via YouTube @lexfridman or lexlib.io index."}
|
||||
@@ -0,0 +1,26 @@
|
||||
# PRE-REGISTERED resolution — BATTERY 2022 bitcoin-as-collateral-credit thesis (D1). Two-sided.
|
||||
# DESIGN_v2.1 condition 2: hostile-checker-proof — named institutions, dollar figures, dated events.
|
||||
# Instrument = the two-sided net-corroboration (affirms − denies) trajectory per derivative
|
||||
# (condition 3): the engine should show DEMAND rising while SUPPLY stays flat — the "half-confirmed,
|
||||
# load-bearing half not moving" disconfirmation, NOT clear the supply derivatives early.
|
||||
thesis: "Bitcoin-as-collateral credit goes mainstream; institutional/incumbent capital funds it at scale within 24-36 months (>=1 major institution enters)."
|
||||
window: {start: "2022-01-01", end: "2024-12-31"}
|
||||
metric_type: adoption_evidence
|
||||
criteria:
|
||||
BATTERY-demand-borrower-appetite:
|
||||
expected: rising
|
||||
confirm_iff: ">=3 NAMED active BTC-collateralized credit products/originators operating by 2024-12-31 (e.g. Unchained, Ledn, Strike lending, Battery, Salt, debifi) — countable, datable."
|
||||
checks: ["count of named BTC-collateralized lenders active 2024 with origination", "any public origination-volume figures"]
|
||||
BATTERY-institutional-supply:
|
||||
expected: flat
|
||||
confirm_iff: "A NAMED institution/bank committed a stated DOLLAR figure of lending capital to BTC-collateralized credit AT SCALE (>$100M) by 2024-12-31."
|
||||
checks: ["named institutional capital provider + committed $ to BTC-collateralized lending (yes/no + figure)", "BTC-credit fund AUM from institutional LPs — figure"]
|
||||
BATTERY-incumbent-entry:
|
||||
expected: not_yet
|
||||
confirm_iff: "A top-50 (by assets) traditional bank/financial institution PUBLICLY entered BTC-collateralized lending by 2024-12-31 (named institution + dated announcement)."
|
||||
checks: ["named top-50 bank offering/funding BTC-collateralized loans — yes/no + date"]
|
||||
BATTERY-custody-policy-enablement:
|
||||
expected: late_edge
|
||||
confirm_iff: "SAB-121 repealed/rescinded AND bank BTC-custody cleared WITHIN window (<=2024-12-31)."
|
||||
checks: ["SAB-121 status + exact date (pre-registered fact: issued 2022-03-31; rescinded by SAB-122 on 2025-01-23 — JUST PAST the window, so this resolves 'arrived at the edge, not in window')"]
|
||||
prior_expectation: "EARLY on the SUPPLY axis: demand rose (multiple named lenders), institutional supply stalled through 2024 (no named top-50 incumbent at scale), the policy catalyst (SAB-121 repeal) landed Jan 2025 — just past window. The two-sided scorer should surface the supply-side DISCONFIRMATION."
|
||||
@@ -0,0 +1,20 @@
|
||||
# PRE-REGISTERED resolution criteria for the §7.1 backtest confusion matrix (DESIGN_v2 §1.1).
|
||||
# Committed BEFORE pulling any price data. Uniform rule for ALL derivatives:
|
||||
# - equal-weight, start-normalized basket index over 2023-01-01 .. 2025-06-30
|
||||
# - "confirmed real" iff index first hits >= +40% vs 2023-01 baseline on date D AND still >= +25% at D+90d
|
||||
# - repricing_date = D
|
||||
# Baskets are best-judgment liquid proxies for each derivative's real-world exposure, drawn from the
|
||||
# §7.3 source universe. They are NOT tuned to outcomes. Grant/dev may revise the baskets — but doing so
|
||||
# is a logged pre-registration change (DESIGN_v2 §4), not a quiet tune.
|
||||
window: {start: "2023-01-01", end: "2025-06-30"}
|
||||
rule: {threshold_pct: 40, hold_pct: 25, hold_days: 90}
|
||||
baskets:
|
||||
K2023: [CEG, VST, TLN, NEE, GEV, VRT, PWR, CCJ] # root: broad power-infra build
|
||||
K2023-picks-and-shovels: [CEG, VST, TLN, NEE, GEV, VRT, PWR, CCJ] # broad power-infra basket
|
||||
K2023-uranium: [CCJ] # Cameco = the liquid uranium proxy in-corpus
|
||||
K2023-nuclear: [CEG, TLN, VST] # nuclear-heavy IPPs
|
||||
K2023-utilities-repriced: [CEG, VST, NEE, TLN] # IPPs / utilities with firm generation
|
||||
K2023-grid-interconnect: [PWR, GEV] # grid build / interconnection (Quanta, GE Vernova)
|
||||
K2023-transformers: [GEV, VRT] # electrical equipment
|
||||
K2023-gas-turbines: [GEV] # GE Vernova gas turbines
|
||||
K2023-cooling: [VRT] # Vertiv = data-center thermal/cooling proxy
|
||||
@@ -0,0 +1,22 @@
|
||||
# PRE-REGISTERED resolution — STRIKE 2022 payments-network thesis (adversarial NEGATIVE).
|
||||
# DESIGN_v2.1 condition 2: indicators must survive a HOSTILE checker — named, countable, datable
|
||||
# things a disinterested third party could look up and get the SAME yes/no. No "material scale" hand-waving.
|
||||
# Committed before evidence-gathering. The instrument for the engine output is NOT runway — it is the
|
||||
# two-sided net-corroboration (affirms − denies) trajectory (DESIGN_v2.1 condition 3): a PASS for Strike
|
||||
# looks like the engine STAYING QUIET (no clear) or the disconfirming side accumulating, NOT clearing early.
|
||||
thesis: "Bitcoin/Lightning becomes a retail payments network that materially disrupts the card rails."
|
||||
window: {start: "2022-01-01", end: "2024-12-31"}
|
||||
metric_type: adoption_evidence
|
||||
criteria:
|
||||
STRIKE-merchant-lightning-integration:
|
||||
# The 2022 memo promised specific integrations. Datable: did each ship Lightning RETAIL acceptance
|
||||
# at general availability (not a pilot/press release) by 2024-12-31? Three named yes/no checks.
|
||||
confirm_iff: ">=2 of {NCR/Aloha PoS, Shopify, Blackhawk Network} shipped Lightning retail payment acceptance at GENERAL AVAILABILITY with sustained merchant usage by 2024-12-31."
|
||||
checks: ["NCR (now NCR Voyix/Aloha) Lightning retail acceptance GA? yes/no", "Shopify native Lightning checkout GA at scale? yes/no", "Blackhawk Network Lightning retail acceptance GA? yes/no"]
|
||||
STRIKE-lightning-retail-acceptance:
|
||||
confirm_iff: "A NAMED, citable figure shows Lightning-settled RETAIL payment volume > $1B/year by 2024, OR Strike publicly reports merchant-acceptance/payments as a material (>10% of revenue) line."
|
||||
checks: ["Lightning Network public retail payment-volume estimate (bitcoinvisuals / River Lightning report) — figure + date", "Strike public revenue/product mix disclosure — merchant payments a named line? yes/no"]
|
||||
STRIKE-card-rail-disruption:
|
||||
confirm_iff: "Any Visa/Mastercard 10-K/earnings disclosure attributing SUSTAINED US retail payment-volume erosion to bitcoin/Lightning by 2024-12-31."
|
||||
checks: ["Visa FY22-24 10-K: bitcoin/Lightning named as a volume-erosion factor? yes/no", "Mastercard FY22-24 10-K: same? yes/no"]
|
||||
prior_expectation: "FAILED. Strike's growth came from exchange + BTC-collateralized lending, not retail payments. The PRECISION TEST: if the engine CLEARS any of these on bitcoin-cluster (own_network-tagged) chatter while the named checks are NO, that is the reflexive false positive the discipline must catch."
|
||||
@@ -0,0 +1,37 @@
|
||||
# HOSTILE-VERIFIED outcome labels for the adversarial cases (gathered AFTER the criteria in
|
||||
# resolution.{STRIKE,BATTERY}2022.yaml were frozen — correct pre-registration order). Each was
|
||||
# researched then independently re-verified by a skeptic agent to the "survive a hostile checker"
|
||||
# standard. `played_out` is normalized to the THESIS reality (yes = the derivative came true), not the
|
||||
# agent's verdict word (which was polarity-inconsistent on the card-erosion check).
|
||||
STRIKE2022: # payments thesis — verified DEAD on all three (the clean negative we needed)
|
||||
STRIKE-merchant-lightning-integration:
|
||||
played_out: no
|
||||
evidence: "Only Shopify reached GA (Strike app, 8 lifetime reviews in 3 yrs = trivial). Blackhawk: Strike CEO's own 2022-12-31 update = 'final testing, planned early-2023 launch at 50 locations' — never confirmed live. NCR/Aloha: no ship, no timeline. <2 of 3 at scale."
|
||||
cite: ["jimmymow.medium.com/strike-commerce-update", "apps.shopify.com/strike", "fortune.com/crypto/2024/04/24 (demand was trading/custody, not POS)"]
|
||||
STRIKE-lightning-retail-acceptance:
|
||||
played_out: no
|
||||
evidence: "No named retail Lightning volume >$1B/yr in 2024. Strike's only disclosure ($6B 2024 'payments volume', Apr-2025) is unsegmented and dominated by brokerage buy/sell + remittance. River: total Lightning (all uses) only hit ~$1B/MONTH in late 2025."
|
||||
cite: ["news.bitcoin.com/bitcoin-payments-firm-strike-grew-600-in-2024", "River Lightning report"]
|
||||
STRIKE-card-rail-disruption:
|
||||
played_out: no
|
||||
evidence: "ZERO erosion attribution. Full-text search of all 5 as-filed 10-Ks (Visa FY22-24, MC FY23-24): 'bitcoin' and 'lightning' appear 0 times; crypto only in generic forward-looking risk lists. Card networks grew."
|
||||
cite: ["SEC EDGAR Visa/Mastercard 10-Ks"]
|
||||
BATTERY2022: # D1 — demand REAL, supply FAILED, + an instructive milestone-vs-substance wrinkle
|
||||
BATTERY-demand-borrower-appetite:
|
||||
played_out: yes
|
||||
evidence: ">=4 named BTC-collateralized originators active through 2024 (Unchained ~$1B cumulative by 2025; Ledn ~$392M FY24 origination; plus Strike lending, Salt, debifi, AnchorWatch)."
|
||||
cite: ["thebloc / Ledn", "Unchained disclosures"]
|
||||
BATTERY-institutional-supply:
|
||||
played_out: no
|
||||
binding_constraint: capital_provider_willingness # NOT regulation — see custody-policy note below
|
||||
evidence: "No named institution DEPLOYED >$100M of BTC-collateralized lending capital at scale IN 2024. Cantor Fitzgerald's $2B program announced 2024-07-27 but first loans (FalconX, Maple) closed 2025-05-27 — zero deployed in window. CRITICAL CAUSAL NOTE: the constraint was capital-provider WILLINGNESS, not regulatory permissibility — nothing legally restricted dollar holders from funding these loans. Proof: the 2025 regulatory unblock (SAB-122) did NOT produce institutional dollars; Battery remains unraised as of 2026. Measure supply as actual committed/deployed capital, never as the regulatory enabler."
|
||||
cite: ["Cantor BTC-lending announcements", "Battery raise status thru 2026 (unraised)"]
|
||||
BATTERY-incumbent-entry:
|
||||
played_out: token # IMPORTANT WRINKLE: milestone technically met EARLY, but token — NOT the at-scale thesis
|
||||
evidence: "Goldman Sachs (5th-largest US bank) executed ONE bitcoin-collateralized loan ~late-Apr-2022 (on-record spokeswoman, CoinDesk 2022-04-28). A one-off facility, not scaled BTC-lending entry. The D1 milestone phrasing ('>=1 major institution enters') resolves YES on this; the SUBSTANCE ('institutional capital at scale') resolves NO. Same reality, opposite verdicts by phrasing."
|
||||
cite: ["coindesk.com 2022-04-28 (Goldman BTC-collateralized loan)"]
|
||||
BATTERY-custody-policy-enablement:
|
||||
played_out: edge
|
||||
axis: context # ENABLER/CONTEXT, not the supply resolver — weight 0 into the supply score
|
||||
evidence: "SAB-121 issued 2022-03-31; rescinded by SAB-122 on 2025-01-23 — just PAST the 2024-12-31 window. But this axis is CONTEXT, not supply: the post-window unblock did NOT produce capital for Battery (still unraised in 2026), demonstrating regulation was never the binding constraint on Battery's supply leg. Retained here only to show the policy-YES / supply-NO divergence — an engine must NOT read this enabler as supply arriving (that's the false-positive S1 exists to catch)."
|
||||
cite: ["sec.gov SAB-121 / SAB-122", "Battery still unraised post-SAB-122 (2026)"]
|
||||
@@ -0,0 +1,13 @@
|
||||
# River research docs to ingest (verified fetchable + TEXT-extractable, workflow wxlh2oinb).
|
||||
# Image-based PDFs (river-lightning-report.pdf 2022, river-bitcoin-adoption-report-2025/2026.pdf) are
|
||||
# OMITTED — zero text layer (pypdf extracts nothing); revisit with OCR if needed.
|
||||
# Loaded by: signal_engine ingest-doc-manifest --file seeds/river_docs.manifest.yaml
|
||||
docs:
|
||||
# --- LIGHTNING / PAYMENTS (most Strike-thesis-relevant, in-window) ---
|
||||
- {source: src-river, method: html, date: "2023-10-10", title: "River Lightning Report 2023 — The Lightning Network Grew 1212% in 2 Years", url: "https://river.com/content/the-lightning-network-in-2023"}
|
||||
- {source: src-river, method: pdf, date: "2023-06-14", title: "River Payments Report — Bitcoin vs the $156 Trillion Global Payments Industry", url: "https://river.com/learn/files/river-payments-report.pdf"}
|
||||
# --- ADOPTION / THESIS research ---
|
||||
- {source: src-river, method: html, date: "2024-09-04", title: "River — Why Business Bitcoin Adoption Grew by 30% in 1 Year", url: "https://river.com/content/business-bitcoin-adoption-2024"}
|
||||
- {source: src-river, method: pdf, date: "2024-11-14", title: "River — Entering The Dual Money Era", url: "https://river.com/learn/files/river-dual-money-era-2024.pdf"}
|
||||
- {source: src-river, method: html, date: "2023-11-21", title: "River — Is There a Best Time and Day to DCA Bitcoin?", url: "https://river.com/content/best-time-and-day-to-dca-bitcoin"}
|
||||
- {source: src-river, method: html, date: "2022-12-08", title: "River — What Could Bitcoin Mining Look Like at One Zettahash?", url: "https://river.com/content/what-could-bitcoin-mining-look-like-at-one-zettahash"}
|
||||
@@ -0,0 +1,44 @@
|
||||
# EISC connectedness edges for the bitcoin cluster (workflow wd2a9zb9e, 2026-06-08).
|
||||
# These are PRIORS so the independence model discounts shared-guest/citation/community overlap BEFORE
|
||||
# transcription auto-detects it. The transcribe_worker upserts onto the SAME PK (it stores sorted([a,b]),
|
||||
# weight += 1.0 on conflict), so seeding in sorted order means real detections accumulate — no reversed dup.
|
||||
# Math is undirected (frozenset); kappa is applied in-code {shared_guest 0.85, citation 0.45, community 0.60}.
|
||||
# weight = per-edge strength multiplier (1.0 = one strong overlap; clamp 0.95). Loaded idempotently (DO NOTHING).
|
||||
# RULE: an own_network host (Ten31 portfolio/partner) is QUARANTINED in live mode regardless of edges;
|
||||
# edges still matter in TEST mode and for independent<->independent discounting.
|
||||
edges:
|
||||
# --- host-identity / co-host core (the Ten31 own_network orbit; near-total redundancy) ---
|
||||
- {a: pod-tftc, b: pod-rabbitholerecap, type: shared_guest, weight: 1.0, evidence: "Marty Bent hosts both (host identity)"}
|
||||
- {a: pod-citadeldispatch, b: pod-rabbitholerecap, type: shared_guest, weight: 1.0, evidence: "Matt Odell hosts both (host identity)"}
|
||||
- {a: pod-tftc, b: pod-citadeldispatch, type: shared_guest, weight: 1.0, evidence: "Bent+Odell co-host RHR weekly + constant guest-swap"}
|
||||
- {a: pod-tftc, b: pod-rabbitholerecap, type: citation, weight: 1.0, evidence: "RHR is the weekly recap/companion within the TFTC network; re-surfaces TFTC segments"}
|
||||
# --- What Bitcoin Did <-> Ten31-orbit core (own_network=false but heavy host-level overlap) ---
|
||||
- {a: pod-whatbitcoindid, b: pod-citadeldispatch, type: shared_guest, weight: 1.0, evidence: "McCormack/Odell repeat cross-guests + conference panels 2022-2023 (Odell on WBD518)"}
|
||||
- {a: pod-whatbitcoindid, b: pod-tftc, type: shared_guest, weight: 1.0, evidence: "McCormack/Bent mutual guests + co-panelists 2022-2023"}
|
||||
- {a: pod-whatbitcoindid, b: pod-rabbitholerecap, type: shared_guest, weight: 1.0, evidence: "McCormack alongside Odell+Bent repeatedly; densest guest-swap triangle"}
|
||||
# --- Stephan Livera <-> core + WBD (connective tissue of the independent leg) ---
|
||||
- {a: pod-stephanlivera, b: pod-citadeldispatch, type: shared_guest, weight: 1.0, evidence: "Livera/Odell frequent mutual guests; shared Lightning/self-custody circuit"}
|
||||
- {a: pod-stephanlivera, b: pod-tftc, type: shared_guest, weight: 1.0, evidence: "Livera/Bent repeat cross-guests + co-panelists 2022-2023"}
|
||||
- {a: pod-stephanlivera, b: pod-rabbitholerecap, type: shared_guest, weight: 1.0, evidence: "Livera overlaps RHR hosts via constant mutual guesting"}
|
||||
- {a: pod-stephanlivera, b: pod-whatbitcoindid, type: shared_guest, weight: 1.0, evidence: "Top-tier interview hosts sharing the same revolving guest roster"}
|
||||
# --- Bitcoin Audible + Anita Posch peripheral (independent leg internal coupling) ---
|
||||
- {a: pod-bitcoinaudible, b: pod-stephanlivera, type: shared_guest, weight: 1.0, evidence: "Guy Swann recurring on SLP / Lightning-essay circuit"}
|
||||
- {a: pod-bitcoinaudible, b: pod-whatbitcoindid, type: community, weight: 1.0, evidence: "Same English-language BTC podcast sub-scene; overlapping audience/guest pool"}
|
||||
- {a: pod-anitaposch, b: pod-stephanlivera, type: community, weight: 1.0, evidence: "Same advocacy sub-scene; peripheral (Global-South focus, separate roster)"}
|
||||
- {a: pod-anitaposch, b: pod-bitcoinaudible, type: community, weight: 1.0, evidence: "Educational/advocacy BTC sub-scene overlap"}
|
||||
# --- The Bitcoin Layer (macro node; edges-only, RSS dropped) ---
|
||||
- {a: pod-bitcoinlayer, b: pod-whatbitcoindid, type: shared_guest, weight: 1.0, evidence: "Nik Bhatia recurring macro/rates guest on WBD 2022-2023"}
|
||||
- {a: pod-bitcoinlayer, b: pod-stephanlivera, type: shared_guest, weight: 1.0, evidence: "Bhatia macro/Fed guest in SLP orbit 2022-2023"}
|
||||
- {a: pod-bitcoinlayer, b: pod-stephanlivera, type: community, weight: 1.0, evidence: "Shared macro-leaning corner (rates/liquidity framing)"}
|
||||
# --- NEW independent legs <-> core (the 'missing' edges: they share the same guest roster) ---
|
||||
- {a: pod-citadeldispatch, b: pod-coinstories, type: shared_guest, weight: 1.0, evidence: "Odell guested Coin Stories; Brunell on shared interview circuit"}
|
||||
- {a: pod-coinstories, b: pod-tftc, type: shared_guest, weight: 1.0, evidence: "Bent co-paneled w/ Brunell (Pomp panel etc.)"}
|
||||
- {a: pod-coinstories, b: pod-whatbitcoindid, type: community, weight: 1.0, evidence: "Same high-reach BTC interview circuit; overlapping Mallers/Saylor/Marcus roster"}
|
||||
- {a: pod-bitcoinstandard, b: pod-citadeldispatch, type: shared_guest, weight: 1.0, evidence: "Odell guested Bitcoin Standard ep #126 'Stacking Sats with Matt Odell' (Aug 2022)"}
|
||||
- {a: pod-kevinrooke, b: pod-stephanlivera, type: community, weight: 1.0, evidence: "Shared Lightning-operator guest pool (Breez/Voltage/Amboss/River)"}
|
||||
- {a: pod-kevinrooke, b: pod-tftc, type: community, weight: 1.0, evidence: "Shared Lightning-operator guest pool / scene overlap"}
|
||||
- {a: pod-bitcoinmagazine, b: pod-citadeldispatch, type: community, weight: 1.0, evidence: "Odell historical BM contributor; shared contributor/guest pool"}
|
||||
- {a: pod-simplybitcoin, b: pod-tftc, type: shared_guest, weight: 1.0, evidence: "Bent guested 'WE WILL WIN | Simply Bitcoin IRL'"}
|
||||
# --- Quarantine fixtures <-> core (matter only in TEST mode; live-dropped as own_network) ---
|
||||
- {a: pod-bitcoinreview, b: pod-citadeldispatch, type: shared_guest, weight: 1.0, evidence: "Odell recurring co-host of Bitcoin.Review (BR001-011 'ft. Odell')"}
|
||||
- {a: pod-cafebitcoin, b: pod-citadeldispatch, type: shared_guest, weight: 1.0, evidence: "Odell (Swan advisor) ran 'Bitcoin Lightning Summit' on Cafe Bitcoin"}
|
||||
@@ -0,0 +1,29 @@
|
||||
# Battery (bitcoin-collateralized lending) text-corpus sources — from verified manifest (workflow w7559rp1x, 2026-06-08).
|
||||
# kind=filing (text docs, no transcription). cluster: 'credit' = the crypto-credit ecosystem; 'macro' = policy regulators.
|
||||
# own_network=1 → Ten31-affiliated (quarantined in live: Ten31 hearing its own ecosystem). lineage captured in notes.
|
||||
# AXIS DISCIPLINE: policy sources are CONTEXT — their claims map to the custody-policy-enablement fan-out node, never the
|
||||
# institutional-supply node. SUPPLY resolves only on committed/deployed-capital claims (Galaxy/Coinbase/Cantor-funded).
|
||||
sources:
|
||||
# --- SUPPLY: committed/deployed capital + market data (the resolver tier) ---
|
||||
- {id: bat-galaxy, name: "Galaxy Research", kind: filing, cluster: credit, role: IND, own_network: false, notes: "independent — State of Crypto Lending dataset + quarterly loan-book ($863M Q3'24). THE in-window deployed-dollar supply measure."}
|
||||
- {id: bat-9fin, name: "9fin (credit data)", kind: filing, cluster: credit, role: IND, own_network: false, notes: "independent credit-data — names BTC-backed private-credit funders."}
|
||||
- {id: bat-cantor, name: "Cantor Fitzgerald (PR)", kind: filing, cluster: credit, role: IND, own_network: false, notes: "primary — $2B program ANNOUNCED Jul-2024 (future tense, no funding source named). Announced≠funded anchor."}
|
||||
- {id: bat-falconx, name: "FalconX (newsroom)", kind: filing, cluster: credit, role: IND, own_network: false, notes: "primary — closed FIRST Cantor financing May-2025 (out-of-window): proves in-window Cantor was announce-only."}
|
||||
- {id: bat-maple, name: "Maple Finance (insights)", kind: filing, cluster: credit, role: IND, own_network: false, notes: "primary — first Cantor tranche May-2025 (out-of-window). Funding source = Cantor balance sheet."}
|
||||
- {id: bat-twoprime, name: "Two Prime", kind: filing, cluster: credit, role: IND, own_network: false, notes: "primary/independent — sees $2B DEMAND, discloses ZERO deployed (the supply absence), Nov-2023 in-window."}
|
||||
- {id: bat-battery, name: "Battery Finance / Newmarket (PR)", kind: filing, cluster: credit, role: IND, own_network: true, notes: "OWN_NETWORK — Battery is a Ten31 PARTNERSHIP. Launch Nov-2024: ONE funded deal, NO pooled committed-capital figure, NO named funding source. The absence is the signal."}
|
||||
- {id: bat-bitcoinmag, name: "Bitcoin Magazine (business desk)", kind: filing, cluster: credit, role: IND, own_network: false, notes: "independent-but-aligned (bitcoin-advocacy media) — original interview re Battery launch; mechanics, no fund size."}
|
||||
# --- DEMAND: originator disclosures ---
|
||||
- {id: bat-unchained, name: "Unchained (blog)", kind: filing, cluster: credit, role: IND, own_network: true, notes: "OWN_NETWORK — Unchained is a Ten31 PORTFOLIO co. Demand: >$500M cumulative origination."}
|
||||
- {id: bat-ledn, name: "Ledn (Open Book)", kind: filing, cluster: credit, role: IND, own_network: false, notes: "primary — monthly Open Book origination disclosures (>$10B cumulative)."}
|
||||
- {id: bat-salt, name: "SALT Lending (blog)", kind: filing, cluster: credit, role: IND, own_network: false, notes: "primary — paginated blog archive (v2: crawl /blog/page/N/)."}
|
||||
- {id: bat-debifi, name: "Debifi", kind: filing, cluster: credit, role: IND, own_network: true, notes: "OWN_NETWORK — Ten31 portfolio. Demand-side lender (v2: Medium RSS crawl)."}
|
||||
# --- DOWNSTREAM echo (near-zero independence — event detection / corroboration only) ---
|
||||
- {id: bat-coindesk, name: "CoinDesk", kind: filing, cluster: credit, role: DX, own_network: false, notes: "downstream — independent trade press; Cantor-funded + Two Prime coverage."}
|
||||
- {id: bat-coinspeaker, name: "Coinspeaker", kind: filing, cluster: credit, role: DX, own_network: false, notes: "downstream — Coinbase institutional-lending (>$57M committed, Reg D) Sept-2023."}
|
||||
- {id: bat-asr, name: "Asset Securitization Report", kind: filing, cluster: credit, role: IND, own_network: false, notes: "independent structured-credit — Milo crypto-mortgage 2022 (soft paywall)."}
|
||||
# --- POLICY CONTEXT (axis=context — weight 0 into supply; maps to custody-policy fan-out node only) ---
|
||||
- {id: bat-sec, name: "SEC (policy primaries)", kind: filing, cluster: macro, role: IND, own_network: false, notes: "axis=CONTEXT — SAB-121 (2022) / SAB-122 (2025). NOT a supply input."}
|
||||
- {id: bat-fed, name: "Federal Reserve (policy)", kind: filing, cluster: macro, role: IND, own_network: false, notes: "axis=CONTEXT — SR 22-6 (crypto-collateralized loans), SR 23-7 Novel Activities."}
|
||||
- {id: bat-fdic, name: "FDIC (policy)", kind: filing, cluster: macro, role: IND, own_network: false, notes: "axis=CONTEXT — FIL-16-2022 crypto prior-notification."}
|
||||
- {id: bat-occ, name: "OCC (policy)", kind: filing, cluster: macro, role: IND, own_network: false, notes: "axis=CONTEXT — IL 1170/1179 (custody + supervisory-non-objection gate)."}
|
||||
@@ -0,0 +1,31 @@
|
||||
# Bitcoin-cluster sources for the STRIKE reflexivity backtest (DESIGN_v2.1). Resolved + verified.
|
||||
# own_network=1 → the Ten31 orbit (Odell/Bent partners) = QUARANTINED (dropped in live EISC, test fixture).
|
||||
# The independent leg (own_network unset) is the non-self-referential corroboration for the Strike test.
|
||||
# All bitcoin cluster → cluster_capped_low (§4.5). rss_url verified to parse; FULL = reaches 2022-2023.
|
||||
sources:
|
||||
# --- OWN-NETWORK ORBIT (quarantined) ---
|
||||
- {id: pod-tftc, name: "TFTC (Marty Bent)", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: true, rss_url: "https://feeds.fountain.fm/ZwwaDULvAj0yZvJ5kdB9", channel_url: "https://www.youtube.com/@TFTC", backtest_2022_2023: rss_full, notes: "FULL: 112 eps 2022, 88 in 2023. Ten31 partner — own_network."}
|
||||
- {id: pod-citadeldispatch, name: "Citadel Dispatch (Matt Odell)", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: true, rss_url: "https://serve.podhome.fm/rss/c90e609a-df1e-596a-bd5e-57bcc8aad6cc", channel_url: "https://www.youtube.com/channel/UCoA72saVAuQ8hYCnBO0Lymw", backtest_2022_2023: rss_full, notes: "FULL: 53 eps 2022, 37 in 2023. Ten31 partner — own_network."}
|
||||
- {id: pod-rabbitholerecap, name: "Rabbit Hole Recap (Odell + Bent)", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: true, rss_url: "https://feeds.fountain.fm/0EAzqUaM4qqanDr1qNuK", channel_url: "https://www.youtube.com/@TFTC", backtest_2022_2023: rss_full, notes: "FULL: 230 eps 2022 (launched Apr 2022), 51 in 2023. Both Ten31 partners — own_network."}
|
||||
# --- INDEPENDENT bitcoin / Lightning-payments discourse (the non-self-referential leg) ---
|
||||
- {id: pod-stephanlivera, name: "Stephan Livera Podcast", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: false, rss_url: "https://anchor.fm/s/7d083a4/podcast/rss", channel_url: "https://www.youtube.com/channel/UCDqPIrJSzHyyJpmH6wnxVxA", backtest_2022_2023: rss_full, notes: "VERIFIED RSS FULL (2026-06-08): anchor.fm feed has 739 eps 2018-2026, 203 in 2022-2023 window — earlier 'truncated/youtube_only' note was wrong. Strong independent Lightning-payments discourse."}
|
||||
- {id: pod-bitcoinaudible, name: "Bitcoin Audible (Guy Swann)", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: false, rss_url: "https://feeds.castos.com/mj96z", channel_url: "https://www.youtube.com/channel/UClG-wqz-OuXfzbpqwJd3fVA", backtest_2022_2023: rss_full, notes: "FULL: 1375 eps since 2018, deep archive. Reads/narrates Lightning retail-payments essays. Independent."}
|
||||
- {id: pod-anitaposch, name: "The Anita Posch Show", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: false, rss_url: "https://www.vodio.fr/rssmedias.php?valeur=1673", channel_url: "https://www.youtube.com/AnitaPosch", backtest_2022_2023: rss_partial, notes: "RESOLVED 2026-06-08: original bitcoinundco feed 302s to homepage (0 items); vodio re-host has 183 eps 2018-2025, 30 in 2022-2023 window (op3.dev->mp3 enclosures). Modest but real independent leg (Global South / payments focus)."}
|
||||
# === EXPANSION (workflow wd2a9zb9e, 2026-06-08): feeds curl-verified + own_network cross-checked vs Ten31 portfolio/team by an adversarial pass. ===
|
||||
# --- INDEPENDENT, HIGH Strike relevance (Lightning/retail-payments) — primary corroboration legs ---
|
||||
- {id: pod-kevinrooke, name: "The Kevin Rooke Show", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: false, rss_url: "https://anchor.fm/s/71a8cc78/podcast/rss", channel_url: "https://www.youtube.com/@KevinRooke", backtest_2022_2023: rss_full, notes: "VERIFIED rss_full: 145 eps, 127 in 2022-2023. MOST Lightning-laser-focused independent show (TLS series: Breez/BTCPay/OpenNode/Muun/Voltage/River). Highest signal-per-ep for payments thesis. No Ten31 tie."}
|
||||
- {id: pod-whatbitcoindid, name: "What Bitcoin Did (Peter McCormack)", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: false, rss_url: "https://feeds.acast.com/public/shows/69d4f193b76468caacc5068f", channel_url: "https://www.youtube.com/@WhatBitcoinDid", backtest_2022_2023: rss_full, notes: "VERIFIED rss_full: 1060 eps, 311 in 2022-2023 (now branded 'The Peter McCormack Show', WBD### back-catalog intact). own_network=FALSE — McCormack NOT a Ten31 partner/portfolio (external capital is Winklevoss/Gemini); heavy guest-overlap w/ Odell/Bent core captured as shared_guest EDGES, not quarantine. Ran Lightning series + Mallers/Strike interviews."}
|
||||
- {id: pod-bitcoinmagazine, name: "Bitcoin Magazine Podcast", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: false, rss_url: "https://anchor.fm/s/cefa18a0/podcast/rss", channel_url: "https://www.youtube.com/@BitcoinMagazine", backtest_2022_2023: rss_full, notes: "VERIFIED rss_full: 788 eps, 304 in 2022-2023. BTC Inc (Nakamoto/Bailey orbit, NOT Ten31). Hosts 2022-23 Keroles/Lindner. Heavy Lightning/Mallers/merchant coverage. own_network=false."}
|
||||
- {id: pod-coinstories, name: "Coin Stories (Natalie Brunell)", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: false, rss_url: "https://rss.libsyn.com/shows/344543/destinations/2813255.xml", channel_url: "https://www.youtube.com/@CoinStoriesNatalieBrunell", backtest_2022_2023: rss_full, notes: "VERIFIED rss_full: 500 eps, 155 in 2022-2023. Independent journalist; 2x Jack Mallers (Strike CEO), Marcus/Saylor on Lightning, El-Salvador/merchant eps. own_network=false (Odell/Bent only as guests)."}
|
||||
- {id: pod-bitcoinstandard, name: "The Bitcoin Standard Podcast (Saifedean Ammous)", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: false, rss_url: "https://rss.buzzsprout.com/1849151.rss", channel_url: "https://www.youtube.com/@saifedean", backtest_2022_2023: rss_full, notes: "VERIFIED rss_full: 365 eps, 103 in 2022-2023. CORRECTION: in-window Strike relevance is MEDIUM not high (cited Mallers ep #89 is Nov-2021, out-of-window; in-window core is Austrian econ/Fiat Food). Independent. own_network=false."}
|
||||
- {id: pod-simplybitcoin, name: "Simply Bitcoin", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: false, rss_url: "https://anchor.fm/s/717a2198/podcast/rss", channel_url: "https://www.youtube.com/@SimplyBitcoinTV", backtest_2022_2023: rss_full, notes: "VERIFIED rss_full: 1689 eps, 574 in 2022-2023. Independent daily-news (Nico+Opti, not Swan). Recurring Mallers/Strike/El-Salvador/Lightning but daily-news frame → strike_relevance MEDIUM. own_network=false (Bent only as guest)."}
|
||||
# --- OWN-NETWORK (Ten31 PORTFOLIO host) — QUARANTINE fixtures: highest Strike density, but Ten31-owned → drop in live EISC ---
|
||||
- {id: pod-cafebitcoin, name: "The Cafe Bitcoin Podcast (Swan)", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: false, rss_url: "https://feeds.simplecast.com/H9Jmx_ko", channel_url: "https://www.youtube.com/@SwanBitcoin", backtest_2022_2023: rss_full, notes: "VERIFIED rss_full: 648 eps, 495 in 2022-2023. own_network=FALSE — RECLASSIFIED 2026-06-08 per Grant: Ten31's Swan investment is IMMATERIAL → NOT a conflict → INDEPENDENT. Highest Lightning-payments density → strong INDEPENDENT leg for Strike."}
|
||||
- {id: pod-bitcoinreview, name: "Bitcoin.Review (NVK / Coinkite)", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: true, rss_url: "https://serve.podhome.fm/rss/7cd0202b-463c-5b2e-b252-d4845cb71466", channel_url: "https://www.youtube.com/@BitcoinReview", backtest_2022_2023: rss_full, notes: "VERIFIED rss_full: 97 eps, 57 in 2022-2023 (launched Jun 2022). own_network=TRUE — host NVK is CEO of Coinkite, a Ten31 PORTFOLIO co (Ten31-led Series A); Odell recurring co-host. Dense Lightning/ecash/wallet content → QUARANTINE fixture."}
|
||||
# --- INDEPENDENT, lower Strike relevance — BROAD bitcoin corpus (source rows recorded; NOT auto-queued for transcription yet) ---
|
||||
- {id: pod-whatismoney, name: "The What is Money? Show (Robert Breedlove)", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: false, rss_url: "https://feeds.simplecast.com/MLdpYXYI", channel_url: "https://www.youtube.com/@RobertBreedlove", backtest_2022_2023: rss_full, notes: "VERIFIED rss_full: 600 eps, 314 in 2022-2023. Money-philosophy/macro dominant; intermittent Lightning → strike med-low. Independent (no Ten31 tie). BROAD corpus."}
|
||||
- {id: pod-unchained-shin, name: "Unchained (Laura Shin)", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: false, rss_url: "https://feeds.megaphone.fm/LSHML4761942757", channel_url: "https://www.youtube.com/@Unchained_pod", backtest_2022_2023: rss_full, notes: "VERIFIED rss_full: 1174 eps, 303 in 2022-2023. NAME-COLLISION GUARD: Laura Shin's podcast is NOT Ten31's portfolio co 'Unchained Capital' (custody firm) — distinct entities. own_network=false. Broad multi-chain/crypto, Eth-heavy → strike medium. BROAD corpus."}
|
||||
- {id: pod-bitcoinfundamentals, name: "Bitcoin Fundamentals (Preston Pysh)", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: false, rss_url: "https://feeds.megaphone.fm/PPLLC8974708240", channel_url: "https://www.youtube.com/@PrestonPysh", backtest_2022_2023: rss_full, notes: "VERIFIED rss_full: combined We-Study-Billionaires feed; Bitcoin leg = BTC###-prefixed eps, 105 in 2022-2023. INGEST NOTE: needs a BTC### title filter (not yet supported by ingest-podcast) → defer. Pysh is GP at Ego Death Capital (rival BTC VC) → independent of Ten31. BROAD corpus."}
|
||||
- {id: pod-pomp, name: "The Pomp Podcast (Anthony Pompliano)", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: false, rss_url: "https://anchor.fm/s/b4841110/podcast/rss", channel_url: "https://www.youtube.com/@AnthonyPompliano", backtest_2022_2023: rss_full, notes: "VERIFIED rss_full: 1741 eps, 519 in 2022-2023. Broad business/crypto, light on Lightning retail-payments → strike low. Consensus barometer. Independent. BROAD corpus."}
|
||||
# --- KEEP but DROP from RSS ingest: pod-bitcoinlayer 2022-2023 audio is YouTube-only (Soundwise/Simplecast feeds start 2024). Edge-node only unless YouTube path added. ---
|
||||
- {id: pod-bitcoinlayer, name: "The Bitcoin Layer (Nik Bhatia)", kind: podcast, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: false, rss_url: "https://app.mysoundwise.com/rss/1665708396190s", channel_url: "https://www.youtube.com/c/thebitcoinlayer", backtest_2022_2023: youtube_only, notes: "rss_partial: all audio RSS feeds (Soundwise/Simplecast) start 2024; 2022-2023 back-catalog only on YouTube (chan UCDo6-SUypaXlTmH6AyrYBZA, launched Jun 2022). Macro/rates focus → strike low. own_network=false. NOT auto-queued; serves as shared_guest edge node into WBD/SLP."}
|
||||
@@ -0,0 +1,8 @@
|
||||
# River Research (river.com) — independent Bitcoin/Lightning research (workflow wxlh2oinb, 2026-06-08).
|
||||
# own_network=FALSE per Grant: River IS a Ten31 portfolio co, but the stake is IMMATERIAL → not a
|
||||
# conflict → INDEPENDENT (same materiality rule as Swan). Real Lightning transaction/usage DATA — a
|
||||
# strong independent leg for the Strike payments thesis (tests "Lightning grew" vs "retail payments at scale").
|
||||
# kind=filing (text docs). cluster=bitcoin. NOTE: River's report PDFs are largely IMAGE-BASED (no text
|
||||
# layer) → ingest the HTML landing pages + the text-layer PDFs only; image-PDFs deferred (need OCR).
|
||||
sources:
|
||||
- {id: src-river, name: "River Research", kind: filing, cluster: bitcoin, role: IND, cluster_capped_low: true, own_network: false, channel_url: "https://river.com/research/", notes: "Independent (immaterial Ten31 stake). Bitcoin/Lightning usage research incl. annual Lightning Report (real tx data). Image-PDF reports deferred for OCR; HTML + text-PDFs ingested."}
|
||||
@@ -0,0 +1,74 @@
|
||||
# Source registry seed — handoff §7.3 (companies) + §7.4 (podcasts). VERIFY tickers/feeds at ingestion.
|
||||
# `notes` flag backtest-era (2022-2023) coverage for §7.1, since some entities didn't exist yet.
|
||||
# Podcast rss_url is left null where the real feed must still be resolved (some feeds truncate the
|
||||
# back-catalog — resolution may need the show's archive feed or a YouTube back-catalog).
|
||||
# Bitcoin-cluster podcasts are cluster_capped_low (§4.5 — most correlated with Ten31's own priors).
|
||||
|
||||
sources:
|
||||
# ============================ COMPANIES (§7.3) ============================
|
||||
# --- AI compute & hyperscalers → ai_tech ---
|
||||
- {id: co-nvda, name: NVIDIA, kind: filing, cluster: ai_tech, ticker: NVDA}
|
||||
- {id: co-googl, name: Alphabet, kind: filing, cluster: ai_tech, ticker: GOOGL}
|
||||
- {id: co-msft, name: Microsoft, kind: filing, cluster: ai_tech, ticker: MSFT}
|
||||
- {id: co-amzn, name: Amazon, kind: filing, cluster: ai_tech, ticker: AMZN}
|
||||
- {id: co-meta, name: Meta, kind: filing, cluster: ai_tech, ticker: META}
|
||||
- {id: co-avgo, name: Broadcom, kind: filing, cluster: ai_tech, ticker: AVGO}
|
||||
- {id: co-tsm, name: TSMC, kind: filing, cluster: ai_tech, ticker: TSM, notes: "Foreign filer — files 20-F/6-K, NOT 10-K/10-Q; override forms at ingest"}
|
||||
- {id: co-crwv, name: CoreWeave, kind: filing, cluster: ai_tech, ticker: CRWV, notes: "IPO 2025 — NO 2022-23 filings (backtest: absent, expected)"}
|
||||
- {id: co-orcl, name: Oracle, kind: filing, cluster: ai_tech, ticker: ORCL}
|
||||
# --- Energy & power (binding constraint) → energy ---
|
||||
- {id: co-ceg, name: Constellation Energy, kind: filing, cluster: energy, ticker: CEG, notes: "Spun from Exelon Feb 2022 — backtest-era coverage from 2022"}
|
||||
- {id: co-vst, name: Vistra, kind: filing, cluster: energy, ticker: VST}
|
||||
- {id: co-tln, name: Talen Energy, kind: filing, cluster: energy, ticker: TLN, notes: "Relisted 2024 post-restructuring — thin pre-2024 public filings"}
|
||||
- {id: co-gev, name: GE Vernova, kind: filing, cluster: energy, ticker: GEV, notes: "Spun from GE Apr 2024 — pre-2024 power data is inside GE filings"}
|
||||
- {id: co-nee, name: NextEra Energy, kind: filing, cluster: energy, ticker: NEE}
|
||||
- {id: co-ccj, name: Cameco, kind: filing, cluster: energy, ticker: CCJ, notes: "Foreign filer (Canada) — 40-F/6-K"}
|
||||
- {id: co-vrt, name: Vertiv, kind: filing, cluster: energy, ticker: VRT}
|
||||
- {id: co-pwr, name: Quanta Services, kind: filing, cluster: energy, ticker: PWR, notes: "watch — grid/interconnect picks-and-shovels"}
|
||||
- {id: co-oklo, name: Oklo, kind: filing, cluster: energy, ticker: OKLO, notes: "watch; public 2024"}
|
||||
- {id: co-smr, name: NuScale Power, kind: filing, cluster: energy, ticker: SMR, notes: "watch"}
|
||||
# --- Mining <-> AI/HPC (energy-compute seam) → energy ---
|
||||
- {id: co-corz, name: Core Scientific, kind: filing, cluster: energy, ticker: CORZ, notes: "Ch.11 Dec 2022, relisted Jan 2024 — messy 2022-23 filings"}
|
||||
- {id: co-iren, name: IREN (Iris Energy), kind: filing, cluster: energy, ticker: IREN, notes: "IPO Nov 2021 — backtest-era OK; foreign filer 20-F"}
|
||||
- {id: co-wulf, name: TeraWulf, kind: filing, cluster: energy, ticker: WULF, notes: "Public Dec 2021 — backtest-era OK"}
|
||||
- {id: co-cifr, name: Cipher Mining, kind: filing, cluster: energy, ticker: CIFR, notes: "Public Aug 2021 — backtest-era OK"}
|
||||
# --- Debasement <-> bitcoin (treasury/custody) → bitcoin ---
|
||||
- {id: co-mstr, name: Strategy (MicroStrategy), kind: filing, cluster: bitcoin, ticker: MSTR}
|
||||
- {id: co-coin, name: Coinbase, kind: filing, cluster: bitcoin, ticker: COIN}
|
||||
- {id: co-xyz, name: Block, kind: filing, cluster: bitcoin, ticker: XYZ, notes: "Ticker SQ→XYZ (2025) — verify"}
|
||||
- {id: co-hood, name: Robinhood Markets, kind: filing, cluster: bitcoin, ticker: HOOD, notes: "Crypto/fintech broker — alongside COIN/XYZ. Public Jul 2021, backtest-era OK."}
|
||||
- {id: co-xxi, name: Twenty One, kind: filing, cluster: bitcoin, ticker: XXI, notes: "Formed 2025 — does not exist in backtest era"}
|
||||
# --- Major banks (general corpus + Battery incumbent-entry/supply axis). New 'banks' cluster = genuine cross-cluster independence vs bitcoin. ---
|
||||
- {id: co-jpm, name: JPMorgan Chase, kind: filing, cluster: banks, ticker: JPM}
|
||||
- {id: co-bac, name: Bank of America, kind: filing, cluster: banks, ticker: BAC}
|
||||
- {id: co-c, name: Citigroup, kind: filing, cluster: banks, ticker: C}
|
||||
- {id: co-wfc, name: Wells Fargo, kind: filing, cluster: banks, ticker: WFC}
|
||||
- {id: co-gs, name: Goldman Sachs, kind: filing, cluster: banks, ticker: GS, notes: "Battery incumbent-entry: executed ONE BTC-collateralized loan Apr-2022 (token, not at-scale)."}
|
||||
- {id: co-ms, name: Morgan Stanley, kind: filing, cluster: banks, ticker: MS}
|
||||
|
||||
# ============================ PODCASTS (§7.4) ============================
|
||||
# Macro/monetary
|
||||
- {id: pod-oddlots, name: Odd Lots, kind: podcast, cluster: macro, role: IND, notes: "highest-independence cross-domain"}
|
||||
- {id: pod-forwardguidance, name: Forward Guidance, kind: podcast, cluster: macro, role: DX}
|
||||
- {id: pod-macrovoices, name: Macro Voices, kind: podcast, cluster: macro, role: DX, notes: "energy-heavy"}
|
||||
- {id: pod-grantwilliams, name: The Grant Williams Podcast, kind: podcast, cluster: macro, role: IND}
|
||||
- {id: pod-monetarymatters, name: Monetary Matters, kind: podcast, cluster: macro, role: DX, notes: "launched ~2024 — thin backtest-era coverage"}
|
||||
- {id: pod-hiddenforces, name: Hidden Forces, kind: podcast, cluster: macro, role: IND, notes: "highest-independence cross-domain"}
|
||||
# AI/tech
|
||||
- {id: pod-dwarkesh, name: Dwarkesh Podcast, kind: podcast, cluster: ai_tech, role: DX, notes: "launched ~2023 — partial backtest-era coverage; highest-independence"}
|
||||
- {id: pod-nopriors, name: No Priors, kind: podcast, cluster: ai_tech, role: DX, notes: "VC-consensus cluster — discount internal convergence"}
|
||||
- {id: pod-latentspace, name: Latent Space, kind: podcast, cluster: ai_tech, role: DX, notes: "technical; launched ~2023"}
|
||||
- {id: pod-cognitiverev, name: Cognitive Revolution, kind: podcast, cluster: ai_tech, role: DX}
|
||||
- {id: pod-bg2, name: BG2, kind: podcast, cluster: vc_consensus, role: DX, notes: "launched ~2023; mild Ten31 correlation; VC-consensus"}
|
||||
- {id: pod-a16z, name: a16z Podcast, kind: podcast, cluster: vc_consensus, role: DX, notes: "crypto correlation; VC-consensus"}
|
||||
# Energy
|
||||
- {id: pod-catalyst, name: Catalyst w/ Shayle Kann, kind: podcast, cluster: energy, role: DX}
|
||||
- {id: pod-columbiaenergy, name: Columbia Energy Exchange, kind: podcast, cluster: energy, role: DX}
|
||||
- {id: pod-doomberg, name: Doomberg, kind: podcast, cluster: energy, role: IND}
|
||||
# Bitcoin (limited, capped low — §4.5)
|
||||
- {id: pod-bitcoinlayer, name: The Bitcoin Layer, kind: podcast, cluster: bitcoin, role: DX, cluster_capped_low: true, notes: "macro-literate"}
|
||||
- {id: pod-whatbitcoindid, name: What Bitcoin Did, kind: podcast, cluster: bitcoin, role: none, cluster_capped_low: true}
|
||||
# Generalist
|
||||
- {id: pod-allin, name: All-In, kind: podcast, cluster: vc_consensus, role: CB, notes: "consensus barometer"}
|
||||
- {id: pod-iltb, name: Invest Like the Best, kind: podcast, cluster: generalist, role: DX, notes: "cross-domain"}
|
||||
- {id: pod-lex, name: Lex Fridman, kind: podcast, cluster: generalist, role: none, notes: "wide reach, variable"}
|
||||
@@ -0,0 +1,11 @@
|
||||
"""Ten31 Signal Engine — pilot.
|
||||
|
||||
A recurring pipeline that ingests audio + text, extracts structured propositions
|
||||
locally, and surfaces signal over time. The discipline that separates signal from
|
||||
plausible-sounding noise (handoff §5): statistics & graph structure NOMINATE
|
||||
candidates; the frontier model only JUDGES and FANS OUT a pre-filtered shortlist.
|
||||
|
||||
See README.md for the architecture and ten31-signal-engine-handoff.md for the spec.
|
||||
"""
|
||||
|
||||
__version__ = "0.1.0"
|
||||
@@ -0,0 +1,4 @@
|
||||
from .cli import main
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -0,0 +1 @@
|
||||
"""Client-side backfill queue (§13.4). Producers enqueue; ONE worker drains sequentially."""
|
||||
@@ -0,0 +1,123 @@
|
||||
"""Backfill job queue over the `backfill_jobs` table (§13.4).
|
||||
|
||||
Model the corpus backfill as a managed GPU-hours queue, not a real-time fan-out. Producers
|
||||
(ingestion) enqueue lightweight job descriptors; a SINGLE worker leases and drains them one at a
|
||||
time so audio never goes parallel (→ 503). Jobs are:
|
||||
- idempotent: UNIQUE(job_type, input_hash); re-enqueue of seen content is a no-op.
|
||||
- crash-safe: leases expire, so a dead worker's job returns to the pool automatically.
|
||||
- prioritized: lower `priority` runs first (backtest corpus + filings jump ahead).
|
||||
|
||||
This is plain SQLite so the whole queue is `SELECT * FROM backfill_jobs`.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import sqlite3
|
||||
from typing import Any, Optional, Sequence
|
||||
|
||||
LEASE_SECONDS_DEFAULT = 600
|
||||
|
||||
|
||||
def enqueue(
|
||||
conn: sqlite3.Connection,
|
||||
*,
|
||||
job_type: str,
|
||||
target_id: str,
|
||||
input_hash: str,
|
||||
parent_doc_id: str | None = None,
|
||||
priority: int = 100,
|
||||
max_attempts: int = 5,
|
||||
) -> Optional[int]:
|
||||
"""Insert a job. Returns job_id, or None if this (job_type, input_hash) is already queued/done
|
||||
(idempotent skip — §13.4)."""
|
||||
cur = conn.execute(
|
||||
"""INSERT OR IGNORE INTO backfill_jobs
|
||||
(job_type, target_id, parent_doc_id, priority, max_attempts, input_hash, state)
|
||||
VALUES (?,?,?,?,?,?, 'pending')""",
|
||||
(job_type, target_id, parent_doc_id, priority, max_attempts, input_hash),
|
||||
)
|
||||
conn.commit()
|
||||
return cur.lastrowid if cur.rowcount else None
|
||||
|
||||
|
||||
def lease_next(
|
||||
conn: sqlite3.Connection,
|
||||
*,
|
||||
worker_id: str,
|
||||
job_types: Sequence[str] | None = None,
|
||||
lease_seconds: int = LEASE_SECONDS_DEFAULT,
|
||||
) -> Optional[sqlite3.Row]:
|
||||
"""Atomically claim the highest-priority eligible job. Eligible = pending, OR a running/leased
|
||||
job whose lease has expired (crash recovery). Increments `attempts`."""
|
||||
params: list[Any] = []
|
||||
type_filter = ""
|
||||
if job_types:
|
||||
type_filter = f" AND job_type IN ({','.join('?' * len(job_types))})"
|
||||
params.extend(job_types)
|
||||
row = conn.execute(
|
||||
f"""SELECT job_id FROM backfill_jobs
|
||||
WHERE (state = 'pending'
|
||||
OR (state IN ('leased','running')
|
||||
AND lease_expires_at IS NOT NULL
|
||||
AND lease_expires_at < datetime('now')))
|
||||
{type_filter}
|
||||
ORDER BY priority ASC, job_id ASC
|
||||
LIMIT 1""",
|
||||
params,
|
||||
).fetchone()
|
||||
if row is None:
|
||||
return None
|
||||
conn.execute(
|
||||
"""UPDATE backfill_jobs
|
||||
SET state='running', lease_owner=?, lease_expires_at=datetime('now', ?),
|
||||
attempts=attempts+1, updated_at=datetime('now')
|
||||
WHERE job_id=?""",
|
||||
(worker_id, f"+{int(lease_seconds)} seconds", row["job_id"]),
|
||||
)
|
||||
conn.commit()
|
||||
return conn.execute("SELECT * FROM backfill_jobs WHERE job_id=?", (row["job_id"],)).fetchone()
|
||||
|
||||
|
||||
def complete(conn: sqlite3.Connection, job_id: int, *, output_ref: str | None = None,
|
||||
gpu_seconds: float | None = None) -> None:
|
||||
conn.execute(
|
||||
"""UPDATE backfill_jobs SET state='done', output_ref=?, gpu_seconds=?, error=NULL,
|
||||
updated_at=datetime('now') WHERE job_id=?""",
|
||||
(output_ref, gpu_seconds, job_id),
|
||||
)
|
||||
conn.commit()
|
||||
|
||||
|
||||
def fail(conn: sqlite3.Connection, job_id: int, error: Any) -> str:
|
||||
"""Retry (→ pending) if attempts remain, else dead-letter (→ failed). Returns the new state."""
|
||||
row = conn.execute(
|
||||
"SELECT attempts, max_attempts FROM backfill_jobs WHERE job_id=?", (job_id,)
|
||||
).fetchone()
|
||||
exhausted = bool(row) and row["attempts"] >= row["max_attempts"]
|
||||
new_state = "failed" if exhausted else "pending"
|
||||
conn.execute(
|
||||
"""UPDATE backfill_jobs SET state=?, error=?, lease_owner=NULL, lease_expires_at=NULL,
|
||||
updated_at=datetime('now') WHERE job_id=?""",
|
||||
(new_state, str(error)[:2000], job_id),
|
||||
)
|
||||
conn.commit()
|
||||
return new_state
|
||||
|
||||
|
||||
def skip(conn: sqlite3.Connection, job_id: int, reason: str | None = None) -> None:
|
||||
"""Terminal non-error skip (e.g. a chunk that produced zero claims is still 'done', but an
|
||||
intentionally dropped job is 'skipped')."""
|
||||
conn.execute(
|
||||
"UPDATE backfill_jobs SET state='skipped', error=?, updated_at=datetime('now') WHERE job_id=?",
|
||||
(reason, job_id),
|
||||
)
|
||||
conn.commit()
|
||||
|
||||
|
||||
def stats(conn: sqlite3.Connection) -> dict[str, dict[str, int]]:
|
||||
rows = conn.execute(
|
||||
"SELECT job_type, state, COUNT(*) AS n FROM backfill_jobs GROUP BY job_type, state"
|
||||
).fetchall()
|
||||
out: dict[str, dict[str, int]] = {}
|
||||
for r in rows:
|
||||
out.setdefault(r["job_type"], {})[r["state"]] = r["n"]
|
||||
return out
|
||||
@@ -0,0 +1,619 @@
|
||||
"""Pilot CLI. Subcommands map to the build order in handoff §11.
|
||||
|
||||
Currently implemented (foundation): init-db, seed-convictions, spark-status, db-tables.
|
||||
Later stages (ingest, extract, score, judge, eval-ui) are added as they're built.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
from .config import load_config
|
||||
from .store import db
|
||||
from .store.seed import load_convictions, load_fanout
|
||||
from .store.sources import load_source_edges, load_sources, update_feeds
|
||||
|
||||
DEFAULT_CONVICTION_SEED = Path("seeds/conviction_log.seed.yaml")
|
||||
DEFAULT_SOURCES_SEED = Path("seeds/sources.seed.yaml")
|
||||
DEFAULT_FEEDS_SEED = Path("seeds/podcast_feeds.resolved.yaml")
|
||||
|
||||
|
||||
def _setup_logging(level: str) -> None:
|
||||
logging.basicConfig(level=getattr(logging, level.upper(), logging.INFO),
|
||||
format="%(asctime)s %(levelname)s %(name)s: %(message)s")
|
||||
|
||||
|
||||
def cmd_init_db(args: argparse.Namespace) -> int:
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
print(f"Initialized DB at {cfg.db_path}")
|
||||
print("Tables/views:", ", ".join(db.table_names(conn)))
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_seed_convictions(args: argparse.Namespace) -> int:
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn) # ensure schema exists
|
||||
path = Path(args.file)
|
||||
n = load_convictions(conn, path)
|
||||
print(f"Upserted {n} convictions from {path}")
|
||||
breakers = conn.execute(
|
||||
"SELECT conviction_id, thematic_proposition FROM conviction_log WHERE is_thesis_breaker = 1"
|
||||
).fetchall()
|
||||
if breakers:
|
||||
print("Thesis-breakers loaded (engine must surface these AGAINST the thesis, §5.7):")
|
||||
for b in breakers:
|
||||
print(f" {b['conviction_id']}: {b['thematic_proposition'][:80]}...")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_seed_sources(args: argparse.Namespace) -> int:
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
n = load_sources(conn, Path(args.file))
|
||||
by_kind = conn.execute(
|
||||
"SELECT kind, COUNT(*) n FROM sources GROUP BY kind ORDER BY kind"
|
||||
).fetchall()
|
||||
print(f"Upserted {n} sources from {args.file}")
|
||||
for r in by_kind:
|
||||
print(f" {r['kind']}: {r['n']}")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_seed_edges(args: argparse.Namespace) -> int:
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
n = load_source_edges(conn, Path(args.file))
|
||||
total = conn.execute("SELECT COUNT(*) FROM source_edges").fetchone()[0]
|
||||
print(f"Inserted {n} new edges from {args.file} ({total} edges total)")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_load_feeds(args: argparse.Namespace) -> int:
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
n = update_feeds(conn, Path(args.file))
|
||||
print(f"updated {n} podcast feeds")
|
||||
rows = conn.execute(
|
||||
"SELECT backtest_2022_2023, COUNT(*) c FROM sources WHERE kind='podcast' "
|
||||
"GROUP BY backtest_2022_2023 ORDER BY c DESC"
|
||||
).fetchall()
|
||||
print("backtest 2022-2023 reach:")
|
||||
for r in rows:
|
||||
print(f" {r['backtest_2022_2023'] or 'unset'}: {r['c']}")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_ingest_edgar(args: argparse.Namespace) -> int:
|
||||
from .ingest.edgar import EdgarClient, ingest_filings
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
client = EdgarClient(cfg.edgar_user_agent)
|
||||
forms = tuple(f.strip() for f in args.forms.split(",")) if args.forms else ("10-K", "10-Q", "8-K")
|
||||
|
||||
# resolve source_id from ticker (create a lightweight source row if not seeded)
|
||||
row = conn.execute("SELECT source_id FROM sources WHERE upper(ticker)=upper(?)", (args.ticker,)).fetchone()
|
||||
if row:
|
||||
source_id = row["source_id"]
|
||||
else:
|
||||
source_id = f"co-{args.ticker.lower()}"
|
||||
conn.execute(
|
||||
"INSERT OR IGNORE INTO sources (source_id, name, kind, ticker) VALUES (?,?,?,?)",
|
||||
(source_id, args.ticker, "filing", args.ticker.upper()),
|
||||
)
|
||||
conn.commit()
|
||||
|
||||
n_docs, n_jobs = ingest_filings(conn, client, source_id=source_id, ticker=args.ticker,
|
||||
since=args.since, until=args.until, forms=forms)
|
||||
print(f"{args.ticker}: +{n_docs} filing documents, +{n_jobs} extract jobs queued "
|
||||
f"(forms={','.join(forms)}, since={args.since}, until={args.until})")
|
||||
return 0
|
||||
|
||||
|
||||
def _resolve_source_id(conn, ticker: str, kind: str = "filing") -> str:
|
||||
row = conn.execute("SELECT source_id FROM sources WHERE upper(ticker)=upper(?)", (ticker,)).fetchone()
|
||||
if row:
|
||||
return row["source_id"]
|
||||
source_id = f"co-{ticker.lower()}"
|
||||
conn.execute("INSERT OR IGNORE INTO sources (source_id, name, kind, ticker) VALUES (?,?,?,?)",
|
||||
(source_id, ticker.upper(), kind, ticker.upper()))
|
||||
conn.commit()
|
||||
return source_id
|
||||
|
||||
|
||||
def cmd_ingest_doc(args: argparse.Namespace) -> int:
|
||||
from .ingest.docs import ingest_one
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
doc_id = ingest_one(conn, cfg, source_id=args.source, url=args.url,
|
||||
title=args.title or args.url, date=args.date, method=args.method)
|
||||
print(f"ingested: {doc_id}" if doc_id else "no new doc (duplicate / too short / fetch failed)")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_ingest_feed_text(args: argparse.Namespace) -> int:
|
||||
from .ingest.docs import ingest_feed_text
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
n = ingest_feed_text(conn, cfg, source_id=args.source, rss_url=args.url,
|
||||
since=args.since, until=args.until, limit=args.limit)
|
||||
print(f"ingested {n} article docs from feed for {args.source}")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_ingest_doc_manifest(args: argparse.Namespace) -> int:
|
||||
from .ingest.docs import ingest_manifest
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
r = ingest_manifest(conn, cfg, Path(args.file))
|
||||
print(f"manifest: ingested={r['ingested']} skipped={r['skipped']} missing_source={r['missing_source']}")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_ingest_earnings(args: argparse.Namespace) -> int:
|
||||
from .ingest.earnings import FMPClient, ingest_for_ticker
|
||||
cfg = load_config()
|
||||
if not cfg.fmp_api_key:
|
||||
print("FMP_API_KEY not set", file=sys.stderr)
|
||||
return 1
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
fmp = FMPClient(cfg.fmp_api_key)
|
||||
source_id = _resolve_source_id(conn, args.ticker)
|
||||
n_docs, n_jobs = ingest_for_ticker(conn, fmp, source_id=source_id, symbol=args.ticker.upper(),
|
||||
data_dir=cfg.data_dir, since=args.since, until=args.until, limit=args.limit)
|
||||
print(f"{args.ticker}: +{n_docs} earnings transcripts, +{n_jobs} extract jobs (since={args.since}, until={args.until})")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_embed_claims(args: argparse.Namespace) -> int:
|
||||
from .spark import from_config
|
||||
from .embedstore.qdrant_store import get_client, ensure_collection, upsert_pending
|
||||
from .embedstore.embedder import SparseEmbedder
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
sc = from_config(cfg)
|
||||
client = get_client(args.qdrant_url)
|
||||
created = ensure_collection(client)
|
||||
print(f"collection {'created' if created else 'exists'}")
|
||||
sparse = SparseEmbedder() if not args.no_sparse else None
|
||||
n = upsert_pending(conn, sc, client, sparse)
|
||||
print(f"embedded + upserted {n} propositions (sparse={'on' if sparse and sparse.available else 'off'})")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_search(args: argparse.Namespace) -> int:
|
||||
from .spark import from_config
|
||||
cfg = load_config()
|
||||
sc = from_config(cfg)
|
||||
res = sc.search(args.query, collection="propositions", top_k=args.top_k, rerank=not args.no_rerank)
|
||||
hits = res.get("results") or res.get("hits") or res
|
||||
print(json.dumps(hits, indent=2)[:2500])
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_ingest_podcast(args: argparse.Namespace) -> int:
|
||||
from .ingest.podcasts import ingest_rss, ingest_youtube
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
src = conn.execute("SELECT * FROM sources WHERE source_id=?", (args.source,)).fetchone()
|
||||
if not src:
|
||||
print(f"unknown source {args.source}", file=sys.stderr)
|
||||
return 1
|
||||
via = args.via
|
||||
if via == "auto":
|
||||
via = "youtube" if (src["backtest_2022_2023"] == "youtube_only" and args.since) else "rss"
|
||||
fn = ingest_youtube if via == "youtube" else ingest_rss
|
||||
n_docs, n_jobs = fn(conn, src, since=args.since, until=args.until, limit=args.limit)
|
||||
print(f"{src['name']} via {via}: +{n_docs} episodes, +{n_jobs} transcribe jobs")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_run_transcribe(args: argparse.Namespace) -> int:
|
||||
from .spark import from_config
|
||||
from .ingest.transcribe_worker import run_transcribe
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
sc = from_config(cfg)
|
||||
result = run_transcribe(conn, sc, cfg, limit=args.limit, max_chunks=args.max_chunks)
|
||||
print(f"transcription: {result['jobs_processed']} jobs processed")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_run_transcribe_gemini(args: argparse.Namespace) -> int:
|
||||
from .ingest.gemini_transcribe import run_transcribe_gemini
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
r = run_transcribe_gemini(conn, cfg, limit=args.limit, concurrency=args.concurrency)
|
||||
tok_in, tok_out = r["prompt_tokens"], r["output_tokens"]
|
||||
# Gemini 2.5 Flash list price: ~$0.30/1M text-in, audio-in ~$1.00/1M, $2.50/1M out. Audio dominates in.
|
||||
est = tok_in / 1_000_000 * 1.00 + tok_out / 1_000_000 * 2.50
|
||||
print(f"gemini transcribe: done={r['done']} failed={r['failed']} | "
|
||||
f"tokens in={tok_in:,} out={tok_out:,} | ~${est:.2f} this run (≈${est/max(r['done'],1):.3f}/ep)")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_run_extract(args: argparse.Namespace) -> int:
|
||||
from .spark import from_config
|
||||
from .extract.worker import run_extract
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
sc = from_config(cfg)
|
||||
result = run_extract(conn, sc, cfg, limit=args.limit, max_chunks_per_doc=args.max_chunks)
|
||||
print(f"extraction: {result['jobs_processed']} jobs, {result['claims_written']} claims written")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_queue_status(args: argparse.Namespace) -> int:
|
||||
from .backfill import queue
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
s = queue.stats(conn)
|
||||
if not s:
|
||||
print("queue empty")
|
||||
return 0
|
||||
for job_type, states in sorted(s.items()):
|
||||
parts = ", ".join(f"{st}={n}" for st, n in sorted(states.items()))
|
||||
print(f" {job_type}: {parts}")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_feed_peek(args: argparse.Namespace) -> int:
|
||||
from .ingest.feeds import fetch_feed, episode_records
|
||||
parsed = fetch_feed(args.url)
|
||||
status = getattr(parsed, "status", None)
|
||||
recs = episode_records(parsed)
|
||||
print(f"status={status} bozo={getattr(parsed, 'bozo', None)} episodes_with_audio={len(recs)}")
|
||||
for r in recs[: args.limit]:
|
||||
print(f" [{r['published']}] {str(r['title'])[:70]}")
|
||||
if recs:
|
||||
print(f"oldest in feed: {recs[-1]['published']} newest: {recs[0]['published']}")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_serve(args: argparse.Namespace) -> int:
|
||||
import uvicorn
|
||||
from .ui.app import create_app
|
||||
cfg = load_config()
|
||||
port = args.port or cfg.ui_port
|
||||
print(f"serving corpus UI on http://0.0.0.0:{port}")
|
||||
uvicorn.run(create_app(), host="0.0.0.0", port=port)
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_seed_fanout(args: argparse.Namespace) -> int:
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
n = load_fanout(conn, Path(args.file))
|
||||
print(f"seeded {n} fan-out derivative nodes")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_backtest(args: argparse.Namespace) -> int:
|
||||
from .spark import from_config
|
||||
from .signals.run import run_backtest
|
||||
from datetime import datetime, timedelta
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
sc = from_config(cfg)
|
||||
# monthly as_of march
|
||||
start = datetime.strptime(args.start, "%Y-%m-%d")
|
||||
end = datetime.strptime(args.end, "%Y-%m-%d")
|
||||
dates, d = [], start
|
||||
while d <= end:
|
||||
dates.append(d.strftime("%Y-%m-%d"))
|
||||
d = d + timedelta(days=args.step_days)
|
||||
print(f"§7.1 backtest: conviction={args.conviction}, as_of march {args.start}→{args.end} ({len(dates)} points)")
|
||||
timeline = run_backtest(conn, sc, cfg, conviction_id=args.conviction, dates=dates, window_days=args.window_days)
|
||||
|
||||
# report: per-node first-clear date + score trajectory; highlight the headline derivative
|
||||
print("\n=== node trajectories (score by as_of; ★=cleared evidence bar) ===")
|
||||
nodes = {}
|
||||
for as_of, res in timeline:
|
||||
for r in res:
|
||||
key = r["node"]["node_id"] or r["node"]["conviction_id"]
|
||||
nodes.setdefault(key, []).append((as_of, r["result"]["score"], r["evidence"], r["promotion"], r["result"]["inputs"]))
|
||||
for key, traj in sorted(nodes.items()):
|
||||
first = next((t for t in traj if t[2]), None)
|
||||
peak = max(traj, key=lambda t: t[1])
|
||||
mark = f"first-cleared {first[0]}" if first else "never cleared"
|
||||
print(f" {key:28} peak={peak[1]:.2f} {mark}")
|
||||
head = nodes.get(args.headline)
|
||||
if head:
|
||||
print(f"\n=== HEADLINE derivative: {args.headline} ===")
|
||||
for as_of, score, ev, pr, inp in head:
|
||||
star = "★" if ev else ("·" if score > 0 else " ")
|
||||
print(f" {as_of} {star} score={score:.2f} corrob={inp.get('corroboration',0)} "
|
||||
f"n_conf={inp.get('n_confirmed',0)} eisc={inp.get('eisc_corrob',0)} "
|
||||
f"a={inp.get('a_corrob',0)} k_eff={inp.get('k_eff0',0)}")
|
||||
firstclear = next((t for t in head if t[2]), None)
|
||||
print(f"\n VERDICT: headline power-infra derivative "
|
||||
f"{'SURFACED at ' + firstclear[0] if firstclear else 'did NOT surface'} "
|
||||
f"(bar = under_acted ≥ {0.3})")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_two_sided(args: argparse.Namespace) -> int:
|
||||
"""Two-sided net-corroboration trajectory (DESIGN_v2.1 H5) for the adversarial cases.
|
||||
BATTERY: demand-net should rise while supply-net stays flat. STRIKE: net stays quiet in live, fires in test."""
|
||||
from .spark import from_config as spark_from_config
|
||||
from .extract.backends import from_config as backend_from_config
|
||||
from .signals.two_sided import trajectory
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
sc = spark_from_config(cfg)
|
||||
backend = backend_from_config(cfg, sc)
|
||||
nodes = conn.execute(
|
||||
"SELECT node_id, derivative_proposition FROM fanout_nodes WHERE parent_conviction_id=? ORDER BY node_id",
|
||||
(args.conviction,),
|
||||
).fetchall()
|
||||
dates = [d.strip() for d in args.dates.split(",")]
|
||||
filt = [s for s in args.nodes.split(",") if s] if args.nodes else []
|
||||
for r in nodes:
|
||||
if filt and not any(k.lower() in r["node_id"].lower() for k in filt):
|
||||
continue
|
||||
for mode in [m.strip() for m in args.modes.split(",")]:
|
||||
traj = trajectory(conn, sc, backend, r["derivative_proposition"], dates,
|
||||
window_days=args.window_days, mode=mode)
|
||||
print(f"\n### {r['node_id']} [mode={mode}, window={args.window_days}d] ###")
|
||||
for pt in traj:
|
||||
print(f" {pt['as_of']}: net={pt['net']:+.2f} "
|
||||
f"affirm(eisc={pt['affirms_eisc']}, hard_src={pt.get('hard_affirm_src','?')}, "
|
||||
f"n_claims={pt['n_affirm']}, soft_dropped={pt.get('soft_affirm_src_dropped','?')}) "
|
||||
f"deny(eisc={pt['denies_eisc']}, n={pt['n_deny']}) "
|
||||
f"own_net={pt['own_network_affirm_src']}")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_confusion(args: argparse.Namespace) -> int:
|
||||
from .signals.confusion import run_confusion
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
out = run_confusion(conn, cfg, args.spec)
|
||||
classify = out["classify"]
|
||||
print("=== PRE-REGISTERED confusion matrix (DESIGN_v2 §1) — precision AND recall; RUNWAY = frac of move still ahead at signal ===")
|
||||
print(f"{'derivative':26} {'reprice?':8} {'peak%':>6} {'whisper':>9} {'run_wh':>6} {'cleared':>9} {'run_cl':>6} cl/wh")
|
||||
for r in out["rows"]:
|
||||
cl, wh = classify(r, "cleared"), classify(r, "whisper")
|
||||
miss = f" (no px:{','.join(r['missing'])})" if r["missing"] else ""
|
||||
print(f"{r['node']:26} {('REAL' if r['confirmed'] else 'no'):8} {str(r['peak_pct']):>6} "
|
||||
f"{str(r['whisper_date'] or '-'):>9} {str(r['runway_whisper'] if r['runway_whisper'] is not None else '-'):>6} "
|
||||
f"{str(r['cleared_date'] or '-'):>9} {str(r['runway_cleared'] if r['runway_cleared'] is not None else '-'):>6} "
|
||||
f"{cl}/{wh}{miss}")
|
||||
for level in ("cleared", "whisper"):
|
||||
c, p, rec = out[level]
|
||||
print(f"\n{level.upper()} level: TP={c['TP']} FP={c['FP']} FN={c['FN']} TN={c['TN']} | "
|
||||
f"precision={p if p is None else round(p,2)} recall={rec if rec is None else round(rec,2)}")
|
||||
print("\nlead_* = days the repricing came AFTER the signal (positive = engine was early).")
|
||||
print("The cleared→whisper delta = what the independence floor cost in lead time / recall.")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_provenance(args: argparse.Namespace) -> int:
|
||||
"""The processing log — what's been ingested/processed, so we never reprocess silently."""
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
db.init_db(conn)
|
||||
print("processed documents (the durable log):")
|
||||
for r in conn.execute(
|
||||
"SELECT kind, COUNT(*) total, SUM(CASE WHEN processed_at IS NOT NULL THEN 1 ELSE 0 END) proc "
|
||||
"FROM documents GROUP BY kind ORDER BY kind"
|
||||
):
|
||||
print(f" {r['kind']:14} {r['proc']}/{r['total']} processed")
|
||||
print("dedup model: (1) UNIQUE(source_id, external_id) = robust pre-GPU guard; "
|
||||
"(2) dedup_key = cross-mirror (title+date); content_hash = audit only.")
|
||||
dups = conn.execute(
|
||||
"SELECT dedup_key, COUNT(*) c FROM documents WHERE dedup_key IS NOT NULL "
|
||||
"GROUP BY dedup_key HAVING c > 1"
|
||||
).fetchall()
|
||||
print(f"cross-mirror dedup_key groups (same episode via >1 feed): {len(dups)}")
|
||||
miss = conn.execute("SELECT COUNT(*) FROM documents WHERE dedup_key IS NULL").fetchone()[0]
|
||||
if miss:
|
||||
print(f" ({miss} docs missing dedup_key — run `provenance --backfill-hashes`)")
|
||||
if args.backfill_hashes:
|
||||
import hashlib
|
||||
import os
|
||||
from .util import audio_dedup_key
|
||||
ndk = nch = 0
|
||||
for r in conn.execute("SELECT doc_id, kind, title, date, external_id, transcript_path, dedup_key, content_hash FROM documents"):
|
||||
updates: dict = {}
|
||||
if not r["dedup_key"]:
|
||||
updates["dedup_key"] = (audio_dedup_key(r["title"], r["date"])
|
||||
if r["kind"] in ("podcast", "youtube") else r["external_id"])
|
||||
ndk += 1
|
||||
if not r["content_hash"] and r["transcript_path"] and os.path.exists(r["transcript_path"]):
|
||||
updates["content_hash"] = hashlib.sha256(open(r["transcript_path"], "rb").read()).hexdigest()
|
||||
nch += 1
|
||||
if updates:
|
||||
sets = ", ".join(f"{k}=?" for k in updates)
|
||||
conn.execute(f"UPDATE documents SET {sets} WHERE doc_id=?", (*updates.values(), r["doc_id"]))
|
||||
conn.commit()
|
||||
print(f"backfilled {ndk} dedup_keys, {nch} content hashes (audit)")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_db_tables(args: argparse.Namespace) -> int:
|
||||
cfg = load_config()
|
||||
conn = db.connect(cfg.db_path)
|
||||
for t in db.table_names(conn):
|
||||
print(t)
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_spark_status(args: argparse.Namespace) -> int:
|
||||
from .spark import from_config
|
||||
cfg = load_config()
|
||||
sc = from_config(cfg)
|
||||
try:
|
||||
print("status:", sc.status())
|
||||
print("endpoints:", sc.endpoints())
|
||||
return 0
|
||||
except Exception as e: # noqa: BLE001 — health probe; surface, don't crash
|
||||
print(f"Spark Control unreachable at {cfg.spark_control_url}: {e}", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
p = argparse.ArgumentParser(prog="signal_engine", description="Ten31 Signal Engine (pilot)")
|
||||
sub = p.add_subparsers(dest="command", required=True)
|
||||
|
||||
sub.add_parser("init-db", help="Create the SQLite schema").set_defaults(func=cmd_init_db)
|
||||
|
||||
sp = sub.add_parser("seed-convictions", help="Load the conviction log (§3.1)")
|
||||
sp.add_argument("--file", default=str(DEFAULT_CONVICTION_SEED))
|
||||
sp.set_defaults(func=cmd_seed_convictions)
|
||||
|
||||
ss = sub.add_parser("seed-sources", help="Load the source registry (§7.3/§7.4)")
|
||||
ss.add_argument("--file", default=str(DEFAULT_SOURCES_SEED))
|
||||
ss.set_defaults(func=cmd_seed_sources)
|
||||
|
||||
sde = sub.add_parser("seed-edges", help="Seed EISC connectedness edges (priors) idempotently")
|
||||
sde.add_argument("--file", default="seeds/source_edges.bitcoin.seed.yaml")
|
||||
sde.set_defaults(func=cmd_seed_edges)
|
||||
|
||||
lf = sub.add_parser("load-feeds", help="Apply resolved/verified podcast feed URLs + backtest reach")
|
||||
lf.add_argument("--file", default=str(DEFAULT_FEEDS_SEED))
|
||||
lf.set_defaults(func=cmd_load_feeds)
|
||||
|
||||
sf = sub.add_parser("seed-fanout", help="Load the hand-written fan-out tree (§7.1 backtest)")
|
||||
sf.add_argument("--file", default="seeds/fanout.K2023.seed.yaml")
|
||||
sf.set_defaults(func=cmd_seed_fanout)
|
||||
|
||||
bt = sub.add_parser("backtest", help="Run the §7.1 under-acted-conviction backtest (as-of march)")
|
||||
bt.add_argument("--conviction", default="K2023")
|
||||
bt.add_argument("--start", default="2023-01-01")
|
||||
bt.add_argument("--end", default="2024-06-01")
|
||||
bt.add_argument("--step-days", type=int, default=30)
|
||||
bt.add_argument("--window-days", type=int, default=90, help="~quarterly for filings/earnings cadence")
|
||||
bt.add_argument("--headline", default="K2023-picks-and-shovels")
|
||||
bt.set_defaults(func=cmd_backtest)
|
||||
|
||||
ie = sub.add_parser("ingest-edgar", help="Fetch SEC filings for a ticker → documents + extract jobs")
|
||||
ie.add_argument("--ticker", required=True)
|
||||
ie.add_argument("--since", help="ISO date lower bound, e.g. 2022-01-01")
|
||||
ie.add_argument("--until", help="ISO date upper bound, e.g. 2023-12-31")
|
||||
ie.add_argument("--forms", help="comma list, default 10-K,10-Q,8-K")
|
||||
ie.set_defaults(func=cmd_ingest_edgar)
|
||||
|
||||
idoc = sub.add_parser("ingest-doc", help="Fetch one text doc (HTML/PDF) → document + extract job (Battery corpus)")
|
||||
idoc.add_argument("--source", required=True, help="source_id (must exist)")
|
||||
idoc.add_argument("--url", required=True)
|
||||
idoc.add_argument("--title")
|
||||
idoc.add_argument("--date", help="ISO date of the document")
|
||||
idoc.add_argument("--method", choices=["auto", "html", "pdf"], default="auto")
|
||||
idoc.set_defaults(func=cmd_ingest_doc)
|
||||
|
||||
idm = sub.add_parser("ingest-doc-manifest", help="Batch-ingest a YAML doc manifest (Battery corpus)")
|
||||
idm.add_argument("--file", default="seeds/battery_docs.manifest.yaml")
|
||||
idm.set_defaults(func=cmd_ingest_doc_manifest)
|
||||
|
||||
ift = sub.add_parser("ingest-feed-text", help="Ingest article bodies behind a text RSS feed (blog/press)")
|
||||
ift.add_argument("--source", required=True)
|
||||
ift.add_argument("--url", required=True, help="RSS feed URL")
|
||||
ift.add_argument("--since")
|
||||
ift.add_argument("--until")
|
||||
ift.add_argument("--limit", type=int, default=50)
|
||||
ift.set_defaults(func=cmd_ingest_feed_text)
|
||||
|
||||
ge = sub.add_parser("ingest-earnings", help="Fetch FMP earnings transcripts → documents + extract jobs")
|
||||
ge.add_argument("--ticker", required=True)
|
||||
ge.add_argument("--since", help="ISO date lower bound (uses transcript date)")
|
||||
ge.add_argument("--until", help="ISO date upper bound")
|
||||
ge.add_argument("--limit", type=int, default=8)
|
||||
ge.set_defaults(func=cmd_ingest_earnings)
|
||||
|
||||
ts = sub.add_parser("two-sided", help="Two-sided net-corroboration trajectory (Strike/Battery adversarial cases)")
|
||||
ts.add_argument("--conviction", default="BATTERY2022")
|
||||
ts.add_argument("--nodes", default="", help="comma substrings to filter fan-out nodes, e.g. demand,supply")
|
||||
ts.add_argument("--dates", default="2022-12-31,2023-06-30,2023-12-31,2024-06-30,2024-12-31")
|
||||
ts.add_argument("--modes", default="live", help="comma list: live,test")
|
||||
ts.add_argument("--window-days", type=int, default=365)
|
||||
ts.set_defaults(func=cmd_two_sided)
|
||||
|
||||
ec = sub.add_parser("embed-claims", help="Embed pending propositions → Qdrant hybrid collection (§4.3)")
|
||||
ec.add_argument("--qdrant-url", default="http://192.168.1.87:6333")
|
||||
ec.add_argument("--no-sparse", action="store_true", help="dense-only (skip BM25)")
|
||||
ec.set_defaults(func=cmd_embed_claims)
|
||||
|
||||
se = sub.add_parser("search", help="Hybrid search the proposition store via the gateway")
|
||||
se.add_argument("--query", required=True)
|
||||
se.add_argument("--top-k", type=int, default=8)
|
||||
se.add_argument("--no-rerank", action="store_true")
|
||||
se.set_defaults(func=cmd_search)
|
||||
|
||||
ip = sub.add_parser("ingest-podcast", help="Register podcast episodes → transcribe jobs (RSS or YouTube)")
|
||||
ip.add_argument("--source", required=True, help="source_id, e.g. pod-dwarkesh")
|
||||
ip.add_argument("--via", choices=["auto", "rss", "youtube"], default="auto")
|
||||
ip.add_argument("--since")
|
||||
ip.add_argument("--until")
|
||||
ip.add_argument("--limit", type=int, default=20)
|
||||
ip.set_defaults(func=cmd_ingest_podcast)
|
||||
|
||||
rt = sub.add_parser("run-transcribe", help="Drain 'transcribe' jobs → speaker-attributed transcripts + voiceprints")
|
||||
rt.add_argument("--limit", type=int, default=5)
|
||||
rt.add_argument("--max-chunks", type=int, default=999)
|
||||
rt.set_defaults(func=cmd_run_transcribe)
|
||||
|
||||
rtg = sub.add_parser("run-transcribe-gemini",
|
||||
help="One-time backfill: drain 'transcribe' jobs via Gemini (off the Spark GPU)")
|
||||
rtg.add_argument("--limit", type=int, default=5)
|
||||
rtg.add_argument("--concurrency", type=int, default=4)
|
||||
rtg.set_defaults(func=cmd_run_transcribe_gemini)
|
||||
|
||||
re = sub.add_parser("run-extract", help="Drain 'extract' jobs → claims via the local LLM (§4.2)")
|
||||
re.add_argument("--limit", type=int, default=5, help="max jobs to process this run")
|
||||
re.add_argument("--max-chunks", type=int, default=4, help="max chunks per document")
|
||||
re.set_defaults(func=cmd_run_extract)
|
||||
|
||||
sub.add_parser("queue-status", help="Backfill queue counts by type/state").set_defaults(func=cmd_queue_status)
|
||||
|
||||
fp = sub.add_parser("feed-peek", help="Parse an RSS feed and show episode coverage")
|
||||
fp.add_argument("--url", required=True)
|
||||
fp.add_argument("--limit", type=int, default=5)
|
||||
fp.set_defaults(func=cmd_feed_peek)
|
||||
|
||||
sv = sub.add_parser("serve", help="Run the corpus-management web UI (FastAPI)")
|
||||
sv.add_argument("--port", type=int, default=None)
|
||||
sv.set_defaults(func=cmd_serve)
|
||||
|
||||
cm = sub.add_parser("confusion-matrix", help="Pre-registered precision/recall on the §7.1 derivatives (resolver)")
|
||||
cm.add_argument("--spec", default="seeds/resolution.K2023.yaml")
|
||||
cm.set_defaults(func=cmd_confusion)
|
||||
|
||||
pv = sub.add_parser("provenance", help="Processing log: what's ingested/processed (dedup-safe)")
|
||||
pv.add_argument("--backfill-hashes", action="store_true", help="compute content_hash for older transcripts")
|
||||
pv.set_defaults(func=cmd_provenance)
|
||||
|
||||
sub.add_parser("db-tables", help="List tables/views").set_defaults(func=cmd_db_tables)
|
||||
sub.add_parser("spark-status", help="Probe Spark Control health").set_defaults(func=cmd_spark_status)
|
||||
return p
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
args = build_parser().parse_args(argv)
|
||||
cfg = load_config()
|
||||
_setup_logging(cfg.log_level)
|
||||
return args.func(args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -0,0 +1,101 @@
|
||||
"""Environment-driven configuration (handoff §10, §13).
|
||||
|
||||
All config flows through env vars so the SAME code runs as a plain process now and, later, as a
|
||||
StartOS s9pk daemon (which injects these via the daemon's `exec.env` from a `store.json` FileModel).
|
||||
A local `.env` (gitignored) is loaded for convenience during the pilot.
|
||||
|
||||
Live values confirmed against the operator's gateway 2026-06-07 (GET /api/status,/api/endpoints):
|
||||
gateway = https://192.168.1.72:62419 (self-signed → SPARK_VERIFY_TLS=false)
|
||||
LLM = RedHatAI/Qwen3.6-35B-A3B-NVFP4
|
||||
embed = BAAI/bge-m3 (1024-d) rerank = BAAI/bge-reranker-v2-m3
|
||||
ASR = nvidia/parakeet-tdt-0.6b-v3 diarizer = nvidia/diar_sortformer_4spk-v1
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def _load_dotenv(path: str = ".env") -> None:
|
||||
"""Minimal .env loader (no dependency): KEY=VALUE lines populate os.environ if not already set."""
|
||||
p = Path(path)
|
||||
if not p.exists():
|
||||
return
|
||||
for line in p.read_text().splitlines():
|
||||
line = line.strip()
|
||||
if not line or line.startswith("#") or "=" not in line:
|
||||
continue
|
||||
key, _, val = line.partition("=")
|
||||
os.environ.setdefault(key.strip(), val.strip().strip('"').strip("'"))
|
||||
|
||||
|
||||
def _env(key: str, default: str | None = None) -> str | None:
|
||||
return os.environ.get(key, default)
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class Config:
|
||||
spark_control_url: str
|
||||
spark_verify_tls: bool
|
||||
spark_timeout_s: float
|
||||
audio_concurrency: int # global in-flight cap across BOTH parakeet audio endpoints (sit at 2, ceiling 3)
|
||||
|
||||
local_llm_model: str
|
||||
embed_model: str
|
||||
transcribe_model: str
|
||||
|
||||
anthropic_api_key: str | None
|
||||
frontier_model: str
|
||||
|
||||
# Extraction backend: 'local' (Qwen via Spark Control, default) | 'gemini' (batch overflow/fallback, §scaling)
|
||||
extraction_backend: str
|
||||
gemini_api_key: str | None
|
||||
gemini_model: str
|
||||
|
||||
fmp_api_key: str | None
|
||||
edgar_user_agent: str
|
||||
|
||||
data_dir: Path
|
||||
database_url: str
|
||||
audio_cache_dir: Path
|
||||
|
||||
ui_port: int
|
||||
log_level: str
|
||||
|
||||
@classmethod
|
||||
def from_env(cls) -> "Config":
|
||||
_load_dotenv()
|
||||
data_dir = Path(_env("DATA_DIR", "./data") or "./data")
|
||||
return cls(
|
||||
spark_control_url=_env("SPARK_CONTROL_URL", "https://192.168.1.72:62419") or "",
|
||||
spark_verify_tls=(_env("SPARK_VERIFY_TLS", "false") or "false").lower() == "true",
|
||||
spark_timeout_s=float(_env("SPARK_TIMEOUT_S", "180") or "180"),
|
||||
audio_concurrency=min(3, max(1, int(_env("AUDIO_CONCURRENCY", "2") or "2"))),
|
||||
local_llm_model=_env("LOCAL_LLM_MODEL", "RedHatAI/Qwen3.6-35B-A3B-NVFP4") or "",
|
||||
embed_model=_env("EMBED_MODEL", "BAAI/bge-m3") or "",
|
||||
transcribe_model=_env("TRANSCRIBE_MODEL", "nvidia/parakeet-tdt-0.6b-v3") or "",
|
||||
anthropic_api_key=_env("ANTHROPIC_API_KEY"),
|
||||
frontier_model=_env("FRONTIER_MODEL", "claude-opus-4-8") or "",
|
||||
extraction_backend=_env("EXTRACTION_BACKEND", "local") or "local",
|
||||
gemini_api_key=_env("GEMINI_API_KEY"),
|
||||
gemini_model=_env("GEMINI_MODEL", "gemini-2.5-flash") or "",
|
||||
fmp_api_key=_env("FMP_API_KEY"),
|
||||
edgar_user_agent=_env("EDGAR_USER_AGENT", "Ten31 Research grant@ten31.xyz") or "",
|
||||
data_dir=data_dir,
|
||||
database_url=_env("DATABASE_URL", "") or "",
|
||||
audio_cache_dir=Path(_env("AUDIO_CACHE_DIR", str(data_dir / "audio-cache")) or "audio-cache"),
|
||||
ui_port=int(_env("UI_PORT", "8000") or "8000"),
|
||||
log_level=_env("LOG_LEVEL", "INFO") or "INFO",
|
||||
)
|
||||
|
||||
@property
|
||||
def db_path(self) -> Path:
|
||||
prefix = "sqlite:///"
|
||||
if self.database_url.startswith(prefix):
|
||||
return Path(self.database_url[len(prefix):])
|
||||
return self.data_dir / "signal.db"
|
||||
|
||||
|
||||
def load_config() -> Config:
|
||||
return Config.from_env()
|
||||
@@ -0,0 +1,6 @@
|
||||
"""Embedding + vector storage (§4.3).
|
||||
|
||||
Embed DISTILLED PROPOSITIONS (not raw chunks) into a Qdrant HYBRID collection: dense bge-m3
|
||||
(via the gateway) + BM25 sparse (client-side), so entity-heavy propositions (MSTR/Strategy/
|
||||
Microstrategy) match on the lexical leg too. Retrieval goes through the gateway's /api/search.
|
||||
"""
|
||||
@@ -0,0 +1,36 @@
|
||||
"""Proposition embedding: dense (bge-m3 via gateway) + optional BM25 sparse (client-side)."""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def dense_embed(sc, texts: list[str]) -> list[list[float]]:
|
||||
"""Dense bge-m3 (1024-d) via the gateway /v1/embeddings (§4.3)."""
|
||||
resp = sc.embed(texts)
|
||||
data = sorted(resp["data"], key=lambda d: d.get("index", 0))
|
||||
return [d["embedding"] for d in data]
|
||||
|
||||
|
||||
class SparseEmbedder:
|
||||
"""BM25 sparse vectors via FastEmbed `Qdrant/bm25` (the operator's CRM uses this exact model,
|
||||
with the collection's `modifier: idf`). Degrades gracefully to dense-only if fastembed is absent."""
|
||||
|
||||
def __init__(self, model_name: str = "Qdrant/bm25") -> None:
|
||||
self.available = False
|
||||
self._model = None
|
||||
try:
|
||||
from fastembed import SparseTextEmbedding
|
||||
self._model = SparseTextEmbedding(model_name=model_name)
|
||||
self.available = True
|
||||
except Exception as e: # noqa: BLE001
|
||||
log.warning("fastembed sparse unavailable (%s) — upserting dense-only; add sparse later", e)
|
||||
|
||||
def embed(self, texts: list[str]) -> list[dict | None]:
|
||||
if not self.available or self._model is None:
|
||||
return [None] * len(texts)
|
||||
out: list[dict | None] = []
|
||||
for emb in self._model.embed(texts):
|
||||
out.append({"indices": emb.indices.tolist(), "values": emb.values.tolist()})
|
||||
return out
|
||||
@@ -0,0 +1,79 @@
|
||||
"""Qdrant hybrid collection: create + upsert distilled propositions (§4.3).
|
||||
|
||||
Collection mgmt + upserts go DIRECT to Qdrant (§13.2 "(Qdrant direct) :6333"); retrieval goes
|
||||
through the gateway's /api/search. Named dense vector `bge_m3` (1024-d cosine) + sparse `bm25`
|
||||
(modifier IDF). Point id is a deterministic UUID5 of claim_id, so re-upsert is idempotent.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import sqlite3
|
||||
import uuid
|
||||
|
||||
from qdrant_client import QdrantClient, models
|
||||
|
||||
from .embedder import SparseEmbedder, dense_embed
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
COLLECTION = "propositions"
|
||||
DENSE = "bge_m3"
|
||||
SPARSE = "bm25"
|
||||
_NS = uuid.UUID("5f9b7e10-0000-4000-8000-000000000001")
|
||||
|
||||
# Filterable payload (§4.3): stance/topic/cluster/date for stance distributions, time-windowed
|
||||
# consensus, corroboration lookups. NEVER infer stance from vector distance (§2.2/§5.3).
|
||||
_PAYLOAD_FIELDS = (
|
||||
"claim_id", "doc_id", "source_id", "source_cluster", "topic_canonical", "date",
|
||||
"claim_type", "time_horizon", "confidence", "rel_polarity", "engages_consensus",
|
||||
"counters_position", "thesis_seam", "salience", "claimant", "proposition",
|
||||
)
|
||||
|
||||
|
||||
def get_client(qdrant_url: str) -> QdrantClient:
|
||||
return QdrantClient(url=qdrant_url, prefer_grpc=False, timeout=60)
|
||||
|
||||
|
||||
def ensure_collection(client: QdrantClient, *, dim: int = 1024) -> bool:
|
||||
names = [c.name for c in client.get_collections().collections]
|
||||
if COLLECTION in names:
|
||||
return False
|
||||
client.create_collection(
|
||||
collection_name=COLLECTION,
|
||||
vectors_config={DENSE: models.VectorParams(size=dim, distance=models.Distance.COSINE)},
|
||||
sparse_vectors_config={SPARSE: models.SparseVectorParams(modifier=models.Modifier.IDF)},
|
||||
)
|
||||
log.info("created Qdrant collection %r (dense %s %dd + sparse %s/idf)", COLLECTION, DENSE, dim, SPARSE)
|
||||
return True
|
||||
|
||||
|
||||
def _point_id(claim_id: str) -> str:
|
||||
return str(uuid.uuid5(_NS, claim_id))
|
||||
|
||||
|
||||
def upsert_pending(conn: sqlite3.Connection, sc, client: QdrantClient,
|
||||
sparse: SparseEmbedder | None = None, *, batch: int = 64) -> int:
|
||||
"""Embed + upsert every claim that has no qdrant_point_id yet; back-link the id into SQLite."""
|
||||
rows = conn.execute("SELECT * FROM claims WHERE qdrant_point_id IS NULL").fetchall()
|
||||
if not rows:
|
||||
return 0
|
||||
total = 0
|
||||
for i in range(0, len(rows), batch):
|
||||
chunk = rows[i:i + batch]
|
||||
texts = [r["proposition"] for r in chunk]
|
||||
dvecs = dense_embed(sc, texts)
|
||||
svecs = sparse.embed(texts) if sparse else [None] * len(texts)
|
||||
points = []
|
||||
for r, dv, sv in zip(chunk, dvecs, svecs):
|
||||
vectors: dict = {DENSE: dv}
|
||||
if sv is not None:
|
||||
vectors[SPARSE] = models.SparseVector(indices=sv["indices"], values=sv["values"])
|
||||
payload = {f: r[f] for f in _PAYLOAD_FIELDS}
|
||||
points.append(models.PointStruct(id=_point_id(r["claim_id"]), vector=vectors, payload=payload))
|
||||
client.upsert(collection_name=COLLECTION, points=points)
|
||||
for r in chunk:
|
||||
conn.execute("UPDATE claims SET qdrant_point_id=? WHERE claim_id=?",
|
||||
(_point_id(r["claim_id"]), r["claim_id"]))
|
||||
conn.commit()
|
||||
total += len(chunk)
|
||||
return total
|
||||
@@ -0,0 +1,6 @@
|
||||
"""Extraction (§4.2) — local LLM → structured claim units. The cost & quality center.
|
||||
|
||||
Emits at the level of the PROPOSITION: a passage may yield 0..N claims, and MOST passages yield
|
||||
zero. An extractor that dutifully emits a claim per chunk reintroduces exactly the noise the rest
|
||||
of the system is designed to remove.
|
||||
"""
|
||||
@@ -0,0 +1,64 @@
|
||||
"""Pluggable extraction backends (§scaling).
|
||||
|
||||
The §4.2 extractor calls a backend that turns chat messages into a JSON string. Default is the
|
||||
LOCAL Qwen via Spark Control (the ~95%-local design). The Gemini backend is the documented
|
||||
overflow/fallback for bulk back-cataloging at scale, or if the Sparks are unavailable — used for
|
||||
the PUBLIC corpus only, never conviction/exposure data (sovereignty boundary, §4.6).
|
||||
|
||||
A backend exposes: complete_json(messages, max_tokens) -> str (a JSON object string).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class LocalQwenBackend:
|
||||
name = "local"
|
||||
|
||||
def __init__(self, sc) -> None:
|
||||
self.sc = sc
|
||||
|
||||
def complete_json(self, messages: list[dict], *, max_tokens: int = 4000) -> str:
|
||||
resp = self.sc.chat(messages, json_object=True, temperature=0,
|
||||
enable_thinking=False, max_tokens=max_tokens)
|
||||
return resp["choices"][0]["message"]["content"]
|
||||
|
||||
|
||||
class GeminiBackend:
|
||||
"""Gemini fallback/overflow. Implemented against the `google-genai` SDK. NOTE: untested until a
|
||||
key is provided — validate end-to-end before relying on it for a real backfill. The async BATCH
|
||||
API is the eventual scale path; this synchronous form is the drop-in fallback."""
|
||||
name = "gemini"
|
||||
|
||||
def __init__(self, api_key: str, model: str = "gemini-2.5-flash") -> None:
|
||||
from google import genai # guarded import; pip install google-genai
|
||||
self._genai = genai
|
||||
self.client = genai.Client(api_key=api_key)
|
||||
self.model = model
|
||||
|
||||
def complete_json(self, messages: list[dict], *, max_tokens: int = 4000) -> str:
|
||||
from google.genai import types
|
||||
system = "\n\n".join(m["content"] for m in messages if m["role"] == "system")
|
||||
user = "\n\n".join(m["content"] for m in messages if m["role"] != "system")
|
||||
resp = self.client.models.generate_content(
|
||||
model=self.model,
|
||||
contents=user,
|
||||
config=types.GenerateContentConfig(
|
||||
system_instruction=system or None,
|
||||
temperature=0,
|
||||
max_output_tokens=max_tokens,
|
||||
response_mime_type="application/json",
|
||||
),
|
||||
)
|
||||
return resp.text or "{}"
|
||||
|
||||
|
||||
def from_config(cfg, sc) -> "LocalQwenBackend | GeminiBackend":
|
||||
if cfg.extraction_backend == "gemini":
|
||||
if not cfg.gemini_api_key:
|
||||
log.warning("EXTRACTION_BACKEND=gemini but GEMINI_API_KEY missing — falling back to local")
|
||||
else:
|
||||
return GeminiBackend(cfg.gemini_api_key, cfg.gemini_model)
|
||||
return LocalQwenBackend(sc)
|
||||
@@ -0,0 +1,117 @@
|
||||
"""Claim extraction: text → 0..N claim units → SQLite (§4.2)."""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import sqlite3
|
||||
from typing import Any
|
||||
|
||||
from .prompt import SEED_TOPICS, build_messages
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
_ENUMS = {
|
||||
"claim_type": {"interpretive", "predictive", "descriptive", "reactive"},
|
||||
"time_horizon": {"near", "medium", "long", "unspecified"},
|
||||
"confidence": {"low", "med", "high"},
|
||||
"thesis_seam": {"energy_compute", "debasement_bitcoin", "ai_data_ownership", "none"},
|
||||
"salience": {"central", "secondary", "aside"},
|
||||
}
|
||||
|
||||
|
||||
def register_seed_topics(conn: sqlite3.Connection) -> None:
|
||||
"""Pre-load the controlled half of the hybrid topic vocabulary (§4.2)."""
|
||||
for t in SEED_TOPICS:
|
||||
conn.execute(
|
||||
"INSERT INTO topics (topic_canonical, status) VALUES (?, 'controlled') "
|
||||
"ON CONFLICT(topic_canonical) DO UPDATE SET status='controlled'",
|
||||
(t,),
|
||||
)
|
||||
conn.commit()
|
||||
|
||||
|
||||
def chunk_text(text: str, max_chars: int) -> list[str]:
|
||||
"""Split on paragraph boundaries into windows that fit the model context alongside the prompt."""
|
||||
text = text.strip()
|
||||
if not text:
|
||||
return []
|
||||
if len(text) <= max_chars:
|
||||
return [text]
|
||||
chunks: list[str] = []
|
||||
cur: list[str] = []
|
||||
size = 0
|
||||
for para in text.split("\n\n"):
|
||||
if size + len(para) > max_chars and cur:
|
||||
chunks.append("\n\n".join(cur))
|
||||
cur, size = [], 0
|
||||
cur.append(para)
|
||||
size += len(para) + 2
|
||||
if cur:
|
||||
chunks.append("\n\n".join(cur))
|
||||
return chunks
|
||||
|
||||
|
||||
def _parse_claims(content: str) -> list[dict]:
|
||||
try:
|
||||
obj = json.loads(content)
|
||||
except Exception:
|
||||
i, j = content.find("{"), content.rfind("}")
|
||||
if i < 0 or j < 0:
|
||||
return []
|
||||
try:
|
||||
obj = json.loads(content[i:j + 1])
|
||||
except Exception:
|
||||
return []
|
||||
claims = obj.get("claims", []) if isinstance(obj, dict) else []
|
||||
return [c for c in claims if isinstance(c, dict) and c.get("proposition")]
|
||||
|
||||
|
||||
def extract_claims_from_text(backend, text: str, *, source_name: str, source_cluster: str | None,
|
||||
date: str | None, kind: str) -> list[dict]:
|
||||
"""`backend` is any object with .complete_json(messages, max_tokens) -> str
|
||||
(see extract.backends: LocalQwenBackend | GeminiBackend)."""
|
||||
messages = build_messages(text, source_name=source_name, source_cluster=source_cluster,
|
||||
date=date, kind=kind)
|
||||
content = backend.complete_json(messages, max_tokens=4000)
|
||||
return _parse_claims(content)
|
||||
|
||||
|
||||
def _enum(c: dict, field: str, default: str) -> str:
|
||||
v = c.get(field)
|
||||
return v if v in _ENUMS[field] else default
|
||||
|
||||
|
||||
def persist_claims(conn: sqlite3.Connection, *, doc: sqlite3.Row, source: sqlite3.Row | None,
|
||||
claims: list[dict], chunk_idx: int) -> int:
|
||||
n = 0
|
||||
cluster = source["source_cluster"] if source else None
|
||||
for i, c in enumerate(claims):
|
||||
seam = _enum(c, "thesis_seam", "none")
|
||||
topic = c.get("topic_canonical") or None
|
||||
if topic:
|
||||
# register emergent topics BEFORE the claim (claims.topic_canonical is a FK → topics)
|
||||
conn.execute(
|
||||
"INSERT OR IGNORE INTO topics (topic_canonical, status, seam) VALUES (?, 'emergent', ?)",
|
||||
(topic, seam),
|
||||
)
|
||||
claim_id = f"{doc['doc_id']}:{chunk_idx}:{i}"
|
||||
conn.execute(
|
||||
"""INSERT OR IGNORE INTO claims
|
||||
(claim_id, doc_id, source_id, proposition, topic_canonical, topic_raw, claimant,
|
||||
source_cluster, date, claim_type, time_horizon, confidence, rel_polarity,
|
||||
engages_consensus, counters_position, thesis_seam, salience)
|
||||
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
|
||||
(
|
||||
claim_id, doc["doc_id"], doc["source_id"], str(c["proposition"])[:1000],
|
||||
topic, c.get("topic_raw"),
|
||||
c.get("claimant") or (source["name"] if source else None),
|
||||
cluster, doc["date"],
|
||||
_enum(c, "claim_type", "descriptive"), _enum(c, "time_horizon", "unspecified"),
|
||||
_enum(c, "confidence", "med"), "none",
|
||||
1 if c.get("engages_consensus") else 0, c.get("counters_position"),
|
||||
seam, _enum(c, "salience", "secondary"),
|
||||
),
|
||||
)
|
||||
n += 1
|
||||
conn.commit()
|
||||
return n
|
||||
@@ -0,0 +1,47 @@
|
||||
"""SEC filing HTML → plain text. Stdlib only (boring, inspectable).
|
||||
|
||||
Drops script/style/head and inline-XBRL hidden blocks (10-Ks embed a huge <ix:hidden> section of
|
||||
numeric facts that would otherwise swamp the extractor), and collapses whitespace.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from html.parser import HTMLParser
|
||||
|
||||
_SKIP_TAGS = {"script", "style", "head"}
|
||||
_SKIP_PREFIXES = ("ix:hidden",) # inline-XBRL hidden fact dump
|
||||
_BLOCK_TAGS = {"p", "div", "br", "tr", "li", "h1", "h2", "h3", "h4", "h5", "h6", "table"}
|
||||
|
||||
|
||||
class _Stripper(HTMLParser):
|
||||
def __init__(self) -> None:
|
||||
super().__init__(convert_charrefs=True)
|
||||
self._skip_depth = 0
|
||||
self._parts: list[str] = []
|
||||
|
||||
def handle_starttag(self, tag: str, attrs) -> None:
|
||||
if tag in _SKIP_TAGS or tag.startswith(_SKIP_PREFIXES):
|
||||
self._skip_depth += 1
|
||||
elif tag in _BLOCK_TAGS:
|
||||
self._parts.append("\n")
|
||||
|
||||
def handle_endtag(self, tag: str) -> None:
|
||||
if tag in _SKIP_TAGS or tag.startswith(_SKIP_PREFIXES):
|
||||
self._skip_depth = max(0, self._skip_depth - 1)
|
||||
elif tag in _BLOCK_TAGS:
|
||||
self._parts.append("\n")
|
||||
|
||||
def handle_data(self, data: str) -> None:
|
||||
if self._skip_depth == 0 and data.strip():
|
||||
self._parts.append(data)
|
||||
|
||||
|
||||
def html_to_text(html: str, *, max_chars: int = 300_000) -> str:
|
||||
p = _Stripper()
|
||||
p.feed(html)
|
||||
text = "".join(p._parts)
|
||||
text = re.sub(r"[ \t ]+", " ", text)
|
||||
text = re.sub(r"\n\s*\n\s*\n+", "\n\n", text)
|
||||
text = "\n".join(line.strip() for line in text.splitlines())
|
||||
text = text.strip()
|
||||
return text[:max_chars]
|
||||
@@ -0,0 +1,72 @@
|
||||
"""The §4.2 claim-extraction prompt. Prompt engineering is ours (§13.3); the schema is finalized.
|
||||
|
||||
Discipline encoded here (the whole point of the system, §2/§4.2):
|
||||
- Extract at the level of the PROPOSITION; emit ZERO when there is no substantive claim.
|
||||
- Separate topic from stance: capture stance-vs-consensus explicitly, never as a bull/bear label.
|
||||
- thesis_seam is a TAG, not a filter — off-thesis and anti-thesis claims are still extracted.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
# Hybrid topic vocabulary (§4.2): a small SEEDED controlled list. The model reuses one when it
|
||||
# fits and proposes a concise snake_case topic otherwise; emergent topics are merged on a schedule.
|
||||
SEED_TOPICS = [
|
||||
# energy <-> compute
|
||||
"ai_compute_demand", "ai_power_constraint", "datacenter_buildout", "grid_interconnect",
|
||||
"transformers_equipment", "nuclear_power", "natural_gas_power", "uranium_supply",
|
||||
"cooling_infrastructure", "miner_flexible_load", "mining_ai_pivot",
|
||||
# debasement <-> bitcoin
|
||||
"bitcoin_reserve_asset", "bitcoin_collateral_credit", "bitcoin_treasury_strategy",
|
||||
"btc_custody_regulation", "sovereign_bitcoin_adoption",
|
||||
# ai <-> data ownership
|
||||
"ai_data_ownership", "confidential_inference", "ai_commoditization",
|
||||
# macro
|
||||
"fed_policy", "fiscal_debasement", "stablecoins_cbdc",
|
||||
]
|
||||
|
||||
_SYSTEM = """You are the claim-extraction component of an investment signal engine. You read a passage \
|
||||
(an SEC filing excerpt or a podcast/earnings-call transcript) and extract structured CLAIM UNITS.
|
||||
|
||||
A CLAIM UNIT is a single normalized proposition that someone asserts — a forward-looking prediction, \
|
||||
an interpretive or causal judgment, or a stance taken against a prevailing view. It must be specific \
|
||||
enough to later be checked against the world.
|
||||
|
||||
CRITICAL DISCIPLINE — be willing to extract NOTHING:
|
||||
- Most passages contain ZERO claim units. Boilerplate, legal disclaimers, ad reads, pleasantries, \
|
||||
generic descriptions, routine financial line-items, and recitations of well-known news are NOT claims.
|
||||
- Do NOT invent claims. Do NOT emit one claim per paragraph to seem thorough. If the passage has no \
|
||||
substantive proposition, return {"claims": []}. A precise empty answer is the correct, valued output.
|
||||
- Extract at the level of the PROPOSITION: one normalized subject-assertion-object sentence each. A \
|
||||
single rich passage may yield several; a long dull one yields none.
|
||||
|
||||
For EACH claim unit, output these fields:
|
||||
- "proposition": one normalized sentence (subject-assertion-object), self-contained.
|
||||
- "topic_canonical": a concise snake_case topic for clustering. REUSE one of the provided seed topics \
|
||||
when it fits; otherwise propose a new concise snake_case label. Normalize synonyms (Fed/FOMC/rates → fed_policy).
|
||||
- "topic_raw": the topic as actually phrased in the passage.
|
||||
- "claimant": who asserts it (speaker name or the filing company). Use "unknown" if unclear.
|
||||
- "claim_type": one of interpretive | predictive | descriptive | reactive. (interpretive/predictive = \
|
||||
insight; descriptive/reactive = news echo — extract those only if clearly salient.)
|
||||
- "time_horizon": one of near | medium | long | unspecified (for predictive claims especially).
|
||||
- "confidence": the claimant's apparent conviction — one of low | med | high.
|
||||
- "engages_consensus": true ONLY if the claim explicitly argues against a stated mainstream view.
|
||||
- "counters_position": the mainstream position it argues against, or null.
|
||||
- "thesis_seam": one of energy_compute | debasement_bitcoin | ai_data_ownership | none. This is a TAG \
|
||||
for relevance only — tag off-thesis claims "none" and STILL extract them.
|
||||
- "salience": central | secondary | aside (how central the claim is to the passage).
|
||||
|
||||
Return ONLY a JSON object: {"claims": [ {...}, ... ]}. No prose, no markdown."""
|
||||
|
||||
|
||||
def build_messages(text: str, *, source_name: str, source_cluster: str | None,
|
||||
date: str | None, kind: str) -> list[dict[str, str]]:
|
||||
seed = ", ".join(SEED_TOPICS)
|
||||
context = (
|
||||
f"Source: {source_name or 'unknown'} (cluster: {source_cluster or 'n/a'}, type: {kind}, "
|
||||
f"date: {date or 'n/a'}).\n"
|
||||
f"Seed topics to reuse when they fit: {seed}.\n\n"
|
||||
f"PASSAGE:\n{text}"
|
||||
)
|
||||
return [
|
||||
{"role": "system", "content": _SYSTEM},
|
||||
{"role": "user", "content": context},
|
||||
]
|
||||
@@ -0,0 +1,69 @@
|
||||
"""Extraction worker — drains 'extract' jobs from the backfill queue (§4.2, §13.4).
|
||||
|
||||
Single sequential worker by design: extraction is the heavier serial load on the one LLM GPU.
|
||||
For each job: load the document, get its text (fetch+strip filing HTML, or read a stored transcript),
|
||||
chunk it, run the §4.2 extractor per chunk, persist 0..N claims, complete the job.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
|
||||
from ..backfill import queue
|
||||
from . import claims as claims_mod
|
||||
from .html_text import html_to_text
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def _document_text(doc, *, user_agent: str) -> str:
|
||||
if doc["transcript_path"]:
|
||||
return Path(doc["transcript_path"]).read_text()
|
||||
if doc["kind"] == "filing" and doc["url"]:
|
||||
r = requests.get(doc["url"], headers={"User-Agent": user_agent}, timeout=90)
|
||||
r.raise_for_status()
|
||||
return html_to_text(r.text)
|
||||
raise ValueError(f"no text source for {doc['doc_id']} (kind={doc['kind']}, url={doc['url']})")
|
||||
|
||||
|
||||
def run_extract(conn, sc, cfg, *, limit: int = 10, max_chunks_per_doc: int = 4,
|
||||
chunk_chars: int = 18_000, lease_seconds: int = 900,
|
||||
worker_id: str = "extract-1") -> dict:
|
||||
from .backends import from_config as backend_from_config
|
||||
backend = backend_from_config(cfg, sc)
|
||||
log.info("extraction backend: %s", backend.name)
|
||||
claims_mod.register_seed_topics(conn)
|
||||
processed = total_claims = 0
|
||||
while processed < limit:
|
||||
job = queue.lease_next(conn, worker_id=worker_id, job_types=["extract"], lease_seconds=lease_seconds)
|
||||
if job is None:
|
||||
break
|
||||
processed += 1
|
||||
doc = conn.execute("SELECT * FROM documents WHERE doc_id=?", (job["target_id"],)).fetchone()
|
||||
if doc is None:
|
||||
queue.skip(conn, job["job_id"], "document missing")
|
||||
continue
|
||||
src = conn.execute("SELECT * FROM sources WHERE source_id=?", (doc["source_id"],)).fetchone()
|
||||
try:
|
||||
text = _document_text(doc, user_agent=cfg.edgar_user_agent)
|
||||
chunks = claims_mod.chunk_text(text, chunk_chars)[:max_chunks_per_doc]
|
||||
doc_claims = 0
|
||||
for idx, chunk in enumerate(chunks):
|
||||
cl = claims_mod.extract_claims_from_text(
|
||||
backend, chunk,
|
||||
source_name=src["name"] if src else "",
|
||||
source_cluster=src["source_cluster"] if src else None,
|
||||
date=doc["date"], kind=doc["kind"],
|
||||
)
|
||||
doc_claims += claims_mod.persist_claims(conn, doc=doc, source=src, claims=cl, chunk_idx=idx)
|
||||
conn.execute("UPDATE documents SET processed_at=datetime('now') WHERE doc_id=?", (doc["doc_id"],))
|
||||
conn.commit()
|
||||
queue.complete(conn, job["job_id"], output_ref=f"{doc_claims} claims / {len(chunks)} chunks")
|
||||
total_claims += doc_claims
|
||||
log.info("extracted %d claims from %s (%d chunks)", doc_claims, doc["doc_id"], len(chunks))
|
||||
except Exception as e: # noqa: BLE001
|
||||
state = queue.fail(conn, job["job_id"], e)
|
||||
log.warning("extract failed for %s: %s (→ %s)", job["target_id"], e, state)
|
||||
return {"jobs_processed": processed, "claims_written": total_claims}
|
||||
@@ -0,0 +1,5 @@
|
||||
"""Ingestion layer (§4.1) — the biggest greenfield piece.
|
||||
|
||||
Spark Control transcribes audio you hand it; it does NOT fetch. Everything here is fetch/schedule:
|
||||
RSS + YouTube + EDGAR + FMP earnings, long-audio chunking, and cross-chunk speaker stitching.
|
||||
"""
|
||||
@@ -0,0 +1,36 @@
|
||||
"""Long-audio chunking (§4.1, §13.4).
|
||||
|
||||
Podcasts run 1–3 h; the diarizer caps at 4 speakers/chunk and Spark 2 is a single GPU, so we cut
|
||||
long audio into ~2–3 min pieces sent SEQUENTIALLY (parallel audio → 503 FFT race). Each chunk is
|
||||
diarized independently and re-stitched across chunks by voiceprint (see speaker_stitch.py).
|
||||
Requires ffmpeg/ffprobe.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
|
||||
CHUNK_SECONDS_DEFAULT = 150 # 2.5 min, within the ~2–3 min guidance
|
||||
|
||||
|
||||
def duration_seconds(src: str | Path) -> float:
|
||||
out = subprocess.run(
|
||||
["ffprobe", "-v", "error", "-show_entries", "format=duration",
|
||||
"-of", "default=noprint_wrappers=1:nokey=1", str(src)],
|
||||
check=True, capture_output=True, text=True,
|
||||
)
|
||||
return float(out.stdout.strip())
|
||||
|
||||
|
||||
def chunk_audio(src: str | Path, out_dir: str | Path, *, chunk_seconds: int = CHUNK_SECONDS_DEFAULT) -> list[Path]:
|
||||
"""Split into fixed-length WAV chunks using ffmpeg's segment muxer (no re-encode of timing).
|
||||
Returns chunk paths in order. Order matters: the queue sends them sequentially."""
|
||||
out_dir = Path(out_dir)
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
pattern = str(out_dir / "chunk_%04d.wav")
|
||||
subprocess.run(
|
||||
["ffmpeg", "-y", "-i", str(src), "-f", "segment", "-segment_time", str(chunk_seconds),
|
||||
"-ar", "16000", "-ac", "1", "-reset_timestamps", "1", pattern],
|
||||
check=True, capture_output=True,
|
||||
)
|
||||
return sorted(out_dir.glob("chunk_*.wav"))
|
||||
@@ -0,0 +1,159 @@
|
||||
"""Text-document fetcher for the Battery (bitcoin-collateralized lending) corpus and any non-filing,
|
||||
non-audio source: policy primaries (SEC SABs, OCC/FDIC/Fed), lender/issuer blogs, credit-market data.
|
||||
|
||||
Unlike EDGAR (CIK-driven) and the podcast path (audio→transcribe), these are dated HTML pages, PDFs, or
|
||||
article RSS feeds. We fetch ONCE, extract clean text (HTML via html_to_text, PDF via pypdf), save it, and
|
||||
point documents.transcript_path at the saved text so the extract worker reads it directly (it already
|
||||
supports transcript_path) — this also lets PDFs work, which the worker's on-demand html_to_text fetch can't.
|
||||
|
||||
A source row must exist first (FK). Lineage/axis live on the source's cluster/notes (set in the seed);
|
||||
policy sources are axis=context and must NOT feed the supply resolver (weight 0) — enforced downstream.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import io
|
||||
import logging
|
||||
import sqlite3
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
|
||||
from ..backfill import queue
|
||||
from ..extract.html_text import html_to_text
|
||||
from .feeds import fetch_feed
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
DEFAULT_UA = "ten31-signal-engine/1.0 (research; contact ops@ten31.xyz)"
|
||||
|
||||
|
||||
def _pdf_to_text(data: bytes, *, max_chars: int) -> str:
|
||||
import pypdf
|
||||
reader = pypdf.PdfReader(io.BytesIO(data))
|
||||
parts: list[str] = []
|
||||
total = 0
|
||||
for page in reader.pages:
|
||||
t = page.extract_text() or ""
|
||||
parts.append(t)
|
||||
total += len(t)
|
||||
if total > max_chars:
|
||||
break
|
||||
return "\n".join(parts)[:max_chars]
|
||||
|
||||
|
||||
def fetch_clean_text(url: str, *, method: str = "auto", ua: str = DEFAULT_UA,
|
||||
timeout: int = 90, max_chars: int = 300_000) -> str:
|
||||
"""Fetch a URL once and return clean text. Auto-detects PDF vs HTML by content-type + magic bytes."""
|
||||
r = requests.get(url, headers={"User-Agent": ua}, timeout=timeout)
|
||||
r.raise_for_status()
|
||||
ctype = r.headers.get("Content-Type", "").lower()
|
||||
is_pdf = method == "pdf" or "application/pdf" in ctype or r.content[:5] == b"%PDF-"
|
||||
if is_pdf:
|
||||
return _pdf_to_text(r.content, max_chars=max_chars)
|
||||
return html_to_text(r.text, max_chars=max_chars)
|
||||
|
||||
|
||||
_BLOCK_MARKERS = (
|
||||
"aggressive automated scraping", "request access", "access denied", "are you a robot",
|
||||
"enable javascript", "captcha", "verify you are human", "rate limit exceeded",
|
||||
"403 forbidden", "unusual traffic", "checking your browser",
|
||||
)
|
||||
|
||||
|
||||
def _looks_blocked(text: str) -> bool:
|
||||
"""Anti-scraping interstitials return 200 + a short access-denied body. Detect so we don't ingest
|
||||
a block page as if it were the document (a real policy/blog doc is long and has no such markers)."""
|
||||
low = text[:2500].lower()
|
||||
return any(m in low for m in _BLOCK_MARKERS)
|
||||
|
||||
|
||||
def _doc_id(source_id: str, url: str) -> str:
|
||||
return f"doc:{source_id}:{hashlib.sha256(url.encode()).hexdigest()[:12]}"
|
||||
|
||||
|
||||
def ingest_one(conn: sqlite3.Connection, cfg, *, source_id: str, url: str, title: str,
|
||||
date: str | None, method: str = "auto", prompt_version: str = "extract-v0",
|
||||
min_chars: int = 400) -> str | None:
|
||||
"""Fetch+store one text document and enqueue extraction. Idempotent on (source_id, url).
|
||||
Returns doc_id if newly ingested, else None (duplicate, too-short, or fetch error → logged)."""
|
||||
doc_id = _doc_id(source_id, url)
|
||||
if conn.execute("SELECT 1 FROM documents WHERE doc_id=?", (doc_id,)).fetchone():
|
||||
return None
|
||||
ua = getattr(cfg, "user_agent", None) or DEFAULT_UA
|
||||
try:
|
||||
text = fetch_clean_text(url, method=method, ua=ua)
|
||||
except Exception as e: # noqa: BLE001
|
||||
log.warning("doc fetch failed %s: %s", url, e)
|
||||
return None
|
||||
if not text or len(text) < min_chars:
|
||||
log.warning("doc too short (%d chars), skipping %s", len(text or ""), url)
|
||||
return None
|
||||
if _looks_blocked(text):
|
||||
log.warning("blocked/anti-scrape page detected, skipping %s", url)
|
||||
return None
|
||||
safe = doc_id.replace(":", "_")
|
||||
tpath = Path(cfg.data_dir) / "docs" / f"{safe}.txt"
|
||||
tpath.parent.mkdir(parents=True, exist_ok=True)
|
||||
tpath.write_text(text)
|
||||
content_hash = hashlib.sha256(text.encode()).hexdigest()
|
||||
conn.execute(
|
||||
"""INSERT OR IGNORE INTO documents
|
||||
(doc_id, source_id, kind, external_id, url, title, date, transcript_path, content_hash, processed_at)
|
||||
VALUES (?,?,?,?,?,?,?,?,?,datetime('now'))""",
|
||||
(doc_id, source_id, "filing", url, url, title[:300] if title else url, date, str(tpath), content_hash),
|
||||
)
|
||||
conn.commit()
|
||||
h = hashlib.sha256(f"{doc_id}|{prompt_version}".encode()).hexdigest()
|
||||
queue.enqueue(conn, job_type="extract", target_id=doc_id, input_hash=h,
|
||||
parent_doc_id=doc_id, priority=50)
|
||||
conn.commit()
|
||||
log.info("ingested doc %s (%d chars) for %s", doc_id, len(text), source_id)
|
||||
return doc_id
|
||||
|
||||
|
||||
def ingest_manifest(conn: sqlite3.Connection, cfg, path) -> dict:
|
||||
"""Batch-ingest the docs listed in a YAML manifest ({docs:[{source,url,title,date,method}]}).
|
||||
Returns {ingested, skipped, missing_source}. Each source must already exist (FK)."""
|
||||
import yaml
|
||||
from pathlib import Path as _Path
|
||||
data = yaml.safe_load(_Path(path).read_text()) or {}
|
||||
docs = data.get("docs", [])
|
||||
ingested = skipped = missing = 0
|
||||
for d in docs:
|
||||
src = d.get("source")
|
||||
if not conn.execute("SELECT 1 FROM sources WHERE source_id=?", (src,)).fetchone():
|
||||
log.warning("manifest doc references missing source %r — skipping %s", src, d.get("url"))
|
||||
missing += 1
|
||||
continue
|
||||
doc_id = ingest_one(conn, cfg, source_id=src, url=d["url"], title=d.get("title", d["url"]),
|
||||
date=d.get("date"), method=d.get("method", "auto"))
|
||||
if doc_id:
|
||||
ingested += 1
|
||||
else:
|
||||
skipped += 1
|
||||
return {"ingested": ingested, "skipped": skipped, "missing_source": missing}
|
||||
|
||||
|
||||
def ingest_feed_text(conn: sqlite3.Connection, cfg, *, source_id: str, rss_url: str,
|
||||
since: str | None = None, until: str | None = None, limit: int = 50) -> int:
|
||||
"""Ingest the ARTICLE bodies behind a text RSS feed (blog/press feed). Each item's link is fetched
|
||||
and stored as a dated text document. Returns count of newly-ingested docs."""
|
||||
from .feeds import _published_iso
|
||||
parsed = fetch_feed(rss_url, user_agent=getattr(cfg, "user_agent", None) or DEFAULT_UA)
|
||||
n = 0
|
||||
for entry in parsed.entries:
|
||||
if n >= limit:
|
||||
break
|
||||
link = entry.get("link")
|
||||
if not link:
|
||||
continue
|
||||
date = _published_iso(entry)
|
||||
if since and date and date < since:
|
||||
continue
|
||||
if until and date and date > until:
|
||||
continue
|
||||
if ingest_one(conn, cfg, source_id=source_id, url=link,
|
||||
title=entry.get("title", link), date=date):
|
||||
n += 1
|
||||
return n
|
||||
@@ -0,0 +1,61 @@
|
||||
"""Audio acquisition (§4.1). Spark Control transcribes audio you fetch — this fetches it.
|
||||
|
||||
- Podcast enclosures: a plain streaming download that follows the Podtrac/Megaphone redirects to the
|
||||
final signed CDN object (download immediately; resolved URLs carry short-lived params).
|
||||
- YouTube: yt-dlp (audio-only → 16 kHz mono WAV). NOTE: 2026 YouTube enforces PO Tokens broadly — run
|
||||
the `bgutil-ytdlp-pot-provider` sidecar or pulls will 403. yt-dlp is treated as a LAST resort; prefer
|
||||
the RSS enclosure where a show publishes both (ToS: downloading YT audio violates YouTube ToS).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
|
||||
DEFAULT_UA = "Ten31SignalEngine/0.1 (+https://ten31.xyz)"
|
||||
|
||||
|
||||
def download_enclosure(url: str, dest: str | Path, *, user_agent: str = DEFAULT_UA, timeout: int = 120) -> Path:
|
||||
dest = Path(dest)
|
||||
dest.parent.mkdir(parents=True, exist_ok=True)
|
||||
with requests.get(url, stream=True, allow_redirects=True,
|
||||
headers={"User-Agent": user_agent}, timeout=timeout) as r:
|
||||
r.raise_for_status()
|
||||
with open(dest, "wb") as f:
|
||||
for chunk in r.iter_content(chunk_size=1 << 16):
|
||||
f.write(chunk)
|
||||
return dest
|
||||
|
||||
|
||||
def to_wav_16k_mono(src: str | Path, dst: str | Path) -> Path:
|
||||
"""Normalize any audio to 16 kHz mono PCM WAV (what the ASR endpoint wants). Requires ffmpeg."""
|
||||
dst = Path(dst)
|
||||
dst.parent.mkdir(parents=True, exist_ok=True)
|
||||
subprocess.run(
|
||||
["ffmpeg", "-y", "-i", str(src), "-ar", "16000", "-ac", "1", "-f", "wav", str(dst)],
|
||||
check=True, capture_output=True,
|
||||
)
|
||||
return dst
|
||||
|
||||
|
||||
def download_youtube_audio(url: str, out_dir: str | Path, *, archive_file: str | Path | None = None) -> Path:
|
||||
"""Audio-only via yt-dlp → 16 kHz mono WAV. `archive_file` (yt-dlp --download-archive) is the
|
||||
canonical 'only-new' dedup for channel/playlist back-catalog pulls."""
|
||||
out_dir = Path(out_dir)
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
cmd = [
|
||||
"yt-dlp", "-f", "bestaudio/best", "-x", "--audio-format", "wav",
|
||||
"--postprocessor-args", "ffmpeg:-ar 16000 -ac 1",
|
||||
"-o", str(out_dir / "%(id)s.%(ext)s"),
|
||||
"--no-progress",
|
||||
]
|
||||
if archive_file:
|
||||
cmd += ["--download-archive", str(archive_file)]
|
||||
cmd.append(url)
|
||||
subprocess.run(cmd, check=True, capture_output=True)
|
||||
# yt-dlp names the file by video id; return the newest wav
|
||||
wavs = sorted(out_dir.glob("*.wav"), key=lambda p: p.stat().st_mtime)
|
||||
if not wavs:
|
||||
raise RuntimeError("yt-dlp produced no wav (PO-token/cookies issue? see module docstring)")
|
||||
return wavs[-1]
|
||||
@@ -0,0 +1,127 @@
|
||||
"""Earnings-call transcripts via Financial Modeling Prep (§4.1, §12 — decision: FMP).
|
||||
|
||||
Audio isn't reliably fetchable for large-caps (no uniform feed; ~30–90d replay expiry breaks
|
||||
backfill), so FMP's transcript API is the backbone and EDGAR filings remain the durable core. FMP
|
||||
also exposes an earnings *calendar* to trigger ingestion on the day a call drops.
|
||||
|
||||
Endpoint paths/params are marked TODO(contract): confirm against the FMP 'stable' docs for the
|
||||
account tier at integration. Needs config.fmp_api_key.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import sqlite3
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import requests
|
||||
|
||||
FMP_BASE = "https://financialmodelingprep.com/stable"
|
||||
|
||||
|
||||
class FMPClient:
|
||||
def __init__(self, api_key: str, *, base: str = FMP_BASE, timeout: int = 30) -> None:
|
||||
if not api_key:
|
||||
raise ValueError("FMP_API_KEY is required for earnings-call transcripts")
|
||||
self.api_key = api_key
|
||||
self.base = base
|
||||
self.timeout = timeout
|
||||
self.s = requests.Session()
|
||||
|
||||
def _get(self, path: str, **params: Any) -> Any:
|
||||
params["apikey"] = self.api_key
|
||||
r = self.s.get(f"{self.base}/{path}", params=params, timeout=self.timeout)
|
||||
r.raise_for_status()
|
||||
return r.json()
|
||||
|
||||
# Confirmed against FMP 'stable' 2026-06-07 (v3 is legacy/403). Note singular "earning".
|
||||
def transcript_dates(self, symbol: str) -> Any:
|
||||
"""List available transcripts: [{quarter, fiscalYear, date}, ...]."""
|
||||
return self._get("earning-call-transcript-dates", symbol=symbol)
|
||||
|
||||
def transcript(self, symbol: str, *, year: int, quarter: int) -> Any:
|
||||
"""One transcript: [{symbol, period, year, date, content}]. Use the `date` field as the
|
||||
document date — FMP's year/quarter labels are fiscal and can be offset from the call date."""
|
||||
return self._get("earning-call-transcript", symbol=symbol, year=year, quarter=quarter)
|
||||
|
||||
def earnings_calendar(self, *, from_date: str, to_date: str) -> Any:
|
||||
"""Earnings calendar (ingestion trigger): [{symbol, date, epsActual, ...}, ...]."""
|
||||
return self._get("earnings-calendar", **{"from": from_date, "to": to_date})
|
||||
|
||||
|
||||
def ingest_transcript(
|
||||
conn: sqlite3.Connection,
|
||||
*,
|
||||
source_id: str,
|
||||
symbol: str,
|
||||
year: int,
|
||||
quarter: int,
|
||||
content: str,
|
||||
date: str | None,
|
||||
data_dir: Path,
|
||||
prompt_version: str = "extract-v0",
|
||||
) -> tuple[bool, bool]:
|
||||
"""Store one transcript (content written to disk → transcript_path) and enqueue an 'extract'
|
||||
job. Idempotent. Returns (new_document, new_job)."""
|
||||
from ..backfill import queue
|
||||
|
||||
external_id = f"{symbol}-{year}Q{quarter}"
|
||||
doc_id = f"earnings:{external_id}"
|
||||
tdir = Path(data_dir) / "transcripts"
|
||||
tdir.mkdir(parents=True, exist_ok=True)
|
||||
tpath = tdir / f"{external_id}.txt"
|
||||
tpath.write_text(content)
|
||||
content_hash = hashlib.sha256(content.encode()).hexdigest()
|
||||
cur = conn.execute(
|
||||
"""INSERT OR IGNORE INTO documents
|
||||
(doc_id, source_id, kind, external_id, title, date, transcript_path, content_hash, processed_at)
|
||||
VALUES (?,?,?,?,?,?,?,?, datetime('now'))""",
|
||||
(doc_id, source_id, "earnings_call", external_id, f"{symbol} {year} Q{quarter} call",
|
||||
date, str(tpath), content_hash),
|
||||
)
|
||||
conn.commit()
|
||||
if not cur.rowcount:
|
||||
return (False, False)
|
||||
# earnings-call Q&A is the highest-yield text source (§4.1) → priority 40, ahead of filings (50).
|
||||
h = hashlib.sha256(f"{doc_id}|{prompt_version}".encode()).hexdigest()
|
||||
new_job = queue.enqueue(conn, job_type="extract", target_id=doc_id, input_hash=h,
|
||||
parent_doc_id=doc_id, priority=40) is not None
|
||||
return (True, new_job)
|
||||
|
||||
|
||||
def ingest_for_ticker(
|
||||
conn: sqlite3.Connection,
|
||||
fmp: FMPClient,
|
||||
*,
|
||||
source_id: str,
|
||||
symbol: str,
|
||||
data_dir: Path,
|
||||
since: str | None = None,
|
||||
until: str | None = None,
|
||||
limit: int = 8,
|
||||
) -> tuple[int, int]:
|
||||
"""Enumerate available transcripts via the dates index, fetch those in [since, until], and
|
||||
ingest. Uses each transcript's own `date` (FMP fiscal labels are offset). Returns (docs, jobs)."""
|
||||
dates = fmp.transcript_dates(symbol)
|
||||
picked = []
|
||||
for d in dates if isinstance(dates, list) else []:
|
||||
dt = d.get("date")
|
||||
if since and dt and dt < since:
|
||||
continue
|
||||
if until and dt and dt > until:
|
||||
continue
|
||||
picked.append(d)
|
||||
n_docs = n_jobs = 0
|
||||
for d in picked[:limit]:
|
||||
tr = fmp.transcript(symbol, year=d["fiscalYear"], quarter=d["quarter"])
|
||||
item = (tr[0] if isinstance(tr, list) and tr else tr) or {}
|
||||
content = item.get("content") or ""
|
||||
if not content:
|
||||
continue
|
||||
nd, nj = ingest_transcript(
|
||||
conn, source_id=source_id, symbol=symbol, year=d["fiscalYear"], quarter=d["quarter"],
|
||||
content=content, date=item.get("date") or d.get("date"), data_dir=data_dir,
|
||||
)
|
||||
n_docs += int(nd)
|
||||
n_jobs += int(nj)
|
||||
return n_docs, n_jobs
|
||||
@@ -0,0 +1,148 @@
|
||||
"""SEC EDGAR ingestion (§4.1).
|
||||
|
||||
Hits the official data.sec.gov / www.sec.gov APIs directly (free, keyless, full history).
|
||||
Two hard requirements:
|
||||
- a descriptive User-Agent (SEC 403s requests without one) — from config.edgar_user_agent.
|
||||
- ≤10 requests/sec aggregate — enforced by a min-interval throttle here.
|
||||
|
||||
Supports an explicit date range AND historical shards (filings.files[]), so the §7.1 backtest can
|
||||
reach 2022–2023 filings, not just the most-recent ~1000.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import sqlite3
|
||||
import time
|
||||
from typing import Iterator
|
||||
|
||||
import requests
|
||||
|
||||
_FILING_COLS = ("accessionNumber", "form", "filingDate", "primaryDocument", "primaryDocDescription")
|
||||
|
||||
|
||||
class EdgarClient:
|
||||
BASE_DATA = "https://data.sec.gov"
|
||||
BASE_WWW = "https://www.sec.gov"
|
||||
|
||||
def __init__(self, user_agent: str, *, min_interval: float = 0.12) -> None:
|
||||
if not user_agent or "@" not in user_agent:
|
||||
raise ValueError("EDGAR requires a descriptive User-Agent with contact email (config.edgar_user_agent)")
|
||||
self.s = requests.Session()
|
||||
self.s.headers.update({"User-Agent": user_agent, "Accept-Encoding": "gzip, deflate"})
|
||||
self.min_interval = min_interval
|
||||
self._last = 0.0
|
||||
self._tickers: dict[str, int] | None = None
|
||||
|
||||
def _throttle(self) -> None:
|
||||
dt = time.monotonic() - self._last
|
||||
if dt < self.min_interval:
|
||||
time.sleep(self.min_interval - dt)
|
||||
self._last = time.monotonic()
|
||||
|
||||
def _get(self, url: str) -> requests.Response:
|
||||
self._throttle()
|
||||
r = self.s.get(url, timeout=30)
|
||||
r.raise_for_status()
|
||||
return r
|
||||
|
||||
# ---- ticker → CIK ----
|
||||
def ticker_map(self) -> dict[str, int]:
|
||||
if self._tickers is None:
|
||||
data = self._get(f"{self.BASE_WWW}/files/company_tickers.json").json()
|
||||
self._tickers = {row["ticker"].upper(): int(row["cik_str"]) for row in data.values()}
|
||||
return self._tickers
|
||||
|
||||
def cik_for(self, ticker: str) -> int | None:
|
||||
return self.ticker_map().get(ticker.upper())
|
||||
|
||||
# ---- filings ----
|
||||
def _iter_array(self, block: dict, forms, since, until) -> Iterator[dict]:
|
||||
arrays = [block.get(c, []) for c in _FILING_COLS]
|
||||
for acc, form, fdate, pdoc, pdesc in zip(*arrays):
|
||||
if forms and form not in forms:
|
||||
continue
|
||||
if since and fdate < since:
|
||||
continue
|
||||
if until and fdate > until:
|
||||
continue
|
||||
yield {"accession": acc, "form": form, "filing_date": fdate,
|
||||
"primary_document": pdoc, "description": pdesc}
|
||||
|
||||
def iter_filings(
|
||||
self,
|
||||
cik: int,
|
||||
*,
|
||||
forms: tuple[str, ...] = ("10-K", "10-Q", "8-K"),
|
||||
since: str | None = None,
|
||||
until: str | None = None,
|
||||
) -> Iterator[dict]:
|
||||
"""Yield filing descriptors. Pulls the inline 'recent' block AND any historical shards whose
|
||||
date window overlaps [since, until] — required to reach the backtest era for active filers."""
|
||||
sub = self._get(f"{self.BASE_DATA}/submissions/CIK{cik:010d}.json").json()
|
||||
recent = sub.get("filings", {}).get("recent", {})
|
||||
for f in self._iter_array(recent, forms, since, until):
|
||||
yield self._with_url(cik, f)
|
||||
for shard in sub.get("filings", {}).get("files", []):
|
||||
# shard has filingFrom / filingTo; skip shards entirely outside the window.
|
||||
if until and shard.get("filingFrom", "") > until:
|
||||
continue
|
||||
if since and shard.get("filingTo", "9999") < since:
|
||||
continue
|
||||
block = self._get(f"{self.BASE_DATA}/submissions/{shard['name']}").json()
|
||||
for f in self._iter_array(block, forms, since, until):
|
||||
yield self._with_url(cik, f)
|
||||
|
||||
def _with_url(self, cik: int, f: dict) -> dict:
|
||||
acc_nodash = f["accession"].replace("-", "")
|
||||
f["cik"] = cik
|
||||
f["url"] = f"{self.BASE_WWW}/Archives/edgar/data/{cik}/{acc_nodash}/{f['primary_document']}"
|
||||
return f
|
||||
|
||||
def fetch_html(self, filing: dict) -> str:
|
||||
return self._get(filing["url"]).text
|
||||
|
||||
|
||||
# Domestic annual/quarterly + foreign-private-issuer equivalents. 20-F (foreign annual, e.g. TSM/IREN),
|
||||
# 40-F (Canadian annual, e.g. CCJ). 8-K/6-K (current reports) excluded by default — low claim yield.
|
||||
HIGH_YIELD_FORMS = ("10-K", "10-Q", "20-F", "40-F")
|
||||
|
||||
|
||||
def ingest_filings(
|
||||
conn: sqlite3.Connection,
|
||||
client: EdgarClient,
|
||||
*,
|
||||
source_id: str,
|
||||
ticker: str,
|
||||
since: str | None = None,
|
||||
until: str | None = None,
|
||||
forms: tuple[str, ...] = HIGH_YIELD_FORMS,
|
||||
prompt_version: str = "extract-v0",
|
||||
) -> tuple[int, int]:
|
||||
"""Insert filing documents and enqueue 'extract' jobs. Filings are text → no transcription;
|
||||
they go straight to extraction (the extract worker fetches + strips the HTML later). Default
|
||||
forms cover both domestic (10-K/10-Q) and foreign-private-issuer (20-F/40-F) filers.
|
||||
Returns (new_documents, new_jobs). Idempotent on (source_id, accession)."""
|
||||
from ..backfill import queue
|
||||
|
||||
cik = client.cik_for(ticker)
|
||||
if cik is None:
|
||||
raise ValueError(f"No CIK found for ticker {ticker!r}")
|
||||
n_docs = n_jobs = 0
|
||||
for f in client.iter_filings(cik, forms=forms, since=since, until=until):
|
||||
doc_id = f"edgar:{f['accession']}"
|
||||
cur = conn.execute(
|
||||
"""INSERT OR IGNORE INTO documents (doc_id, source_id, kind, external_id, url, title, date)
|
||||
VALUES (?,?,?,?,?,?,?)""",
|
||||
(doc_id, source_id, "filing", f["accession"], f["url"],
|
||||
f"{ticker} {f['form']} {f['filing_date']}", f["filing_date"]),
|
||||
)
|
||||
conn.commit()
|
||||
if not cur.rowcount:
|
||||
continue
|
||||
n_docs += 1
|
||||
h = hashlib.sha256(f"{doc_id}|{prompt_version}".encode()).hexdigest()
|
||||
# priority 50: filings are high-info-density (§4.1) → ahead of podcasts (100)
|
||||
if queue.enqueue(conn, job_type="extract", target_id=doc_id, input_hash=h,
|
||||
parent_doc_id=doc_id, priority=50) is not None:
|
||||
n_jobs += 1
|
||||
return n_docs, n_jobs
|
||||
@@ -0,0 +1,65 @@
|
||||
"""Podcast RSS ingestion (§4.1).
|
||||
|
||||
feedparser + conditional GET (ETag/Last-Modified) for efficient incremental polling, with a
|
||||
composite (feed_url, guid) dedup discipline. Many podcast CDNs send no validators and some feeds
|
||||
truncate to recent episodes — for the §7.1 backtest, older episodes may need the show's full
|
||||
archive feed (some hosts expose `?limit=` / a separate archive URL) or a YouTube back-catalog.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import time
|
||||
from typing import Any
|
||||
|
||||
import feedparser
|
||||
|
||||
DEFAULT_UA = "Ten31SignalEngine/0.1 (+https://ten31.xyz)"
|
||||
|
||||
|
||||
def fetch_feed(url: str, *, etag: str | None = None, modified: str | None = None,
|
||||
user_agent: str = DEFAULT_UA) -> feedparser.FeedParserDict:
|
||||
"""Conditional GET. On HTTP 304 the result has .status == 304 and .entries == [] → skip."""
|
||||
return feedparser.parse(url, etag=etag, modified=modified, agent=user_agent)
|
||||
|
||||
|
||||
def _published_iso(entry: Any) -> str | None:
|
||||
t = entry.get("published_parsed") or entry.get("updated_parsed")
|
||||
if not t:
|
||||
return None
|
||||
return time.strftime("%Y-%m-%d", t)
|
||||
|
||||
|
||||
def _enclosure_audio_url(entry: Any) -> str | None:
|
||||
for enc in entry.get("enclosures", []) or []:
|
||||
if str(enc.get("type", "")).startswith("audio"):
|
||||
return enc.get("href") or enc.get("url")
|
||||
# some feeds put audio only in links rel=enclosure
|
||||
for link in entry.get("links", []) or []:
|
||||
if link.get("rel") == "enclosure" and str(link.get("type", "")).startswith("audio"):
|
||||
return link.get("href")
|
||||
return None
|
||||
|
||||
|
||||
def _guid(entry: Any) -> str:
|
||||
g = entry.get("id") or entry.get("link")
|
||||
if g:
|
||||
return str(g)
|
||||
basis = f"{entry.get('title','')}|{entry.get('published','')}"
|
||||
return "sha1:" + hashlib.sha1(basis.encode()).hexdigest()
|
||||
|
||||
|
||||
def episode_records(parsed: feedparser.FeedParserDict) -> list[dict]:
|
||||
"""Normalize feed entries to episode records. Skips entries with no audio enclosure."""
|
||||
out: list[dict] = []
|
||||
for e in parsed.entries:
|
||||
audio = _enclosure_audio_url(e)
|
||||
if not audio:
|
||||
continue
|
||||
out.append({
|
||||
"guid": _guid(e),
|
||||
"title": e.get("title"),
|
||||
"audio_url": audio,
|
||||
"link": e.get("link"),
|
||||
"published": _published_iso(e),
|
||||
})
|
||||
return out
|
||||
@@ -0,0 +1,195 @@
|
||||
"""One-time backfill path: transcribe podcast episodes via the Gemini multimodal API instead of the
|
||||
local Spark Parakeet+diarizer pipeline. Used to take a bulk backfill OFF the shared Spark GPU (which
|
||||
contends with production) — it is NOT the steady-state transcriber (local Parakeet remains the default).
|
||||
|
||||
Scope/guardrail: podcast audio is PUBLIC data, so sending it to the frontier does NOT trip the
|
||||
exposure/positioning-data rule (that guardrail is about Ten31's conviction/exposure data, never public
|
||||
audio). Output is written in the SAME 'Speaker: text' transcript format the extractor consumes, so the
|
||||
downstream extract→embed stages are agnostic to which transcriber produced the file.
|
||||
|
||||
Tradeoff vs local: Gemini yields speaker-LABELED text, not voiceprint fingerprints — so no voiceprint
|
||||
auto-edges. We rely on the hand-seeded EISC edges + name-based attribution instead (acceptable for a
|
||||
bounded backfill).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import logging
|
||||
import re
|
||||
import time
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from pathlib import Path
|
||||
|
||||
from ..backfill import queue
|
||||
from .download import download_enclosure
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
_PROMPT = (
|
||||
"You are a precise podcast transcriptionist. Transcribe this audio VERBATIM as a speaker-diarized "
|
||||
"transcript.\n"
|
||||
"RULES:\n"
|
||||
"- One line per speaker turn, formatted exactly as `Name: spoken text` (a colon and one space).\n"
|
||||
"- The host of this show is {host} — label every host turn with exactly `{host}` (the person's "
|
||||
"name, never the show's name).\n"
|
||||
"- When the host introduces a guest by name (e.g. 'welcome X to the show', 'I'm joined by X'), use "
|
||||
"that real first name (or full name) as the guest's label for the WHOLE transcript. Only fall back "
|
||||
"to `Guest` (or `Guest 2`, `Guest 3`) if a name is never stated. Do not invent names.\n"
|
||||
"- Do NOT include timestamps, ad-reads markers, summaries, headings, markdown, or any commentary. "
|
||||
"Only the transcript lines.\n"
|
||||
"- Transcribe the entire episode from start to finish. Do not stop early or summarize.\n"
|
||||
)
|
||||
|
||||
|
||||
def _host_person(source_name: str) -> str:
|
||||
"""Derive the host's PERSON name from a source/show name so claimant attribution isn't the show.
|
||||
'What Bitcoin Did (Peter McCormack)' -> 'Peter McCormack'; 'Stephan Livera Podcast' -> 'Stephan
|
||||
Livera'; 'The Kevin Rooke Show' -> 'Kevin Rooke'; 'The Anita Posch Show' -> 'Anita Posch'."""
|
||||
m = re.search(r"\(([^)]+)\)", source_name or "")
|
||||
if m:
|
||||
return m.group(1).strip()
|
||||
s = re.sub(r"^The\s+", "", source_name or "").strip()
|
||||
s = re.sub(r"\s+(Podcast|Show)$", "", s, flags=re.I).strip()
|
||||
return s
|
||||
|
||||
|
||||
def _sniff_audio_mime(path: Path) -> str:
|
||||
"""Determine audio MIME from the file header — the downloaded enclosure has a generic `.src`
|
||||
extension, so the Files API can't infer it and rejects the upload without an explicit mime_type."""
|
||||
with open(path, "rb") as fh:
|
||||
head = fh.read(16)
|
||||
if head[:3] == b"ID3" or (len(head) > 1 and head[0] == 0xFF and (head[1] & 0xE0) == 0xE0):
|
||||
return "audio/mpeg"
|
||||
if head[4:8] == b"ftyp":
|
||||
return "audio/mp4" # m4a/aac
|
||||
if head[:4] == b"OggS":
|
||||
return "audio/ogg"
|
||||
if head[:4] == b"RIFF":
|
||||
return "audio/wav"
|
||||
if head[:4] == b"fLaC":
|
||||
return "audio/flac"
|
||||
return "audio/mpeg" # podcast default
|
||||
|
||||
|
||||
def _upload_and_wait(client, audio_path: Path, *, poll_s: float = 2.0, timeout_s: float = 300.0):
|
||||
"""Upload to the Files API and wait until the file is ACTIVE (audio is processed server-side)."""
|
||||
from google.genai import types
|
||||
mime = _sniff_audio_mime(audio_path)
|
||||
f = client.files.upload(file=str(audio_path), config=types.UploadFileConfig(mime_type=mime))
|
||||
waited = 0.0
|
||||
while getattr(f.state, "name", str(f.state)) == "PROCESSING" and waited < timeout_s:
|
||||
time.sleep(poll_s)
|
||||
waited += poll_s
|
||||
f = client.files.get(name=f.name)
|
||||
state = getattr(f.state, "name", str(f.state))
|
||||
if state != "ACTIVE":
|
||||
raise RuntimeError(f"Gemini file not ACTIVE (state={state}) for {audio_path.name}")
|
||||
return f
|
||||
|
||||
|
||||
def transcribe_one(client, model: str, audio_path: Path, host_name: str, *,
|
||||
max_output_tokens: int = 65536) -> tuple[str, dict]:
|
||||
"""Transcribe a single audio file → (transcript_text, usage_dict). Network/CPU only; no DB."""
|
||||
from google.genai import types
|
||||
f = _upload_and_wait(client, audio_path)
|
||||
try:
|
||||
resp = client.models.generate_content(
|
||||
model=model,
|
||||
contents=[f, _PROMPT.format(host=host_name or "the host")],
|
||||
config=types.GenerateContentConfig(temperature=0, max_output_tokens=max_output_tokens),
|
||||
)
|
||||
text = (resp.text or "").strip()
|
||||
um = getattr(resp, "usage_metadata", None)
|
||||
usage = {
|
||||
"prompt_tokens": getattr(um, "prompt_token_count", 0) or 0,
|
||||
"output_tokens": getattr(um, "candidates_token_count", 0) or 0,
|
||||
"finish_reason": str(getattr(resp.candidates[0], "finish_reason", "")) if resp.candidates else "",
|
||||
}
|
||||
return text, usage
|
||||
finally:
|
||||
try:
|
||||
client.files.delete(name=f.name)
|
||||
except Exception as e: # noqa: BLE001 — best-effort cleanup
|
||||
log.debug("file cleanup failed for %s: %s", f.name, e)
|
||||
|
||||
|
||||
def _fetch_and_transcribe(client, model: str, cfg, doc, host_name: str) -> dict:
|
||||
"""Worker-thread unit: download enclosure → Gemini transcribe → write transcript file. No DB writes."""
|
||||
cache = Path(cfg.audio_cache_dir)
|
||||
cache.mkdir(parents=True, exist_ok=True)
|
||||
safe = doc["doc_id"].replace(":", "_")
|
||||
src = cache / f"{safe}.src"
|
||||
audio = download_enclosure(doc["url"], src)
|
||||
try:
|
||||
text, usage = transcribe_one(client, model, audio, host_name)
|
||||
if not text or len(text) < 40:
|
||||
raise RuntimeError(f"empty/short transcript ({len(text)} chars)")
|
||||
tpath = Path(cfg.data_dir) / "transcripts" / f"{safe}.txt"
|
||||
tpath.parent.mkdir(parents=True, exist_ok=True)
|
||||
tpath.write_text(text)
|
||||
return {
|
||||
"doc_id": doc["doc_id"], "ok": True, "transcript_path": str(tpath),
|
||||
"n_lines": text.count("\n") + 1, "content_hash": hashlib.sha256(text.encode()).hexdigest(),
|
||||
"usage": usage,
|
||||
}
|
||||
finally:
|
||||
try:
|
||||
if audio.exists():
|
||||
audio.unlink()
|
||||
except Exception: # noqa: BLE001
|
||||
pass
|
||||
|
||||
|
||||
def run_transcribe_gemini(conn, cfg, *, limit: int = 5, concurrency: int = 4,
|
||||
lease_seconds: int = 7200, worker_id: str = "gemini-transcribe") -> dict:
|
||||
"""Lease pending transcribe jobs and transcribe them via Gemini in parallel. DB writes stay on the
|
||||
main thread; only download+API run in the pool. Reports token usage for cost accounting."""
|
||||
from google import genai
|
||||
if not cfg.gemini_api_key:
|
||||
raise RuntimeError("GEMINI_API_KEY not configured")
|
||||
client = genai.Client(api_key=cfg.gemini_api_key)
|
||||
model = cfg.gemini_model or "gemini-2.5-flash"
|
||||
|
||||
# Lease the batch up front (main thread); resolve docs + host names.
|
||||
leased: list[tuple] = []
|
||||
while len(leased) < limit:
|
||||
job = queue.lease_next(conn, worker_id=worker_id, job_types=["transcribe"], lease_seconds=lease_seconds)
|
||||
if job is None:
|
||||
break
|
||||
doc = conn.execute("SELECT * FROM documents WHERE doc_id=?", (job["target_id"],)).fetchone()
|
||||
if doc is None:
|
||||
queue.skip(conn, job["job_id"], "document missing")
|
||||
continue
|
||||
host = conn.execute("SELECT name FROM sources WHERE source_id=?", (doc["source_id"],)).fetchone()
|
||||
leased.append((job, doc, _host_person(host["name"]) if host else ""))
|
||||
|
||||
done = failed = prompt_tok = out_tok = 0
|
||||
with ThreadPoolExecutor(max_workers=concurrency) as pool:
|
||||
futs = {pool.submit(_fetch_and_transcribe, client, model, cfg, doc, host): (job, doc)
|
||||
for (job, doc, host) in leased}
|
||||
for fut in as_completed(futs):
|
||||
job, doc = futs[fut]
|
||||
try:
|
||||
r = fut.result()
|
||||
conn.execute(
|
||||
"UPDATE documents SET transcript_path=?, content_hash=?, processed_at=datetime('now') "
|
||||
"WHERE doc_id=?", (r["transcript_path"], r["content_hash"], doc["doc_id"]),
|
||||
)
|
||||
h = hashlib.sha256(f"{doc['doc_id']}|extract-v0".encode()).hexdigest()
|
||||
queue.enqueue(conn, job_type="extract", target_id=doc["doc_id"], input_hash=h,
|
||||
parent_doc_id=doc["doc_id"], priority=100)
|
||||
queue.complete(conn, job["job_id"], output_ref=f"gemini {r['n_lines']} lines")
|
||||
conn.commit()
|
||||
done += 1
|
||||
prompt_tok += r["usage"]["prompt_tokens"]
|
||||
out_tok += r["usage"]["output_tokens"]
|
||||
fr = r["usage"]["finish_reason"]
|
||||
log.info("gemini transcribed %s (%d lines, %d in/%d out tok%s)", doc["doc_id"],
|
||||
r["n_lines"], r["usage"]["prompt_tokens"], r["usage"]["output_tokens"],
|
||||
", TRUNCATED" if "MAX_TOKENS" in fr else "")
|
||||
except Exception as e: # noqa: BLE001
|
||||
state = queue.fail(conn, job["job_id"], e)
|
||||
conn.commit()
|
||||
failed += 1
|
||||
log.warning("gemini transcribe failed for %s: %s (→ %s)", doc["doc_id"], e, state)
|
||||
return {"done": done, "failed": failed, "prompt_tokens": prompt_tok, "output_tokens": out_tok}
|
||||
@@ -0,0 +1,45 @@
|
||||
"""Speaker-name identification (§4.5 enhancement).
|
||||
|
||||
In a 1-on-1 interview the host introduces the guest by name at the top. Reading the transcript head
|
||||
with the LLM, we attach a real NAME to each diarized speaker → voiceprints.person_label. This gives
|
||||
the independence graph a SECOND, orthogonal overlap signal: the same NAMED guest across two shows is
|
||||
a shared_guest edge even when the voiceprints don't cluster (different mic/codec/room). It complements
|
||||
voiceprint cosine matching and is robust to fingerprint drift — exactly the case the operator flagged.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
_SYS = (
|
||||
'You identify the speakers in a podcast/interview transcript. Each line is "LABEL: text". '
|
||||
"Using the introduction and context, determine each LABEL's real full name and role. In an "
|
||||
"interview the host normally introduces themselves and the guest within the first minute. Only "
|
||||
"assert a name you can actually support from the text — if you cannot tell, use null. "
|
||||
'Return ONLY JSON: {"speakers": {"<LABEL>": {"name": "Full Name" or null, '
|
||||
'"role": "host"|"guest"|"panelist"|"unknown", "confidence": "low"|"med"|"high"}}}.'
|
||||
)
|
||||
|
||||
|
||||
def identify_speakers(backend, transcript_head: str, *, source_name: str, host_hint: str | None = None) -> dict:
|
||||
"""Returns {label: {name, role, confidence}}. `backend` is any extract.backends backend."""
|
||||
ctx = f"Show: {source_name}."
|
||||
if host_hint:
|
||||
ctx += f" The show's usual host is {host_hint}."
|
||||
ctx += "\n\nTRANSCRIPT (beginning):\n" + transcript_head
|
||||
messages = [{"role": "system", "content": _SYS}, {"role": "user", "content": ctx}]
|
||||
raw = backend.complete_json(messages, max_tokens=600)
|
||||
try:
|
||||
obj = json.loads(raw)
|
||||
except Exception:
|
||||
i, j = raw.find("{"), raw.rfind("}")
|
||||
if i < 0 or j < 0:
|
||||
return {}
|
||||
try:
|
||||
obj = json.loads(raw[i:j + 1])
|
||||
except Exception:
|
||||
return {}
|
||||
spk = obj.get("speakers", {}) if isinstance(obj, dict) else {}
|
||||
return spk if isinstance(spk, dict) else {}
|
||||
@@ -0,0 +1,111 @@
|
||||
"""Podcast ingestion → documents + 'transcribe' jobs (§4.1).
|
||||
|
||||
RSS path: parse the feed, take episodes in [since, until], register documents pointing at the audio
|
||||
enclosure. YouTube path: enumerate a channel's videos in the date window via yt-dlp (the back-catalog
|
||||
route for the ~9 shows whose RSS is a truncated rolling window — see seeds/podcast_feeds.resolved.yaml).
|
||||
The transcribe worker downloads + processes either kind identically.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import json
|
||||
import logging
|
||||
import sqlite3
|
||||
import subprocess
|
||||
|
||||
from ..backfill import queue
|
||||
from ..util import audio_dedup_key
|
||||
from .feeds import episode_records, fetch_feed
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def _enqueue_doc(conn, *, source_id, kind, external_id, url, title, date) -> tuple[int, int]:
|
||||
doc_id = f"pod:{source_id}:{hashlib.sha1(external_id.encode()).hexdigest()[:12]}"
|
||||
dkey = audio_dedup_key(title, date)
|
||||
# Cross-mirror dedup (pre-GPU): if this same episode was already processed (any source/feed),
|
||||
# record the sighting for provenance but DON'T re-transcribe. (external_id UNIQUE already covers
|
||||
# same-feed re-ingest; this covers the same episode via a different feed/YouTube mirror.)
|
||||
dup = conn.execute(
|
||||
"SELECT doc_id FROM documents WHERE dedup_key=? AND processed_at IS NOT NULL LIMIT 1", (dkey,)
|
||||
).fetchone()
|
||||
cur = conn.execute(
|
||||
"""INSERT OR IGNORE INTO documents (doc_id, source_id, kind, external_id, url, title, date, dedup_key)
|
||||
VALUES (?,?,?,?,?,?,?,?)""",
|
||||
(doc_id, source_id, kind, external_id, url, title, date, dkey),
|
||||
)
|
||||
conn.commit()
|
||||
if not cur.rowcount:
|
||||
return (0, 0) # same (source_id, external_id) already known
|
||||
if dup:
|
||||
conn.execute(
|
||||
"UPDATE documents SET processed_at=datetime('now'), raw_path=? WHERE doc_id=?",
|
||||
(f"dup_of:{dup['doc_id']}", doc_id),
|
||||
)
|
||||
conn.commit()
|
||||
log.info("skip transcribe for %s — duplicate content of %s", doc_id, dup["doc_id"])
|
||||
return (1, 0)
|
||||
h = hashlib.sha256(f"{doc_id}|audio-v0".encode()).hexdigest()
|
||||
job = queue.enqueue(conn, job_type="transcribe", target_id=doc_id, input_hash=h,
|
||||
parent_doc_id=doc_id, priority=100)
|
||||
return (1, 1 if job is not None else 0)
|
||||
|
||||
|
||||
def ingest_rss(conn: sqlite3.Connection, source: sqlite3.Row, *, since=None, until=None, limit=20):
|
||||
if not source["rss_url"]:
|
||||
raise ValueError(f"{source['source_id']} has no rss_url")
|
||||
recs = episode_records(fetch_feed(source["rss_url"]))
|
||||
n_docs = n_jobs = count = 0
|
||||
for r in recs:
|
||||
d = r["published"]
|
||||
if since and d and d < since:
|
||||
continue
|
||||
if until and d and d > until:
|
||||
continue
|
||||
if count >= limit:
|
||||
break
|
||||
count += 1
|
||||
nd, nj = _enqueue_doc(conn, source_id=source["source_id"], kind="podcast",
|
||||
external_id=r["guid"], url=r["audio_url"], title=r["title"], date=d)
|
||||
n_docs += nd
|
||||
n_jobs += nj
|
||||
return n_docs, n_jobs
|
||||
|
||||
|
||||
def ingest_youtube(conn: sqlite3.Connection, source: sqlite3.Row, *, since=None, until=None,
|
||||
limit=20, max_scan=800):
|
||||
"""Enumerate channel videos in the date window via yt-dlp (NON-flat, so upload_date is populated —
|
||||
flat mode returns NA). Videos come newest-first, so we use --dateafter/--datebefore to select the
|
||||
window and --break-match-filters to STOP scanning once we drop below `since` (avoids walking the
|
||||
entire channel history). The transcribe worker downloads audio on demand."""
|
||||
if not source["channel_url"]:
|
||||
raise ValueError(f"{source['source_id']} has no channel_url")
|
||||
url = source["channel_url"].rstrip("/")
|
||||
if "/playlist" not in url and not url.endswith("/videos"):
|
||||
url = url + "/videos"
|
||||
cmd = ["yt-dlp", "--no-warnings", "--ignore-errors", "--skip-download",
|
||||
"--print", "%(id)s\t%(upload_date)s\t%(title)s", "--playlist-end", str(max_scan)]
|
||||
if since:
|
||||
s = since.replace("-", "")
|
||||
cmd += ["--dateafter", s, "--break-match-filters", f"upload_date>={s}"]
|
||||
if until:
|
||||
cmd += ["--datebefore", until.replace("-", "")]
|
||||
cmd.append(url)
|
||||
out = subprocess.run(cmd, capture_output=True, text=True, timeout=900)
|
||||
n_docs = n_jobs = count = 0
|
||||
for line in out.stdout.splitlines():
|
||||
parts = line.split("\t")
|
||||
if len(parts) < 2 or not parts[0] or parts[1] in ("NA", ""):
|
||||
continue
|
||||
vid, upd = parts[0], parts[1]
|
||||
title = parts[2] if len(parts) > 2 else vid
|
||||
date = f"{upd[:4]}-{upd[4:6]}-{upd[6:8]}" if len(upd) == 8 else None
|
||||
if count >= limit:
|
||||
break
|
||||
count += 1
|
||||
nd, nj = _enqueue_doc(conn, source_id=source["source_id"], kind="youtube",
|
||||
external_id=vid, url=f"https://www.youtube.com/watch?v={vid}",
|
||||
title=title, date=date)
|
||||
n_docs += nd
|
||||
n_jobs += nj
|
||||
return n_docs, n_jobs
|
||||
@@ -0,0 +1,60 @@
|
||||
"""Cross-chunk speaker stitching + the voiceprint library (§4.1, §4.5).
|
||||
|
||||
diarize-chunk returns a 192-d TitaNet voiceprint per speaker per chunk. Because each chunk is
|
||||
diarized independently, "Speaker 1" in chunk 3 is not the same label as "Speaker 1" in chunk 7 —
|
||||
we re-cluster by cosine similarity (~0.7 distance threshold) so one person gets one identity across
|
||||
the whole episode. The SAME library then matches a guest ACROSS shows by voice (the independence
|
||||
graph's hardest edge, §4.5).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import numpy as np
|
||||
|
||||
DISTANCE_THRESHOLD = 0.7 # cosine DISTANCE (1 - cosine similarity); §4.1
|
||||
|
||||
|
||||
def _unit(v: np.ndarray) -> np.ndarray:
|
||||
n = np.linalg.norm(v)
|
||||
return v / n if n else v
|
||||
|
||||
|
||||
def cosine_distance(a: np.ndarray, b: np.ndarray) -> float:
|
||||
return float(1.0 - np.dot(_unit(np.asarray(a, dtype=float)), _unit(np.asarray(b, dtype=float))))
|
||||
|
||||
|
||||
def stitch_chunks(chunk_voiceprints: list[np.ndarray], *, threshold: float = DISTANCE_THRESHOLD) -> list[int]:
|
||||
"""Greedy online clustering of per-(chunk,speaker) voiceprints into stable speaker ids.
|
||||
|
||||
Input: a flat list of voiceprint vectors (one per chunk-speaker, in encounter order).
|
||||
Output: a parallel list of cluster ids. A vector joins the nearest existing cluster if its
|
||||
distance to that cluster's centroid < threshold, else it starts a new cluster.
|
||||
"""
|
||||
centroids: list[np.ndarray] = []
|
||||
counts: list[int] = []
|
||||
labels: list[int] = []
|
||||
for vp in chunk_voiceprints:
|
||||
vp = np.asarray(vp, dtype=float)
|
||||
if centroids:
|
||||
dists = [cosine_distance(vp, c) for c in centroids]
|
||||
j = int(np.argmin(dists))
|
||||
if dists[j] < threshold:
|
||||
centroids[j] = (centroids[j] * counts[j] + vp) / (counts[j] + 1)
|
||||
counts[j] += 1
|
||||
labels.append(j)
|
||||
continue
|
||||
centroids.append(vp.copy())
|
||||
counts.append(1)
|
||||
labels.append(len(centroids) - 1)
|
||||
return labels
|
||||
|
||||
|
||||
def match_library(vp: np.ndarray, library: list[tuple[str, np.ndarray]], *,
|
||||
threshold: float = DISTANCE_THRESHOLD) -> str | None:
|
||||
"""Return the voiceprint_id of the closest library entry within threshold, else None
|
||||
(a new speaker → caller mints a new library id)."""
|
||||
best_id, best_d = None, threshold
|
||||
for vid, lib_vec in library:
|
||||
d = cosine_distance(vp, lib_vec)
|
||||
if d < best_d:
|
||||
best_id, best_d = vid, d
|
||||
return best_id
|
||||
@@ -0,0 +1,308 @@
|
||||
"""Audio → speaker-attributed transcript + voiceprint library (§4.1, §4.5).
|
||||
|
||||
Per chunk (sequential — audio lock): diarize-chunk (192-d TitaNet fingerprints + timed speaker
|
||||
segments) + transcribe (word timestamps). Align words to speakers by time, stitch speakers ACROSS
|
||||
chunks by fingerprint cosine, then match the persisted voiceprint library so the SAME guest is
|
||||
recognized ACROSS shows by voice — the highest-leverage input to the source-independence graph.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import time
|
||||
import uuid
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
|
||||
from ..backfill import queue
|
||||
from .chunker import chunk_audio
|
||||
from .download import download_enclosure, download_youtube_audio, to_wav_16k_mono
|
||||
from .speaker_stitch import DISTANCE_THRESHOLD, match_library, stitch_chunks
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# ---------- alignment ----------
|
||||
def _speaker_at(segments: list[dict], t: float) -> str:
|
||||
for s in segments:
|
||||
if s["start_s"] <= t <= s["end_s"]:
|
||||
return s["speaker"]
|
||||
if not segments:
|
||||
return "Speaker_0"
|
||||
return min(segments, key=lambda s: min(abs(s["start_s"] - t), abs(s["end_s"] - t)))["speaker"]
|
||||
|
||||
|
||||
def align_words(words: list[dict], segments: list[dict]) -> list[dict]:
|
||||
"""Group word-level transcription into speaker turns using the diarization segments."""
|
||||
turns: list[dict] = []
|
||||
cur: dict | None = None
|
||||
for w in words:
|
||||
mid = (w["start"] + w["end"]) / 2
|
||||
spk = _speaker_at(segments, mid)
|
||||
if cur and cur["speaker"] == spk:
|
||||
cur["text"] += " " + w["text"]
|
||||
cur["end"] = w["end"]
|
||||
else:
|
||||
if cur:
|
||||
turns.append(cur)
|
||||
cur = {"speaker": spk, "start": w["start"], "end": w["end"], "text": w["text"]}
|
||||
if cur:
|
||||
turns.append(cur)
|
||||
return turns
|
||||
|
||||
|
||||
# ---------- per-document audio processing ----------
|
||||
def diarize_transcribe_chunks(sc, chunks: list[Path], *, concurrency: int = 2):
|
||||
"""Returns (chunk_turns, chunk_speakers): turns per chunk + (chunk_idx, local_spk, fingerprint).
|
||||
|
||||
Drives up to `concurrency` chunks in flight — the client's global audio SEMAPHORE is the hard cap
|
||||
across both parakeet endpoints (sit at 2: keeps the single serial GPU continuously fed = full
|
||||
throughput, no idle gap). A single chunk's failure is non-fatal (skip; the client already busy-
|
||||
retries transient blips), but if a MAJORITY of chunks fail the whole job raises so it retries later
|
||||
(rather than emitting a half-empty transcript). Results are reassembled in chunk order."""
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
|
||||
def _one(idx: int, ch: Path):
|
||||
dia = sc.diarize_chunk(str(ch))
|
||||
tr = sc.transcribe(str(ch))
|
||||
turns = align_words(tr.get("words", []), dia.get("segments", []))
|
||||
spks = [(idx, spk, np.asarray(vec, dtype=np.float32))
|
||||
for spk, vec in (dia.get("fingerprints") or {}).items()]
|
||||
return idx, turns, spks
|
||||
|
||||
results: dict[int, tuple] = {}
|
||||
failed = 0
|
||||
with ThreadPoolExecutor(max_workers=max(1, concurrency)) as pool:
|
||||
futs = {pool.submit(_one, i, ch): i for i, ch in enumerate(chunks)}
|
||||
for fut in as_completed(futs):
|
||||
try:
|
||||
idx, turns, spks = fut.result()
|
||||
results[idx] = (turns, spks)
|
||||
except Exception as e: # noqa: BLE001 — one contended chunk shouldn't kill the episode
|
||||
failed += 1
|
||||
log.warning("chunk %d/%d failed (%s) — skipping", futs[fut], len(chunks), str(e)[:90])
|
||||
if chunks and failed >= max(3, len(chunks) // 2):
|
||||
raise RuntimeError(f"{failed}/{len(chunks)} chunks failed — backend contended; will retry later")
|
||||
chunk_turns = [(idx, results[idx][0]) for idx in sorted(results)]
|
||||
chunk_speakers = [s for idx in sorted(results) for s in results[idx][1]]
|
||||
return chunk_turns, chunk_speakers
|
||||
|
||||
|
||||
def stitch_and_centroids(chunk_speakers, *, threshold: float = DISTANCE_THRESHOLD):
|
||||
"""Cluster all (chunk,speaker) fingerprints into within-episode global speakers."""
|
||||
if not chunk_speakers:
|
||||
return {}, {}
|
||||
vecs = [v for (_, _, v) in chunk_speakers]
|
||||
labels = stitch_chunks(vecs, threshold=threshold)
|
||||
keymap: dict[tuple[int, str], int] = {}
|
||||
groups: dict[int, list[np.ndarray]] = {}
|
||||
for (idx, spk, vec), lab in zip(chunk_speakers, labels):
|
||||
keymap[(idx, spk)] = lab
|
||||
groups.setdefault(lab, []).append(vec)
|
||||
centroids = {lab: np.mean(v, axis=0) for lab, v in groups.items()}
|
||||
return keymap, centroids
|
||||
|
||||
|
||||
def _load_library(conn) -> list[tuple[str, np.ndarray]]:
|
||||
rows = conn.execute("SELECT voiceprint_id, vector, person_label FROM voiceprints").fetchall()
|
||||
return [(r["voiceprint_id"], np.frombuffer(r["vector"], dtype=np.float32)) for r in rows]
|
||||
|
||||
|
||||
def _label_for(conn, vpid: str) -> str:
|
||||
r = conn.execute("SELECT person_label FROM voiceprints WHERE voiceprint_id=?", (vpid,)).fetchone()
|
||||
return (r["person_label"] if r and r["person_label"] else f"SPK:{vpid[:8]}")
|
||||
|
||||
|
||||
def resolve_voiceprints(conn, doc, centroids: dict[int, np.ndarray], *, threshold: float = DISTANCE_THRESHOLD):
|
||||
"""Match each within-episode speaker to the persisted library (cross-show identity) or mint a new
|
||||
one; record observations; add shared_guest edges when the voice also appears in ANOTHER source."""
|
||||
library = _load_library(conn)
|
||||
cluster_to_vpid: dict[int, str] = {}
|
||||
for lab, cen in centroids.items():
|
||||
vpid = match_library(cen, library, threshold=threshold)
|
||||
if vpid is None:
|
||||
vpid = "vp_" + uuid.uuid4().hex[:16]
|
||||
conn.execute(
|
||||
"INSERT INTO voiceprints (voiceprint_id, vector, first_doc_id) VALUES (?,?,?)",
|
||||
(vpid, cen.astype(np.float32).tobytes(), doc["doc_id"]),
|
||||
)
|
||||
library.append((vpid, cen))
|
||||
conn.execute(
|
||||
"INSERT INTO voiceprint_observations (voiceprint_id, doc_id, chunk_idx) VALUES (?,?,?)",
|
||||
(vpid, doc["doc_id"], None),
|
||||
)
|
||||
cluster_to_vpid[lab] = vpid
|
||||
conn.commit()
|
||||
# independence graph (§4.5): if this voice appears in a DIFFERENT source, that's a shared guest.
|
||||
for vpid in set(cluster_to_vpid.values()):
|
||||
others = conn.execute(
|
||||
"""SELECT DISTINCT d.source_id FROM voiceprint_observations o
|
||||
JOIN documents d ON d.doc_id = o.doc_id
|
||||
WHERE o.voiceprint_id=? AND d.source_id != ?""",
|
||||
(vpid, doc["source_id"]),
|
||||
).fetchall()
|
||||
for o in others:
|
||||
a, b = sorted([doc["source_id"], o["source_id"]])
|
||||
conn.execute(
|
||||
"""INSERT INTO source_edges (src_a, src_b, edge_type, weight, evidence)
|
||||
VALUES (?,?,'shared_guest',1.0,?)
|
||||
ON CONFLICT(src_a, src_b, edge_type)
|
||||
DO UPDATE SET weight = weight + 1.0, evidence = excluded.evidence""",
|
||||
(a, b, vpid),
|
||||
)
|
||||
conn.commit()
|
||||
return cluster_to_vpid
|
||||
|
||||
|
||||
def _labeled(chunk_turns, keymap, label_by_cluster: dict) -> str:
|
||||
lines: list[str] = []
|
||||
for idx, turns in chunk_turns:
|
||||
for t in turns:
|
||||
lab = keymap.get((idx, t["speaker"]))
|
||||
label = label_by_cluster.get(lab, t["speaker"])
|
||||
lines.append(f"{label}: {t['text']}")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def build_transcript(conn, chunk_turns, keymap, cluster_to_vpid) -> str:
|
||||
labels = {lab: _label_for(conn, vpid) for lab, vpid in cluster_to_vpid.items()}
|
||||
return _labeled(chunk_turns, keymap, labels)
|
||||
|
||||
|
||||
def apply_names(conn, cluster_to_vpid: dict, idmap: dict) -> dict:
|
||||
"""Attach confident names to the voiceprint library (person_label). Returns {cluster: name}."""
|
||||
named: dict[int, str] = {}
|
||||
for lab, vpid in cluster_to_vpid.items():
|
||||
info = idmap.get(f"Speaker {lab + 1}") or idmap.get(str(lab + 1)) or {}
|
||||
name = (info.get("name") or "").strip() if isinstance(info, dict) else ""
|
||||
if name and info.get("confidence") in ("med", "high"):
|
||||
conn.execute("UPDATE voiceprints SET person_label=? WHERE voiceprint_id=?", (name, vpid))
|
||||
named[lab] = name
|
||||
conn.commit()
|
||||
return named
|
||||
|
||||
|
||||
def add_name_edges(conn, doc, cluster_to_vpid: dict) -> int:
|
||||
"""Name-based shared_guest edges: same person_label seen in a DIFFERENT source → independence edge,
|
||||
even if the voiceprints didn't cluster (drift-robust complement to voiceprint matching, §4.5)."""
|
||||
n = 0
|
||||
for vpid in set(cluster_to_vpid.values()):
|
||||
r = conn.execute("SELECT person_label FROM voiceprints WHERE voiceprint_id=?", (vpid,)).fetchone()
|
||||
name = r["person_label"] if r else None
|
||||
if not name:
|
||||
continue
|
||||
others = conn.execute(
|
||||
"""SELECT DISTINCT d.source_id FROM voiceprints v
|
||||
JOIN voiceprint_observations o ON o.voiceprint_id = v.voiceprint_id
|
||||
JOIN documents d ON d.doc_id = o.doc_id
|
||||
WHERE v.person_label = ? AND d.source_id != ?""",
|
||||
(name, doc["source_id"]),
|
||||
).fetchall()
|
||||
for o in others:
|
||||
a, b = sorted([doc["source_id"], o["source_id"]])
|
||||
conn.execute(
|
||||
"""INSERT INTO source_edges (src_a, src_b, edge_type, weight, evidence)
|
||||
VALUES (?,?,'shared_guest',1.0,?)
|
||||
ON CONFLICT(src_a, src_b, edge_type)
|
||||
DO UPDATE SET weight = weight + 1.0, evidence = excluded.evidence""",
|
||||
(a, b, f"name:{name}"),
|
||||
)
|
||||
n += 1
|
||||
conn.commit()
|
||||
return n
|
||||
|
||||
|
||||
def _download_audio(doc, cfg) -> Path:
|
||||
cache = Path(cfg.audio_cache_dir)
|
||||
cache.mkdir(parents=True, exist_ok=True)
|
||||
wav = cache / f"{doc['doc_id'].replace(':', '_')}.wav"
|
||||
if wav.exists():
|
||||
return wav
|
||||
url = doc["url"]
|
||||
if doc["kind"] == "youtube" or (url and ("youtube.com" in url or "youtu.be" in url)):
|
||||
return download_youtube_audio(url, cache, archive_file=cache / "yt-archive.txt")
|
||||
raw = download_enclosure(url, cache / f"{doc['doc_id'].replace(':', '_')}.src")
|
||||
return to_wav_16k_mono(raw, wav)
|
||||
|
||||
|
||||
def process_document(conn, sc, cfg, doc, *, max_chunks: int, chunk_seconds: int = 150,
|
||||
keep_audio: bool = False) -> int:
|
||||
audio = _download_audio(doc, cfg)
|
||||
chunkdir = Path(cfg.audio_cache_dir) / f"chunks_{doc['doc_id'].replace(':', '_')}"
|
||||
chunks = chunk_audio(audio, chunkdir, chunk_seconds=chunk_seconds)[:max_chunks]
|
||||
chunk_turns, chunk_speakers = diarize_transcribe_chunks(
|
||||
sc, chunks, concurrency=getattr(cfg, "audio_concurrency", 2))
|
||||
keymap, centroids = stitch_and_centroids(chunk_speakers)
|
||||
cluster_to_vpid = resolve_voiceprints(conn, doc, centroids)
|
||||
|
||||
# Name the speakers (§4.5): host introduces guest in 1-on-1 → attach person_label, then a
|
||||
# name-based shared_guest edge that survives voiceprint drift across shows.
|
||||
src = conn.execute("SELECT name FROM sources WHERE source_id=?", (doc["source_id"],)).fetchone()
|
||||
try:
|
||||
from ..extract.backends import from_config as backend_from_config
|
||||
from .identify import identify_speakers
|
||||
backend = backend_from_config(cfg, sc)
|
||||
draft = _labeled(chunk_turns, keymap, {lab: f"Speaker {lab + 1}" for lab in cluster_to_vpid})
|
||||
idmap = identify_speakers(backend, draft[:6000], source_name=src["name"] if src else "")
|
||||
named = apply_names(conn, cluster_to_vpid, idmap)
|
||||
if named:
|
||||
log.info("named speakers in %s: %s", doc["doc_id"], ", ".join(named.values()))
|
||||
except Exception as e: # noqa: BLE001 — naming is best-effort enrichment
|
||||
log.warning("speaker identification failed for %s: %s", doc["doc_id"], e)
|
||||
add_name_edges(conn, doc, cluster_to_vpid)
|
||||
|
||||
transcript = build_transcript(conn, chunk_turns, keymap, cluster_to_vpid)
|
||||
tpath = Path(cfg.data_dir) / "transcripts" / f"{doc['doc_id'].replace(':', '_')}.txt"
|
||||
tpath.parent.mkdir(parents=True, exist_ok=True)
|
||||
tpath.write_text(transcript)
|
||||
import hashlib
|
||||
content_hash = hashlib.sha256(transcript.encode()).hexdigest()
|
||||
conn.execute(
|
||||
"UPDATE documents SET transcript_path=?, duration_sec=?, content_hash=?, processed_at=datetime('now') WHERE doc_id=?",
|
||||
(str(tpath), len(chunks) * chunk_seconds, content_hash, doc["doc_id"]),
|
||||
)
|
||||
conn.commit()
|
||||
h = hashlib.sha256(f"{doc['doc_id']}|extract-v0".encode()).hexdigest()
|
||||
queue.enqueue(conn, job_type="extract", target_id=doc["doc_id"], input_hash=h,
|
||||
parent_doc_id=doc["doc_id"], priority=100)
|
||||
if not keep_audio:
|
||||
_cleanup_audio(audio, chunkdir)
|
||||
return len(chunk_turns)
|
||||
|
||||
|
||||
def _cleanup_audio(audio: Path, chunkdir: Path) -> None:
|
||||
"""Audio files are large and disposable once transcribed — reclaim the disk (the transcript +
|
||||
voiceprints are what we keep). Backfilling hundreds of 1-3 hr episodes would otherwise be tens of GB."""
|
||||
import shutil
|
||||
try:
|
||||
if audio.exists():
|
||||
audio.unlink()
|
||||
src = audio.with_suffix(".src")
|
||||
if src.exists():
|
||||
src.unlink()
|
||||
if chunkdir.exists():
|
||||
shutil.rmtree(chunkdir, ignore_errors=True)
|
||||
except Exception as e: # noqa: BLE001
|
||||
log.warning("audio cleanup failed for %s: %s", audio, e)
|
||||
|
||||
|
||||
def run_transcribe(conn, sc, cfg, *, limit: int = 5, max_chunks: int = 999,
|
||||
lease_seconds: int = 3600, worker_id: str = "transcribe-1") -> dict:
|
||||
processed = 0
|
||||
while processed < limit:
|
||||
job = queue.lease_next(conn, worker_id=worker_id, job_types=["transcribe"], lease_seconds=lease_seconds)
|
||||
if job is None:
|
||||
break
|
||||
processed += 1
|
||||
doc = conn.execute("SELECT * FROM documents WHERE doc_id=?", (job["target_id"],)).fetchone()
|
||||
if doc is None:
|
||||
queue.skip(conn, job["job_id"], "document missing")
|
||||
continue
|
||||
try:
|
||||
n = process_document(conn, sc, cfg, doc, max_chunks=max_chunks)
|
||||
queue.complete(conn, job["job_id"], output_ref=f"{n} chunks")
|
||||
log.info("transcribed %s (%d chunks)", doc["doc_id"], n)
|
||||
except Exception as e: # noqa: BLE001
|
||||
state = queue.fail(conn, job["job_id"], e)
|
||||
log.warning("transcribe failed for %s: %s (→ %s)", job["target_id"], e, state)
|
||||
return {"jobs_processed": processed}
|
||||
@@ -0,0 +1,6 @@
|
||||
"""The scoring brain (build blueprint).
|
||||
|
||||
Stats/geometry NOMINATE candidates; the frontier model only judges/expands a pre-filtered shortlist
|
||||
(§5.1). Every count that feeds a score routes through the independence primitive (EISC), never a raw
|
||||
source count (§4.5). Every scorer reads `visible_claims` (as-of filtered), never `claims` directly.
|
||||
"""
|
||||
@@ -0,0 +1,43 @@
|
||||
"""As-of harness (§6.6 look-ahead guard).
|
||||
|
||||
Every scorer reads the `visible_claims` TEMP VIEW, never `claims` directly: at nomination time only
|
||||
claims dated <= as_of are visible, so the backtest can't reward noticing what already happened. The
|
||||
view also resolves merged canonical topics (topics.status='merged') to a stable `topic_id`.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import sqlite3
|
||||
|
||||
|
||||
class Scorer:
|
||||
"""Context manager that binds a run to an as_of date and exposes `visible_claims`.
|
||||
|
||||
mode='backtest' enforces strict as-of discipline; 'forward' is the live pilot. as_of is a
|
||||
controlled ISO date (YYYY-MM-DD) — safe to inline into the view DDL (views can't take params)."""
|
||||
|
||||
def __init__(self, conn: sqlite3.Connection, as_of: str, *, mode: str = "backtest") -> None:
|
||||
self.conn = conn
|
||||
self.as_of = as_of
|
||||
self.mode = mode
|
||||
|
||||
def __enter__(self) -> "Scorer":
|
||||
self.conn.executescript(
|
||||
f"""
|
||||
DROP VIEW IF EXISTS visible_claims;
|
||||
CREATE TEMP VIEW visible_claims AS
|
||||
SELECT c.*,
|
||||
COALESCE((SELECT t.merged_into FROM topics t
|
||||
WHERE t.topic_canonical = c.topic_canonical AND t.status='merged'),
|
||||
c.topic_canonical) AS topic_id
|
||||
FROM claims c
|
||||
JOIN documents d ON d.doc_id = c.doc_id
|
||||
WHERE c.date IS NOT NULL AND c.date <= '{self.as_of}';
|
||||
"""
|
||||
)
|
||||
return self
|
||||
|
||||
def __exit__(self, *exc) -> None:
|
||||
self.conn.execute("DROP VIEW IF EXISTS visible_claims")
|
||||
|
||||
def count_visible(self) -> int:
|
||||
return self.conn.execute("SELECT COUNT(*) FROM visible_claims").fetchone()[0]
|
||||
@@ -0,0 +1,49 @@
|
||||
"""The quantitative bar (§5.1, §6.6) — the single gate between nomination and the frontier judge.
|
||||
|
||||
Two tiers:
|
||||
- evidence bar → clears hard gates → WRITE A LEDGER ROW (the denominator, §6.6), even if never judged.
|
||||
- promotion bar → also clears the score threshold → goes to the frontier judge.
|
||||
|
||||
THE GLOBAL META-RULE (applied to every scorer): no candidate clears on a single source or single
|
||||
cluster — EISC_adj >= 2.0 AND K_eff >= 2. This is the §2.1 anti-lonely-outlier law, enforced once.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
EISC_FLOOR = 2.0
|
||||
KEFF_FLOOR = 2
|
||||
|
||||
# Defaults; overridable via the score_thresholds table (so the backtest can sweep without code edits).
|
||||
DEFAULT_MIN_SCORE = {"under_acted": 0.3, "emergence": 2.0, "contrarian": 1.5,
|
||||
"convergence": 2.5, "intersection": 2.0}
|
||||
|
||||
|
||||
def _min_score(conn, scorer: str) -> float:
|
||||
if conn is not None:
|
||||
row = conn.execute("SELECT min_score FROM score_thresholds WHERE scorer=?", (scorer,)).fetchone()
|
||||
if row and row[0] is not None:
|
||||
return float(row[0])
|
||||
return DEFAULT_MIN_SCORE.get(scorer, 0.0)
|
||||
|
||||
|
||||
def evaluate(scorer: str, result: dict, *, conn=None) -> tuple[bool, bool]:
|
||||
"""Returns (cleared_evidence_bar, cleared_promotion_bar)."""
|
||||
if scorer == "under_acted":
|
||||
return _under_acted(result, _min_score(conn, scorer))
|
||||
return (False, False) # Job A scorers wired with the forward pilot
|
||||
|
||||
|
||||
def _under_acted(result: dict, min_score: float) -> tuple[bool, bool]:
|
||||
i = result["inputs"]
|
||||
breaker = bool(i.get("is_breaker"))
|
||||
# §4.4 Job B = "rising INDEPENDENT corroboration". EISC>=2.0 enforces independence (shared-guest +
|
||||
# same-cluster discounting), so this is NOT an isolated point or one-guest echo (§2.1). Cross-cluster
|
||||
# (k_eff>=2) is the §4.5 GOLD for Job A DISCOVERY — NOT a hard gate for Job B corroboration: N
|
||||
# independent energy companies confirming a power thesis is real corroboration. Cross-cluster still
|
||||
# BOOSTS the score (eisc_corrob = eisc_adj includes the xcluster_mult) so cross-cluster ranks first.
|
||||
corroborated = (i.get("n_confirmed", 0) >= 4 and i.get("n_src", 0) >= 2
|
||||
and i.get("eisc_corrob", 0.0) >= EISC_FLOOR and i.get("a_corrob", 0.0) > 0)
|
||||
conv_ok = breaker or i.get("conviction_weight", 0.0) >= 0.7 # med-high / high
|
||||
expo_ok = breaker or i.get("exposure") in ("none", "lt2") # genuine exposure gap
|
||||
evidence = corroborated and conv_ok and expo_ok
|
||||
promotion = evidence and result["score"] >= min_score
|
||||
return evidence, promotion
|
||||
@@ -0,0 +1,86 @@
|
||||
"""Pre-registered confusion matrix on the §7.1 derivatives (DESIGN_v2 §1.3).
|
||||
|
||||
Measures PRECISION and RECALL, not recall alone. Uses the engine's already-stored candidate_scores
|
||||
(cleared_date + whisper_date) × the pre-registered external repricing (resolution.K2023.yaml). Reports
|
||||
the matrix at BOTH the cleared level (what the engine fired) and the whisper level (what it saw before
|
||||
the independence floor) — the delta is the empirical answer to the gate debate.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from datetime import datetime
|
||||
|
||||
import yaml
|
||||
|
||||
from .external import basket_index, fetch_eod, resolve_reprice, runway_at_signal
|
||||
|
||||
|
||||
def _engine_dates(conn) -> dict[str, dict]:
|
||||
"""For each under_acted node: earliest cleared as_of and earliest whisper as_of (n_conf>=4, a>0)."""
|
||||
rows = conn.execute(
|
||||
"SELECT node_id, conviction_id, as_of, cleared_evidence_bar ev, inputs_json "
|
||||
"FROM candidate_scores WHERE scorer='under_acted'"
|
||||
).fetchall()
|
||||
out: dict[str, dict] = {}
|
||||
for r in rows:
|
||||
k = r["node_id"] or r["conviction_id"]
|
||||
i = json.loads(r["inputs_json"])
|
||||
d = out.setdefault(k, {"cleared": None, "whisper": None})
|
||||
if r["ev"] and (d["cleared"] is None or r["as_of"] < d["cleared"]):
|
||||
d["cleared"] = r["as_of"]
|
||||
if i.get("n_confirmed", 0) >= 4 and i.get("a_corrob", 0) > 0:
|
||||
if d["whisper"] is None or r["as_of"] < d["whisper"]:
|
||||
d["whisper"] = r["as_of"]
|
||||
return out
|
||||
|
||||
|
||||
def _lead_days(repricing_date: str, signal_date: str | None) -> int | None:
|
||||
if not signal_date or not repricing_date:
|
||||
return None
|
||||
return (datetime.strptime(repricing_date, "%Y-%m-%d") - datetime.strptime(signal_date, "%Y-%m-%d")).days
|
||||
|
||||
|
||||
def run_confusion(conn, cfg, spec_path: str) -> dict:
|
||||
spec = yaml.safe_load(open(spec_path))
|
||||
w, rule = spec["window"], spec["rule"]
|
||||
engine = _engine_dates(conn)
|
||||
price_cache: dict[str, list] = {}
|
||||
|
||||
rows = []
|
||||
for node, basket in spec["baskets"].items():
|
||||
prices = {}
|
||||
for sym in basket:
|
||||
if sym not in price_cache:
|
||||
price_cache[sym] = fetch_eod(cfg.fmp_api_key, sym, w["start"], w["end"])
|
||||
prices[sym] = price_cache[sym]
|
||||
missing = [s for s in basket if not prices[s]]
|
||||
idx = basket_index(prices)
|
||||
res = resolve_reprice(idx, threshold_pct=rule["threshold_pct"], hold_pct=rule["hold_pct"],
|
||||
hold_days=rule["hold_days"])
|
||||
ed = engine.get(node, {"cleared": None, "whisper": None})
|
||||
rows.append({
|
||||
"node": node, "basket": basket, "missing": missing,
|
||||
"confirmed": res["confirmed"], "repricing_date": res["repricing_date"], "peak_pct": res["peak_pct"],
|
||||
"cleared_date": ed["cleared"], "whisper_date": ed["whisper"],
|
||||
"lead_cleared": _lead_days(res["repricing_date"], ed["cleared"]) if res["confirmed"] else None,
|
||||
"lead_whisper": _lead_days(res["repricing_date"], ed["whisper"]) if res["confirmed"] else None,
|
||||
# DESIGN_v2.1 Correction A: runway = fraction of the durable move still ahead at signal
|
||||
"runway_cleared": runway_at_signal(idx, ed["cleared"]) if res["confirmed"] else None,
|
||||
"runway_whisper": runway_at_signal(idx, ed["whisper"]) if res["confirmed"] else None,
|
||||
})
|
||||
|
||||
def classify(r, level):
|
||||
fired = bool(r[f"{level}_date"])
|
||||
real = r["confirmed"]
|
||||
return "TP" if (fired and real) else "FP" if (fired and not real) else "FN" if real else "TN"
|
||||
|
||||
def matrix(level):
|
||||
c = {"TP": 0, "FP": 0, "FN": 0, "TN": 0}
|
||||
for r in rows:
|
||||
c[classify(r, level)] += 1
|
||||
p = c["TP"] / (c["TP"] + c["FP"]) if (c["TP"] + c["FP"]) else None
|
||||
rec = c["TP"] / (c["TP"] + c["FN"]) if (c["TP"] + c["FN"]) else None
|
||||
return c, p, rec
|
||||
|
||||
return {"rows": rows, "cleared": matrix("cleared"), "whisper": matrix("whisper"),
|
||||
"classify": classify}
|
||||
@@ -0,0 +1,96 @@
|
||||
"""External-confirmation data for the resolver (DESIGN_v2 §1). Price series via FMP (already paid for).
|
||||
|
||||
This is the *resolving* leg (§6.2): real-world repricing, not discourse. Kept deliberately simple and
|
||||
transparent — the resolution rule is pre-registered, so the code here only fetches + applies it.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import requests
|
||||
|
||||
_FMP = "https://financialmodelingprep.com"
|
||||
|
||||
|
||||
def fetch_eod(api_key: str, symbol: str, start: str, end: str) -> list[tuple[str, float]]:
|
||||
"""Daily (date, close) for a symbol. Tries the FMP 'stable' then legacy 'v3' price endpoints."""
|
||||
s = requests.Session()
|
||||
attempts = [
|
||||
(f"{_FMP}/stable/historical-price-eod/full", {"symbol": symbol, "from": start, "to": end}),
|
||||
(f"{_FMP}/api/v3/historical-price-full/{symbol}", {"from": start, "to": end}),
|
||||
]
|
||||
for url, params in attempts:
|
||||
try:
|
||||
r = s.get(url, params={**params, "apikey": api_key}, timeout=40)
|
||||
if r.status_code != 200:
|
||||
continue
|
||||
j = r.json()
|
||||
except Exception: # noqa: BLE001
|
||||
continue
|
||||
rows = j.get("historical") if isinstance(j, dict) else j
|
||||
if not rows:
|
||||
continue
|
||||
out = [(x["date"][:10], x.get("close") or x.get("adjClose")) for x in rows
|
||||
if x.get("date") and (x.get("close") or x.get("adjClose"))]
|
||||
if out:
|
||||
return sorted(out)
|
||||
return []
|
||||
|
||||
|
||||
def basket_index(prices_by_symbol: dict[str, list[tuple[str, float]]]) -> list[tuple[str, float]]:
|
||||
"""Equal-weight, each-symbol-normalized-to-its-own-first-close index, averaged over dates where
|
||||
data exists. (Symbols that IPO'd mid-window enter at 1.0 when they start — flagged by the caller.)"""
|
||||
norm = {}
|
||||
for sym, series in prices_by_symbol.items():
|
||||
if series:
|
||||
base = series[0][1]
|
||||
norm[sym] = {d: c / base for d, c in series if base}
|
||||
dates = sorted({d for n in norm.values() for d in n})
|
||||
idx = []
|
||||
for d in dates:
|
||||
vals = [n[d] for n in norm.values() if d in n]
|
||||
if vals:
|
||||
idx.append((d, sum(vals) / len(vals)))
|
||||
return idx
|
||||
|
||||
|
||||
def index_value_at(index: list[tuple[str, float]], date: str | None) -> float | None:
|
||||
"""Latest index value on or before `date` (baseline if the signal predates the data)."""
|
||||
if not index or not date:
|
||||
return None
|
||||
vals = [v for d, v in index if d <= date]
|
||||
return vals[-1] if vals else index[0][1]
|
||||
|
||||
|
||||
def runway_at_signal(index: list[tuple[str, float]], signal_date: str | None) -> float | None:
|
||||
"""Fraction of the durable move STILL AHEAD at the signal date (DESIGN_v2.1 Correction A).
|
||||
1.0 = whole move ahead (signal before it); 0.0 = signal at the peak. The right metric for a
|
||||
long-duration holder — a modestly-late signal with most of the move ahead is still actionable."""
|
||||
if not index or not signal_date:
|
||||
return None
|
||||
base = index[0][1]
|
||||
peak = max(v for _, v in index)
|
||||
val = index_value_at(index, signal_date)
|
||||
if peak <= base or val is None:
|
||||
return None
|
||||
return round(max(0.0, (peak - val) / (peak - base)), 2)
|
||||
|
||||
|
||||
def resolve_reprice(index: list[tuple[str, float]], *, threshold_pct: float, hold_pct: float,
|
||||
hold_days: int) -> dict:
|
||||
"""Apply the pre-registered rule: first date the index is ≥ +threshold% vs baseline AND still
|
||||
≥ +hold% `hold_days` later. Returns {confirmed, repricing_date, peak_pct}."""
|
||||
from datetime import datetime, timedelta
|
||||
if not index:
|
||||
return {"confirmed": False, "repricing_date": None, "peak_pct": None}
|
||||
base = index[0][1]
|
||||
thr = 1.0 + threshold_pct / 100.0
|
||||
hold = 1.0 + hold_pct / 100.0
|
||||
by_date = dict(index)
|
||||
dates = [d for d, _ in index]
|
||||
peak = max(v for _, v in index)
|
||||
for d, v in index:
|
||||
if v / base >= thr:
|
||||
target = (datetime.strptime(d, "%Y-%m-%d") + timedelta(days=hold_days)).strftime("%Y-%m-%d")
|
||||
later = [vv for dd, vv in index if dd >= target]
|
||||
if later and (later[0] / base) >= hold:
|
||||
return {"confirmed": True, "repricing_date": d, "peak_pct": round((peak / base - 1) * 100, 1)}
|
||||
return {"confirmed": False, "repricing_date": None, "peak_pct": round((peak / base - 1) * 100, 1)}
|
||||
@@ -0,0 +1,113 @@
|
||||
"""Effective Independent Source Count (EISC) — the system's differentiator (§4.5).
|
||||
|
||||
Discount convergence by source connectedness. Five shows that "independently converge" but share one
|
||||
guest must count as ~one voice; three shows across macro/energy/ai with no shared guests are gold.
|
||||
|
||||
Method (resolved in the design panel): noisy-OR connectedness matrix + inverse-row-sum EISC.
|
||||
- symmetric & order-independent (unlike a sequential pairwise-penalty walk)
|
||||
- each source's contribution is individually explainable ("counts 0.31 because connected to 3 others")
|
||||
- collapses correctly: 5 clones -> ~1.0 ; 5 cross-cluster independents -> ~5.0 (raw)
|
||||
- no eigensolve (unstable at n=2..4, our common case)
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from collections import defaultdict
|
||||
|
||||
import numpy as np
|
||||
|
||||
# Coupling per edge type: a voiceprint-confirmed shared guest is near-total redundancy on a topic.
|
||||
KAPPA = {"shared_guest": 0.85, "citation": 0.45, "community": 0.60}
|
||||
# Same-cluster baseline correlation (sources in the same world are partly redundant even w/o an edge).
|
||||
CLUSTER_COUPLING = {"bitcoin": 0.55, "vc_consensus": 0.35}
|
||||
SAME_CLUSTER_DEFAULT = 0.25
|
||||
EDGE_CLAMP = 0.95 # cap kappa*weight so a heavily-weighted edge can't exceed near-total
|
||||
CAP_VALUE = 0.25 # §4.5: bitcoin / capped sources contribute at most 0.25 of a voice
|
||||
CLUSTER_MIN_CONTRIB = 0.5 # a cluster must add >= half an independent voice to count toward K_eff
|
||||
|
||||
|
||||
def effective_independent_N(srcs: list[tuple], edges: list[tuple], *, mode: str = "live") -> dict:
|
||||
"""srcs: [(source_id, source_cluster, cluster_capped_low[, own_network])]; edges: [(a,b,type,weight)].
|
||||
mode='live' (default) DROPS own_network sources (Ten31's own orbit — listening to ourselves, §v2.1);
|
||||
mode='test' keeps them (the reflexivity test fixture). Returns {eisc_adj, eisc_raw, k_eff, ...}."""
|
||||
if mode == "live":
|
||||
srcs = [s for s in srcs if not (len(s) > 3 and s[3])]
|
||||
ids = [s[0] for s in srcs]
|
||||
n = len(ids)
|
||||
if n == 0:
|
||||
return {"eisc_adj": 0.0, "eisc_raw": 0.0, "k_eff": 0, "xcluster_mult": 1.0, "per_source_contrib": {}}
|
||||
idx = {sid: i for i, sid in enumerate(ids)}
|
||||
cluster = {s[0]: s[1] for s in srcs}
|
||||
capped = {s[0]: (bool(s[2]) or s[1] == "bitcoin") for s in srcs}
|
||||
|
||||
# edge channel: combine all edges between a pair by noisy-OR product of (1 - kappa*weight)
|
||||
pair_factor: dict = defaultdict(lambda: 1.0)
|
||||
for a, b, etype, w in edges:
|
||||
if a in idx and b in idx and a != b:
|
||||
term = min(EDGE_CLAMP, KAPPA.get(etype, 0.0) * (w if w is not None else 1.0))
|
||||
pair_factor[frozenset((a, b))] *= (1.0 - term)
|
||||
|
||||
C = np.eye(n)
|
||||
for i in range(n):
|
||||
for j in range(i + 1, n):
|
||||
a, b = ids[i], ids[j]
|
||||
e = 1.0 - pair_factor[frozenset((a, b))] # 0 if no edge
|
||||
ci, cj = cluster[a], cluster[b]
|
||||
clust = (CLUSTER_COUPLING.get(ci, SAME_CLUSTER_DEFAULT)
|
||||
if (ci is not None and ci == cj) else 0.0)
|
||||
c = 1.0 - (1.0 - e) * (1.0 - clust)
|
||||
C[i, j] = C[j, i] = c
|
||||
|
||||
rowsum = C.sum(axis=1) # includes the diagonal 1.0
|
||||
contrib, eisc_raw = {}, 0.0
|
||||
cluster_mass: dict = defaultdict(float)
|
||||
for i, sid in enumerate(ids):
|
||||
cap = CAP_VALUE if capped[sid] else 1.0
|
||||
contrib[sid] = cap * (1.0 / rowsum[i])
|
||||
eisc_raw += contrib[sid]
|
||||
if not capped[sid] and cluster[sid]:
|
||||
cluster_mass[cluster[sid]] += contrib[sid]
|
||||
|
||||
# cross-cluster bonus: count NON-capped clusters that genuinely contribute an independent voice
|
||||
# (summed contribution >= half a voice). This stops "one guest across many clusters" from earning
|
||||
# the gold multiplier — the raw EISC already collapses that guest to ~1, and k_eff must agree.
|
||||
k_eff = sum(1 for m in cluster_mass.values() if m >= CLUSTER_MIN_CONTRIB)
|
||||
xmult = max(1.0, 1.0 + 0.5 * (k_eff - 1)) # 1clu->1.0, 2->1.5, 3->2.0 (gold)
|
||||
return {
|
||||
"eisc_adj": xmult * eisc_raw,
|
||||
"eisc_raw": eisc_raw,
|
||||
"k_eff": k_eff,
|
||||
"xcluster_mult": xmult,
|
||||
"per_source_contrib": {k: round(v, 4) for k, v in contrib.items()},
|
||||
}
|
||||
|
||||
|
||||
# --- DB helpers (the brain only READS the graph; edges are produced upstream by the voiceprint lib) ---
|
||||
def load_source_meta(conn, ids: list[str]) -> list[tuple]:
|
||||
ids = list(dict.fromkeys(ids))
|
||||
if not ids:
|
||||
return []
|
||||
ph = ",".join("?" * len(ids))
|
||||
rows = conn.execute(
|
||||
f"SELECT source_id, source_cluster, cluster_capped_low, COALESCE(own_network,0) "
|
||||
f"FROM sources WHERE source_id IN ({ph})", ids
|
||||
).fetchall()
|
||||
return [(r[0], r[1], r[2], r[3]) for r in rows]
|
||||
|
||||
|
||||
def load_edges(conn, ids: list[str]) -> list[tuple]:
|
||||
ids = list(dict.fromkeys(ids))
|
||||
if not ids:
|
||||
return []
|
||||
ph = ",".join("?" * len(ids))
|
||||
rows = conn.execute(
|
||||
f"SELECT src_a, src_b, edge_type, weight FROM source_edges WHERE src_a IN ({ph}) AND src_b IN ({ph})",
|
||||
ids + ids,
|
||||
).fetchall()
|
||||
return [(r[0], r[1], r[2], r[3]) for r in rows]
|
||||
|
||||
|
||||
def eisc_for(conn, source_ids: list[str], *, mode: str = "live") -> dict:
|
||||
"""Convenience: EISC for a set of source_ids, loading cluster/cap/own_network + edges from SQLite.
|
||||
mode='live' drops own_network sources; mode='test' keeps them (§v2.1 condition 1)."""
|
||||
ids = list(dict.fromkeys(source_ids))
|
||||
return effective_independent_N(load_source_meta(conn, ids), load_edges(conn, ids), mode=mode)
|
||||
@@ -0,0 +1,49 @@
|
||||
"""Ledger + candidate_scores writers. Log EVERY bar-clearer from day one (§6.6 denominator).
|
||||
|
||||
date_logged = as_of (backtest rows carry historical dates so lead-time math is correct). The
|
||||
discourse_metric JSON is FROZEN here at log time — the resolver (separate forward pass) never edits it.
|
||||
Grant's rating lives in human_evaluations; the model never reads it pre-log (§6.7).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import json
|
||||
|
||||
|
||||
def _sig_id(scorer: str, key: str, as_of: str) -> str:
|
||||
return "sig_" + hashlib.sha1(f"{scorer}|{key}|{as_of}".encode()).hexdigest()[:16]
|
||||
|
||||
|
||||
def _score_id(scorer: str, key: str, as_of: str) -> str:
|
||||
return hashlib.sha1(f"cs|{scorer}|{key}|{as_of}".encode()).hexdigest()
|
||||
|
||||
|
||||
def record_candidate_score(conn, result: dict, as_of: str, evidence: bool, promotion: bool) -> None:
|
||||
key = result.get("node_id") or result.get("conviction_id") or result.get("topic_canonical") or ""
|
||||
conn.execute(
|
||||
"""INSERT OR REPLACE INTO candidate_scores
|
||||
(score_id, scorer, as_of, topic_canonical, node_id, conviction_id, score,
|
||||
cleared_evidence_bar, cleared_promotion_bar, inputs_json)
|
||||
VALUES (?,?,?,?,?,?,?,?,?,?)""",
|
||||
(_score_id(result["scorer"], key, as_of), result["scorer"], as_of,
|
||||
result.get("topic_canonical"), result.get("node_id"), result.get("conviction_id"),
|
||||
result["score"], int(evidence), int(promotion), json.dumps(result["inputs"])[:8000]),
|
||||
)
|
||||
conn.commit()
|
||||
|
||||
|
||||
def log_candidate(conn, *, scorer: str, as_of: str, ledger_type: str, proposition: str,
|
||||
discourse_metric: dict, origin_conviction_id=None, origin_node_id=None) -> str:
|
||||
key = origin_node_id or origin_conviction_id or proposition
|
||||
signal_id = _sig_id(scorer, key, as_of)
|
||||
dm = {**discourse_metric, "scorer": scorer}
|
||||
conn.execute(
|
||||
"""INSERT OR IGNORE INTO ledger
|
||||
(signal_id, type, proposition, date_logged, discourse_metric, model_confidence,
|
||||
origin_conviction_id, origin_node_id)
|
||||
VALUES (?,?,?,?,?,?,?,?)""",
|
||||
(signal_id, ledger_type, proposition[:1000], as_of, json.dumps(dm)[:8000], None,
|
||||
origin_conviction_id, origin_node_id),
|
||||
)
|
||||
conn.commit()
|
||||
return signal_id
|
||||
@@ -0,0 +1,80 @@
|
||||
"""Local-LLM scoring helpers (§4.4). Bounded labeling passes over PRE-FILTERED candidates only —
|
||||
never nomination from the raw corpus (§5.1). JSON mode, temp 0, no thinking → deterministic.
|
||||
|
||||
Helper #2 (derivative-relevance) is built first — it's the one the §7.1 backtest needs. Helper #1
|
||||
(stance-folding for Job A contrarian) comes with the forward pilot.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
_REL_SYS = (
|
||||
"You assess whether claims corroborate a specific investment hypothesis (a 2nd/3rd-order "
|
||||
"derivative of a thesis). For EACH claim decide: does it provide real-world evidence that the "
|
||||
"hypothesis is PLAYING OUT (corroborates), and the direction. 'affirms' = supports the hypothesis; "
|
||||
"'contradicts' = is evidence against it; 'tangential' = same topic words but not actually about the "
|
||||
"hypothesis (e.g. 'transformers' the ML architecture vs the electrical-grid kind). Be strict: a "
|
||||
"passing mention is tangential, not corroboration. "
|
||||
"TWO HARD RULES (these are the difference between catching a real signal and being fooled):\n"
|
||||
"1) REALIZED-ONLY. The hypothesis must be PLAYING OUT in fact. Announcements, plans, intentions, "
|
||||
"forecasts, targets, and 'may/will/expects/poised-to/aims-to/up-to' language are NOT corroboration — "
|
||||
"they are 'tangential' unless the claim states the thing has ACTUALLY HAPPENED / been DEPLOYED / "
|
||||
"closed. A $2B program 'announced' or capital 'made available' is NOT capital deployed. A company "
|
||||
"that 'may consider' or 'expects' something has not done it.\n"
|
||||
"2) ROLE-MATCH. The actor in the claim must occupy the role the hypothesis is about. If the "
|
||||
"hypothesis is that capital PROVIDERS are funding/supplying something, then a BORROWER or USER on the "
|
||||
"demand side (e.g. a firm posting an asset AS collateral to RECEIVE a loan) is the wrong side of the "
|
||||
"transaction → 'tangential' to that hypothesis, not 'affirms'. "
|
||||
'Return ONLY JSON: {"results":[{"claim_id":"...","corroborates":true|false,'
|
||||
'"direction":"affirms"|"contradicts"|"tangential"}]}.'
|
||||
)
|
||||
|
||||
|
||||
def _parse(raw: str) -> list[dict]:
|
||||
try:
|
||||
obj = json.loads(raw)
|
||||
except Exception:
|
||||
i, j = raw.find("{"), raw.rfind("}")
|
||||
if i < 0 or j < 0:
|
||||
return []
|
||||
try:
|
||||
obj = json.loads(raw[i:j + 1])
|
||||
except Exception:
|
||||
return []
|
||||
res = obj.get("results", []) if isinstance(obj, dict) else []
|
||||
return [r for r in res if isinstance(r, dict) and r.get("claim_id")]
|
||||
|
||||
|
||||
def derivative_relevance(backend, derivative: str, claims: list[dict]) -> dict[str, dict]:
|
||||
"""claims: [{claim_id, proposition}]. Returns {claim_id: {corroborates, direction}}.
|
||||
Filters retrieval near-misses; it cannot ADD claims search didn't return (not a nominator)."""
|
||||
if not claims:
|
||||
return {}
|
||||
listing = "\n".join(f"- [{c['claim_id']}] {c['proposition']}" for c in claims)
|
||||
user = (f"HYPOTHESIS (derivative): {derivative}\n\nCLAIMS:\n{listing}\n\n"
|
||||
f"Judge each claim id.")
|
||||
messages = [{"role": "system", "content": _REL_SYS}, {"role": "user", "content": user}]
|
||||
# Output is ~one JSON record per claim (claim_id + corroborates + direction ≈ 70-100 tokens). At
|
||||
# top_k=60 that's ~5k tokens — a fixed 3000 budget truncated mid-array → empty parse → a node
|
||||
# silently zeroed (the source of the unstable 5-affirm/0-affirm flip). Size the budget to the batch.
|
||||
budget = max(3000, 120 * len(claims) + 500)
|
||||
parsed = []
|
||||
for attempt in range(2): # one retry — a gateway-under-load truncation shouldn't zero out a node
|
||||
raw = backend.complete_json(messages, max_tokens=budget)
|
||||
parsed = _parse(raw)
|
||||
if parsed:
|
||||
break
|
||||
log.warning("derivative_relevance empty parse (attempt %d) for %r; raw[:160]=%r",
|
||||
attempt + 1, derivative[:50], raw[:160])
|
||||
# The listing presents ids as `- [{claim_id}] ...`; the model INCONSISTENTLY echoes the id back with
|
||||
# the surrounding brackets ("[edgar:...]") — which then misses the bracket-less lookup key and the
|
||||
# whole node reads as 0/(missing). Normalize the brackets+whitespace so matching is robust either way.
|
||||
out = {}
|
||||
for r in parsed:
|
||||
cid = str(r["claim_id"]).strip().strip("[]").strip()
|
||||
out[cid] = {"corroborates": bool(r.get("corroborates")),
|
||||
"direction": r.get("direction", "tangential")}
|
||||
return out
|
||||
@@ -0,0 +1,27 @@
|
||||
"""Resolver — the SEPARATE forward pass that closes the loop (§6.2, §6.3).
|
||||
|
||||
ARCHITECTURALLY ISOLATED from the scorers: it has no shared write path with them. Scorers write
|
||||
candidate_scores + ledger rows with outcome columns NULL and a FROZEN discourse_metric. The resolver
|
||||
runs later (larger as_of), reads ledger rows whose date_logged < as_of_now, and writes ONLY
|
||||
resolution_date / discourse_outcome / external_outcome / lead_time_days. It is FORBIDDEN from touching
|
||||
discourse_metric — that is the structural reason the ledger can't reward noticing what already happened.
|
||||
|
||||
Implementation note: real resolutions need forward time (the clock can't be backfilled). For the
|
||||
backtest, the discourse leg can be resolved by re-running the discourse metric forward from date_logged;
|
||||
the external leg (price/filings/human check, §6.5) is filled as that evidence arrives. Stubbed now to
|
||||
lock the architecture; filled out for the forward pilot.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
|
||||
def resolve_discourse_leg(conn, sc, cfg, *, as_of_now: str) -> int:
|
||||
"""For each ledger row logged before as_of_now without a resolution, re-measure discourse forward
|
||||
and set discourse_outcome + lead_time. (Forward-only; never reads/edits discourse_metric.)
|
||||
Returns count resolved. STUB — implemented for the forward pilot."""
|
||||
rows = conn.execute(
|
||||
"SELECT signal_id, date_logged FROM ledger WHERE resolution_date IS NULL AND date_logged < ?",
|
||||
(as_of_now,),
|
||||
).fetchall()
|
||||
# TODO(forward-pilot): re-run windowed independence from date_logged→as_of_now for each row's
|
||||
# origin derivative; set discourse_outcome in {up_cross_cluster,up_single_cluster,flat,down}.
|
||||
return 0
|
||||
@@ -0,0 +1,81 @@
|
||||
"""Scoring orchestrator. For Job B / the §7.1 backtest: march as_of dates, score every conviction +
|
||||
fan-out derivative, gate, log the denominator, promote nodes.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
|
||||
from ..extract.backends import from_config as backend_from_config
|
||||
from . import bar, under_acted
|
||||
from .asof import Scorer
|
||||
from .ledger_writer import log_candidate, record_candidate_score
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def _nodes_for(conn, as_of, mode, conviction_ids):
|
||||
nodes = []
|
||||
where, params = "", []
|
||||
if conviction_ids:
|
||||
ph = ",".join("?" * len(conviction_ids))
|
||||
where = f" WHERE conviction_id IN ({ph})"
|
||||
params = list(conviction_ids)
|
||||
for c in conn.execute(
|
||||
f"SELECT conviction_id, thematic_proposition, conviction_level, current_exposure, is_thesis_breaker "
|
||||
f"FROM conviction_log{where}", params,
|
||||
):
|
||||
nodes.append({"conviction_id": c[0], "node_id": None, "derivative": c[1],
|
||||
"level": c[2], "exposure": c[3], "breaker": bool(c[4])})
|
||||
fq = ("SELECT f.node_id, f.parent_conviction_id, f.derivative_proposition, c.conviction_level, "
|
||||
"c.current_exposure, c.is_thesis_breaker FROM fanout_nodes f "
|
||||
"JOIN conviction_log c ON c.conviction_id = f.parent_conviction_id")
|
||||
conds, fparams = [], []
|
||||
if conviction_ids:
|
||||
conds.append(f"f.parent_conviction_id IN ({','.join('?' * len(conviction_ids))})")
|
||||
fparams += list(conviction_ids)
|
||||
if mode == "forward": # backtest uses the seeded tree as the as-of-2023 hypothesis (no created_at leak)
|
||||
conds.append("f.created_at <= ?")
|
||||
fparams.append(as_of)
|
||||
if conds:
|
||||
fq += " WHERE " + " AND ".join(conds)
|
||||
for f in conn.execute(fq, fparams):
|
||||
nodes.append({"conviction_id": f[1], "node_id": f[0], "derivative": f[2],
|
||||
"level": f[3], "exposure": f[4], "breaker": bool(f[5])})
|
||||
return nodes
|
||||
|
||||
|
||||
def run_under_acted(conn, sc, cfg, *, as_of, mode="backtest", conviction_ids=None, window_days=28) -> list[dict]:
|
||||
backend = backend_from_config(cfg, sc)
|
||||
out = []
|
||||
with Scorer(conn, as_of, mode=mode):
|
||||
for nd in _nodes_for(conn, as_of, mode, conviction_ids):
|
||||
r = under_acted.score_node(
|
||||
conn, sc, backend, as_of=as_of, derivative=nd["derivative"],
|
||||
conviction_id=nd["conviction_id"], node_id=nd["node_id"],
|
||||
conviction_level=nd["level"], exposure=nd["exposure"], is_breaker=nd["breaker"],
|
||||
window_days=window_days,
|
||||
)
|
||||
ev, pr = bar.evaluate("under_acted", r, conn=conn)
|
||||
record_candidate_score(conn, r, as_of, ev, pr)
|
||||
if ev:
|
||||
log_candidate(conn, scorer="under_acted", as_of=as_of,
|
||||
ledger_type="under_acted_conviction", proposition=nd["derivative"],
|
||||
discourse_metric=r["inputs"], origin_conviction_id=nd["conviction_id"],
|
||||
origin_node_id=nd["node_id"])
|
||||
if nd["node_id"]:
|
||||
conn.execute("UPDATE fanout_nodes SET status=? WHERE node_id=?",
|
||||
("signal" if pr else "corroborated", nd["node_id"]))
|
||||
conn.commit()
|
||||
out.append({"node": nd, "result": r, "evidence": ev, "promotion": pr})
|
||||
return out
|
||||
|
||||
|
||||
def run_backtest(conn, sc, cfg, *, conviction_id, dates, window_days=90) -> list[tuple]:
|
||||
timeline = []
|
||||
for as_of in dates:
|
||||
res = run_under_acted(conn, sc, cfg, as_of=as_of, mode="backtest",
|
||||
conviction_ids=[conviction_id], window_days=window_days)
|
||||
timeline.append((as_of, res))
|
||||
fired = [r for r in res if r["evidence"]]
|
||||
log.info("as_of %s: %d/%d nodes cleared evidence bar", as_of, len(fired), len(res))
|
||||
return timeline
|
||||
@@ -0,0 +1,105 @@
|
||||
"""Two-sided net-corroboration (DESIGN_v2.1 H5 + condition 3) — the instrument for the adversarial cases.
|
||||
|
||||
For a derivative, track the INDEPENDENCE-WEIGHTED affirms MINUS denies over time. This is the right
|
||||
output for Strike/Battery (where the question is "did the engine distinguish real adoption from
|
||||
narrative, and catch the contradiction?"), not runway:
|
||||
- STRIKE (reflexivity): a PASS = net stays low/quiet in LIVE mode (own_network dropped) while it
|
||||
would have fired in TEST mode (own_network kept) → the engine refuses the intra-cluster echo.
|
||||
- BATTERY (timing): the DEMAND derivative's net rises while the SUPPLY derivative's net stays flat →
|
||||
"half-confirmed, the load-bearing half isn't moving" = the eroding-conviction signal.
|
||||
Reuses the §4.6 relevance helper, which already returns direction affirms|contradicts|tangential.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from .independence import eisc_for
|
||||
from .llm_helpers import derivative_relevance
|
||||
from .windows import window_bounds
|
||||
|
||||
|
||||
def classify_corpus(sc, backend, derivative: str, as_of: str, *, top_k: int = 60) -> list[dict]:
|
||||
"""Retrieve (as-of filtered) + LLM-classify each claim's direction toward the derivative.
|
||||
Returns affirms/contradicts claims with source_id + date (tangential dropped)."""
|
||||
res = sc.search(derivative, collection="propositions", top_k=top_k, rerank=True)
|
||||
hits = res.get("data", []) if isinstance(res, dict) else []
|
||||
cand = []
|
||||
for h in hits:
|
||||
pl = (h.get("payload") or {})
|
||||
d = pl.get("date")
|
||||
if not pl.get("claim_id") or not d or d[:10] > as_of:
|
||||
continue
|
||||
cand.append({"claim_id": pl["claim_id"], "proposition": pl.get("proposition", ""),
|
||||
"date": d[:10], "source_id": pl.get("source_id")})
|
||||
if not cand:
|
||||
return []
|
||||
rel = derivative_relevance(backend, derivative,
|
||||
[{"claim_id": c["claim_id"], "proposition": c["proposition"]} for c in cand])
|
||||
out = []
|
||||
for c in cand:
|
||||
direction = rel.get(c["claim_id"], {}).get("direction", "tangential")
|
||||
if direction in ("affirms", "contradicts"):
|
||||
out.append({**c, "direction": direction})
|
||||
return out
|
||||
|
||||
|
||||
# DESIGN_v2 ADOPT #1 (claim-type weighting): a node "resolves" on REALIZED, descriptive disclosure —
|
||||
# not on forecasts/intent. A source counts toward the net only if it carries a HARD (realized-fact)
|
||||
# claim on this side; predictive/interpretive claims (forecasts, opinion, 'may consider', 'expects')
|
||||
# are the exact material that fooled the supply axis on Battery, so they don't qualify a source alone.
|
||||
_HARD_CLAIM_TYPES = ("descriptive", "reactive")
|
||||
|
||||
|
||||
def _hard_sources(conn, claim_ids: list[str]) -> set:
|
||||
"""Sources that contributed at least one realized-fact (descriptive/reactive) claim among claim_ids."""
|
||||
if not claim_ids:
|
||||
return set()
|
||||
ph = ",".join("?" * len(claim_ids))
|
||||
qph = ",".join("?" * len(_HARD_CLAIM_TYPES))
|
||||
rows = conn.execute(
|
||||
f"SELECT DISTINCT source_id FROM claims WHERE claim_id IN ({ph}) AND claim_type IN ({qph})",
|
||||
list(claim_ids) + list(_HARD_CLAIM_TYPES),
|
||||
).fetchall()
|
||||
return {r[0] for r in rows}
|
||||
|
||||
|
||||
def net_at(conn, classified: list[dict], as_of: str, *, window_days: int = 90, mode: str = "live",
|
||||
require_hard_evidence: bool = True) -> dict:
|
||||
"""Net independence-weighted corroboration in the trailing window ending at as_of. With
|
||||
require_hard_evidence (default), a source only counts on a side if it carries a realized-fact claim
|
||||
there — forecasts/intent alone don't qualify it (the announced-vs-deployed / opinion-vs-fact guard)."""
|
||||
_, start, end = window_bounds(as_of, n=1, days=window_days)[0]
|
||||
win = [c for c in classified if start < c["date"] <= end]
|
||||
aff = [c for c in win if c["direction"] == "affirms"]
|
||||
den = [c for c in win if c["direction"] == "contradicts"]
|
||||
aff_src_all = {c["source_id"] for c in aff}
|
||||
den_src_all = {c["source_id"] for c in den}
|
||||
if require_hard_evidence:
|
||||
hard_aff = _hard_sources(conn, [c["claim_id"] for c in aff])
|
||||
hard_den = _hard_sources(conn, [c["claim_id"] for c in den])
|
||||
aff_src = list(aff_src_all & hard_aff)
|
||||
den_src = list(den_src_all & hard_den)
|
||||
else:
|
||||
aff_src, den_src = list(aff_src_all), list(den_src_all)
|
||||
aff_e = eisc_for(conn, aff_src, mode=mode)["eisc_adj"] if aff_src else 0.0
|
||||
den_e = eisc_for(conn, den_src, mode=mode)["eisc_adj"] if den_src else 0.0
|
||||
own = 0
|
||||
if aff_src:
|
||||
ph = ",".join("?" * len(aff_src))
|
||||
own = conn.execute(
|
||||
f"SELECT COUNT(*) FROM sources WHERE source_id IN ({ph}) AND COALESCE(own_network,0)=1", aff_src
|
||||
).fetchone()[0]
|
||||
return {"as_of": as_of, "affirms_eisc": round(aff_e, 2), "denies_eisc": round(den_e, 2),
|
||||
"net": round(aff_e - den_e, 2),
|
||||
"n_affirm": len(aff), "n_deny": len(den),
|
||||
"hard_affirm_src": len(aff_src), "soft_affirm_src_dropped": len(aff_src_all) - len(aff_src),
|
||||
"own_network_affirm_src": own}
|
||||
|
||||
|
||||
def trajectory(conn, sc, backend, derivative: str, as_of_dates: list[str], *,
|
||||
window_days: int = 90, mode: str = "live", top_k: int = 60) -> list[dict]:
|
||||
"""The net-corroboration curve over as_of_dates. Run twice (mode='live' vs 'test') to see what the
|
||||
own_network quarantine removes — the reflexivity measurement."""
|
||||
out = []
|
||||
for as_of in as_of_dates:
|
||||
classified = classify_corpus(sc, backend, derivative, as_of, top_k=top_k)
|
||||
out.append(net_at(conn, classified, as_of, window_days=window_days, mode=mode))
|
||||
return out
|
||||
@@ -0,0 +1,75 @@
|
||||
"""Under-acted-conviction scorer — Job B, the §7.1 backtest target.
|
||||
|
||||
score = conviction_weight x exposure_gap x rising_independent_corroboration
|
||||
|
||||
Fires when Ten31 believes something (high conviction), has little/no position (exposure gap), and the
|
||||
world is beginning to corroborate it or a derivative of it — independently and with acceleration. This
|
||||
is the signal that should have flagged "size up power-infra picks-and-shovels" in 2023.
|
||||
|
||||
Exposure is joined LOCALLY (never crosses the frontier boundary, §4.6). Corroboration is RETRIEVED
|
||||
(stats nominate), then an LLM helper only FILTERS retrieval near-misses (§5.1) — it cannot add claims.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from .llm_helpers import derivative_relevance
|
||||
from .windows import windowed_independence
|
||||
|
||||
CONVICTION_WEIGHT = {"low": 0.15, "med": 0.4, "med-high": 0.7, "high": 1.0}
|
||||
EXPOSURE_GAP = {"none": 1.0, "lt2": 0.8, "2to10": 0.4, "gt10": 0.1, "unset": 0.6}
|
||||
|
||||
|
||||
def score_node(conn, sc, backend, *, as_of: str, derivative: str, conviction_id: str,
|
||||
node_id: str | None, conviction_level: str, exposure: str,
|
||||
is_breaker: bool = False, top_k: int = 40, window_days: int = 28) -> dict:
|
||||
cw = CONVICTION_WEIGHT.get(conviction_level, 0.4)
|
||||
eg = EXPOSURE_GAP.get(exposure, 0.6)
|
||||
|
||||
# 1. RETRIEVE (stats nominate): hybrid search over embedded propositions; as-of post-filter.
|
||||
try:
|
||||
res = sc.search(derivative, collection="propositions", top_k=top_k, rerank=True)
|
||||
except Exception as e: # noqa: BLE001
|
||||
return _result(conviction_id, node_id, 0.0, {"reason": f"search_failed:{str(e)[:60]}"},
|
||||
cw, eg, exposure, is_breaker)
|
||||
hits = res.get("data", []) if isinstance(res, dict) else []
|
||||
cand = []
|
||||
for h in hits:
|
||||
pl = (h.get("payload") or {}) if isinstance(h, dict) else {}
|
||||
d = pl.get("date")
|
||||
if not pl.get("claim_id") or not d or d[:10] > as_of: # Qdrant can't date-filter; do it here
|
||||
continue
|
||||
cand.append({"claim_id": pl["claim_id"], "proposition": pl.get("proposition", ""),
|
||||
"date": d, "source_id": pl.get("source_id")})
|
||||
if not cand:
|
||||
return _result(conviction_id, node_id, 0.0, {"reason": "no_retrieval", "n_retrieved": 0},
|
||||
cw, eg, exposure, is_breaker)
|
||||
|
||||
# 2. FILTER near-misses with the LLM (affirms-only). Not a nominator — can't add claims.
|
||||
rel = derivative_relevance(backend, derivative,
|
||||
[{"claim_id": c["claim_id"], "proposition": c["proposition"]} for c in cand])
|
||||
confirmed = [c for c in cand
|
||||
if rel.get(c["claim_id"], {}).get("corroborates")
|
||||
and rel[c["claim_id"]].get("direction") == "affirms"]
|
||||
n_src = len({c["source_id"] for c in confirmed})
|
||||
|
||||
# 3. CORROBORATION = independence-weighted acceleration over the confirmed set (treat as a topic).
|
||||
# window_days matches corpus cadence: ~90d for quarterly filings/earnings, ~28d for weekly podcasts.
|
||||
wi = windowed_independence(conn, [(c["date"], c["source_id"]) for c in confirmed], as_of, days=window_days)
|
||||
a_corrob = wi["acceleration"]
|
||||
eisc_corrob = wi["eisc0"]
|
||||
corroboration = max(0.0, a_corrob) * eisc_corrob
|
||||
|
||||
score = corroboration if is_breaker else cw * eg * corroboration
|
||||
inputs = {
|
||||
"as_of": as_of, "derivative": derivative, "n_retrieved": len(cand), "n_confirmed": len(confirmed),
|
||||
"n_src": n_src, "a_corrob": a_corrob, "eisc_corrob": eisc_corrob, "k_eff0": wi["k_eff0"],
|
||||
"window_counts": wi["counts"], "window_eisc": wi["eisc"], "corroboration": round(corroboration, 3),
|
||||
"confirmed_claim_ids": [c["claim_id"] for c in confirmed][:50],
|
||||
}
|
||||
return _result(conviction_id, node_id, score, inputs, cw, eg, exposure, is_breaker)
|
||||
|
||||
|
||||
def _result(conviction_id, node_id, score, inputs, cw, eg, exposure, is_breaker) -> dict:
|
||||
inputs = {**inputs, "conviction_weight": cw, "exposure_gap": eg, "exposure": exposure,
|
||||
"is_breaker": is_breaker}
|
||||
return {"scorer": "under_acted", "conviction_id": conviction_id, "node_id": node_id,
|
||||
"score": round(float(score), 4), "inputs": inputs}
|
||||
@@ -0,0 +1,53 @@
|
||||
"""Temporal windows + windowed independence (the single temporal layer, §4.4).
|
||||
|
||||
28-day non-overlapping windows anchored at as_of (W0 ends at as_of, then back). Non-overlapping
|
||||
avoids autocorrelation faking significance. The signal is the discrete 2nd derivative of the
|
||||
INDEPENDENCE-WEIGHTED flow (EISC per window), never the raw count — so a topic that "accelerates"
|
||||
only because one show booked the same guest three times has flat N(W).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
from .independence import eisc_for
|
||||
|
||||
WINDOW_DAYS = 28
|
||||
N_WINDOWS = 3
|
||||
|
||||
|
||||
def _d(s: str) -> datetime:
|
||||
return datetime.strptime(s[:10], "%Y-%m-%d")
|
||||
|
||||
|
||||
def window_bounds(as_of: str, *, n: int = N_WINDOWS, days: int = WINDOW_DAYS) -> list[tuple[int, str, str]]:
|
||||
"""Returns [(idx, start_iso, end_iso)] with W0 ending at as_of, extending backward only."""
|
||||
end = _d(as_of)
|
||||
out = []
|
||||
for idx in range(n):
|
||||
w_end = end - timedelta(days=idx * days)
|
||||
w_start = end - timedelta(days=(idx + 1) * days)
|
||||
out.append((idx, w_start.strftime("%Y-%m-%d"), w_end.strftime("%Y-%m-%d")))
|
||||
return out
|
||||
|
||||
|
||||
def windowed_independence(conn, rows: list[tuple], as_of: str, *, n: int = N_WINDOWS,
|
||||
days: int = WINDOW_DAYS) -> dict:
|
||||
"""rows: [(date_iso, source_id)]. For each window compute raw count + EISC_adj of its sources.
|
||||
Returns {counts:[c0..], eisc:[N0..], k_eff:[...], acceleration, eisc0, sources0}.
|
||||
acceleration = N0 - 2*N1 + N2 (independence-weighted 2nd derivative)."""
|
||||
bounds = window_bounds(as_of, n=n, days=days)
|
||||
counts, eiscs, keffs, src_sets = [], [], [], []
|
||||
for _idx, start, end in bounds:
|
||||
win = [r for r in rows if r[0] and start < r[0][:10] <= end]
|
||||
srcs = list({r[1] for r in win})
|
||||
e = eisc_for(conn, srcs) if srcs else {"eisc_adj": 0.0, "k_eff": 0}
|
||||
counts.append(len(win))
|
||||
eiscs.append(e["eisc_adj"])
|
||||
keffs.append(e["k_eff"])
|
||||
src_sets.append(srcs)
|
||||
accel = eiscs[0] - 2 * eiscs[1] + eiscs[2] if n >= 3 else 0.0
|
||||
return {
|
||||
"counts": counts, "eisc": [round(x, 3) for x in eiscs], "k_eff": keffs,
|
||||
"acceleration": round(accel, 3), "eisc0": round(eiscs[0], 3), "k_eff0": keffs[0],
|
||||
"sources0": src_sets[0], "n_total": sum(counts),
|
||||
}
|
||||
@@ -0,0 +1,9 @@
|
||||
"""Spark Control gateway client — the SINGLE chokepoint for all gateway HTTP (§13).
|
||||
|
||||
No other module in the engine knows the gateway URL. Everything local-compute
|
||||
(transcription, diarization, the local LLM, embeddings, rerank, hybrid search, and the
|
||||
scrub/rehydrate sovereignty boundary) goes through here.
|
||||
"""
|
||||
from .client import SparkControl, SparkControlError, from_config
|
||||
|
||||
__all__ = ["SparkControl", "SparkControlError", "from_config"]
|
||||
@@ -0,0 +1,242 @@
|
||||
"""Spark Control HTTP client (handoff §13.2 endpoint table).
|
||||
|
||||
Enforces the two operational invariants from §4.1 / §13.4 (revised per infra guidance 2026-06-09):
|
||||
1. AUDIO concurrency is CAPPED at 2 in-flight (hard ceiling 3), GLOBAL across both parakeet
|
||||
endpoints (/v1/audio/transcriptions + /api/audio/diarize*) — they share ONE serial GPU. A
|
||||
process-wide BoundedSemaphore enforces it. Going wider buys ZERO throughput (requests queue and
|
||||
hold the GPU); 2 just keeps the GPU continuously fed with no idle gap = full throughput.
|
||||
2. Transient unresponsiveness is NORMAL, not failure: when the GPU stays continuously busy the
|
||||
/health and in-flight requests can briefly (1-4s) stop responding. Timeouts / 503s /
|
||||
connection-resets are "busy, retry" — handled by short exponential backoff, never treated as work loss.
|
||||
|
||||
NOTE: request/response *shapes* for the non-OpenAI endpoints (/api/audio/*, /scrub,
|
||||
/rehydrate, /api/search) are provisional and marked TODO(contract) — confirm against the
|
||||
live gateway's /api/endpoints. The OpenAI-compatible routes (/v1/*) follow the standard.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import threading
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import requests
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
# Process-wide AUDIO in-flight cap, GLOBAL across both parakeet endpoints. Single serial GPU shared
|
||||
# with the operator's production app → concurrency only deepens the queue + lengthens transient
|
||||
# busy-blips; sit at 2 (full throughput, ~2-3s busy windows), hard ceiling 3.
|
||||
_AUDIO_MAX = 3
|
||||
_AUDIO_SEM = threading.BoundedSemaphore(2)
|
||||
|
||||
|
||||
def _set_audio_concurrency(n: int) -> None:
|
||||
"""Resize the global audio semaphore (clamped to [1, _AUDIO_MAX]). Called at client init from config;
|
||||
set before any worker threads start, so the rebind is not racing in-flight acquirers."""
|
||||
global _AUDIO_SEM
|
||||
_AUDIO_SEM = threading.BoundedSemaphore(min(_AUDIO_MAX, max(1, int(n))))
|
||||
|
||||
|
||||
class SparkControlError(RuntimeError):
|
||||
pass
|
||||
|
||||
|
||||
class SparkControl:
|
||||
def __init__(
|
||||
self,
|
||||
base_url: str,
|
||||
*,
|
||||
verify_tls: bool = False,
|
||||
timeout: float = 120.0,
|
||||
llm_model: str = "",
|
||||
embed_model: str = "",
|
||||
transcribe_model: str = "",
|
||||
audio_concurrency: int = 2,
|
||||
) -> None:
|
||||
self.base = base_url.rstrip("/")
|
||||
self.verify = verify_tls
|
||||
self.timeout = timeout
|
||||
self.llm_model = llm_model
|
||||
self.embed_model = embed_model
|
||||
self.transcribe_model = transcribe_model
|
||||
_set_audio_concurrency(audio_concurrency)
|
||||
self._session = requests.Session()
|
||||
if not verify_tls:
|
||||
# same-LAN self-signed cert (§13): suppress the per-request InsecureRequestWarning noise.
|
||||
import urllib3
|
||||
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
|
||||
|
||||
# ---------- low-level ----------
|
||||
def _post(
|
||||
self,
|
||||
path: str,
|
||||
*,
|
||||
json: Any = None,
|
||||
files: Any = None,
|
||||
data: Any = None,
|
||||
retries: int = 4,
|
||||
backoff: float = 5.0,
|
||||
) -> Any:
|
||||
url = f"{self.base}{path}"
|
||||
for attempt in range(retries + 1):
|
||||
try:
|
||||
r = self._session.post(
|
||||
url, json=json, files=files, data=data,
|
||||
timeout=self.timeout, verify=self.verify,
|
||||
)
|
||||
if r.status_code == 503:
|
||||
raise SparkControlError("503 from Spark Control (GPU busy / cold start)")
|
||||
r.raise_for_status()
|
||||
return r.json()
|
||||
except (requests.RequestException, SparkControlError) as e:
|
||||
if attempt < retries:
|
||||
sleep = backoff * (2 ** attempt)
|
||||
log.warning("Spark Control POST %s failed (%s); retry %d/%d in %.0fs",
|
||||
path, e, attempt + 1, retries, sleep)
|
||||
time.sleep(sleep)
|
||||
else:
|
||||
raise SparkControlError(f"POST {path} failed after {retries} retries: {e}") from e
|
||||
|
||||
def _get(self, path: str) -> Any:
|
||||
r = self._session.get(f"{self.base}{path}", timeout=self.timeout, verify=self.verify)
|
||||
r.raise_for_status()
|
||||
return r.json()
|
||||
|
||||
# ---------- health / discovery (§13.2) ----------
|
||||
def status(self) -> Any:
|
||||
return self._get("/api/status")
|
||||
|
||||
def endpoints(self) -> Any:
|
||||
return self._get("/api/endpoints")
|
||||
|
||||
# ---------- local LLM: extraction + scoring helpers (§4.2) ----------
|
||||
def chat(
|
||||
self,
|
||||
messages: list[dict[str, str]],
|
||||
*,
|
||||
json_object: bool = True,
|
||||
temperature: float = 0.0,
|
||||
enable_thinking: bool = False,
|
||||
max_tokens: int | None = None,
|
||||
) -> Any:
|
||||
"""Deterministic, no-chain-of-thought extraction per §4.2 (temp 0, thinking off,
|
||||
JSON mode for guaranteed-valid JSON)."""
|
||||
body: dict[str, Any] = {
|
||||
"model": self.llm_model,
|
||||
"messages": messages,
|
||||
"temperature": temperature,
|
||||
"chat_template_kwargs": {"enable_thinking": enable_thinking},
|
||||
}
|
||||
if json_object:
|
||||
body["response_format"] = {"type": "json_object"}
|
||||
if max_tokens:
|
||||
body["max_tokens"] = max_tokens
|
||||
return self._post("/v1/chat/completions", json=body)
|
||||
|
||||
# ---------- embeddings / rerank / hybrid search (§4.3) ----------
|
||||
def embed(self, inputs: list[str]) -> Any:
|
||||
"""Embed DISTILLED PROPOSITIONS, not raw chunks (§4.3)."""
|
||||
return self._post("/v1/embeddings", json={"model": self.embed_model, "input": inputs})
|
||||
|
||||
def rerank(self, query: str, documents: list[str], *, top_n: int | None = None) -> Any:
|
||||
body: dict[str, Any] = {"query": query, "documents": documents}
|
||||
if top_n:
|
||||
body["top_n"] = top_n
|
||||
return self._post("/v1/rerank", json=body)
|
||||
|
||||
def search(
|
||||
self,
|
||||
query: str,
|
||||
*,
|
||||
collection: str,
|
||||
top_k: int = 10,
|
||||
retrieve_n: int | None = None,
|
||||
rerank: bool = True,
|
||||
filter: dict[str, Any] | None = None,
|
||||
with_payload: bool = True,
|
||||
min_score: float | None = None,
|
||||
dense_vector_name: str = "bge_m3",
|
||||
sparse_vector_name: str = "bm25",
|
||||
text_field: str = "proposition",
|
||||
) -> Any:
|
||||
"""Hybrid dense+sparse retrieval (RRF) + optional rerank over a Qdrant collection (§4.3).
|
||||
The gateway defaults vector names to 'dense'/'sparse'; our `propositions` collection uses
|
||||
named vectors bge_m3/bm25, so they must be passed explicitly (confirmed live)."""
|
||||
body: dict[str, Any] = {
|
||||
"query": query, "collection": collection, "top_k": top_k,
|
||||
"rerank": rerank, "with_payload": with_payload,
|
||||
"dense_vector_name": dense_vector_name,
|
||||
"sparse_vector_name": sparse_vector_name,
|
||||
"text_field": text_field,
|
||||
}
|
||||
if retrieve_n is not None:
|
||||
body["retrieve_n"] = retrieve_n
|
||||
if filter is not None:
|
||||
body["filter"] = filter
|
||||
if min_score is not None:
|
||||
body["min_score"] = min_score
|
||||
return self._post("/api/search", json=body)
|
||||
|
||||
# ---------- audio: capped at 2 in-flight GLOBAL (semaphore), short busy-retry ----------
|
||||
# backoff=1.5 → ~1.5/3/6/12/24s: tuned to ride out the 1-4s busy-blips, not the old 5-40s.
|
||||
def transcribe(self, audio_path: str | Path, *, response_format: str = "verbose_json") -> Any:
|
||||
with _AUDIO_SEM, open(audio_path, "rb") as f:
|
||||
return self._post(
|
||||
"/v1/audio/transcriptions",
|
||||
files={"file": f},
|
||||
data={"model": self.transcribe_model, "response_format": response_format},
|
||||
retries=5, backoff=1.5,
|
||||
)
|
||||
|
||||
def diarize_chunk(self, audio_path: str | Path) -> Any:
|
||||
# TODO(contract): confirm /api/audio/diarize-chunk response shape (segments + 192-d voiceprint).
|
||||
with _AUDIO_SEM, open(audio_path, "rb") as f:
|
||||
return self._post("/api/audio/diarize-chunk", files={"file": f}, retries=5, backoff=1.5)
|
||||
|
||||
def transcribe_with_speakers(self, audio_path: str | Path) -> Any:
|
||||
with _AUDIO_SEM, open(audio_path, "rb") as f:
|
||||
return self._post("/api/audio/transcribe-with-speakers", files={"file": f}, retries=5, backoff=1.5)
|
||||
|
||||
# ---------- frontier sovereignty boundary (§4.6) ----------
|
||||
# Confirmed contract (gateway /openapi.json):
|
||||
# /scrub: task_id*, items*, known_entities, actor, tier1_action, bucket, ner, map_handle
|
||||
# /rehydrate: task_id*, map_handle*, items*, actor, strict
|
||||
# De-identifies IDENTITIES into stable placeholders; the de-anon map stays on the box and is
|
||||
# referenced by `map_handle`. Exposure/position data must NEVER be sent here at all (§4.6).
|
||||
def scrub(
|
||||
self,
|
||||
items: list[Any],
|
||||
*,
|
||||
task_id: str,
|
||||
known_entities: dict[str, str] | None = None,
|
||||
actor: str | None = None,
|
||||
ner: bool = True,
|
||||
) -> Any:
|
||||
"""Returns the scrubbed items + a `map_handle` to pass to rehydrate. `known_entities` is the
|
||||
caller-supplied dictionary (Strike→[FUND_1]); `ner` toggles the local-Qwen NER backstop."""
|
||||
body: dict[str, Any] = {"task_id": task_id, "items": items, "ner": ner}
|
||||
if known_entities is not None:
|
||||
body["known_entities"] = known_entities
|
||||
if actor is not None:
|
||||
body["actor"] = actor
|
||||
return self._post("/scrub", json=body)
|
||||
|
||||
def rehydrate(self, items: list[Any], *, task_id: str, map_handle: str, strict: bool = False) -> Any:
|
||||
"""Restore real identities in the frontier's output locally, using the scrub `map_handle`."""
|
||||
return self._post("/rehydrate", json={
|
||||
"task_id": task_id, "map_handle": map_handle, "items": items, "strict": strict,
|
||||
})
|
||||
|
||||
|
||||
def from_config(cfg: Any) -> SparkControl:
|
||||
return SparkControl(
|
||||
cfg.spark_control_url,
|
||||
verify_tls=cfg.spark_verify_tls,
|
||||
timeout=cfg.spark_timeout_s,
|
||||
llm_model=cfg.local_llm_model,
|
||||
embed_model=cfg.embed_model,
|
||||
transcribe_model=cfg.transcribe_model,
|
||||
audio_concurrency=getattr(cfg, "audio_concurrency", 2),
|
||||
)
|
||||
@@ -0,0 +1,4 @@
|
||||
"""Persistence layer: SQLite (metadata, ledger, conviction log, graph, queue).
|
||||
|
||||
Qdrant (vectors) is reached via the Spark Control gateway; see signal_engine.spark.
|
||||
"""
|
||||
@@ -0,0 +1,81 @@
|
||||
"""SQLite connection + schema initialization. Boring and inspectable (§5)."""
|
||||
from __future__ import annotations
|
||||
|
||||
import sqlite3
|
||||
from pathlib import Path
|
||||
|
||||
SCHEMA_FILE = Path(__file__).with_name("schema.sql")
|
||||
|
||||
|
||||
def connect(db_path: Path) -> sqlite3.Connection:
|
||||
db_path = Path(db_path)
|
||||
db_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
conn = sqlite3.connect(str(db_path), timeout=30)
|
||||
conn.row_factory = sqlite3.Row
|
||||
conn.execute("PRAGMA foreign_keys = ON")
|
||||
conn.execute("PRAGMA busy_timeout = 30000") # wait, don't fail, under concurrent backfill writers
|
||||
return conn
|
||||
|
||||
|
||||
# Additive migrations for DBs created before a column existed (CREATE IF NOT EXISTS won't add columns).
|
||||
_MIGRATIONS = {
|
||||
"documents": {"content_hash": "TEXT", "processed_at": "TEXT", "dedup_key": "TEXT"},
|
||||
# DESIGN_v2.1 condition 1: own_network = the Ten31 orbit (Odell/Bent partners etc.) — listening to
|
||||
# ourselves. Quarantined: a TEST FIXTURE for the reflexivity case, DROPPED in live EISC scoring.
|
||||
"sources": {"backtest_2022_2023": "TEXT", "own_network": "INTEGER"},
|
||||
# DESIGN_v2.1: tag derivatives by distance-from-edge for TRIAGE — surfaced, NEVER used as a filter
|
||||
# (an engine that pre-filters to in-mandate reproduces the AI/compute mandate-expansion miss).
|
||||
"fanout_nodes": {"distance_from_edge": "TEXT"},
|
||||
}
|
||||
|
||||
|
||||
def _widen_cluster_check(conn: sqlite3.Connection) -> None:
|
||||
"""Add 'banks'/'credit'/'fintech' to sources.source_cluster's CHECK. SQLite can't ALTER a CHECK, so
|
||||
rebuild the (tiny) table via the standard table-swap. Idempotent: no-op once already widened. Toggles
|
||||
foreign_keys OFF around the swap (DROP would otherwise fail on inbound FKs); data copied by value so
|
||||
referential integrity holds. busy_timeout (set in connect) lets it wait out concurrent backfill writers."""
|
||||
import re
|
||||
row = conn.execute("SELECT sql FROM sqlite_master WHERE type='table' AND name='sources'").fetchone()
|
||||
if not row or "'banks'" in row[0]:
|
||||
return
|
||||
new_list = ("('macro','ai_tech','energy','bitcoin','vc_consensus','generalist',"
|
||||
"'banks','credit','fintech')")
|
||||
new_ddl = re.sub(r"source_cluster IN\s*\([^)]*\)", f"source_cluster IN {new_list}", row[0], count=1)
|
||||
new_ddl = new_ddl.replace("CREATE TABLE sources", "CREATE TABLE sources_new", 1)
|
||||
conn.commit() # close any implicit txn before toggling FK pragma
|
||||
conn.execute("PRAGMA foreign_keys=OFF")
|
||||
try:
|
||||
conn.execute(new_ddl)
|
||||
conn.execute("INSERT INTO sources_new SELECT * FROM sources")
|
||||
conn.execute("DROP TABLE sources")
|
||||
conn.execute("ALTER TABLE sources_new RENAME TO sources")
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.execute("PRAGMA foreign_keys=ON")
|
||||
|
||||
|
||||
def _migrate(conn: sqlite3.Connection) -> None:
|
||||
for table, cols in _MIGRATIONS.items():
|
||||
existing = {r[1] for r in conn.execute(f"PRAGMA table_info({table})")}
|
||||
for col, typ in cols.items():
|
||||
if col not in existing:
|
||||
conn.execute(f"ALTER TABLE {table} ADD COLUMN {col} {typ}")
|
||||
# indexes on migrated columns (created here so they work on DBs predating the column)
|
||||
conn.execute("CREATE INDEX IF NOT EXISTS idx_documents_content_hash ON documents(content_hash)")
|
||||
conn.execute("CREATE INDEX IF NOT EXISTS idx_documents_dedup_key ON documents(dedup_key)")
|
||||
conn.commit()
|
||||
_widen_cluster_check(conn)
|
||||
|
||||
|
||||
def init_db(conn: sqlite3.Connection) -> None:
|
||||
"""Idempotent: CREATE ... IF NOT EXISTS + additive column migrations."""
|
||||
conn.executescript(SCHEMA_FILE.read_text())
|
||||
conn.commit()
|
||||
_migrate(conn)
|
||||
|
||||
|
||||
def table_names(conn: sqlite3.Connection) -> list[str]:
|
||||
rows = conn.execute(
|
||||
"SELECT name FROM sqlite_master WHERE type IN ('table','view') ORDER BY name"
|
||||
).fetchall()
|
||||
return [r[0] for r in rows]
|
||||
@@ -0,0 +1,280 @@
|
||||
-- Ten31 Signal Engine — SQLite schema (pilot)
|
||||
-- Source of truth: ten31-signal-engine-handoff.md §4 (pipeline layers), §6.7 (ledger),
|
||||
-- §3.1 (conviction log), §13.4 (backfill queue).
|
||||
-- Design principle (§5, §10): boring, inspectable tables. The whole system state is a SELECT away.
|
||||
|
||||
PRAGMA journal_mode = WAL;
|
||||
PRAGMA foreign_keys = ON;
|
||||
|
||||
-- ============================================================================
|
||||
-- CANONICAL TOPIC VOCABULARY (§4.2) — HYBRID (operator decision):
|
||||
-- seeded controlled list + emergent topics merged in on a schedule.
|
||||
-- ============================================================================
|
||||
CREATE TABLE IF NOT EXISTS topics (
|
||||
topic_canonical TEXT PRIMARY KEY,
|
||||
status TEXT CHECK (status IN ('controlled','emergent','merged')) DEFAULT 'emergent',
|
||||
merged_into TEXT REFERENCES topics(topic_canonical),
|
||||
seam TEXT,
|
||||
created_at TEXT DEFAULT (datetime('now'))
|
||||
);
|
||||
|
||||
-- ============================================================================
|
||||
-- SOURCES & DOCUMENTS (§4.1)
|
||||
-- ============================================================================
|
||||
CREATE TABLE IF NOT EXISTS sources (
|
||||
source_id TEXT PRIMARY KEY,
|
||||
name TEXT NOT NULL,
|
||||
kind TEXT NOT NULL CHECK (kind IN ('podcast','youtube','filing','earnings_call')),
|
||||
source_cluster TEXT CHECK (source_cluster IN
|
||||
('macro','ai_tech','energy','bitcoin','vc_consensus','generalist','banks','credit','fintech')),
|
||||
role TEXT CHECK (role IN ('CB','IND','DX','none')) DEFAULT 'none', -- §7.4
|
||||
rss_url TEXT,
|
||||
channel_url TEXT,
|
||||
ticker TEXT,
|
||||
-- §8 credibility: neutral prior that DECAYS in favor of earned track record from the ledger.
|
||||
bootstrap_prior REAL DEFAULT 1.0,
|
||||
earned_credibility REAL,
|
||||
cluster_capped_low INTEGER DEFAULT 0, -- §4.5 bitcoin cluster deliberately under-weighted
|
||||
backtest_2022_2023 TEXT, -- §7.1 reach: rss_full | rss_2023_only | youtube_only | launched_later | unavailable
|
||||
notes TEXT,
|
||||
created_at TEXT DEFAULT (datetime('now'))
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS documents (
|
||||
doc_id TEXT PRIMARY KEY,
|
||||
source_id TEXT NOT NULL REFERENCES sources(source_id),
|
||||
kind TEXT NOT NULL, -- podcast|youtube|filing|earnings_call
|
||||
external_id TEXT, -- rss guid / yt video id / EDGAR accession / transcript id
|
||||
url TEXT,
|
||||
title TEXT,
|
||||
date TEXT, -- ISO publication/filing date
|
||||
duration_sec REAL,
|
||||
raw_path TEXT, -- downloaded audio / raw filing
|
||||
transcript_path TEXT,
|
||||
-- DEDUP MODEL (layered):
|
||||
-- (1) UNIQUE(source_id, external_id) below = the ROBUST guard. external_id is the stable item id
|
||||
-- (RSS GUID / YouTube video id / EDGAR accession). Checked at ingest, BEFORE any GPU work.
|
||||
-- (2) dedup_key = normalized title+date → catches the SAME episode arriving via a different
|
||||
-- feed/mirror (different external_id). Computed pre-transcription. NOT from the transcript.
|
||||
-- content_hash is ONLY an audit fingerprint of the transcript (did a re-run change?) — it is NOT
|
||||
-- a dedup key (ASR is non-deterministic, so one differing word flips the hash).
|
||||
dedup_key TEXT,
|
||||
content_hash TEXT,
|
||||
processed_at TEXT, -- set when transcription/extraction completes
|
||||
ingested_at TEXT DEFAULT (datetime('now')),
|
||||
UNIQUE (source_id, external_id) -- idempotent ingest (§13.4 dedup)
|
||||
);
|
||||
-- indexes for dedup_key / content_hash are created in db._migrate (after columns exist on older DBs).
|
||||
|
||||
-- ============================================================================
|
||||
-- CLAIMS / PROPOSITIONS (§4.2) — the atomic unit of the whole system.
|
||||
-- One passage emits 0..N claims; MOST of a podcast hour is 0 (§4.2). The
|
||||
-- extractor must be willing to find nothing.
|
||||
-- NOTE: thesis_seam is a TAG, never a hard filter (§5.7) — off-thesis &
|
||||
-- anti-thesis claims MUST survive.
|
||||
-- ============================================================================
|
||||
CREATE TABLE IF NOT EXISTS claims (
|
||||
claim_id TEXT PRIMARY KEY,
|
||||
doc_id TEXT NOT NULL REFERENCES documents(doc_id),
|
||||
source_id TEXT NOT NULL REFERENCES sources(source_id),
|
||||
proposition TEXT NOT NULL, -- normalized subject-assertion-object
|
||||
topic_canonical TEXT REFERENCES topics(topic_canonical),
|
||||
topic_raw TEXT,
|
||||
claimant TEXT,
|
||||
source_cluster TEXT,
|
||||
date TEXT,
|
||||
claim_type TEXT CHECK (claim_type IN ('interpretive','predictive','descriptive','reactive')),
|
||||
time_horizon TEXT CHECK (time_horizon IN ('near','medium','long','unspecified')),
|
||||
confidence TEXT CHECK (confidence IN ('low','med','high')),
|
||||
-- §4.2 relation: stance is EXTRACTED, never inferred from vector distance (§2.2/§5.3).
|
||||
rel_target_claim_id TEXT REFERENCES claims(claim_id),
|
||||
rel_polarity TEXT CHECK (rel_polarity IN ('affirms','denies','qualifies','none')) DEFAULT 'none',
|
||||
engages_consensus INTEGER DEFAULT 0,
|
||||
counters_position TEXT,
|
||||
thesis_seam TEXT CHECK (thesis_seam IN
|
||||
('energy_compute','debasement_bitcoin','ai_data_ownership','none')) DEFAULT 'none',
|
||||
salience TEXT CHECK (salience IN ('central','secondary','aside')) DEFAULT 'secondary',
|
||||
qdrant_point_id TEXT, -- link to the embedded proposition vector (§4.3)
|
||||
extracted_at TEXT DEFAULT (datetime('now'))
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_claims_topic ON claims(topic_canonical);
|
||||
CREATE INDEX IF NOT EXISTS idx_claims_date ON claims(date);
|
||||
CREATE INDEX IF NOT EXISTS idx_claims_seam ON claims(thesis_seam);
|
||||
CREATE INDEX IF NOT EXISTS idx_claims_type ON claims(claim_type);
|
||||
|
||||
-- ============================================================================
|
||||
-- SOURCE-INDEPENDENCE GRAPH (§4.5) — discount convergence by connectedness.
|
||||
-- Cross-cluster convergence = gold; within-cluster = near-noise.
|
||||
-- ============================================================================
|
||||
CREATE TABLE IF NOT EXISTS source_edges (
|
||||
src_a TEXT NOT NULL REFERENCES sources(source_id),
|
||||
src_b TEXT NOT NULL REFERENCES sources(source_id),
|
||||
edge_type TEXT NOT NULL CHECK (edge_type IN ('shared_guest','citation','community')),
|
||||
weight REAL DEFAULT 1.0,
|
||||
evidence TEXT, -- voiceprint_id / show-note ref / url
|
||||
updated_at TEXT DEFAULT (datetime('now')),
|
||||
PRIMARY KEY (src_a, src_b, edge_type)
|
||||
);
|
||||
|
||||
-- ============================================================================
|
||||
-- VOICEPRINT LIBRARY (§4.5, §4.1) — same-guest-across-shows BY VOICE.
|
||||
-- 192-dim TitaNet voiceprints; cosine ~0.7 distance threshold for same speaker.
|
||||
-- This is the highest-leverage automated input to the independence graph.
|
||||
-- ============================================================================
|
||||
CREATE TABLE IF NOT EXISTS voiceprints (
|
||||
voiceprint_id TEXT PRIMARY KEY,
|
||||
vector BLOB NOT NULL, -- 192 x float32
|
||||
person_label TEXT, -- resolved name if known
|
||||
first_doc_id TEXT REFERENCES documents(doc_id),
|
||||
first_seen TEXT DEFAULT (datetime('now'))
|
||||
);
|
||||
CREATE TABLE IF NOT EXISTS voiceprint_observations (
|
||||
obs_id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
voiceprint_id TEXT NOT NULL REFERENCES voiceprints(voiceprint_id),
|
||||
doc_id TEXT NOT NULL REFERENCES documents(doc_id),
|
||||
chunk_idx INTEGER,
|
||||
segment_start REAL,
|
||||
segment_end REAL
|
||||
);
|
||||
|
||||
-- ============================================================================
|
||||
-- CONVICTION LOG (§3.1) — human-owned seed nodes for Job B.
|
||||
-- Structural rule (§3.1): separate the TRACKABLE thematic proposition (corpus
|
||||
-- can corroborate) from TEAM conviction (context only). The engine must NEVER
|
||||
-- present theme corroboration as validation of the team bet beneath it.
|
||||
-- Exposure scored as coarse NAV bands (operator decision): none | lt2 | 2to10 | gt10 | unset.
|
||||
-- ============================================================================
|
||||
CREATE TABLE IF NOT EXISTS conviction_log (
|
||||
conviction_id TEXT PRIMARY KEY, -- R1, E1, A1, B1 ...
|
||||
seam TEXT, -- root|energy_compute|debasement_bitcoin|ai_data_ownership
|
||||
thematic_proposition TEXT NOT NULL, -- the TRACKABLE half
|
||||
team_conviction_note TEXT, -- context ONLY, never scored as theme validation
|
||||
conviction_level TEXT CHECK (conviction_level IN ('low','med','med-high','high')),
|
||||
current_exposure TEXT CHECK (current_exposure IN ('none','lt2','2to10','gt10','unset')) DEFAULT 'unset',
|
||||
exposure_note TEXT, -- original §3.1 prose ("pervasive", "MED-HIGH") pending NAV-band finalization
|
||||
disconfirming_signal TEXT,
|
||||
is_thesis_breaker INTEGER DEFAULT 0, -- §3.1 B1-B3: engine must surface these AGAINST the thesis (§5.7)
|
||||
updated_at TEXT DEFAULT (datetime('now'))
|
||||
);
|
||||
|
||||
-- Conviction fan-out tree (§4.6). A derivative is a HYPOTHESIS until independent
|
||||
-- corpus corroboration AND the exposure gap both clear the bar — then 'signal'.
|
||||
CREATE TABLE IF NOT EXISTS fanout_nodes (
|
||||
node_id TEXT PRIMARY KEY,
|
||||
parent_conviction_id TEXT REFERENCES conviction_log(conviction_id),
|
||||
parent_node_id TEXT REFERENCES fanout_nodes(node_id),
|
||||
derivative_proposition TEXT NOT NULL,
|
||||
depth INTEGER DEFAULT 1,
|
||||
status TEXT CHECK (status IN ('hypothesis','corroborated','signal')) DEFAULT 'hypothesis',
|
||||
created_at TEXT DEFAULT (datetime('now'))
|
||||
);
|
||||
|
||||
-- ============================================================================
|
||||
-- DUAL-EVALUATION LEDGER (§4.7, §6) — START DAY ONE; the clock can't be backfilled.
|
||||
-- Log EVERY candidate that clears the quantitative bar (§6.6 — you need a denominator).
|
||||
-- ============================================================================
|
||||
CREATE TABLE IF NOT EXISTS ledger (
|
||||
signal_id TEXT PRIMARY KEY,
|
||||
type TEXT NOT NULL CHECK (type IN ('theme','event','under_acted_conviction')),
|
||||
proposition TEXT NOT NULL,
|
||||
date_logged TEXT NOT NULL DEFAULT (datetime('now')),
|
||||
discourse_metric TEXT, -- JSON: acceleration, cross-cluster source set, independence-discounted count
|
||||
external_check TEXT, -- JSON: resolution spec / nested clean events the model proposed (§6.5)
|
||||
resolution_date TEXT,
|
||||
discourse_outcome TEXT CHECK (discourse_outcome IN
|
||||
('up_cross_cluster','up_single_cluster','flat','down')),
|
||||
external_outcome TEXT CHECK (external_outcome IN
|
||||
('correct','partial','wrong','unresolved_expired','too_early')),
|
||||
lead_time_days INTEGER, -- §6.3 THE alpha measurement (to the DERIVATIVE node for Job B)
|
||||
model_confidence REAL, -- §6.7 logged ONLY to measure its uselessness — NEVER fed into scoring
|
||||
origin_conviction_id TEXT REFERENCES conviction_log(conviction_id), -- Job B traceability
|
||||
origin_node_id TEXT REFERENCES fanout_nodes(node_id)
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_ledger_type ON ledger(type);
|
||||
CREATE INDEX IF NOT EXISTS idx_ledger_logged ON ledger(date_logged);
|
||||
|
||||
-- Human eval on a SEPARATE write path (§6.7): "keep them in separate columns and do not let the
|
||||
-- model see Grant's rating before it logs its prediction." The model-facing code reads `ledger`;
|
||||
-- ONLY the eval UI writes here. A separate table makes that separation structural, not a convention.
|
||||
CREATE TABLE IF NOT EXISTS human_evaluations (
|
||||
signal_id TEXT PRIMARY KEY REFERENCES ledger(signal_id),
|
||||
grant_rating INTEGER, -- "non-obvious and relevant to me?" (e.g. 1-5)
|
||||
non_obvious INTEGER, -- 0/1
|
||||
notes TEXT,
|
||||
rated_at TEXT DEFAULT (datetime('now'))
|
||||
);
|
||||
|
||||
-- Reporting view — the valuable cell is DISAGREEMENT (§6.7). Used for analysis, NOT by the model path.
|
||||
CREATE VIEW IF NOT EXISTS v_ledger_eval AS
|
||||
SELECT l.*, h.grant_rating, h.non_obvious, h.notes AS grant_notes, h.rated_at
|
||||
FROM ledger l LEFT JOIN human_evaluations h ON h.signal_id = l.signal_id;
|
||||
|
||||
-- ============================================================================
|
||||
-- BACKFILL QUEUE (§13.4) — client-side, measured in GPU-HOURS.
|
||||
-- Extraction (one LLM pass per chunk over the whole corpus) is the HEAVIER serial load.
|
||||
-- Audio is SEQUENTIAL (parallel → 503). Leases give crash-safe resumability.
|
||||
-- ============================================================================
|
||||
CREATE TABLE IF NOT EXISTS backfill_jobs (
|
||||
job_id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
job_type TEXT NOT NULL CHECK (job_type IN ('transcribe','diarize','extract','embed')),
|
||||
target_id TEXT NOT NULL, -- doc_id or chunk id
|
||||
parent_doc_id TEXT,
|
||||
state TEXT NOT NULL CHECK (state IN
|
||||
('pending','leased','running','done','failed','skipped')) DEFAULT 'pending',
|
||||
priority INTEGER DEFAULT 100, -- lower = sooner (backtest corpus jumps the queue, §7.1)
|
||||
attempts INTEGER DEFAULT 0,
|
||||
max_attempts INTEGER DEFAULT 5,
|
||||
lease_owner TEXT,
|
||||
lease_expires_at TEXT,
|
||||
input_hash TEXT NOT NULL, -- hash(content + model/prompt version) — idempotency
|
||||
output_ref TEXT,
|
||||
gpu_seconds REAL, -- measured per job → self-calibrating GPU-hours estimate
|
||||
error TEXT,
|
||||
created_at TEXT DEFAULT (datetime('now')),
|
||||
updated_at TEXT DEFAULT (datetime('now')),
|
||||
UNIQUE (job_type, input_hash)
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_jobs_state_priority ON backfill_jobs(state, priority, job_id);
|
||||
|
||||
-- ============================================================================
|
||||
-- SCORING BRAIN state (the "brain", build blueprint). Candidate state lands here +
|
||||
-- ledger + fanout_nodes.status; existing tables unchanged.
|
||||
-- ============================================================================
|
||||
|
||||
-- Temporal layer: one row per (topic, as_of, window). 28d non-overlapping windows.
|
||||
CREATE TABLE IF NOT EXISTS topic_window_stats (
|
||||
topic_canonical TEXT NOT NULL,
|
||||
as_of TEXT NOT NULL,
|
||||
window_idx INTEGER NOT NULL, -- 0 = window ending at as_of, 1 = prior, 2 = baseline
|
||||
window_start TEXT NOT NULL,
|
||||
window_end TEXT NOT NULL,
|
||||
n_interp_pred INTEGER NOT NULL DEFAULT 0,
|
||||
n_descr_react INTEGER NOT NULL DEFAULT 0,
|
||||
n_distinct_src INTEGER NOT NULL DEFAULT 0,
|
||||
n_distinct_clu INTEGER NOT NULL DEFAULT 0,
|
||||
PRIMARY KEY (topic_canonical, as_of, window_idx)
|
||||
);
|
||||
|
||||
-- Audit trail: one row per (scorer, key, as_of). Deterministic score_id → re-run reproduces.
|
||||
CREATE TABLE IF NOT EXISTS candidate_scores (
|
||||
score_id TEXT PRIMARY KEY,
|
||||
scorer TEXT NOT NULL, -- emergence|contrarian|intersection|convergence|under_acted
|
||||
as_of TEXT NOT NULL,
|
||||
topic_canonical TEXT,
|
||||
node_id TEXT,
|
||||
conviction_id TEXT,
|
||||
score REAL NOT NULL,
|
||||
cleared_evidence_bar INTEGER NOT NULL DEFAULT 0, -- tier 1: logged to ledger (the denominator)
|
||||
cleared_promotion_bar INTEGER NOT NULL DEFAULT 0, -- tier 2: sent to frontier judge
|
||||
inputs_json TEXT NOT NULL, -- every term that produced the score (full audit)
|
||||
computed_at TEXT DEFAULT (datetime('now'))
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_cs_asof ON candidate_scores(scorer, as_of, cleared_promotion_bar);
|
||||
|
||||
-- Tunable bar config so the backtest can sweep thresholds without code edits.
|
||||
CREATE TABLE IF NOT EXISTS score_thresholds (
|
||||
scorer TEXT PRIMARY KEY,
|
||||
min_score REAL,
|
||||
gates_json TEXT,
|
||||
version TEXT
|
||||
);
|
||||
@@ -0,0 +1,74 @@
|
||||
"""Load human-owned seed data (conviction log, §3.1) into SQLite.
|
||||
|
||||
The conviction log is the highest-leverage Job B input (§3.1) and is HUMAN-OWNED:
|
||||
Grant edits the YAML seed files; this loader upserts them. Re-running is idempotent.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import sqlite3
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import yaml
|
||||
|
||||
_CONVICTION_COLS = (
|
||||
"conviction_id",
|
||||
"seam",
|
||||
"thematic_proposition",
|
||||
"team_conviction_note",
|
||||
"conviction_level",
|
||||
"current_exposure",
|
||||
"exposure_note",
|
||||
"disconfirming_signal",
|
||||
"is_thesis_breaker",
|
||||
)
|
||||
|
||||
|
||||
def _row(c: dict[str, Any]) -> dict[str, Any]:
|
||||
return {
|
||||
"conviction_id": c["id"],
|
||||
"seam": c.get("seam"),
|
||||
"thematic_proposition": c["thematic_proposition"],
|
||||
"team_conviction_note": c.get("team_conviction_note"),
|
||||
"conviction_level": c.get("conviction_level"),
|
||||
"current_exposure": c.get("current_exposure", "unset"),
|
||||
"exposure_note": c.get("exposure_note"),
|
||||
"disconfirming_signal": c.get("disconfirming_signal"),
|
||||
"is_thesis_breaker": 1 if c.get("is_thesis_breaker") else 0,
|
||||
}
|
||||
|
||||
|
||||
def load_fanout(conn: sqlite3.Connection, path: Path) -> int:
|
||||
"""Load a hand-written fan-out tree (§7.1 backtest). Idempotent on node_id."""
|
||||
data = yaml.safe_load(Path(path).read_text()) or {}
|
||||
parent = data["parent_conviction_id"]
|
||||
nodes = data.get("nodes", [])
|
||||
for n in nodes:
|
||||
conn.execute(
|
||||
"""INSERT INTO fanout_nodes
|
||||
(node_id, parent_conviction_id, derivative_proposition, depth, status, distance_from_edge)
|
||||
VALUES (?,?,?,?, 'hypothesis', ?)
|
||||
ON CONFLICT(node_id) DO UPDATE SET derivative_proposition=excluded.derivative_proposition,
|
||||
parent_conviction_id=excluded.parent_conviction_id,
|
||||
distance_from_edge=excluded.distance_from_edge""",
|
||||
(n["node_id"], parent, n["derivative_proposition"], n.get("depth", 1), n.get("distance_from_edge")),
|
||||
)
|
||||
conn.commit()
|
||||
return len(nodes)
|
||||
|
||||
|
||||
def load_convictions(conn: sqlite3.Connection, path: Path) -> int:
|
||||
data = yaml.safe_load(Path(path).read_text()) or {}
|
||||
rows = data.get("convictions", [])
|
||||
cols = ", ".join(_CONVICTION_COLS)
|
||||
placeholders = ", ".join(f":{c}" for c in _CONVICTION_COLS)
|
||||
updates = ", ".join(f"{c}=excluded.{c}" for c in _CONVICTION_COLS if c != "conviction_id")
|
||||
sql = (
|
||||
f"INSERT INTO conviction_log ({cols}, updated_at) "
|
||||
f"VALUES ({placeholders}, datetime('now')) "
|
||||
f"ON CONFLICT(conviction_id) DO UPDATE SET {updates}, updated_at=datetime('now')"
|
||||
)
|
||||
for c in rows:
|
||||
conn.execute(sql, _row(c))
|
||||
conn.commit()
|
||||
return len(rows)
|
||||
@@ -0,0 +1,90 @@
|
||||
"""Load the source registry (companies + podcasts, §7.3/§7.4) into SQLite. Idempotent upsert."""
|
||||
from __future__ import annotations
|
||||
|
||||
import sqlite3
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import yaml
|
||||
|
||||
_COLS = ("source_id", "name", "kind", "source_cluster", "role", "rss_url",
|
||||
"channel_url", "ticker", "cluster_capped_low", "own_network", "backtest_2022_2023", "notes")
|
||||
|
||||
|
||||
def _row(s: dict[str, Any]) -> dict[str, Any]:
|
||||
return {
|
||||
"source_id": s["id"],
|
||||
"name": s["name"],
|
||||
"kind": s["kind"],
|
||||
"source_cluster": s.get("cluster"),
|
||||
"role": s.get("role", "none"),
|
||||
"rss_url": s.get("rss_url"),
|
||||
"channel_url": s.get("channel_url"),
|
||||
"ticker": s.get("ticker"),
|
||||
"cluster_capped_low": 1 if s.get("cluster_capped_low") else 0,
|
||||
"own_network": 1 if s.get("own_network") else 0,
|
||||
"backtest_2022_2023": s.get("backtest_2022_2023"),
|
||||
"notes": s.get("notes"),
|
||||
}
|
||||
|
||||
|
||||
def update_feeds(conn: sqlite3.Connection, path: Path) -> int:
|
||||
"""Apply resolved/verified podcast feed URLs + backtest-reach to existing source rows."""
|
||||
try:
|
||||
conn.execute("ALTER TABLE sources ADD COLUMN backtest_2022_2023 TEXT")
|
||||
conn.commit()
|
||||
except sqlite3.OperationalError:
|
||||
pass # column already exists
|
||||
data = yaml.safe_load(Path(path).read_text()) or {}
|
||||
rows = data.get("feeds", [])
|
||||
for f in rows:
|
||||
conn.execute(
|
||||
"""UPDATE sources
|
||||
SET rss_url=:rss_url, channel_url=:youtube_channel_url,
|
||||
backtest_2022_2023=:backtest_2022_2023, notes=COALESCE(:note, notes)
|
||||
WHERE source_id=:id""",
|
||||
{
|
||||
"id": f["id"], "rss_url": f.get("rss_url"),
|
||||
"youtube_channel_url": f.get("youtube_channel_url"),
|
||||
"backtest_2022_2023": f.get("backtest_2022_2023"), "note": f.get("note"),
|
||||
},
|
||||
)
|
||||
conn.commit()
|
||||
return len(rows)
|
||||
|
||||
|
||||
def load_source_edges(conn: sqlite3.Connection, path: Path) -> int:
|
||||
"""Seed EISC connectedness edges (priors) idempotently. Stores src_a,src_b in sorted order to
|
||||
match the transcribe_worker's convention (sorted([a,b]) + ON CONFLICT weight+=1) so real detections
|
||||
accumulate on the same PK instead of creating a reversed duplicate. DO NOTHING on conflict → a
|
||||
re-run won't inflate, and won't clobber a stronger auto-detected weight."""
|
||||
data = yaml.safe_load(Path(path).read_text()) or {}
|
||||
rows = data.get("edges", [])
|
||||
applied = 0
|
||||
for e in rows:
|
||||
a, b = sorted([e["a"], e["b"]])
|
||||
cur = conn.execute(
|
||||
"""INSERT INTO source_edges (src_a, src_b, edge_type, weight, evidence)
|
||||
VALUES (?,?,?,?,?)
|
||||
ON CONFLICT(src_a, src_b, edge_type) DO NOTHING""",
|
||||
(a, b, e["type"], float(e.get("weight", 1.0)), e.get("evidence")),
|
||||
)
|
||||
applied += cur.rowcount
|
||||
conn.commit()
|
||||
return applied
|
||||
|
||||
|
||||
def load_sources(conn: sqlite3.Connection, path: Path) -> int:
|
||||
data = yaml.safe_load(Path(path).read_text()) or {}
|
||||
rows = data.get("sources", [])
|
||||
cols = ", ".join(_COLS)
|
||||
placeholders = ", ".join(f":{c}" for c in _COLS)
|
||||
updates = ", ".join(f"{c}=excluded.{c}" for c in _COLS if c != "source_id")
|
||||
sql = (
|
||||
f"INSERT INTO sources ({cols}, created_at) VALUES ({placeholders}, datetime('now')) "
|
||||
f"ON CONFLICT(source_id) DO UPDATE SET {updates}"
|
||||
)
|
||||
for s in rows:
|
||||
conn.execute(sql, _row(s))
|
||||
conn.commit()
|
||||
return len(rows)
|
||||
@@ -0,0 +1,5 @@
|
||||
"""Web UI (FastAPI) — corpus management + (later) the human-eval rating interface (§4.7/§6.7).
|
||||
|
||||
This is the app the StartOS s9pk exposes on its `ui` interface. Server-rendered HTML, no template
|
||||
engine / JS framework — boring and inspectable, like the rest of the system.
|
||||
"""
|
||||
@@ -0,0 +1,179 @@
|
||||
"""Corpus-management web UI (FastAPI).
|
||||
|
||||
Pages:
|
||||
/ dashboard — corpus + pipeline counts at a glance
|
||||
/corpus full source selection (companies + podcasts) + "add source" form
|
||||
/corpus/add POST handler (manual urlencoded parse → no python-multipart dependency)
|
||||
/source/{id} per-source detail: documents + extracted claims (inspect the signal)
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import html
|
||||
import re
|
||||
import sqlite3
|
||||
from urllib.parse import parse_qs
|
||||
|
||||
from fastapi import FastAPI, Request
|
||||
from fastapi.responses import HTMLResponse, RedirectResponse
|
||||
|
||||
from ..config import load_config
|
||||
from ..store import db
|
||||
|
||||
_CSS = """
|
||||
body{font:14px/1.5 -apple-system,Segoe UI,Roboto,sans-serif;margin:0;background:#0f1115;color:#e6e6e6}
|
||||
header{background:#161a22;padding:12px 20px;border-bottom:1px solid #2a2f3a}
|
||||
header a{color:#7aa2f7;text-decoration:none;margin-right:18px;font-weight:600}
|
||||
main{padding:20px;max-width:1100px;margin:0 auto}
|
||||
h1{font-size:20px}h2{font-size:16px;margin-top:28px;color:#9aa5b1}
|
||||
table{border-collapse:collapse;width:100%;margin:10px 0}
|
||||
th,td{text-align:left;padding:6px 10px;border-bottom:1px solid #232833;font-size:13px}
|
||||
th{color:#9aa5b1;font-weight:600}
|
||||
tr:hover td{background:#161a22}
|
||||
.tag{display:inline-block;padding:1px 7px;border-radius:10px;background:#232833;font-size:11px;color:#aab}
|
||||
.cards{display:flex;gap:14px;flex-wrap:wrap}
|
||||
.card{background:#161a22;border:1px solid #2a2f3a;border-radius:8px;padding:14px 18px;min-width:130px}
|
||||
.card .n{font-size:24px;font-weight:700;color:#7aa2f7}.card .l{color:#9aa5b1;font-size:12px}
|
||||
form{background:#161a22;border:1px solid #2a2f3a;border-radius:8px;padding:16px;margin:14px 0}
|
||||
label{display:block;margin:8px 0 2px;color:#9aa5b1;font-size:12px}
|
||||
input,select{background:#0f1115;border:1px solid #2a2f3a;color:#e6e6e6;border-radius:5px;padding:6px 8px;width:240px}
|
||||
button{background:#7aa2f7;color:#0f1115;border:0;border-radius:6px;padding:8px 16px;font-weight:700;cursor:pointer;margin-top:12px}
|
||||
a{color:#7aa2f7}.muted{color:#6b7280;font-size:12px}
|
||||
"""
|
||||
|
||||
_CLUSTERS = ["macro", "ai_tech", "energy", "bitcoin", "vc_consensus", "generalist"]
|
||||
_KINDS = ["podcast", "youtube", "filing", "earnings_call"]
|
||||
_ROLES = ["none", "CB", "IND", "DX"]
|
||||
|
||||
|
||||
def _page(title: str, body: str) -> HTMLResponse:
|
||||
nav = ('<header><a href="/">Dashboard</a><a href="/corpus">Corpus</a>'
|
||||
'<span class="muted">Ten31 Signal Engine</span></header>')
|
||||
doc = f"<!doctype html><html><head><meta charset=utf-8><title>{html.escape(title)}</title>" \
|
||||
f"<style>{_CSS}</style></head><body>{nav}<main>{body}</main></body></html>"
|
||||
return HTMLResponse(doc)
|
||||
|
||||
|
||||
def _slug(s: str) -> str:
|
||||
return re.sub(r"[^a-z0-9]+", "-", s.lower()).strip("-")[:40] or "src"
|
||||
|
||||
|
||||
def create_app() -> FastAPI:
|
||||
cfg = load_config()
|
||||
app = FastAPI(title="Ten31 Signal Engine")
|
||||
|
||||
def conn() -> sqlite3.Connection:
|
||||
c = db.connect(cfg.db_path)
|
||||
db.init_db(c)
|
||||
return c
|
||||
|
||||
@app.get("/", response_class=HTMLResponse)
|
||||
def dashboard() -> HTMLResponse:
|
||||
c = conn()
|
||||
def scalar(q, *a):
|
||||
r = c.execute(q, a).fetchone()
|
||||
return r[0] if r else 0
|
||||
cards = {
|
||||
"Sources": scalar("SELECT COUNT(*) FROM sources"),
|
||||
"Documents": scalar("SELECT COUNT(*) FROM documents"),
|
||||
"Claims": scalar("SELECT COUNT(*) FROM claims"),
|
||||
"Embedded": scalar("SELECT COUNT(*) FROM claims WHERE qdrant_point_id IS NOT NULL"),
|
||||
"Convictions": scalar("SELECT COUNT(*) FROM conviction_log"),
|
||||
"Ledger": scalar("SELECT COUNT(*) FROM ledger"),
|
||||
}
|
||||
cards_html = "".join(f'<div class="card"><div class="n">{v}</div><div class="l">{k}</div></div>'
|
||||
for k, v in cards.items())
|
||||
# breakdowns
|
||||
def rows(q):
|
||||
return "".join(f"<tr><td>{html.escape(str(a))}</td><td>{b}</td></tr>" for a, b in c.execute(q))
|
||||
claims_by_type = rows("SELECT claim_type, COUNT(*) FROM claims GROUP BY claim_type ORDER BY 2 DESC")
|
||||
claims_by_seam = rows("SELECT thesis_seam, COUNT(*) FROM claims GROUP BY thesis_seam ORDER BY 2 DESC")
|
||||
queue = rows("SELECT job_type||' / '||state, COUNT(*) FROM backfill_jobs GROUP BY 1 ORDER BY 1")
|
||||
c.close()
|
||||
body = f"""<h1>Dashboard</h1><div class="cards">{cards_html}</div>
|
||||
<h2>Claims by type</h2><table><tr><th>type</th><th>n</th></tr>{claims_by_type or '<tr><td class=muted colspan=2>none yet</td></tr>'}</table>
|
||||
<h2>Claims by thesis seam</h2><table><tr><th>seam</th><th>n</th></tr>{claims_by_seam or '<tr><td class=muted colspan=2>none yet</td></tr>'}</table>
|
||||
<h2>Backfill queue</h2><table><tr><th>type / state</th><th>n</th></tr>{queue or '<tr><td class=muted colspan=2>empty</td></tr>'}</table>"""
|
||||
return _page("Dashboard", body)
|
||||
|
||||
@app.get("/corpus", response_class=HTMLResponse)
|
||||
def corpus() -> HTMLResponse:
|
||||
c = conn()
|
||||
srcs = c.execute("""
|
||||
SELECT s.*,
|
||||
(SELECT COUNT(*) FROM documents d WHERE d.source_id=s.source_id) docs,
|
||||
(SELECT COUNT(*) FROM claims cl WHERE cl.source_id=s.source_id) claims
|
||||
FROM sources s ORDER BY s.kind, s.source_id""").fetchall()
|
||||
c.close()
|
||||
|
||||
def row(s):
|
||||
extra = s["ticker"] or s["backtest_2022_2023"] or ""
|
||||
return (f"<tr><td><a href='/source/{html.escape(s['source_id'])}'>{html.escape(s['name'])}</a></td>"
|
||||
f"<td><span class=tag>{s['kind']}</span></td><td>{s['source_cluster'] or ''}</td>"
|
||||
f"<td>{s['role'] or ''}</td><td>{html.escape(str(extra))}</td>"
|
||||
f"<td>{s['docs']}</td><td>{s['claims']}</td></tr>")
|
||||
table = "".join(row(s) for s in srcs)
|
||||
opt = lambda xs: "".join(f"<option>{x}</option>" for x in xs)
|
||||
form = f"""<form method=post action="/corpus/add">
|
||||
<strong>Add to corpus</strong>
|
||||
<label>Name</label><input name=name required placeholder="NVIDIA / Odd Lots">
|
||||
<label>Kind</label><select name=kind>{opt(_KINDS)}</select>
|
||||
<label>Cluster</label><select name=cluster>{opt(_CLUSTERS)}</select>
|
||||
<label>Role</label><select name=role>{opt(_ROLES)}</select>
|
||||
<label>Ticker (companies)</label><input name=ticker placeholder="NVDA">
|
||||
<label>RSS URL (podcasts)</label><input name=rss_url placeholder="https://...">
|
||||
<label>YouTube channel</label><input name=channel_url placeholder="https://youtube.com/@...">
|
||||
<button type=submit>Add source</button>
|
||||
</form>"""
|
||||
body = f"""<h1>Corpus ({len(srcs)} sources)</h1>{form}
|
||||
<table><tr><th>name</th><th>kind</th><th>cluster</th><th>role</th><th>ticker / backtest</th><th>docs</th><th>claims</th></tr>{table}</table>"""
|
||||
return _page("Corpus", body)
|
||||
|
||||
@app.post("/corpus/add")
|
||||
async def corpus_add(request: Request):
|
||||
raw = (await request.body()).decode()
|
||||
f = {k: v[0].strip() for k, v in parse_qs(raw).items() if v and v[0].strip()}
|
||||
name = f.get("name")
|
||||
if not name:
|
||||
return RedirectResponse("/corpus", status_code=303)
|
||||
kind = f.get("kind", "podcast")
|
||||
ticker = f.get("ticker")
|
||||
sid = f"co-{ticker.lower()}" if ticker else f"{'pod' if kind in ('podcast','youtube') else kind}-{_slug(name)}"
|
||||
c = conn()
|
||||
c.execute("""INSERT OR IGNORE INTO sources
|
||||
(source_id, name, kind, source_cluster, role, ticker, rss_url, channel_url)
|
||||
VALUES (?,?,?,?,?,?,?,?)""",
|
||||
(sid, name, kind, f.get("cluster"), f.get("role", "none"),
|
||||
ticker.upper() if ticker else None, f.get("rss_url"), f.get("channel_url")))
|
||||
c.commit()
|
||||
c.close()
|
||||
return RedirectResponse("/corpus", status_code=303)
|
||||
|
||||
@app.get("/source/{source_id}", response_class=HTMLResponse)
|
||||
def source_detail(source_id: str) -> HTMLResponse:
|
||||
c = conn()
|
||||
s = c.execute("SELECT * FROM sources WHERE source_id=?", (source_id,)).fetchone()
|
||||
if not s:
|
||||
c.close()
|
||||
return _page("Not found", "<h1>Source not found</h1>")
|
||||
claims = c.execute("""SELECT proposition, claim_type, time_horizon, thesis_seam, topic_canonical,
|
||||
engages_consensus, date FROM claims WHERE source_id=?
|
||||
ORDER BY date DESC LIMIT 200""", (source_id,)).fetchall()
|
||||
c.close()
|
||||
def crow(cl):
|
||||
star = " ⚔" if cl["engages_consensus"] else ""
|
||||
return (f"<tr><td>{cl['date'] or ''}</td><td><span class=tag>{cl['claim_type']}</span></td>"
|
||||
f"<td>{cl['thesis_seam']}</td><td>{html.escape(cl['topic_canonical'] or '')}</td>"
|
||||
f"<td>{html.escape(cl['proposition'])}{star}</td></tr>")
|
||||
rows = "".join(crow(cl) for cl in claims) or '<tr><td class=muted colspan=5>no claims extracted yet</td></tr>'
|
||||
meta = f"<span class=tag>{s['kind']}</span> cluster={s['source_cluster'] or '-'} role={s['role'] or '-'}"
|
||||
if s["ticker"]:
|
||||
meta += f" ticker={s['ticker']}"
|
||||
if s["backtest_2022_2023"]:
|
||||
meta += f" · backtest={s['backtest_2022_2023']}"
|
||||
body = f"""<h1>{html.escape(s['name'])}</h1><p>{meta}</p>
|
||||
<p class=muted>{html.escape(s['notes'] or '')}</p>
|
||||
<h2>Claims ({len(claims)}) <span class=muted>⚔ = engages consensus</span></h2>
|
||||
<table><tr><th>date</th><th>type</th><th>seam</th><th>topic</th><th>proposition</th></tr>{rows}</table>"""
|
||||
return _page(s["name"], body)
|
||||
|
||||
return app
|
||||
@@ -0,0 +1,28 @@
|
||||
"""Small shared utilities (normalization, dedup keys)."""
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
|
||||
_SHOW_SUFFIX = re.compile(r"\s*[|\-–—]\s*[^|\-–—]*(podcast|show|ep(isode)?\s*\d+).*$", re.I)
|
||||
_EP_PREFIX = re.compile(r"^\s*(ep(isode)?\.?\s*\d+\s*[:\-–]|#\s*\d+\s*[:\-–]|\d+\s*[:\-–])\s*", re.I)
|
||||
_NONALNUM = re.compile(r"[^a-z0-9]+")
|
||||
|
||||
|
||||
def slugify(s: str, *, maxlen: int = 60) -> str:
|
||||
return _NONALNUM.sub("-", (s or "").lower()).strip("-")[:maxlen] or "x"
|
||||
|
||||
|
||||
def normalize_title(title: str) -> str:
|
||||
"""Normalize an episode title so the SAME episode matches across feeds/mirrors despite cosmetic
|
||||
differences ('Ep 42: Foo' vs 'Foo | The Show'). Best-effort — a safety net, not the primary key."""
|
||||
t = title or ""
|
||||
t = _SHOW_SUFFIX.sub("", t)
|
||||
t = _EP_PREFIX.sub("", t)
|
||||
return _NONALNUM.sub(" ", t.lower()).strip()
|
||||
|
||||
|
||||
def audio_dedup_key(title: str | None, date: str | None) -> str:
|
||||
"""Cross-mirror dedup key for audio: normalized title + date. Computed BEFORE transcription so a
|
||||
duplicate episode (same content via a different feed/mirror) is skipped without spending GPU.
|
||||
NOT derived from the transcript (ASR is non-deterministic — a transcript hash would be brittle)."""
|
||||
return f"{normalize_title(title or '')}|{date or ''}"
|
||||
@@ -0,0 +1,384 @@
|
||||
# Ten31 Signal Engine — Build Handoff (Project 2 of 2)
|
||||
|
||||
**Audience:** Claude Code
|
||||
**Owner:** Grant
|
||||
**Status:** v2 spec. Pilot-first. Several items flagged DESIGN-FIRST must be resolved before scaling.
|
||||
**Companion doc:** Project 1 (voice/writing assistant) is a separate, independent build with a different scope and hardware footprint. This doc covers Project 2 only.
|
||||
|
||||
---
|
||||
|
||||
## 1. What we're building (in one paragraph)
|
||||
|
||||
A recurring pipeline that ingests a large, growing corpus of audio (podcasts, YouTube) and text (SEC filings, earnings-call transcripts), extracts structured *claims* from it, and surfaces signal over time, all filtered through Ten31's investment thesis as a relevance lens (Section 3). Every surfaced signal is logged as a falsifiable prediction so the system can be scored against reality over time.
|
||||
|
||||
The system serves **two distinct jobs**, and the design must do both:
|
||||
|
||||
- **Job A — Discovery.** Surface emerging themes (including non-obvious 2nd/3rd-order themes) and contrarian signals that Grant does not yet see. Themes are detected as *convergence* of independent sources; contrarian as a credible minority position actively arguing against an established consensus; and the prize is their intersection — a credible minority view that is also *accelerating* (a consensus about to flip).
|
||||
- **Job B — Closing the conviction-action gap.** Take convictions Ten31 *already holds*, fan them out to their 2nd/3rd-order consequences, and flag the derivative branches where the world is beginning to corroborate something Ten31 has little or no exposure to. This is designed against a specific, expensive, real failure (see Section 1.1): being *right about a root theme and late to its derivatives.*
|
||||
|
||||
This is **not** a summarizer, and it is **not** a free-associating "find me something interesting" engine. The discipline that separates signal from plausible-sounding noise is the hard constraint in Section 5: **statistics and graph structure nominate candidates; the frontier model only judges (and fans out from) a pre-filtered shortlist.**
|
||||
|
||||
### 1.1 The failure mode Job B exists to prevent (the AI/compute miss)
|
||||
|
||||
Three years ago Ten31 co-founder Jonathan Kirkwood publicly articulated, with conviction, that bitcoin mining and AI were both "distributed compute," that interruptible/flexible load was the differentiator, and that the world would need to ~1000x rack space over the decade. **The root call was correct and early.** What was missed was not the prediction — it was the *derivative tree*: if compute explodes and power becomes the binding constraint, then the investable consequences are grid interconnect, transformers, substations, cooling, gas turbines, nuclear, uranium, and the public picks-and-shovels of the buildout — much of which was extraordinarily profitable 2023–2025. Ten31 held the seed conviction but did not systematically fan it out and size up behind the branches; in at least one case (Giga) the AI tailwind arrived by accident rather than as the thesis. **Job B is the institutional countermeasure: never again be right about a root theme and under-act on its derivatives.** (Note: this fan-out logic is Ten31's own — the Platform essay frames the strategy as Sequoia's "aircraft carrier" approach of backing a thesis *and its second-order beneficiaries*: microprocessor → Atari → Apple → Oracle/Cisco. Job B operationalizes exactly that.)
|
||||
|
||||
---
|
||||
|
||||
## 2. The core conceptual model (read this before coding)
|
||||
|
||||
Three things people get wrong about this kind of system. The architecture exists to avoid all three:
|
||||
|
||||
1. **An isolated point in embedding space is almost never "early insight" — it's almost always noise** (ad reads, tangents, mistranscriptions, one-off anecdotes). Do not hunt for lonely outliers. Real signal is *independent convergence*: the same non-obvious idea appearing across sources that share no guests, community, or topic.
|
||||
|
||||
2. **Embeddings are bad at stance and negation.** "Rate cuts cause a recession" and "rate cuts do NOT cause a recession" embed almost on top of each other. Therefore we never rely on raw-chunk vector distance to separate positions. We extract `stance` as a structured field *before* embedding.
|
||||
|
||||
3. **A frontier model asked "what's the non-obvious connection here?" will ALWAYS produce one**, whether or not it's real. It never returns "nothing." So the model is never allowed to nominate candidates from the raw corpus, nor to have its fan-out branches trusted on their own. It only ever assesses candidates that already passed a quantitative bar, and a fanned-out derivative only becomes a signal once independent corpus corroboration confirms it.
|
||||
|
||||
Geometry and stats nominate. The model judges and expands. The world (the prediction ledger) is the final arbiter.
|
||||
|
||||
---
|
||||
|
||||
## 3. The relevance lens AND the seed convictions — Ten31's thesis
|
||||
|
||||
This is the lens that decides what's *relevant* (Job A) and the set of seed convictions the system fans out (Job B). It must NOT decide what's *true*: the system must remain able to surface signals that cut against this thesis (see Section 5). Use the operator's exact framing below.
|
||||
|
||||
### Root (the forcing function)
|
||||
Debasement is the macro forcing function. The cost of everything *reproducible* — cognition, content, code, media — collapses toward zero, so value migrates to the few things that are scarce, verifiable, or owned. **AI is the abundance engine. Bitcoin is the scarcity anchor.** These are not two trends Ten31 straddles; they are two faces of one **megatrend**. Ten31 does not invest in AI because it is growing — it invests where AI's abundance *increases the value of the scarce and the verifiable*, which is the same thing Ten31 has invested in for a decade.
|
||||
|
||||
Framing discipline: **Ten31 is not "adding AI."** Ten31 has been making AI-adjacent investments for years (see below); they were simply filed under bitcoin. The thesis already spanned AI by construction.
|
||||
|
||||
### Bitcoin
|
||||
The apex non-debasable reserve asset that capital progressively converges on to preserve wealth. (NOT framed as a settlement/medium-of-exchange layer — see weak forms, Section 9.)
|
||||
|
||||
### The three investable seams (picks-and-shovels at the convergence)
|
||||
Ten31 backs the **indispensable enabling infrastructure** at the point where two megatrends meet — companies whose demand is *structurally redundant across both trends*, so the investment does not depend on which trend wins. This is NOT rent-extraction / "toll roads"; it is providing the critical infrastructure the buildout cannot happen without, positioned at the seam.
|
||||
|
||||
1. **Energy <-> Compute.** Power infrastructure, equipment, contracts, and software serving both bitcoin mining and AI data centers (portfolio: Giga Energy, Satoshi Energy, Upstream Data). Differentiator: **bitcoin-mining fluency is the underwriting lens for the AI energy buildout.** Miners were the first large-scale flexible, interruptible, behind-the-meter, location-agnostic electrical load in history; that exact playbook is what AI data centers now need. Mining isn't the bet — it's the training ground. Almost no generalist energy or AI fund has this fluency.
|
||||
2. **Debasement <-> Bitcoin.** Bitcoin as **pristine collateral** (the best collateral ever created: liquid, 24/7, divisible, verifiable, un-debasable) plus the **picks-and-shovels infrastructure that makes it easier to access, hold, leverage, and utilize** (portfolio: River, Unchained, Strike, Battery, debifi, AnchorWatch).
|
||||
3. **AI <-> Data-Ownership.** Keep proprietary data and inference under your own roof instead of feeding it to a third party that trains on it, monetizes it, or can cut you off (portfolio: Start9, OpenSecret/Maple). The indispensable option for those who cannot cede control — regulated industries, trade secrets, adversarial jurisdictions, the targeted.
|
||||
|
||||
### Connective logic (NOT a sub-sector to fund)
|
||||
AI belongs in the mandate because the scarcity thesis is *incomplete without the abundance force creating it* — not because AI is adjacent or hot. The narrow playground: **only where AI collides with scarcity, energy, and data-ownership.** Not foundation models, not generic AI apps. Right to win: **when AI hits bitcoin, energy, and freedom tech, Ten31 sees it first and underwrites it best** — a decade of relationships and operating fluency a generalist AI fund lacks and a bitcoin fund without AI understanding lacks.
|
||||
|
||||
### Already-AI track record (proof, not forecast)
|
||||
The mandate is demonstrated by existing investments, several years old, each landing on a named seam:
|
||||
- Energy<->compute: Giga Energy, Satoshi Energy
|
||||
- AI<->data-ownership: Start9, OpenSecret/Maple
|
||||
- AI<->scarcity/verification: StatMuse, Stakwork, Vida
|
||||
(Confirm exact dates and seam assignment per company when finalizing the pitch.)
|
||||
|
||||
### Free optionality (NOT load-bearing)
|
||||
Censorship-resistant settlement for machine-to-machine payments — bitcoin/lightning wins the permissionless, un-freezable, cross-border margin that stablecoins structurally can't serve. Carried as upside, not underwritten.
|
||||
|
||||
### 3.1 The conviction log (seed nodes for Job B)
|
||||
A maintained, human-owned list of beliefs Ten31 holds, each with **conviction** (low/med/high), **current exposure** (none/low/med/high), and a **disconfirming signal**. These are the seed nodes the system fans out (4.6) and the basis for the under-acted-conviction signal (4.4). Highest-leverage input to Job B; keep editable.
|
||||
|
||||
**Critical structural rule — conviction = team x thesis, but the engine can only track the thesis half.** Each entry separates the *trackable thematic proposition* (what the corpus/world can corroborate, what gets fanned out and scored) from *team conviction* (logged as context only — no podcast can resolve whether a given founder out-executes). The engine must never present corroboration of a theme as validation of the team bet beneath it.
|
||||
|
||||
Initial draft (v1 — operator to finalize levels):
|
||||
|
||||
**Root**
|
||||
- **R1 Debasement / neutral reserve** — sovereign debt keeps being monetized not repaid; fiat debasement persists; bitcoin adopted as the neutral non-debasable reserve capital migrates to. Conviction HIGH / Exposure pervasive. Disconfirm: durable fiscal surpluses + falling debt/GDP + no reserve diversification.
|
||||
- **R2 Abundance/scarcity** — AI drives marginal cost of the reproducible toward zero; value accrues to the scarce/verifiable; bitcoin is the "strongest horse," gains relative share, pricing-in-bitcoin grows. Conviction HIGH / Exposure thesis-wide. Disconfirm: scarce/verifiable assets earn no premium as AI content saturates.
|
||||
- **R3 Sovereign + institutional adoption catalyst** — strategic bitcoin reserves (US / nation-states), SAB-121 repeal enabling bank custody, and ETF/treasury inflows provide a price-inelastic bid and invert allocator career risk (ignoring bitcoin becomes the risk). Conviction MED-HIGH / Exposure pervasive (esp. custody/credit names). Disconfirm: reserve plans stall or reverse; banks stay out; policy turns adversarial.
|
||||
|
||||
**Energy <-> Compute**
|
||||
- **E1 Power not chips is the binding constraint** on AI buildout through ~2027-28; seam picks-and-shovels under-priced. Conviction HIGH / Exposure MED-HIGH (Giga, Satoshi Energy). Disconfirm: chips/capital remain bottleneck; interconnect clears fast.
|
||||
- **E2 Miner flexible-load playbook goes mainstream** (demand response, behind-the-meter) for AI data centers + grids; mining fluency = transferable underwriting edge. Conviction HIGH / Exposure MED (Giga power-market optimization, Satoshi). Disconfirm: data centers reject flexible-load; fluency non-transferable.
|
||||
- **E3 Straddlers beat pure-plays** — mining-native operators that pivot into/straddle AI/HPC capture the convergence; mining-only underperforms. Conviction MED / Exposure Giga (straddle) vs Upstream (mining-only, "whiffing on AI"). Disconfirm: pure-play mining outperforms straddlers.
|
||||
|
||||
**Debasement <-> Bitcoin (pristine collateral + picks-and-shovels)**
|
||||
- **D1 Bitcoin-as-collateral goes mainstream** — new BTC-collateralized credit products proliferate; spreads compress; >=1 major traditional institution enters within 24-36 mo. Conviction HIGH / Exposure HIGH (Strike; Battery, Unchained, debifi, AnchorWatch). Disconfirm: stays crypto-native niche; no incumbent entry; spreads hold. *Scarcity-amplification mechanism:* as credit / dual-collateral / insurance products mature, holders borrow rather than sell, shrinking marginal supply (the "scarcer than you think" dynamic), pairing BTC with the ~$46T US credit / ~$4T real estate / ~$1T insurance markets. Battery = dual-collateral real-asset loans (effectively a cheap long-dated BTC call option for borrowers; thesis strong but execution lagging per operator). AnchorWatch = full BTC insurance in multisig (Miniscript) = the fiduciary unlock.
|
||||
- **D2 Incumbents buy not build** — legacy finance/tech acquires bitcoin-natives rather than building (the published exit thesis). Conviction HIGH / Exposure portfolio-wide. Disconfirm: incumbents build in-house or via crypto-generalists; no strategic M&A.
|
||||
- **D3 Bitcoin commercialization of legacy operating businesses** — compressed-multiple firms become structurally advantaged rearchitected around bitcoin (treasury, settlement, self-hosted infra, stranded energy). Conviction MED-HIGH / Exposure enablers (Fold, AnchorWatch, Giga/Upstream). Disconfirm: legacy adoption stalls; no margin advantage.
|
||||
- **D4 Strike re-rates as a bitcoin bank, not payments** — market values it as exchange + major retail BTC-collateralized lender + global access (70+ jurisdictions), not legacy payments. Conviction HIGH (largest position, ~40%) / Exposure HIGH / team conviction high (separate). Disconfirm: stays valued/stuck as payments; lending/exchange don't scale.
|
||||
|
||||
**AI <-> Data-Ownership (PRIME under-acted-conviction target — mirrors the 2023 AI/compute miss)**
|
||||
- **A1 Coherence: owned judgment is the last margin** — AI commoditizes competence, profit on undifferentiated output erodes toward zero; durable margin needs owned/protected proprietary data + judgment; demand grows for sovereign-root + confidential-inference infra. Conviction HIGH (thematic) / Exposure LOW (Start9, OpenSecret/Maple, maybe Primal; small checks). Disconfirm: enterprises cede data/inference with no margin penalty.
|
||||
- **A2 Sovereign option for the segment that can't cede** (regulated, IP-sensitive, adversarial jurisdictions) adopts owned infra + confidential inference even as the majority cedes to convenience. Conviction MED / Exposure LOW. Disconfirm: even IP-sensitive segment fully cedes.
|
||||
- **A3 Start9 broadens beyond the niche** (SaaS->on-prem reversion). Conviction LOW / explicitly uncertain (team high, theme unproven — "maybe drinking our own koolaid, tbd") / Exposure LOW. Disconfirm: stays bitcoiner-niche.
|
||||
|
||||
**Monitored thesis-breakers (engine must surface these against the thesis)**
|
||||
- **B1 Quantum acceleration** compresses CRQC timelines inside NIST 2035 before mitigations deploy (bitcoin-leg breaker).
|
||||
- **B2 AI permanently outbids mining for power**, pushing mining to only truly-stranded margin (energy-leg breaker).
|
||||
- **B3 Stablecoins/CBDCs capture the neutral-reserve role** or bitcoin fails as the exit (tests the complementary-stablecoin view).
|
||||
|
||||
Note: the highest-value early use of the engine is pressuring **A1/A2** — high published conviction, low exposure, world beginning to corroborate — exactly the shape of the prior miss. A3 and E3 are deliberately low-conviction seeds the engine should help resolve.
|
||||
|
||||
---
|
||||
|
||||
## 4. Architecture — pipeline layers
|
||||
|
||||
Design target: **~95% of compute runs locally** (Grant operates dual DGX Spark running Qwen3 via vLLM). Frontier API is used ONLY at the final synthesis/judgment/fan-out step on a small shortlist. Both a cost decision (bulk extraction at frontier prices would be an order of magnitude more expensive) and a data-sovereignty decision.
|
||||
|
||||
```
|
||||
[Ingestion] -> [Extraction (LOCAL)] -> [Embedding+Store] -> [Cluster + Temporal + Graph]
|
||||
-> [Candidate scoring: emergence / contrarian / intersection / under-acted-conviction]
|
||||
-> [Frontier judge + conviction fan-out (SHORTLIST / SEEDS ONLY)]
|
||||
-> [Dual-evaluation ledger: human ratings + falsifiable predictions w/ lead time]
|
||||
```
|
||||
|
||||
### 4.1 Ingestion
|
||||
- Podcasts/YouTube: pull via RSS feeds and YouTube; download audio; **transcribe locally through the operator's existing Spark Control gateway, NOT a Whisper deployment you stand up.** Transcription is a live OpenAI-compatible endpoint: `POST /v1/audio/transcriptions` backed by **NVIDIA Parakeet TDT 0.6B** (`response_format=verbose_json` gives word- and segment-level timestamps). It runs ~60x real-time on the operator's GPU, so transcription is not the bottleneck. See §10 and §13 for the full endpoint list and exactly what you must build vs. call.
|
||||
- **Speaker labels are available — and they're more useful than the doc's "where possible" implies.** Spark Control also exposes **diarization + 192-dim voice fingerprints** (NVIDIA Sortformer + TitaNet): `POST /api/audio/diarize-chunk` returns per-speaker segments **plus a voiceprint per speaker**, and `POST /api/audio/transcribe-with-speakers` returns speaker-attributed transcript blocks in one call. The voiceprints are the important part for this project: they let you **identify the same guest by voice across different shows even when unlabeled** — a direct, automated input to the source-independence graph (§4.5, "shared guests") that you'd otherwise have to infer from show notes.
|
||||
- **Long-audio handling (important operational note).** Podcasts run 1–3 hours; the diarizer (Sortformer) caps at 4 speakers per chunk and the operator's Spark 2 is a single GPU. So you **chunk long audio into ~2–3 minute pieces and send them sequentially** (parallel audio requests can trip a GPU FFT race → 503/retry). `diarize-chunk` is purpose-built for this: it returns a voiceprint per chunk so you can re-cluster the same speaker across chunks (cosine similarity, ~0.7 distance threshold). This chunking + cross-chunk speaker stitching is **your code**, on top of the per-chunk endpoint.
|
||||
- Companies: pull SEC filings (EDGAR) and earnings-call transcripts on a schedule. (These are text — no transcription needed; the earnings-call transcript *source* is still TBD, §12.)
|
||||
- Store raw + transcript with metadata: source, source_cluster, date, speakers, speaker_voiceprints, url.
|
||||
- NOTE on source-quality asymmetry: filings/earnings calls are high-information-density and a likelier source of differentiated signal; podcasts are low-density and more prone to echo. Weight accordingly downstream; do not treat all sources as equal-value just because they're in one corpus.
|
||||
|
||||
### 4.2 Extraction (LOCAL model — this is the cost & quality center) — schema FINALIZED for pilot
|
||||
Run each transcript through a local model to extract structured **claim units**. Extract at the level of the **proposition**; let one passage emit *multiple* claims or *zero*. Most of a podcast hour is zero — the extractor must be willing to find nothing. An extractor that dutifully emits a claim per chunk reintroduces exactly the noise everything else is designed to remove.
|
||||
|
||||
```json
|
||||
{
|
||||
"claim_id": "...",
|
||||
"proposition": "normalized one-sentence proposition: subject-assertion-object",
|
||||
"topic_canonical": "normalized topic for clustering / stance distributions",
|
||||
"topic_raw": "what was actually said (preserved)",
|
||||
"claimant": "who said it",
|
||||
"source": "...",
|
||||
"source_cluster": "macro | ai_tech | energy | bitcoin | vc_consensus | generalist",
|
||||
"date": "ISO date",
|
||||
"claim_type": "interpretive | predictive | descriptive | reactive",
|
||||
"time_horizon": "near | medium | long | unspecified",
|
||||
"confidence": "low | med | high",
|
||||
"relation": { "target_proposition_id": "...|null", "polarity": "affirms | denies | qualifies | none" },
|
||||
"engages_consensus": true,
|
||||
"counters_position": "the mainstream position it argues against, if any",
|
||||
"thesis_seam": "energy_compute | debasement_bitcoin | ai_data_ownership | none",
|
||||
"salience": "central | secondary | aside"
|
||||
}
|
||||
```
|
||||
Design rationale:
|
||||
- **`proposition` is the atomic unit** of the whole system: a normalized claim with an owner and a date. It is what makes "two sources, same stance" detectable, and it is what later becomes a falsifiable prediction. Do NOT collapse stance into a bull/bear label — too lossy.
|
||||
- **`topic_canonical` vs `topic_raw`** — without normalization, "Fed policy"/"interest rates"/"the FOMC" scatter and clustering fails.
|
||||
- **`relation` (affirms/denies/qualifies a prior proposition)** is how a real stance distribution gets built ("11 affirm, 3 deny, 2 qualify") and is the negation fix.
|
||||
- **`claim_type`** separates insight (interpretive/predictive) from news echo (descriptive/reactive); these look identical on a raw frequency chart and completely different once separated.
|
||||
- **`time_horizon`** — a predictive claim is useless to the ledger without one; `unspecified` predictions are lower value.
|
||||
- **`confidence` is low/med/high only** — a 0-1 number from a local model is false precision.
|
||||
- **No `falsifiable` flag** — falsifiability is implied by structure (predictive claim_type + a resolvable proposition), not a model judgment.
|
||||
- **`engages_consensus`/`counters_position`** — distinguishes a real counter-argument (signal) from ignorance that happens to disagree (noise).
|
||||
- **`thesis_seam`** is a tag, NOT a hard filter (off-thesis-but-important signals must survive).
|
||||
- **`salience`** cheaply downweights throwaway lines.
|
||||
|
||||
**Serving (no new infra to build):** run extraction against the operator's local LLM through Spark Control — `POST /v1/chat/completions` (OpenAI-compatible), currently serving **Qwen3.6-35B-A3B-NVFP4 (64K context)** on Spark 1 via vLLM. Two things that make the structured schema above reliable: (1) pass `response_format={"type":"json_object"}` for guaranteed-valid JSON (the operator already uses this exact pattern in production for a strict-JSON extraction task — it works), and (2) `chat_template_kwargs={"enable_thinking": false}`, `temperature: 0` for deterministic, no-chain-of-thought extraction. The 64K context comfortably holds a full transcript chunk plus the schema instructions. The model is hot-swappable from the Spark Control dashboard if you want a different local model for extraction, but one model serves at a time on Spark 1 (see §13 capacity notes).
|
||||
|
||||
### 4.3 Embedding + storage
|
||||
- Embed the **distilled propositions**, NOT raw chunks. (Grant's stack: Qdrant + bge-m3 + SQLite — all already running.)
|
||||
- **Embeddings endpoint:** `POST /v1/embeddings` (OpenAI-compatible) → **bge-m3, 1024-dim dense**, on the operator's GPU. Also live: `POST /v1/rerank` (bge-reranker-v2-m3 cross-encoder) and `POST /api/search` (hybrid dense+sparse retrieval with RRF fusion + optional rerank) against Qdrant.
|
||||
- **Use hybrid, not dense-only.** Propositions are entity-heavy (tickers, company names, fund names, people). bge-m3 dense captures meaning; pair it with Qdrant's BM25 sparse leg (the operator's CRM does exactly this) so "MSTR" / "Strategy" / "Microstrategy" match on the lexical leg too, not just the fuzzy semantic one. `/api/search` already orchestrates dense+sparse+rerank; the BM25 sparse vectors are generated client-side at ingest (FastEmbed `Qdrant/bm25`) and the collection uses `modifier: idf`.
|
||||
- **Reranker is a free quality lever for the judge.** Before the frontier judge (§4.6) sees a shortlist, `/v1/rerank` can re-order the quantitatively-nominated candidates by relevance to the theme query — cheap, local, +precision.
|
||||
- Clustering now means something: propositions, with topic and stance already separated.
|
||||
|
||||
### 4.4 Clustering + temporal + graph + candidate scoring
|
||||
- **Emergence (theme):** track cluster size over time; the signal is the *second derivative* (acceleration), not size. A big static cluster is just a popular topic.
|
||||
- **2nd/3rd-order themes** won't appear as one growing cluster (no one states them outright). Detect two ways: (a) *bridge formation* — new edges in the co-mention graph between previously-unconnected clusters; (b) *top-down synthesis pass* (Section 4.6) that NAMES the higher-order theme. Themes are generated by synthesis, not discovered by geometry.
|
||||
- **Contrarian:** build, per topic, the actual **stance distribution** (possible only because stance/relation were extracted). Needs ALL of: minority position + genuine majority consensus + credible source + `engages_consensus = true`.
|
||||
- **Intersection (consensus-flip):** a minority stance that is ALSO accelerating. Self-correcting against the lonely-crank problem (a crank stays lonely; a correct-early contrarian pulls independent sources in, which velocity catches).
|
||||
- **Under-acted conviction (Job B signal type):** for each seed conviction and its fanned-out derivatives (Section 4.6), score = **conviction (high) x current exposure (low) x rising independent corroboration in the corpus.** Fires when Ten31 believes something, has little/no position, and the world is starting to corroborate it or a derivative of it. This is the signal that should have flagged "size up power-infrastructure picks-and-shovels" in 2023.
|
||||
|
||||
### 4.5 Source-independence graph (build this even in the pilot)
|
||||
Source independence is mostly an illusion: podcasters share guests, quote each other, move in cliques. Five shows "independently converging" may be one guest doing the rounds. Build a graph of sources (shared guests, citations, community overlap) and **discount convergence by how connected the sources are.** Cross-cluster convergence (a macro show + an energy show + an AI show, no shared guests, landing on the same on-thesis idea) is the gold; within-cluster convergence is near-noise. Deliberately under-weight the bitcoin cluster: it's the most correlated with Ten31's own priors.
|
||||
|
||||
**Capability you already have for this: voiceprint-based guest identity.** The "shared guests" edge is the hardest part of this graph to build from metadata (show notes are inconsistent, a guest's name is spelled three ways, many appearances aren't announced). The operator's transcription stack returns a **192-dim TitaNet voiceprint per speaker** (§4.1). Persist a voiceprint library and you can detect **the same person speaking across two shows by voice**, automatically, even when neither show labels them — turning "did these five shows actually share a guest?" from a manual annotation task into a cosine-similarity lookup. This is the single highest-leverage use of the diarization capability for *this* project, and it directly powers the convergence-discounting that separates real cross-cluster signal from one guest doing the rounds.
|
||||
|
||||
### 4.6 Frontier synthesis / judge / fan-out (SHORTLIST + SEEDS ONLY)
|
||||
Frontier API used in two bounded roles, never on the raw corpus:
|
||||
- **Judge (Job A):** receives ONLY candidates that passed the quantitative bar — "this minority stance, these N independent sources, this acceleration, this consensus baseline" — assesses genuine vs artifact, and emits the resolution spec (Section 6). Must NOT generate candidates from scratch.
|
||||
- **Synthesis (Job A):** receives cluster centroids + newly-strengthened edges and NAMES higher-order themes.
|
||||
- **Conviction fan-out (Job B):** receives the seed convictions (Section 3.1) and generates their 2nd/3rd-order derivative nodes. These derivatives are then matched against the corpus; a derivative becomes an under-acted-conviction signal ONLY when independent corroboration AND the exposure gap both clear the bar. The fan-out proposes the tree; the world and the book decide which branch matters. (Confabulation guard: fan-out branches are hypotheses, never signals on their own.)
|
||||
|
||||
**Sovereignty at the frontier boundary — use the operator's redaction gateway.** The public corpus (podcasts, filings) needs no protection. But the inputs to this frontier step are the *most* sensitive thing in the system: Ten31's **conviction log** (§3.1) — actual positions, conviction levels, and explicit **exposure gaps** (where Ten31 believes something and is under-invested). Sending that raw to an external frontier API leaks Ten31's playbook and its blind spots. The operator already runs a **scrub/rehydrate gateway** on Spark Control for exactly this: `POST /scrub` de-identifies the proprietary entities/positions into stable placeholders (`[FUND_1]`, `[POSITION_2]`, `[AMOUNT_3]`) before the call, the frontier model reasons over placeholders, and `POST /rehydrate` restores the real values locally — the de-anonymization map never leaves the box. **Route the conviction-fan-out and judge prompts through scrub → frontier → rehydrate** so the engine can use a frontier model without exposing Ten31's exposure map. (Caller supplies the entity dictionary per request; the gateway also runs a local-Qwen NER backstop for anything the dictionary misses. See §13.)
|
||||
|
||||
**Refinement — keep exposure off the frontier entirely; scrub identities, not substance.** The cheapest and least-blunting split is architectural, not redactional. The frontier's two jobs here — fan out derivatives from a seed *thematic* conviction, and judge a candidate shortlist — do **not** require Ten31's exposure data. So do not send it: compute the conviction x exposure gap (the under-acted-conviction score, §4.4) **locally, after** the frontier returns its thematic output. The crown-jewel data (position sizes, exposure levels, the prioritized strategic map) then never leaves the box, at **zero capability cost** — it isn't redacted, the model simply never needed it. For what *does* go (the relevant thematic slice + public corpus candidates), scrub *entities/identifiers* into placeholders but never redact the *substance* the model must reason over — de-identification is not content redaction (tokenizing `Strike`→`[FUND_1]` preserves reasoning; deleting the claim the model must weigh blunts it). And send only the thematic slice relevant to the current judgment, not the whole conviction map: the individual theses are mostly published in Ten31's essays (low sensitivity); the *combination and prioritization* is the proprietary part. Validate in the pilot: run a sample of judge/fan-out prompts scrubbed vs. unscrubbed and compare output quality, so any reasoning cost of scrubbing is **measured, not assumed**. (Threat-model note: the realistic risk on a bounded commercial frontier call is data-at-rest / breach / subpoena, not the model training on inputs — commercial frontier APIs do not train on inputs by default — so the conservatism is reasonable, and this split buys sovereignty on the part that matters without taxing capability.)
|
||||
|
||||
### 4.7 Dual-evaluation ledger (start day one — see Section 6)
|
||||
|
||||
---
|
||||
|
||||
## 5. Hard constraints / anti-patterns (do not violate)
|
||||
|
||||
1. **Stats/geometry nominate; the model judges/expands a pre-filtered shortlist or seed set.** The model never nominates from the raw corpus, and fanned-out derivatives are never trusted without corpus corroboration.
|
||||
2. **Extract structured propositions first; embed distilled propositions, not raw chunks.**
|
||||
3. **Separate topic from stance** (the `relation` field); never infer stance from vector distance.
|
||||
4. **Discount convergence by source connectedness** (independence graph).
|
||||
5. **Consensus is a moving baseline** — recompute per time-window. Yesterday's contrarian is today's mainstream.
|
||||
6. **Filter reactive/descriptive claims against a news baseline; weight interpretive/predictive.**
|
||||
7. **The lens tags relevance and seeds fan-out; it must not gate truth.** The system must be able to surface a credible, accelerating signal that argues *against* Ten31's own thesis. Concrete bear case it must be able to voice: if AI compute permanently outbids mining for power, mining gets pushed to only truly-stranded margin and the "mining underwrites the grid" leg weakens.
|
||||
|
||||
---
|
||||
|
||||
## 6. Evaluation & resolution — themes as the unit, events as the evidence
|
||||
|
||||
### 6.1 The corrected model (themes are primary; clean events are nested evidence)
|
||||
Earlier framing treated "Tier 1 (clean/falsifiable)" and "Tier 2 (thematic/directional)" as two parallel tracks. That was wrong. They are **two altitudes of one signal**, and they nest:
|
||||
|
||||
- A **theme** is the unit Ten31 actually cares about and acts on ("power, not chips, becomes the gating constraint on AI and gets repriced"). Ten31 is an allocator betting on secular direction, so **themes are primary.**
|
||||
- A theme is *made of* **clean events** — specific, observable, dated confirmations (a multi-GW nuclear PPA signed; transformer lead times blow out; a BTC-collateralized lending product ships). **These clean events are the external-confirmation evidence that grades the theme**, not a competing category.
|
||||
- The higher-order / more thesis-core a theme is, the sparser and fuzzier its clean-event rungs (the abundance/scarcity theme barely has clean confirmations). That is not a defect — those are exactly the themes where Ten31 has the most edge if right and the most risk of self-deception. The ledger's job is to make that tradeoff visible, not hide it.
|
||||
|
||||
### 6.2 Theme resolution = two legs (both required)
|
||||
1. **Discourse leg (leading, partly causal):** did independent, cross-cluster discourse on the theme keep accelerating from the log date forward? Treat discourse acceleration as a *causal leading indicator*, not merely an echo to discount — in Ten31's domains, narrative partly drives the outcome (capital follows story, price follows capital).
|
||||
2. **External-confirmation leg (resolving):** did the bundle of nested clean events actually occur — real-world validation (capital flows, policy, adoption, price, signed deals), not just sustained chatter? Discourse alone resolving a theme would be circular (predicting that people keep talking), so this leg is mandatory for themes Ten31 acts on.
|
||||
|
||||
### 6.3 Lead time is a first-class logged field
|
||||
For every theme signal, record the gap between when the system flagged it (discourse leg) and when external confirmation arrived. **This lead time IS the alpha measurement** — it separates "early to something real" (edge) from "articulate trend-follower" (worthless even if accurate). For Job B derivatives, measure earliness to the *derivative* node, not the root theme (the AI/compute miss was late derivatives, not a wrong root).
|
||||
|
||||
### 6.4 Reflexivity outcome taxonomy
|
||||
- **Discourse up (cross-cluster, independent) + external confirmation follows + positive lead time** -> real, early, edge. The prize.
|
||||
- **Discourse up + no external confirmation by horizon** -> narrative bubble that didn't cash out; record as a faked-out signal (this is how the system learns reflexive-and-real vs reflexive-and-hollow).
|
||||
- **External moves with little prior discourse** -> blind spot; the system missed it; log it.
|
||||
- **Discourse up but only within one cluster** -> echo; discount.
|
||||
|
||||
### 6.5 Clean-event scoring (the nested evidence)
|
||||
Clean events resolve `correct / partial / wrong / unresolved-expired / too-early`, automatically where data exists (price, filings) or via a quick human check. They keep the system honest: if it is systematically wrong on the falsifiable rungs, its themes shouldn't be trusted either. Keep a deliberate minority of pure clean-event predictions as a calibration backbone.
|
||||
|
||||
### 6.6 Two failure modes to design against now
|
||||
- **Survivorship/cherry-picking:** log EVERY candidate that passes the quantitative bar, including boring ones, or there's no denominator and no hit rate.
|
||||
- **Look-ahead / "already priced in":** theme resolution must care about acceleration and confirmation *from the log date forward*, never whether the theme was real in absolute terms — otherwise the ledger rewards noticing things that already happened (the expensive-summarizer failure wearing a success badge).
|
||||
|
||||
### 6.7 Mechanics
|
||||
Dual track, started day one (predictions need time to resolve; the clock can't be backfilled). Human eval (Grant) answers "non-obvious and relevant to me?"; the ledger answers "was it correct, and how early?" Keep them in separate columns and **do not let the model see Grant's rating before it logs its prediction.** The valuable cell is disagreement (boring-but-right -> lens too narrow; brilliant-but-wrong -> seduced by plausibility). The model may *propose* resolution criteria; resolution comes from the world, never from model self-confidence (log `model_confidence` only to measure its uselessness, never to score).
|
||||
|
||||
Minimal ledger (SQLite):
|
||||
`signal_id | type(theme|event|under_acted_conviction) | proposition | date_logged | discourse_metric | external_check | resolution_date | discourse_outcome | external_outcome | lead_time | grant_rating | model_confidence`
|
||||
|
||||
---
|
||||
|
||||
## 7. Pilot scope (do this before the 500-source build)
|
||||
|
||||
Run the full pipeline end-to-end on a bounded, diverse, deliberately *non-correlated* source set. ~20 podcasts already spans hundreds of episodes quickly; widen companies to ~25 across categories.
|
||||
|
||||
### 7.1 Headline pilot test — backtest against Ten31's own history (Job B validation)
|
||||
Seed the conviction log with the ~2023 Kirkwood conviction ("compute will ~1000x; energy becomes the binding constraint; interruptible load is the edge") and run the pipeline over a corpus from that period. **Does the under-acted-conviction signal surface the derivative "size up the power-infrastructure picks-and-shovels of the buildout"?** If yes, that is the strongest possible validation that the system does the job Ten31 actually needs. If no, we learn exactly what's missing before building the big version. This backtest is more convincing than any forward-looking hit and should be run first.
|
||||
|
||||
### 7.2 Forward success test (Job A)
|
||||
Does the system surface anything Grant, a domain expert, finds genuinely non-obvious and didn't already know? Even one or two true hits validates scaling. If it only reproduces what normal reading yields, that's a cheap, early "no."
|
||||
|
||||
### 7.3 Source list — companies (v1 draft; VERIFY tickers/status at ingestion — this space moves fast)
|
||||
|
||||
| Category (seam) | Companies |
|
||||
|---|---|
|
||||
| AI compute & hyperscalers | NVIDIA, Alphabet, Microsoft, Amazon, Meta, Broadcom, TSMC, CoreWeave, Oracle |
|
||||
| Energy & power (binding constraint) | Constellation, Vistra, Talen, GE Vernova, NextEra, Cameco, Vertiv; (watch: Quanta, Oklo, NuScale) |
|
||||
| Mining <-> AI/HPC (energy-compute seam) | Core Scientific, IREN, TeraWulf, Cipher; (watch: Riot, MARA, Bitdeer) |
|
||||
| Debasement <-> bitcoin (treasury/custody) | Strategy (MSTR), Coinbase, Block, Twenty One (XXI) |
|
||||
|
||||
### 7.4 Source list — podcasts / YouTube (v1 draft; VERIFY feeds/hosts/status — some may have changed)
|
||||
|
||||
Roles: CB = consensus barometer, IND = independent/contrarian, DX = domain expert.
|
||||
|
||||
| Cluster | Source | Role |
|
||||
|---|---|---|
|
||||
| Macro/monetary | Odd Lots | IND / cross-domain |
|
||||
| Macro/monetary | Forward Guidance | DX |
|
||||
| Macro/monetary | Macro Voices (energy-heavy) | DX |
|
||||
| Macro/monetary | The Grant Williams Podcast | IND |
|
||||
| Macro/monetary | Monetary Matters | DX |
|
||||
| Macro/monetary | Hidden Forces | IND / cross-domain |
|
||||
| AI/tech | Dwarkesh Podcast | DX / IND |
|
||||
| AI/tech | No Priors | DX |
|
||||
| AI/tech | Latent Space | DX (technical) |
|
||||
| AI/tech | Cognitive Revolution | DX |
|
||||
| AI/tech | BG2 | DX (mild Ten31 correlation) |
|
||||
| AI/tech | a16z Podcast | DX (crypto correlation) |
|
||||
| Energy | Catalyst w/ Shayle Kann | DX |
|
||||
| Energy | Columbia Energy Exchange | DX |
|
||||
| Energy | Doomberg | IND |
|
||||
| Bitcoin (limited) | The Bitcoin Layer | DX (macro-literate) |
|
||||
| Bitcoin (limited) | What Bitcoin Did | - |
|
||||
| Generalist | All-In | CB |
|
||||
| Generalist | Invest Like the Best | DX / cross-domain |
|
||||
| Generalist | Lex Fridman | - (wide reach, variable) |
|
||||
|
||||
Independence notes for the graph:
|
||||
- **Deliberately limited bitcoin cluster**, and TFTC / Bitcoin Alpha / the Odell-Bent orbit are excluded: most correlated with Ten31's own priors; convergence there ~ confirming the prior. (Confirmed in the 2021 vision essay: Matt Odell and Marty Bent are Ten31 partners — this cluster is literally Ten31's own network, so convergence there is the prior, not signal.)
|
||||
- **VC-consensus cluster** (All-In, a16z, BG2, No Priors): shared guests/worldview -> discount internal convergence; All-In retained primarily as a *consensus barometer*.
|
||||
- **Highest-independence cross-domain:** Odd Lots, Dwarkesh, Hidden Forces, Invest Like the Best.
|
||||
- Target signal: cross-cluster convergence among sources with no shared guests.
|
||||
|
||||
---
|
||||
|
||||
## 8. Source credibility over time (DESIGN-FIRST; reuses the ledger)
|
||||
Do not hand-assign static credibility. Start every source at a neutral prior, then **earn credibility from the prediction ledger**: a source's weight rises when claims it made early later resolve correct — and rises most when a contrarian call it made *against consensus* resolves right. Credibility becomes a learned track record, not an opinion, running on the same ledger from Section 6. Cold-start caveat: early weights are weak until enough predictions resolve, so treat credibility as provisional during/after the pilot. A light bootstrap prior (domain relevance + reach) is acceptable as a placeholder but should decay in favor of earned track record.
|
||||
|
||||
---
|
||||
|
||||
## 9. Conceded weak forms (keep on hand; not headline claims)
|
||||
Rebuttals to assumptions, not main points:
|
||||
- NOT a bet on bitcoin as a medium of exchange / "buy coffee with bitcoin." It's bitcoin as the neutral, non-debasable reserve asset.
|
||||
- NOT a bet that everyone self-hosts. Freedom tech / AI-data-ownership needs to be the durable, high-value *indispensable option* for those who can't cede control — even as the majority cedes to convenience for frontier LLMs. (Cost of the sovereign option keeps falling; centralization mints its own dissenters with every breach/deplatforming, so the margin is durable and arguably growing.)
|
||||
- Every thesis leg has a strong form and a conceded weak form; always claim the strong one.
|
||||
|
||||
---
|
||||
|
||||
## 10. Tech / infra notes — corrected to the operator's ACTUAL running stack
|
||||
All local model serving is already live behind one host, **Spark Control** (a StartOS gateway on the operator's Start9 server that fronts the two DGX Sparks). You do not stand up vLLM/Whisper/Qdrant yourself — you call HTTP endpoints. Full inventory + what's yours-to-build is in **§13**.
|
||||
- **Local LLM (extraction + clustering + most scoring):** `POST /v1/chat/completions` → **Qwen3.6-35B-A3B-NVFP4, 64K ctx**, vLLM on Spark 1. Hot-swappable to other models from the dashboard (one at a time).
|
||||
- **Transcription:** `POST /v1/audio/transcriptions` → **NVIDIA Parakeet TDT 0.6B** (~60x real-time). NOT Whisper.
|
||||
- **Diarization + voiceprints:** `POST /api/audio/diarize-chunk` (Sortformer 4-spk + 192-dim TitaNet voiceprints), `POST /api/audio/transcribe-with-speakers` (merged). Voiceprints → cross-show guest identity (§4.5).
|
||||
- **Embeddings / rerank / search:** `POST /v1/embeddings` (**bge-m3**, 1024-d), `POST /v1/rerank` (bge-reranker-v2-m3), `POST /api/search` (Qdrant hybrid dense+sparse + RRF).
|
||||
- **Vector/store:** Qdrant (hybrid-configured) + SQLite (your ledger/metadata).
|
||||
- **Frontier-boundary sovereignty:** `POST /scrub` + `POST /rehydrate` (redaction gateway) — wrap the proprietary conviction/judge prompts (§4.6).
|
||||
- **Health/discovery:** `GET /api/endpoints`, `GET /api/status`.
|
||||
- **Auth:** none on the LAN today (behind the Start9's TLS + access control). Same-LAN clients use a self-signed-cert skip; the operator can add auth if you run off-LAN.
|
||||
- **Ingestion (yours to build):** EDGAR (filings), earnings-call transcript source (TBD), RSS + YouTube pulling/scheduling/download. Spark Control transcribes audio you fetch; it does not fetch.
|
||||
- **Self-hosted ethos (Start9 ecosystem):** private/proprietary data stays local end-to-end; the only external call is the bounded frontier step, and even that is scrubbed.
|
||||
|
||||
## 11. Build order (suggested)
|
||||
1. Ingestion + local transcription for the pilot source set.
|
||||
2. Extraction (schema in 4.2 is finalized for pilot) — local.
|
||||
3. Embedding + storage; basic clustering.
|
||||
4. Prediction ledger scaffold (turn on immediately, even before scoring is good).
|
||||
5. Conviction log (3.1) + seed the ~2023 Kirkwood conviction for the backtest.
|
||||
6. Temporal scoring (emergence acceleration) + stance distributions (contrarian).
|
||||
7. Source-independence graph + convergence discounting.
|
||||
8. Intersection scoring (consensus-flip) + under-acted-conviction scoring.
|
||||
9. Frontier judge + synthesis + conviction fan-out (shortlist/seeds only).
|
||||
10. Human-eval interface (Grant's ratings, kept independent of model).
|
||||
11. Run the backtest (7.1) FIRST, then the forward pilot (7.2); do the disagreement analysis; decide on scaling.
|
||||
|
||||
## 12. Open DESIGN-FIRST items (resolve before scaling past pilot)
|
||||
- Conviction log format/governance (3.1) — who maintains, how exposure is scored.
|
||||
- Earnings-call transcript data source (4.1).
|
||||
- Credibility cold-start bootstrap weighting (8).
|
||||
- Canonical-topic vocabulary management (4.2) — controlled vs emergent.
|
||||
|
||||
---
|
||||
|
||||
## 13. The existing platform (Spark Control) — what's built, what's yours, where the gaps are
|
||||
|
||||
*Added for the implementing dev, who may not have context on the operator's existing infrastructure. The local-compute backbone this project needs already exists and is in production serving two other apps. Treat this section as ground truth; it supersedes any "stand up vLLM/Whisper/Qdrant" assumption elsewhere.*
|
||||
|
||||
### 13.1 What Spark Control is
|
||||
Spark Control is a StartOS package running on the operator's Start9 server. It is a **single trusted HTTP gateway** in front of two NVIDIA DGX Sparks (GB10 Grace-Blackwell, 128 GB unified memory each, ARM64): **Spark 1** runs the LLM (vLLM); **Spark 2** runs the audio models + the embedding server + Qdrant. Everything below is one base URL (the operator provides the LAN address), one TLS cert, OpenAI-compatible where it can be. It already powers a fundraising-CRM agent system and a meeting-transcription app, so it's a stable platform, not a prototype.
|
||||
|
||||
### 13.2 Endpoint reference (all live)
|
||||
| Method + path | Backed by | Use in this project |
|
||||
|---|---|---|
|
||||
| `POST /v1/chat/completions` | Qwen3.6-35B-A3B-NVFP4, 64K ctx (vLLM) | **Extraction (§4.2)**, clustering helpers, most local scoring. JSON-mode supported. |
|
||||
| `POST /v1/embeddings` | bge-m3, 1024-d | **Embed propositions (§4.3).** |
|
||||
| `POST /v1/rerank` | bge-reranker-v2-m3 | Rerank candidate shortlists before the judge (§4.3/4.6). |
|
||||
| `POST /api/search` | Qdrant hybrid (dense+sparse, RRF) + rerank | Retrieval over stored propositions; corpus-corroboration lookups for Job B. |
|
||||
| `POST /v1/audio/transcriptions` | Parakeet TDT 0.6B | **Transcribe podcast/YouTube audio (§4.1).** |
|
||||
| `POST /api/audio/diarize-chunk` | Sortformer + TitaNet | Speaker turns **+ 192-d voiceprints** → guest identity for the independence graph (§4.5). |
|
||||
| `POST /api/audio/transcribe-with-speakers` | Parakeet + Sortformer | Speaker-attributed transcript in one call. |
|
||||
| `POST /scrub` + `POST /rehydrate` | redaction gateway + local-Qwen NER | Wrap the proprietary conviction/judge prompts to the frontier model (§4.6). |
|
||||
| `GET /api/endpoints`, `GET /api/status` | — | Service discovery + health. |
|
||||
| (Qdrant direct) `:6333` | Qdrant v1.16 | Collection mgmt + upserts (ingest side); hybrid named dense+sparse vectors. |
|
||||
|
||||
### 13.3 Build vs. provided
|
||||
**Provided (call it, don't build it):** local LLM serving, transcription, diarization, voiceprints, embeddings, reranking, hybrid vector search + storage, and the scrub/rehydrate sovereignty boundary for the frontier step.
|
||||
|
||||
**Yours to build (the actual project):** all ingestion + scheduling (RSS/YouTube pulling, audio download, EDGAR/earnings fetch); the long-audio chunking + cross-chunk speaker stitching; the voiceprint library + guest-matching logic; the claim-extraction prompts (the schema is finalized, the prompt engineering is yours); all clustering/temporal/graph/scoring logic; the source-independence graph; the prediction ledger + conviction log; the frontier orchestration; and the human-eval interface. **Spark Control gives you the model primitives; the Signal Engine's intelligence is entirely your layer on top.**
|
||||
|
||||
### 13.4 Gaps — what the operator's stack does NOT yet serve (flag these to plan around)
|
||||
1. **No ingestion/scheduler.** Spark Control transcribes audio you hand it; it does not fetch RSS, download YouTube, or pull EDGAR. The entire ingestion layer (feeds, downloaders, cron) is greenfield. *(This is the biggest "build," and the doc's §4.1 already owns it — just don't expect any of it from the gateway.)*
|
||||
2. **No batch/queue orchestration → throughput is the real constraint, not capability.** Transcription (Spark 2) and extraction (Spark 1) each run on a **single GPU**, and audio requests must go **sequentially** (a parallel-request GPU race returns 503). Transcribing is fast per item (~60x real-time) but backfilling "hundreds of episodes" + a 500-source corpus is a **serial job measured in GPU-hours**, and extraction (one LLM forward pass per chunk over the whole corpus) is the heavier of the two. Plan the backfill as a managed queue with patience, not a real-time fan-out. If backfill latency becomes painful, the levers are: a dedicated transcription window, a second model instance, or accepting wall-clock. *(There is no server-side job queue today — you build the queue client-side.)*
|
||||
3. **No earnings-call transcript source wired up** (already flagged §12). EDGAR (filings) is a clean public API; earnings-call *transcripts* need a chosen provider.
|
||||
4. **Embeddings are dense bge-m3 (1024-d), not Matryoshka-truncatable.** Fine at this corpus scale (low hundreds of thousands of propositions is trivial for Qdrant); just don't design around dimension truncation. If proposition-retrieval recall ever becomes the bottleneck, Qwen3-Embedding is the documented A/B upgrade — same `/v1/embeddings` contract.
|
||||
5. **No auth on the LAN endpoints, and Qdrant has no auth/backups yet.** Acceptable for a LAN pilot; if this corpus + ledger becomes long-lived and valuable, the operator would add a Qdrant API key + snapshots (a known, small hardening task) before it's the system of record.
|
||||
6. **One local LLM loads at a time on Spark 1.** If you ever want a *different* local model for extraction vs. a local synthesis step, that's a (slow) hot-swap, not concurrent. For the pilot, Qwen3.6 for all local steps is the assumption.
|
||||
|
||||
### 13.5 Net for the dev
|
||||
The "~95% local compute" design goal in §4 is not aspirational — it's already the operating reality, because every local model this pipeline needs is a live endpoint on hardware the operator runs. Your job is the ingestion, the extraction prompts, and the entire stats/graph/ledger intelligence layer. Wire the model calls to Spark Control; build the brain on top.
|
||||
Reference in New Issue
Block a user