Files
ten31-signal-engine/DESIGN_v2.md
T

20 KiB
Raw Blame History

Ten31 Signal Engine — Design v2 (hypotheses + falsification protocol)

Status: This is not a settled-decisions document. It is a set of design hypotheses with falsification criteria, plus the evaluation protocol that is supposed to prove them wrong. It exists because the dev, the implementing agent, and the principal converged in argument — and convergence among three reasoning processes that share priors is not evidence the design is right. The discipline from here is empirical: stop improving the design by argument; subject it to data that can falsify it.

The single most important correction from the v1 review: all evaluation so far was recall-only ("given something we know happened, did the engine catch it?"). The question a fund actually cares about is precision ("of everything the engine fires on, what fraction pans out?"). You cannot adjudicate a precision/recall tradeoff — which is what the whole gate debate is — while measuring one side. Therefore the evaluation protocol (§1) is the first-class artifact; the design hypotheses (§2) are secondary and only matter once §1 can grade them.


1. Evaluation protocol (the first-class section)

1.1 Pre-registration (committed before any outcome data is pulled)

To avoid the same hindsight leakage that the hand-written fan-out had (we picked the derivatives we knew were right, phrased to match the corpus), the resolution criteria below are fixed before fetching any price/external data. A single uniform rule is applied to all derivatives — no per-derivative threshold tuning.

Resolution rule (uniform, pre-registered):

  • Each derivative maps to a proxy basket of liquid equities most directly exposed to it (committed in seeds/resolution.K2023.yaml).
  • Build an equal-weight, start-normalized basket index over 2023-01-01 → 2025-06-30.
  • "Confirmed real" iff the index first reaches ≥ +40% vs. its 2023-01 baseline on some date D, and is still ≥ +25% 90 days after D (anti-transient-spike). D = the repricing date.
  • Otherwise "not confirmed" in-window.
  • These thresholds (40% / 25% / 90d) and the baskets are the pre-registration. They are deliberately coarse and uniform so they cannot be gamed per-derivative.

1.2 The engine's two signal dates (to price what the gate costs)

For each derivative we record both:

  • cleared_date — earliest as_of where it cleared the evidence bar (the engine's actual fire).
  • whisper_date — earliest as_of where the engine saw accelerating, multi-claim corroboration regardless of the independence floor (n_confirmed ≥ 4 AND a_corrob > 0). This is the signal the gate may have suppressed.

The gap between whisper_date and cleared_date is the lead time the gate costs us, measured, not argued.

1.3 The confusion matrix (precision AND recall)

Per derivative, classify:

  • TP = cleared and confirmed-real (report lead time = repricing_date cleared_date).
  • FP = cleared and not-confirmed (false alarm — the precision killer).
  • FN = had a whisper / signal but never cleared, and confirmed-real (the gate ate a real signal — the uranium hypothesis).
  • TN = no signal and not-confirmed.

Report precision = TP/(TP+FP) and recall = TP/(TP+FN) — and the whisper-level matrix too (re-classify using whisper_date), so we can see what precision/recall would be without the independence floor. The delta between the two matrices is the empirical answer to the gate debate.

1.4 Held-out precision case (pending the principal)

The 9-derivative matrix measures precision on one (known-winner) conviction. For a clean false-positive rate we need an adversarial case: a conviction Ten31 held that did not pan out, seeded and run the same way. Open — Grant to provide a candidate failed/abandoned conviction. Without it, precision is measured on a favorable sample and should be reported as such.

1.5 The deeper thing the resolver tests

The lead-time distribution (not any single signal) tests the engine's foundational assumption — that discourse leads capital with usable, consistent lead time (§6.2). If the distribution is wide, near-zero, or sometimes negative, the leading-indicator thesis partially collapses. The resolver is therefore not just a grader of signals; it is a test of the premise. This is why it is built first.


2. Design hypotheses (each with a falsification criterion)

Adopted from the v1 review; none are settled until §1 grades them.

  • S1 (STANDING RULE) — resolve on the OUTCOME (scaled substance), never on milestones or ENABLERS. Two Battery lessons:

    • Milestone vs. substance: the same reality resolved YES on "≥1 major institution enters" (Goldman, one loan, 2022-04) and NO on "≥$X capital deployed at scale" (none).
    • Enabler vs. outcome (the sharper one): the regulatory ENABLER improved (SAB-121 → SAB-122, Jan 2025) yet the OUTCOME didn't move — Battery still could not raise institutional dollars post-unblock. An engine fanning the enabler ("regulation is unblocking") would have falsely "confirmed" supply on the 2025 deregulation. Battery's binding constraint was never regulatory permissibility (nothing legally barred a capital provider from funding it) — it was capital-provider willingness to fund a novel product from a new company, which persisted as NO even after the backdrop improved. So: phrase derivatives as deployed/at-scale substance ("≥$X capital deployed", "sustained volume > Y"), never as checkbox milestones ("a bank entered") or enablers ("regulation permits it now"). Enablers/catalysts are CONTEXT, not resolution — they can resolve YES while the conviction fails.
  • H1 — The contested gate is the EISC independence FLOOR (+ same-cluster discount), not cross-cluster. v1 internal diagnosis: of 9 derivatives, the binding gate is overwhelmingly EISC < 2.0 (uranium: 15 confirmations, 5 sources, highest acceleration in the run, suppressed by EISC=1.6; picks-and-shovels' early 2023-11 whisper suppressed by EISC=1.6, delayed 6 months). Falsify: if the confusion matrix shows the EISC-suppressed derivatives did not reprice early, the floor was right and v1's relaxation instinct is wrong.

  • Corrected gate framing. Cross-cluster was always a crude proxy for independence of information — job-invariant, not a Job-A-only rule. Keep the two-tier resolution (log the single-cluster whisper to start the clock; act only on broader/independent confirmation) but as independence-tiering, not job-splitting.

  • H2 — Weight independence by claim_type. Operational/descriptive disclosure from a self-interested source (order book, capex, interconnect queue) is more trustworthy than its predictive/opinion claims. Falsify: if down-weighting predictive self-interested claims does not improve precision on the held-out case.

  • H3 — Independence is lineage, not cluster — an edge-population problem on the existing EISC graph (source_edges.edge_type already supports citation/community), not a rebuild. Cheap interim: source-type independence priors + claim_type (H2). NOT temporal-precedence as a standalone discount — the dev's catch: "later near-identical = discount" misclassifies independent re-derivation as echo (3 operators reporting queues = the corroboration we want). Precedence may only direction an already-detected dependency edge, never stand alone.

  • H4 — Replace the 2nd-derivative with a persistence / level-crossing test on the corroboration arrival rate ("crossed N and stayed non-decreasing across k windows; shrinkage at low n"), with per-source-type window cadence (filings quarterly, podcasts weekly). Falsify: re-run the backtest with the persistence estimator; if it does not reduce the sign-flip variance visible in Appendix A, the 2nd-derivative wasn't the problem.

  • H5 — Two-sided Job B (a product feature, not just a defense). Track net independence-weighted corroboration (affirms denies); accelerating contradiction + over-exposure = an "eroding-conviction / reduce" signal of equal value to under-acted-conviction. Also the structural reflexivity defense (Job B is confirmation-seeking by construction — fan out from beliefs, filter to affirms-only).

  • H6 — The frontier fan-out is the UNTESTED half, and it is the half that maps to the actual miss (§1.1). The backtest hand-wrote the derivative tree (hindsight). Design the test: seed the 2023 conviction, give the model 2023-only context, let it propose derivatives, score that tree's precision/recall vs. what actually repriced, and compare frontier-proposed vs. hand-written. Grade on ALTITUDE, not just topical correctness (per S1): "does it propose the derivative at the right altitude — scaled substance vs. first-instance milestone" — because that choice alone flips the verdict (Goldman vs. Cantor). A model that proposes topically-right but milestone-altitude derivatives will look prescient on headlines while missing the real conviction state.

  • Extraction fixes (low-risk): Item-7 (MD&A) targeting for filings; relations populated at a linking stage over the embedded store, not by the extractor (a spec clarification the dev owns).


3. Build sequence

  1. Resolver + the pre-registered confusion matrix on the 9 derivatives already run (this measures precision, settles H1 empirically, and tests the lead-time premise). ← building now.
  2. Independence/estimator rework (H2 claim_type + source-type priors; H4 persistence estimator) — in parallel; makes the next backtest trustworthy.
  3. Frontier-fan-out test (H6) — right behind the estimator; cheap, strategically central.
  4. Two-sided Job B (H5).
  5. Held-out failed-conviction case (§1.4) — once provided.
  6. Broad corpus expansion — last, and lineage-aware (toward independent vantage points: operator/expert-network, supply-chain OEMs/EPCs/ISOs, FERC/interconnection filings, policy — NOT more correlated sell-side/trade-press).

4. Standing discipline

Every additional round of elegant consensus past this point has decreasing value and increasing risk of mistaking coherence for correctness. Changes to scoring parameters (the EISC floor especially) must be justified by the confusion matrix / held-out precision, not by making a known case clear. The resolution criteria in §1.1 are frozen; if they change, that is a pre-registration event to be logged and justified, not a quiet tune.

4.1 Pre-registration change log (post-hoc normalizations — logged per §4)

Any judgment call made after seeing outcome data is a logged pre-registration event, with reasoning. Two on the adversarial cases (both correct, both post-outcome — logging them is what protects the labels from "we graded it how we wanted"):

  1. STRIKE-card-rail-disruption polarity normalization. The verifier agent returned verdict=confirmed, but its evidence was zero bitcoin/Lightning erosion attribution across all five Visa/MC 10-Ks. The confirm_iff criterion was "erosion IS attributed," so the criterion was NOT met. Normalized to played_out: no (thesis failed). Reasoning: the agent used "confirmed" to mean "confirmed the no-erosion finding"; the pre-registered criterion is about whether disruption occurred, which it did not. Polarity, not substance.
  2. BATTERY-incumbent-entry milestone-vs-substance split. The criterion as written ("≥1 major institution publicly entered") resolves YES (Goldman, one loan, 2022-04). Recorded as played_out: token with the explicit note that the substance (institutional capital at scale, the separate BATTERY-institutional-supply check) resolved NO. Reasoning: the milestone phrasing is a weak proxy for the thesis; logging both verdicts on the same reality is the evidence for standing rule S1. The criterion phrasing itself was the flaw, which is now fixed forward by S1 — but the original label is preserved, not silently rewritten.
  3. BATTERY causal misattribution — CORRECTION (principal ground-truth, 2026-06-08). The agent's corpus-scoping narrative claimed the institutional-supply failure was "structurally regulatory" (SAB-121 blocked it). This was wrong and is retracted. Principal correction: nothing legally restricted capital providers from funding Battery; the binding constraint was capital-provider willingness to fund a novel product / new company. Proof it was NOT regulatory: the 2025 regulatory unblock did NOT produce institutional dollars for Battery (still unraised as of 2026). The frozen pre-registered criterion (BATTERY-institutional-supply = "$X deployed at scale", played_out: no) was correct and unchanged; only the post-hoc causal gloss was wrong — a textbook instance of the plausible-narrative failure mode this engine exists to catch, logged here rather than buried. Policy/regulatory sources are demoted to CONTEXT for the Battery supply axis, not its resolver (see BATTERY_corpus_scope.md, corrected).

v2.1 amendments (post-confusion-matrix; two protocol corrections + two adversarial cases)

The first confusion matrix (§5 below / the power-infra run) and the dev's review forced two protocol corrections before this doc ossifies, plus the two failed-conviction cases the precision axis was missing.

Correction A — measure RUNWAY, not lead-time-vs-first-tick

Lead-time (days before the first price tick) is a trading metric. Ten31 is a long-duration, often-private holder. The right metric is remaining durable runway at signal = the fraction of the eventual durable move still ahead when the signal fired: runway = (durable_peak index_at_signal) / (durable_peak baseline).

  • High runway = the signal was actionable for a holder even if "late" by a trading clock.
  • Do not penalize a modestly-late public read — public comps lead the private rounds we actually enter, so a signal a few weeks/a month after the first public tick still left a real window. This is a lens on the public number, not a private-markets dataset to source. Public markets remain the workhorse instrument.
  • Why it matters here: the power-infra cells with "negative lead-time" (uranium 46d, picks 116d) had most of a +225%/+392% move still ahead at signal → high runway → valuable to a holder. Lead-time mis-scored them; runway corrects it.

Correction B — tag derivatives by DISTANCE-FROM-EDGE, never filter on it

The engine's job is to surface derivatives — including ones outside the current mandate (the original AI/compute miss was a mandate-expansion failure; an engine that pre-filters to in-mandate reproduces exactly that blindness). So every derivative carries distance_from_edge ∈ {in_mandate, one_hop, two_hop} for triage only. (Uranium = two_hop — the most dramatic power-infra cell but the least decision-relevant; the engine should still surface it, tagged, and let judgment decide.) Schema: fanout_nodes.distance_from_edge. No scorer may filter on it.

The two adversarial failed-conviction cases (the negatives — §1.4 satisfied)

Both are convictions Ten31 still holds where the falsifiable thesis mis-fired. Pre-registered criteria committed in seeds/resolution.STRIKE2022.yaml / resolution.BATTERY2022.yaml before outcome-labeling; seeds in conviction_log.adversarial.seed.yaml + fanout.STRIKE2022/BATTERY2022.seed.yaml.

  • STRIKE2022 (the single most valuable negative). Failed thesis = Lightning-as-retail-payments-network displacing card rails. It was narrative-driven — the bitcoin-podcast cluster told this story loudly in 2022 = the exact single-cluster, reflexive, talk-our-own-book corroboration the relaxed gate would wave through. The test: if the engine clears the Lightning-payments derivatives on bitcoin-cluster chatter that never showed up in real retail-payment volume, that's the false positive that proves the cross-cluster/lineage discipline. Also a two-sided test: did non-bitcoin sources ever corroborate, or only the book-talkers?
  • BATTERY2022 (the two-sided / timing case). Thesis = bitcoin-as-collateral credit goes mainstream (D1). Right on demand, early/wrong on supply (institutional lending capital didn't arrive at scale). The test: would the two-sided scorer (net independence-weighted affirms denies) have surfaced that the supply-side corroboration was flat while demand-side rose — a genuine, non-hindsight "timing is early" disconfirmation?

Corpus implication (flagged, needed to RUN these — not yet ingested)

  • The STRIKE false-positive test requires the bitcoin-cluster Lightning-payments discourse to be in the corpus — which is the orbit §7.4 deliberately excluded (TFTC / Bitcoin Alpha / the Odell-Bent network, as "Ten31's own priors"). The test of the echo-chamber detector needs the echo chamber in the corpus. That exclusion was right for Job A discovery and is exactly wrong for this precision test.
  • Both cases resolve on adoption-evidence indicators (Lightning retail volume; institutional BTC-credit capital / SAB-121 / bank entry), not equity-price baskets — a different resolver leg than the power-infra prices. Gathering that evidence (pre-registered) + ingesting the 20222024 bitcoin/institutional corpus is the next build to actually run them.

v2.2 (post-bitcoin-cases) — three conditions BUILT, outcomes verified, scope set

The three conditions (all built + unit-verified)

  1. own_network quarantine — a PERMANENT live-scoring rule, not backtest-only. sources.own_network + effective_independent_N(mode=...) default to mode='live' everywhere, which DROPS own_network sources (the Odell/Bent partner orbit = Ten31 listening to itself). The only path that keeps them is two_sided.trajectory(mode='test') — the explicit reflexivity fixture. Confirmed standing: every forward bitcoin-thesis signal automatically excludes the own_network orbit.
  2. Resolution criteria tightened to named/countable/dated (hostile-checker standard).
  3. Two-sided net-corroboration (signals/two_sided.py) — independence-weighted affirms denies — is the instrument for the adversarial cases (NOT runway).

Adversarial outcomes (hostile-verified, frozen — seeds/resolution_outcomes.adversarial.yaml)

  • STRIKE2022 = thesis DEAD on all 3 (merchant integration <2-of-3 at scale; no material retail Lightning volume; zero card erosion in 5 Visa/MC 10-Ks). The clean false-positive test.
  • BATTERY2022 = demand REAL (≥4 named lenders); institutional supply FAILED in-window (Cantor $2B announced 2024-07, deployed 2025-05); incumbent-entry TOKEN (Goldman one loan 2022-04 — see S1); SAB-121 rescinded 2025-01-23 (edge).

Scope (the two cases want OPPOSITE things — do not conflate)

  • STRIKE: narrow and FROZEN. 3 own_network + 3 independent shows, exactly. Purpose = prove the quarantine: clear in test mode (own_network visible), silent in live (own_network dropped). Widening muddies the contrast — narrow is the feature.
  • BATTERY: scoped SEPARATELY and broader, lineage discipline ON. Resolution lives in institutional-credit / policy discourse (SAB-121 commentary, bank earnings-call digital-asset-lending treatment, private-credit trade press, named lenders' disclosures) — NOT bitcoin podcasts. Doubles as the first installment of the §3-step-6 lineage-aware expansion. TRAP: most institutional-credit press is downstream of the same few announcements (every outlet on the Cantor $2B headline = ONE event, not corroboration) — tag downstream/same-event at ingestion.

Standing line (unchanged)

These two cases prove the failure-mode machinery (quarantine catches the echo; two-sided catches the timing split). They are hindsight-known. The precision number that decides whether this is worth running comes only from FORWARD operation on signals nobody pre-selected. Once these run clean, the next move is LIVE, not another backtest.