Triage evaluation findings into AGENTS.md Current state

Add independent evaluation report (EVALUATION.md)
2026-06-13 14:39:41 -05:00 · 2026-06-13 14:39:40 -05:00
2 changed files with 65 additions and 3 deletions
@@ -82,9 +82,14 @@ open /Applications/Ten31Transcripts.app
 - Never commit to `main` or force-push a shared branch; branch first and ask.
 ## Current state
-Present tense; overwritten each session. 69 tests pass; `/Applications/Ten31Transcripts.app` matches HEAD and runs; working tree clean and pushed to `origin`/`main`.
+Present tense; overwritten each session. 69 tests pass; `/Applications/Ten31Transcripts.app` matches HEAD and runs; working tree clean and pushed to `origin`/`main`. A full independent evaluation ran 2026-06-13 → `EVALUATION.md` (committed at repo root; overwritten + re-committed each run for a reviewable diff); its findings are triaged into the lists below.
 - **Working:** call detection (Meet/Zoom/Teams/Signal), dual-track capture, dual-channel + chunked backend hand-off, speaker reconciliation, recap (`transcript.md` + recap-relay-styled `recap.html`), speaker editor, configurable chunk length, standalone Settings window.
 - **In progress:** the Meet visual fix (reject solid camera-off tiles) is unverified end-to-end — no clean run exists yet; the saved Meet session's `visual_timeline.json` predates the fix.
- **Decided but not implemented:** none open (deferred items live in `ROADMAP.md`).
+- **Work queue (P1 — do first):** the TLS-trust override is global and on by default — it returns `URLCredential(trust:)` for *any* host (`InsecureTrustDelegate.swift:22`; default-on at `AppSettings.swift:109`), so the full mic+system audio, visual timeline, and voiceprint upload is MITM-able by anyone on the LAN. Scope the override to the configured backend host and pin the Start9 root CA (or the leaf SPKI hash); default skip-TLS to off. This gates trusting any later backend-integration test.
 - **Known debt (P2 — fix before wider use):**
  - `RecapAnalyzer.mmss()` fatally crashes on NaN/∞ (reproduced 2×); a malformed/MITM'd backend `duration` (e.g. `1e400` → `Double.infinity`) aborts the app at recap-render time — add a finite-guard fallback (`RecapAnalyzer.swift:137`).
  - README is stale by six phases — still says "Phase 0 (scaffold) / no audio capture, detection, or backend hand-off yet" for a shipped Phase-6 app; same lie in source comment `AppSettings.swift:7`. Rewrite both to match reality.
  - `SessionController` (670 lines, the most concurrency-dense file) has zero unit tests — cover `pendingAutoStop` (auto-start-then-immediate-call-end) and the visual-adoption generation guard before any refactor.
 - **Deferred (P3 — later decision or bulk cleanup; full evidence in `EVALUATION.md`):** `docs/` specs drifted from the dual-channel API + recap phase; `docs/01` §7 lists already-resolved open items; `docs/02` §2.10 claims MenuBarUI features that don't exist; AGENTS.md Layout listings under `Audio/`/`Detection/` are incomplete; the `manifest.json` sha256 contract is specced but never written; env-var precedence footgun (saved URL shadows `SPARK_BACKEND_URL`); `SessionController` owns three jobs (extract the open-panel UI); unused `NSAppleEventsUsageDescription`; unauthenticated LAN backend (consider a shared bearer token).
 - **Known bugs:** Meet speaking-detection is sparse (faint blue border); the mic channel emits some sub-second junk "self" fragments; the same person on desktop-mic vs phone-speakerphone does not unify by voiceprint.
- **Next:** (1) re-process the saved Meet session in the app, then read its `speakers.json` + `cluster_fingerprints.json` to confirm ~4 speakers recover; (2) confirm Settings → Your name = "Grant"; (3) record a fresh Meet call to validate the fix on a clean capture.
+- **Next (product validation — no agent could reach the live backend, so this stays manual):** (1) re-process the saved Meet session in the app, then read its `speakers.json` + `cluster_fingerprints.json` to confirm ~4 speakers recover; (2) record a fresh Meet call to validate the visual fix on a clean capture. (The old "confirm Your name = Grant" item is moot — the committed default is the generic `"Me"`; "Grant" only ever lives in local UserDefaults.)
@@ -0,0 +1,57 @@
 # Evaluation — ten31-transcripts — 2026-06-13
 Intent: A native macOS menu-bar app (Swift/SwiftUI/AppKit, macOS 13+, generated by XcodeGen) that auto-detects conference calls (Meet/Zoom/Teams/Signal), records dual-track audio while watching the call window via ScreenCaptureKit for active-speaker cues, and hands audio + a visual speaker timeline to a self-hosted SparkControl backend that performs transcription/diarization/speaker-naming — producing named transcripts and recaps.
 Agents run: evaluator, security-auditor, exerciser, doc-auditor. Skipped: start9-spec-checker (no StartOS-wrapper markers found), reviewer (working tree clean — no diff to review).
 ## Verdict
 This is a genuinely well-engineered personal tool: it builds cleanly with the documented `DEVELOPER_DIR` toolchain, all 69 tests pass in ~1s (claim verified empirically), the architecture is disciplined (the app records/watches/packages/reconciles and correctly delegates all ML to the backend), and secrets hygiene is verifiably clean — the documented 2026-06-13 history scrub survives a full-ref grep with zero leaked hosts or IPs. The headline risk is the TLS trust model: certificate validation is bypassed **globally for any host, on by default**, so anyone on the LAN can MITM the full upload of call audio, the visual timeline, and stored voiceprints — and the same bypass makes a reproducible recap-time crash (`mmss()` on a malformed `Double`) attacker-reachable. The second-largest issue is documentation: the README still describes "Phase 0 (scaffold)" for an app that has shipped through Phase 6, and the `docs/` specs have diverged from the dual-channel API and the recap phase. Code-wise this is close to ready for its single-user, LAN-only purpose; the fixes are well-scoped and mostly small. Fix the TLS model first — it gates the safety of every backend-integration test that follows.
 ## Cross-referenced findings
 - **TLS bypass scope — contradiction resolved against the evaluator.** The evaluator rated Security 4 partly on the basis that the TLS-skip is "intentional/scoped" (`InsecureTrustDelegate.swift:9`). The security-auditor read the implementation and found it is **not** scoped: `URLCredential(trust:)` is returned for any host without a host/fingerprint/CA check (`InsecureTrustDelegate.swift:22`), and it is default-on (`AppSettings.swift:109`). The auditor's direct evidence wins; the Security lens is adjusted down accordingly (see Scorecard).
 - **One attack chain, two agents.** The exerciser independently reproduced (twice) a fatal crash in `RecapAnalyzer.mmss()` on `Double.nan`/`Double.infinity` (`RecapAnalyzer.swift:137`), reachable when the backend returns e.g. `"duration": 1e400`. The security-auditor's P1 global TLS bypass is exactly what lets an on-LAN attacker *be* that backend. These are not two unrelated findings — the P1 bypass converts the P2 crash from "trust the backend" to "any LAN attacker can crash the app at recap time." Listed once each below, but they share an exploit path.
 - **README staleness — corroborated by two agents.** Both the evaluator (P2) and the doc-auditor (multiple lines) independently flagged that `README.md` describes Phase 0 while the code is at Phase 6+, and both flagged the matching stale source comment at `AppSettings.swift:7`. Merged into one finding; the doc-auditor adds that the drift extends into the `docs/` design specs.
 - **Test count — claim verified, not just asserted.** The evaluator and exerciser both built and ran the suite; "69 tests pass" (AGENTS.md) is confirmed by execution, not by counting `func test` declarations.
 ## Priority queue
 - [P1] Global, unscoped TLS bypass trusts any certificate from any host (default-on) — anyone on the LAN can ARP/DNS-spoof the unauthenticated `.local` mDNS name and receive the full mic+system audio, visual timeline, and voiceprints, then return attacker-chosen transcripts — `InsecureTrustDelegate.swift:22`, wired at `SparkControlClient.swift:85`/`GatewayLLMClient.swift:36`/`SparkControlHealth.swift:35` — security-auditor
 - [P2] Skip-TLS defaults to ON, so the P1 MITM window is open from first launch before any user choice — `AppSettings.swift:109` (`... as? Bool ?? true`) — security-auditor
 - [P2] `RecapAnalyzer.mmss()` fatally crashes on NaN/±Infinity (reproduced twice); a malformed/MITM'd backend `duration` decodes to `Double.infinity` and aborts the app at recap-render time — `RecapAnalyzer.swift:137` (`Int(sec.rounded())`) — exerciser (exploit path opened by the P1 finding)
 - [P2] README is stale by six phases — claims "Phase 0 (scaffold)… no audio capture, call detection, screen reading, or backend hand-off yet" for an app that has all of it; the same lie is in source comment `AppSettings.swift:7` — `README.md:7,49,51,56-66` vs. `Ten31Transcripts/{Audio,Detection,Visual,Session,Recap}/` — evaluator + doc-auditor
 - [P2] `SessionController` (670 lines, the most concurrency-dense file: generations, in-flight task adoption, pending-auto-stop) has zero unit tests, while comparable pure logic is well covered — `SessionController.swift:256-282` — evaluator
 - [P3] `docs/` design specs drifted from the implemented backend path: the dual-channel fields (`mic_file`/`system_file`/`self_name`/`self_vad`) are undocumented and the recap/LLM phase is absent — `docs/03_DATA_CONTRACTS.md:109-116`, `docs/02_ARCHITECTURE.md:51,197`, `docs/01_PROJECT_BRIEF.md:31,83,94`, `docs/04_BUILD_PLAN.md` (no recap phase) vs. `SparkControlClient.swift:106-130` / `RecapAnalyzer.swift:8-12` — doc-auditor
 - [P3] `docs/01_PROJECT_BRIEF.md:142-153` §7 lists open items 2–5 (send trigger, retention, voiceprint-update policy, signing) that are already resolved in code — `AppSettings.swift:46`, `VoiceprintStore.swift:25`, `Config/Signing.xcconfig` — doc-auditor
 - [P3] `docs/02_ARCHITECTURE.md:214-216` §2.10 claims MenuBarUI features (recent-sessions list with resend/delete, voiceprint manager) that are absent from the actual UI (`MenuBarView` surfaces only the single last session) — doc-auditor
 - [P3] AGENTS.md Layout listings are incomplete: `Audio/` omits `AudioMixer`/`MonoTrackWriter`/`Resampler`, `Detection/` omits `AudioInputProcesses`/`MicActivityMonitor` — `AGENTS.md:50,53` — doc-auditor
 - [P3] The `manifest.json` per-file `sha256` integrity contract is specified but never written by the pipeline — spec-vs-reality gap — `docs/03_DATA_CONTRACTS.md:61-63` — evaluator
 - [P3] Env-var precedence footgun: a saved UserDefaults backend URL permanently shadows `SPARK_BACKEND_URL`, so the env var silently has no effect once Settings is touched (already noted in ROADMAP) — `AppSettings.swift:105-107`, `ROADMAP.md:23` — evaluator
 - [P3] `SessionController` owns three jobs — recording state machine, backend-processing orchestration, and the saved-session/NSOpenPanel UI flow; extract the open/reprocess UI before the file grows — `SessionController.swift:467-535` — evaluator
 - [P3] Unused, scary-looking `NSAppleEventsUsageDescription` entitlement string ("reads the active browser tab's URL") with no AppleEvents code path (Meet detection uses `CGWindowListCopyWindowInfo` titles only) — drop it — `Info.plist:33` — security-auditor
 - [P3] Backend is unauthenticated by design — any LAN device that reaches it can drive transcription; consider a shared bearer token even on LAN — `docs/03_DATA_CONTRACTS.md:89` — security-auditor
 - [P3] App Sandbox OFF + Hardened Runtime OFF (intentional, required for cross-app observation) leaves the app unconfined; keep the zero-dependency posture as a deliberate compensating control and document it as such — `project.yml:38` + entitlements — security-auditor
 ## Scorecard
 The evaluator's six-lens table, with two lenses adjusted where another agent's evidence contradicts the evaluator's stated basis (adjustments noted):
 | Lens | Score /5 | Notes |
 |---|---|---|
 | Architecture | 5 | Clean layering; ML delegated to backend per intent; pure/testable seams split from I/O. The single 670-line `SessionController` is the only concentration (P3 to extract). |
 | Security | **3** (was 4) | **Adjusted down.** The evaluator's "TLS-skip is intentional/scoped" basis is contradicted by the security-auditor's read: the bypass is global/any-host (`InsecureTrustDelegate.swift:22`) and default-on. Otherwise strong — zero deps, no shell-out, verified-clean secrets, the "never write frames" privacy claim holds in code. |
 | Performance | 5 | Idles near-zero; frames released immediately; grid-sampled vision with reused `CIContext`; sequential backend calls honor the single-GPU constraint. |
 | Testing | 4 | 69 tests pass (verified by execution); they target the real load-bearing logic. Gap: the `SessionController` concurrency state machine is untested. |
 | Code quality | 5 | Consistent style, comments explain *why*, zero warnings, no `try!`. One latent robustness ding: the `mmss()` NaN/∞ fatal (P2). |
 | Documentation | **3** (was 4) | **Adjusted down.** The evaluator scored 4 calling `docs/` "excellent and true," but the doc-auditor's claim-by-claim pass found drift well beyond the README — the dual-channel API and the entire recap phase are undocumented across `docs/01-04`, and the build plan never mentions recap. |
 ## Disagreements & gaps
 - **TLS scope (resolved).** Evaluator said "scoped" and scored Security 4; security-auditor read `InsecureTrustDelegate.swift:22` and found it global + default-on (P1). Resolved in favor of the auditor's direct evidence; Security adjusted to 3.
 - **Documentation breadth (resolved).** Evaluator sampled `docs/` and judged them accurate (lens 4); doc-auditor did a claim-by-claim pass and found material drift in the specs, not just the README. Resolved in favor of the doc-auditor for the lens; adjusted to 3.
 - **Shared blind spot (all runtime-capable agents).** None could exercise live end-to-end behavior — the SparkControl `.local` backend is unreachable from any of these environments by design, and the real on-call visual-cue accuracy needs the gitignored `example-screenshots/`. The Meet visual fix (reject solid camera-off tiles) therefore remains **unverified end-to-end**, which AGENTS.md "Current state" itself acknowledges. No agent could close this; it requires a real call on the user's machine.
 ## Suggested order of work
 1. **Fix the TLS trust model first** — scope the override to the configured backend host and pin the Start9 root CA (or the leaf SPKI hash); default skip-TLS to `false`. This is the P1, and it is the precondition that makes any later backend-integration test trustworthy (it currently gates the P2 crash's reachability).
 2. **Harden `Double`→`Int` conversions on backend-decoded values** — give `mmss()` a finite-guard fallback and audit sibling call sites; closes the recap-time crash chain that step 1 also narrows.
 3. **Rewrite `README.md` to match the shipped app** and fix the `AppSettings.swift:7` "Phase 0" comment — the single highest-leverage doc change (first thing any newcomer reads).
 4. **Reconcile the `docs/` specs** — document the dual-channel fields in `docs/03` §4 and `docs/02`, add the recap phase to `docs/01/02/04`, and close the already-resolved §7 open items.
 5. **Add `SessionController` state-machine tests** (auto-start-then-immediate-call-end via `pendingAutoStop`; the visual-adoption generation guard) — do this *before* the next refactor so it has a safety net.
 6. **Then extract the saved-session/open-panel UI** out of `SessionController` into a small coordinator.
 7. **Run one real call end-to-end** on the user's machine to validate the unverified Meet visual fix and confirm `speakers.json` + `transcript.md` + `recap.html` are written correctly — only meaningful after step 1 makes that path safe.
Author	SHA1	Message	Date
Grant Gilliam	13a8972abb	Triage evaluation findings into AGENTS.md Current state	2026-06-13 14:39:41 -05:00
Grant Gilliam	5a51578a95	Add independent evaluation report (EVALUATION.md)	2026-06-13 14:39:40 -05:00