Add independent evaluation report (EVALUATION.md)
This commit is contained in:
@@ -0,0 +1,57 @@
|
||||
# Evaluation — ten31-transcripts — 2026-06-13
|
||||
|
||||
Intent: A native macOS menu-bar app (Swift/SwiftUI/AppKit, macOS 13+, generated by XcodeGen) that auto-detects conference calls (Meet/Zoom/Teams/Signal), records dual-track audio while watching the call window via ScreenCaptureKit for active-speaker cues, and hands audio + a visual speaker timeline to a self-hosted SparkControl backend that performs transcription/diarization/speaker-naming — producing named transcripts and recaps.
|
||||
|
||||
Agents run: evaluator, security-auditor, exerciser, doc-auditor. Skipped: start9-spec-checker (no StartOS-wrapper markers found), reviewer (working tree clean — no diff to review).
|
||||
|
||||
## Verdict
|
||||
This is a genuinely well-engineered personal tool: it builds cleanly with the documented `DEVELOPER_DIR` toolchain, all 69 tests pass in ~1s (claim verified empirically), the architecture is disciplined (the app records/watches/packages/reconciles and correctly delegates all ML to the backend), and secrets hygiene is verifiably clean — the documented 2026-06-13 history scrub survives a full-ref grep with zero leaked hosts or IPs. The headline risk is the TLS trust model: certificate validation is bypassed **globally for any host, on by default**, so anyone on the LAN can MITM the full upload of call audio, the visual timeline, and stored voiceprints — and the same bypass makes a reproducible recap-time crash (`mmss()` on a malformed `Double`) attacker-reachable. The second-largest issue is documentation: the README still describes "Phase 0 (scaffold)" for an app that has shipped through Phase 6, and the `docs/` specs have diverged from the dual-channel API and the recap phase. Code-wise this is close to ready for its single-user, LAN-only purpose; the fixes are well-scoped and mostly small. Fix the TLS model first — it gates the safety of every backend-integration test that follows.
|
||||
|
||||
## Cross-referenced findings
|
||||
- **TLS bypass scope — contradiction resolved against the evaluator.** The evaluator rated Security 4 partly on the basis that the TLS-skip is "intentional/scoped" (`InsecureTrustDelegate.swift:9`). The security-auditor read the implementation and found it is **not** scoped: `URLCredential(trust:)` is returned for any host without a host/fingerprint/CA check (`InsecureTrustDelegate.swift:22`), and it is default-on (`AppSettings.swift:109`). The auditor's direct evidence wins; the Security lens is adjusted down accordingly (see Scorecard).
|
||||
- **One attack chain, two agents.** The exerciser independently reproduced (twice) a fatal crash in `RecapAnalyzer.mmss()` on `Double.nan`/`Double.infinity` (`RecapAnalyzer.swift:137`), reachable when the backend returns e.g. `"duration": 1e400`. The security-auditor's P1 global TLS bypass is exactly what lets an on-LAN attacker *be* that backend. These are not two unrelated findings — the P1 bypass converts the P2 crash from "trust the backend" to "any LAN attacker can crash the app at recap time." Listed once each below, but they share an exploit path.
|
||||
- **README staleness — corroborated by two agents.** Both the evaluator (P2) and the doc-auditor (multiple lines) independently flagged that `README.md` describes Phase 0 while the code is at Phase 6+, and both flagged the matching stale source comment at `AppSettings.swift:7`. Merged into one finding; the doc-auditor adds that the drift extends into the `docs/` design specs.
|
||||
- **Test count — claim verified, not just asserted.** The evaluator and exerciser both built and ran the suite; "69 tests pass" (AGENTS.md) is confirmed by execution, not by counting `func test` declarations.
|
||||
|
||||
## Priority queue
|
||||
- [P1] Global, unscoped TLS bypass trusts any certificate from any host (default-on) — anyone on the LAN can ARP/DNS-spoof the unauthenticated `.local` mDNS name and receive the full mic+system audio, visual timeline, and voiceprints, then return attacker-chosen transcripts — `InsecureTrustDelegate.swift:22`, wired at `SparkControlClient.swift:85`/`GatewayLLMClient.swift:36`/`SparkControlHealth.swift:35` — security-auditor
|
||||
- [P2] Skip-TLS defaults to ON, so the P1 MITM window is open from first launch before any user choice — `AppSettings.swift:109` (`... as? Bool ?? true`) — security-auditor
|
||||
- [P2] `RecapAnalyzer.mmss()` fatally crashes on NaN/±Infinity (reproduced twice); a malformed/MITM'd backend `duration` decodes to `Double.infinity` and aborts the app at recap-render time — `RecapAnalyzer.swift:137` (`Int(sec.rounded())`) — exerciser (exploit path opened by the P1 finding)
|
||||
- [P2] README is stale by six phases — claims "Phase 0 (scaffold)… no audio capture, call detection, screen reading, or backend hand-off yet" for an app that has all of it; the same lie is in source comment `AppSettings.swift:7` — `README.md:7,49,51,56-66` vs. `Ten31Transcripts/{Audio,Detection,Visual,Session,Recap}/` — evaluator + doc-auditor
|
||||
- [P2] `SessionController` (670 lines, the most concurrency-dense file: generations, in-flight task adoption, pending-auto-stop) has zero unit tests, while comparable pure logic is well covered — `SessionController.swift:256-282` — evaluator
|
||||
- [P3] `docs/` design specs drifted from the implemented backend path: the dual-channel fields (`mic_file`/`system_file`/`self_name`/`self_vad`) are undocumented and the recap/LLM phase is absent — `docs/03_DATA_CONTRACTS.md:109-116`, `docs/02_ARCHITECTURE.md:51,197`, `docs/01_PROJECT_BRIEF.md:31,83,94`, `docs/04_BUILD_PLAN.md` (no recap phase) vs. `SparkControlClient.swift:106-130` / `RecapAnalyzer.swift:8-12` — doc-auditor
|
||||
- [P3] `docs/01_PROJECT_BRIEF.md:142-153` §7 lists open items 2–5 (send trigger, retention, voiceprint-update policy, signing) that are already resolved in code — `AppSettings.swift:46`, `VoiceprintStore.swift:25`, `Config/Signing.xcconfig` — doc-auditor
|
||||
- [P3] `docs/02_ARCHITECTURE.md:214-216` §2.10 claims MenuBarUI features (recent-sessions list with resend/delete, voiceprint manager) that are absent from the actual UI (`MenuBarView` surfaces only the single last session) — doc-auditor
|
||||
- [P3] AGENTS.md Layout listings are incomplete: `Audio/` omits `AudioMixer`/`MonoTrackWriter`/`Resampler`, `Detection/` omits `AudioInputProcesses`/`MicActivityMonitor` — `AGENTS.md:50,53` — doc-auditor
|
||||
- [P3] The `manifest.json` per-file `sha256` integrity contract is specified but never written by the pipeline — spec-vs-reality gap — `docs/03_DATA_CONTRACTS.md:61-63` — evaluator
|
||||
- [P3] Env-var precedence footgun: a saved UserDefaults backend URL permanently shadows `SPARK_BACKEND_URL`, so the env var silently has no effect once Settings is touched (already noted in ROADMAP) — `AppSettings.swift:105-107`, `ROADMAP.md:23` — evaluator
|
||||
- [P3] `SessionController` owns three jobs — recording state machine, backend-processing orchestration, and the saved-session/NSOpenPanel UI flow; extract the open/reprocess UI before the file grows — `SessionController.swift:467-535` — evaluator
|
||||
- [P3] Unused, scary-looking `NSAppleEventsUsageDescription` entitlement string ("reads the active browser tab's URL") with no AppleEvents code path (Meet detection uses `CGWindowListCopyWindowInfo` titles only) — drop it — `Info.plist:33` — security-auditor
|
||||
- [P3] Backend is unauthenticated by design — any LAN device that reaches it can drive transcription; consider a shared bearer token even on LAN — `docs/03_DATA_CONTRACTS.md:89` — security-auditor
|
||||
- [P3] App Sandbox OFF + Hardened Runtime OFF (intentional, required for cross-app observation) leaves the app unconfined; keep the zero-dependency posture as a deliberate compensating control and document it as such — `project.yml:38` + entitlements — security-auditor
|
||||
|
||||
## Scorecard
|
||||
The evaluator's six-lens table, with two lenses adjusted where another agent's evidence contradicts the evaluator's stated basis (adjustments noted):
|
||||
|
||||
| Lens | Score /5 | Notes |
|
||||
|---|---|---|
|
||||
| Architecture | 5 | Clean layering; ML delegated to backend per intent; pure/testable seams split from I/O. The single 670-line `SessionController` is the only concentration (P3 to extract). |
|
||||
| Security | **3** (was 4) | **Adjusted down.** The evaluator's "TLS-skip is intentional/scoped" basis is contradicted by the security-auditor's read: the bypass is global/any-host (`InsecureTrustDelegate.swift:22`) and default-on. Otherwise strong — zero deps, no shell-out, verified-clean secrets, the "never write frames" privacy claim holds in code. |
|
||||
| Performance | 5 | Idles near-zero; frames released immediately; grid-sampled vision with reused `CIContext`; sequential backend calls honor the single-GPU constraint. |
|
||||
| Testing | 4 | 69 tests pass (verified by execution); they target the real load-bearing logic. Gap: the `SessionController` concurrency state machine is untested. |
|
||||
| Code quality | 5 | Consistent style, comments explain *why*, zero warnings, no `try!`. One latent robustness ding: the `mmss()` NaN/∞ fatal (P2). |
|
||||
| Documentation | **3** (was 4) | **Adjusted down.** The evaluator scored 4 calling `docs/` "excellent and true," but the doc-auditor's claim-by-claim pass found drift well beyond the README — the dual-channel API and the entire recap phase are undocumented across `docs/01-04`, and the build plan never mentions recap. |
|
||||
|
||||
## Disagreements & gaps
|
||||
- **TLS scope (resolved).** Evaluator said "scoped" and scored Security 4; security-auditor read `InsecureTrustDelegate.swift:22` and found it global + default-on (P1). Resolved in favor of the auditor's direct evidence; Security adjusted to 3.
|
||||
- **Documentation breadth (resolved).** Evaluator sampled `docs/` and judged them accurate (lens 4); doc-auditor did a claim-by-claim pass and found material drift in the specs, not just the README. Resolved in favor of the doc-auditor for the lens; adjusted to 3.
|
||||
- **Shared blind spot (all runtime-capable agents).** None could exercise live end-to-end behavior — the SparkControl `.local` backend is unreachable from any of these environments by design, and the real on-call visual-cue accuracy needs the gitignored `example-screenshots/`. The Meet visual fix (reject solid camera-off tiles) therefore remains **unverified end-to-end**, which AGENTS.md "Current state" itself acknowledges. No agent could close this; it requires a real call on the user's machine.
|
||||
|
||||
## Suggested order of work
|
||||
1. **Fix the TLS trust model first** — scope the override to the configured backend host and pin the Start9 root CA (or the leaf SPKI hash); default skip-TLS to `false`. This is the P1, and it is the precondition that makes any later backend-integration test trustworthy (it currently gates the P2 crash's reachability).
|
||||
2. **Harden `Double`→`Int` conversions on backend-decoded values** — give `mmss()` a finite-guard fallback and audit sibling call sites; closes the recap-time crash chain that step 1 also narrows.
|
||||
3. **Rewrite `README.md` to match the shipped app** and fix the `AppSettings.swift:7` "Phase 0" comment — the single highest-leverage doc change (first thing any newcomer reads).
|
||||
4. **Reconcile the `docs/` specs** — document the dual-channel fields in `docs/03` §4 and `docs/02`, add the recap phase to `docs/01/02/04`, and close the already-resolved §7 open items.
|
||||
5. **Add `SessionController` state-machine tests** (auto-start-then-immediate-call-end via `pendingAutoStop`; the visual-adoption generation guard) — do this *before* the next refactor so it has a safety net.
|
||||
6. **Then extract the saved-session/open-panel UI** out of `SessionController` into a small coordinator.
|
||||
7. **Run one real call end-to-end** on the user's machine to validate the unverified Meet visual fix and confirm `speakers.json` + `transcript.md` + `recap.html` are written correctly — only meaningful after step 1 makes that path safe.
|
||||
Reference in New Issue
Block a user