Phase 0: menu-bar scaffold, permissions, backend health check

Native SwiftUI menu-bar app (LSUIElement, macOS 13+), generated from project.yml via XcodeGen. Includes: - PermissionsManager (Microphone / Screen Recording / Accessibility) + UI - SparkControlHealth: GET /api/status over self-signed TLS (InsecureTrustDelegate) - AppSettings persistence (host, TLS-skip, output folder, adapter toggles) - Menu-bar panel + Settings, app sandbox & hardened runtime off (LAN tool)
2026-06-05 19:33:53 -05:00
commit b2ae3a62b9
19 changed files with 1448 additions and 0 deletions
@@ -0,0 +1,248 @@
+# Architecture — Ten31 Transcripts
+
+Companion to `01_PROJECT_BRIEF.md`. Module layout, data flow, the per-app adapter
+pattern, the macOS APIs, and the SparkControl integration (now fully specified).
+
+---
+
+## 1. High-level data flow
+
+```
+                    ┌─────────────────────────────────────────┐
+                    │              CallDetector                 │
+                    │  CoreAudio "mic running somewhere"        │
+                    │  + known-app / Meet-tab heuristic         │
+                    └───────────────┬───────────────────────────┘
+                                    │ callStarted(app, window)
+                                    ▼
+        ┌──────────────────────────────────────────────────────────┐
+        │                    SessionController                       │
+        │   owns one Session; shared t0; start/stop; on end package  │
+        └───────┬───────────────────────────┬──────────────────────┘
+                │                            │
+                ▼                            ▼
+   ┌────────────────────────┐   ┌───────────────────────────────────┐
+   │     AudioRecorder       │   │         VisualObserver            │
+   │  SCStream system audio  │   │  SCStream window frames @2–4fps    │
+   │  AVAudioEngine mic       │   │     │ (frames released, never saved)│
+   │  → mic.wav, system.wav   │   │     ▼                              │
+   │  → mixed_mono_16k.wav    │   │  AppAdapter (per app)              │
+   │  + mic VAD → self spans   │   │   OCR names + active-speaker cue   │
+   └────────────┬───────────┘   │   → SpeakerObservation              │
+                │               └──────────────┬────────────────────┘
+                │                              ▼
+                │                  ┌───────────────────────────┐
+                │   self spans ───▶│      TimelineBuilder        │
+                │                  │  debounce/coalesce + merge   │
+                │                  │  mic-VAD self spans          │
+                │                  │  → visual_timeline.json      │
+                │                  └──────────────┬────────────┘
+                │                                 │
+                ▼                                 ▼
+        ┌──────────────────────────────────────────────────────────┐
+        │                       SessionPackager                      │
+        │   mixed_mono_16k.wav + visual_timeline.json + manifest      │
+        │   + chunk plan (if call > ~3 min)                           │
+        └───────────────────────────┬──────────────────────────────┘
+                                     ▼
+                   ┌──────────────────────────────────────┐
+                   │          SparkControlClient            │
+                   │  per chunk, SEQUENTIAL:                 │
+                   │   POST /api/audio/label-merge           │
+                   │     file=chunk.wav                      │
+                   │     timeline=<chunk-local segments>     │
+                   │     known_voiceprints=<from store>      │
+                   │     transcribe=true                     │
+                   │  → named segments + per-speaker prints   │
+                   └───────────────┬───────────┬────────────┘
+                                   │           │
+                       offset+stitch│           │ fingerprints (keyed by name)
+                                   ▼           ▼
+                       ┌────────────────┐  ┌────────────────────┐
+                       │  speakers.json  │  │   VoiceprintStore   │
+                       │ (named, global) │  │  persist/replay     │
+                       └────────────────┘  └────────────────────┘
+```
+
+## 2. Modules
+
+### 2.1 `CallDetector`
+Fire `callStarted(app:window:)` / `callEnded()`.
+- **Mic active system-wide:** CoreAudio `kAudioDevicePropertyDeviceIsRunningSomewhere`
+  on the default input device (listener, not poll).
+- **App present/active:** `NSWorkspace` running/frontmost vs the bundle-ID table.
+- **Meet (browser):** when a browser is frontmost + mic live, read the active-tab
+  URL (AppleScript/Accessibility); confirm `meet.google.com`.
+- **Heuristic:** `mic_running` AND (`known_native_app_active` OR `browser+meet_tab`).
+  Debounce ~2 s open; end when mic quiet > N s and the app/tab leaves foreground
+  or quits.
+- Output: app id + the call window (`SCWindow`) for the `VisualObserver`.
+
+### 2.2 `AudioRecorder`
+- **System audio:** `SCStream` with `capturesAudio = true` (mixer-level; works with
+  headphones; no BlackHole). macOS 13+.
+- **Mic:** `AVAudioEngine` input tap.
+- Outputs: `mic.wav`, `system.wav`, and the backend deliverable
+  **`mixed_mono_16k.wav`** (mic + system summed → mono → 16 kHz PCM WAV).
+- **Shared `t0`** (`CACurrentMediaTime`) stamped once; every audio sample and
+  visual observation is relative to it. Non-negotiable — the merge depends on it.
+- **Mic VAD:** run lightweight energy/VAD on the mic track to emit "the user is
+  speaking" spans. These feed `TimelineBuilder` as pre-seeded **self** segments
+  (high confidence) so the backend names the user even when their own tile isn't
+  read — and so the user's voiceprint enrolls on call one.
+
+### 2.3 `VisualObserver`
+- `SCStream` scoped (via `SCContentFilter`) to the **specific call window**.
+- Throttle to adapter fps (default 3). Hand each frame to the adapter; **release
+  immediately — never persist a frame.**
+- **Window visibility / focus is NOT required.** SCK captures a window's own
+  rendered content even when it's in the background, occluded by other apps, or
+  on another Space. The user can work in other apps during the call and visual
+  capture continues normally. (This is a key reason for window capture over
+  display capture — also more private.)
+- **Capture liveness — the one real failure mode.** Two states stop fresh frames:
+  1. **Minimized to the Dock** — macOS may freeze the window's backing buffer, so
+     SCK delivers stale/duplicate frames. Detect minimization
+     (`SCWindow.isOnScreen == false` / window state) and **pause visual analysis +
+     flag a `visual_gap` for that span** rather than emitting bogus observations.
+  2. **Browser tab switched away (Meet only)** — see §2.4 Meet note.
+  In both cases **audio keeps recording**, and the backend voiceprint fallback
+  still names previously-heard speakers — so a gap only costs naming precision for
+  *new, never-seen* speakers during that exact window. Record gaps in
+  `visual_timeline.json` (a `visual_gaps: [{start,end,reason}]` array) so the
+  cause is auditable; `TimelineBuilder` must not interpolate across a gap.
+
+### 2.4 `AppAdapter` (protocol) + four implementations
+```swift
+protocol AppAdapter {
+    static var bundleIDs: [String] { get }
+    var preferredFPS: Int { get }
+    func analyze(frame: CVPixelBuffer, at t: TimeInterval) -> [SpeakerObservation]
+    func namesFromAccessibility() -> [String]?   // optional
+}
+struct SpeakerObservation {
+    let name: String          // OCR'd / a11y name; "" if unknown
+    let speaking: Bool        // active-speaker cue detected
+    let bbox: CGRect
+    let confidence: Double     // 0..1
+    let t: TimeInterval        // relative to session t0
+}
+```
+Per-adapter cues:
+- **Zoom** (`us.zoom.xos`): colored tile border = active speaker; OCR the tile
+  name label; handle speaker + gallery layouts.
+- **Teams** (`com.microsoft.teams2`): colored ring/border; labeled; like Zoom.
+- **Signal** (`org.whispersystems.signal-desktop`): ring around avatar/initials;
+  try `namesFromAccessibility()` first (Electron a11y tree), OCR fallback.
+- **Meet** (browser): **hybrid** — names via Accessibility/AppleScript (DOM text),
+  speaking cue via Vision (canvas/WebGL animated bars / tile highlight), fused by
+  tile position. Most likely to need iteration.
+  - **Tab-switch caveat (Meet-specific):** if Meet is a browser *tab* and the user
+    switches to a different tab **in the same window**, the browser stops rendering
+    the Meet tab → SCK captures a frozen last-frame (a `visual_gap`). Switching to a
+    different *app* is fine; switching tabs is not. Mitigations, in order: (1)
+    detect the active-tab URL leaving `meet.google.com` and flag a `visual_gap`
+    (don't emit stale observations); (2) prefer capturing Meet in a **dedicated
+    browser window / PWA / standalone window** so tab-switching can't blank it
+    — surface this as a one-time setup tip in the UI; (3) names still come from the
+    a11y/DOM tree where available, and audio + voiceprint fallback carry identity
+    through the gap regardless.
+
+Each adapter is **testable offline** against PNG/JPEG frame fixtures.
+
+### 2.5 `TimelineBuilder`
+Turn noisy per-frame observations into clean `(start, end, name, confidence)`
+segments.
+- Group by name; open a segment after K consecutive speaking frames (e.g. 2),
+  close after M quiet frames (e.g. 2) — hysteresis rides out UI-cue lag/flicker.
+- **Allow overlaps** (crosstalk). Do not force one speaker per instant.
+- Merge in the mic-VAD **self** spans (the user) with high confidence.
+- Normalize OCR name variants ("Sarah J" → "Sarah Jones") via a per-session
+  alias table.
+- Emit `visual_timeline.json` (schema in `03_DATA_CONTRACTS.md`). The flat
+  `segments` array maps directly onto the `timeline` field `label-merge` wants.
+
+### 2.6 `SessionPackager`
+Write the session folder and, if the call is longer than ~3 min, produce a
+**chunk plan**: ~2–3 min windows on `mixed_mono_16k.wav`, each with its
+**timeline slice rebased to chunk-local seconds**.
+```
+~/Ten31Transcripts/sessions/2026-06-05T14-03_zoom/
+  mic.wav  system.wav  mixed_mono_16k.wav
+  visual_timeline.json
+  manifest.json
+  (chunks/ produced transiently if chunking)
+  speakers.json            # written after backend hand-off
+```
+
+### 2.7 `SparkControlClient`
+Deliver to SparkControl. **Primary path = `POST /api/audio/label-merge`** with
+`file`, `timeline`, `known_voiceprints`, `transcribe=true`.
+- **Sequential only** — one audio request in flight (parallel ⇒ `503 + Retry-After`).
+- **Self-signed TLS** — skip verification (`URLSession` delegate trusting the
+  Start9 cert) or trust the Root CA. **No auth on the LAN.**
+- **Per chunk:** call `label-merge` with that chunk's audio + rebased timeline +
+  the **accumulated** voiceprints; offset returned timestamps back to global and
+  append. Names unify across chunks because the same names/voiceprints are passed
+  each time; new voiceprints accumulate into the store.
+- Retry on `503` after `Retry-After`; on hard failure keep the session folder and
+  surface "Resend" in the UI.
+- Limits to respect: **200 MB/request** (`413`), transcription timeout ~300 s,
+  diarization ~600 s. Chunking keeps requests well under these.
+- See `03_DATA_CONTRACTS.md §4` for exact fields and a real response.
+
+### 2.8 result assembly → `speakers.json`
+Concatenate the per-chunk `label-merge` results into one global, named,
+speaker-attributed transcript (timestamps offset to session time). This is the
+seam to the user's existing summarizer. The app does not analyze past this.
+
+### 2.9 `VoiceprintStore`
+Local persistence of named voiceprints — the compounding-identity layer.
+- File: `~/Ten31Transcripts/voiceprints.json` →
+  `{ "<name>": { "vector": [192 floats], "updated": <iso>, "calls": <int> } }`.
+- **On send:** load all entries → pass as `known_voiceprints` to `label-merge`.
+- **On response:** for each speaker resolved by **visual** (or a high-similarity
+  **voiceprint** match), store/refresh that name's vector. **Never** store
+  `Unknown_N`.
+- **Update policy (open, start simple):** overwrite with the latest
+  high-confidence vector, or keep a running mean per name. v1 = store latest with
+  `overlap_confidence ≥ ~0.8`; refine to averaging later (`01 §7.4`).
+- Editable/clearable from the menu-bar UI (rename, delete a person, reset).
+
+### 2.10 `MenuBarUI` (SwiftUI, `LSUIElement`)
+Status (idle / detected / recording / uploading), manual start/stop, recent
+sessions (open folder, resend, delete), adapter toggles, **backend host + a
+health check** (`GET /api/status`), output folder, voiceprint manager, and a
+permissions checklist (Screen Recording, Microphone, Accessibility).
+
+## 3. macOS frameworks & permissions
+
+| Need | Framework | Permission |
+|---|---|---|
+| System audio + window frames | ScreenCaptureKit | Screen Recording |
+| Microphone | AVFoundation / CoreAudio | Microphone |
+| Meet/Signal names, tab URL | Accessibility (AXUIElement) / AppleScript | Accessibility + Automation |
+| OCR + cue analysis | Vision (`VNRecognizeTextRequest`) | none |
+| App/tab detection | AppKit (`NSWorkspace`) | none |
+
+Stable signing identity avoids permission re-prompts on rebuild.
+
+## 4. Performance
+Window-scoped capture + 3 fps + Vision-on-Neural-Engine is light; audio is cheap;
+frames are released immediately so memory stays flat. The app idles near-zero
+until a call starts. Backend requests are sequential and chunked, so they never
+saturate the GPU.
+
+## 5. The merge — now done by the backend
+The app no longer implements the overlap vote. `label-merge` resolves each
+anonymous cluster in order:
+1. **visual** — timeline name with the most temporal overlap (`source: "visual"`,
+   `overlap_confidence`);
+2. **voiceprint** — closest `known_voiceprints` match above `voiceprint_threshold`
+   (`source: "voiceprint"`, `match_similarity`);
+3. **`Unknown_N`** (`source: "unmatched"`) — never guessed/mislabeled.
+The app's contribution is a good timeline (incl. mic-VAD self spans) and an
+ever-growing voiceprint library. `min_overlap` and `voiceprint_threshold` are
+tunable request params if precision needs adjusting.
+```
+```