Phase 0: menu-bar scaffold, permissions, backend health check
Native SwiftUI menu-bar app (LSUIElement, macOS 13+), generated from project.yml via XcodeGen. Includes: - PermissionsManager (Microphone / Screen Recording / Accessibility) + UI - SparkControlHealth: GET /api/status over self-signed TLS (InsecureTrustDelegate) - AppSettings persistence (host, TLS-skip, output folder, adapter toggles) - Menu-bar panel + Settings, app sandbox & hardened runtime off (LAN tool)
This commit is contained in:
@@ -0,0 +1,248 @@
|
||||
# Architecture — Ten31 Transcripts
|
||||
|
||||
Companion to `01_PROJECT_BRIEF.md`. Module layout, data flow, the per-app adapter
|
||||
pattern, the macOS APIs, and the SparkControl integration (now fully specified).
|
||||
|
||||
---
|
||||
|
||||
## 1. High-level data flow
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ CallDetector │
|
||||
│ CoreAudio "mic running somewhere" │
|
||||
│ + known-app / Meet-tab heuristic │
|
||||
└───────────────┬───────────────────────────┘
|
||||
│ callStarted(app, window)
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────┐
|
||||
│ SessionController │
|
||||
│ owns one Session; shared t0; start/stop; on end package │
|
||||
└───────┬───────────────────────────┬──────────────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌────────────────────────┐ ┌───────────────────────────────────┐
|
||||
│ AudioRecorder │ │ VisualObserver │
|
||||
│ SCStream system audio │ │ SCStream window frames @2–4fps │
|
||||
│ AVAudioEngine mic │ │ │ (frames released, never saved)│
|
||||
│ → mic.wav, system.wav │ │ ▼ │
|
||||
│ → mixed_mono_16k.wav │ │ AppAdapter (per app) │
|
||||
│ + mic VAD → self spans │ │ OCR names + active-speaker cue │
|
||||
└────────────┬───────────┘ │ → SpeakerObservation │
|
||||
│ └──────────────┬────────────────────┘
|
||||
│ ▼
|
||||
│ ┌───────────────────────────┐
|
||||
│ self spans ───▶│ TimelineBuilder │
|
||||
│ │ debounce/coalesce + merge │
|
||||
│ │ mic-VAD self spans │
|
||||
│ │ → visual_timeline.json │
|
||||
│ └──────────────┬────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────────────────────────────────────────────────────┐
|
||||
│ SessionPackager │
|
||||
│ mixed_mono_16k.wav + visual_timeline.json + manifest │
|
||||
│ + chunk plan (if call > ~3 min) │
|
||||
└───────────────────────────┬──────────────────────────────┘
|
||||
▼
|
||||
┌──────────────────────────────────────┐
|
||||
│ SparkControlClient │
|
||||
│ per chunk, SEQUENTIAL: │
|
||||
│ POST /api/audio/label-merge │
|
||||
│ file=chunk.wav │
|
||||
│ timeline=<chunk-local segments> │
|
||||
│ known_voiceprints=<from store> │
|
||||
│ transcribe=true │
|
||||
│ → named segments + per-speaker prints │
|
||||
└───────────────┬───────────┬────────────┘
|
||||
│ │
|
||||
offset+stitch│ │ fingerprints (keyed by name)
|
||||
▼ ▼
|
||||
┌────────────────┐ ┌────────────────────┐
|
||||
│ speakers.json │ │ VoiceprintStore │
|
||||
│ (named, global) │ │ persist/replay │
|
||||
└────────────────┘ └────────────────────┘
|
||||
```
|
||||
|
||||
## 2. Modules
|
||||
|
||||
### 2.1 `CallDetector`
|
||||
Fire `callStarted(app:window:)` / `callEnded()`.
|
||||
- **Mic active system-wide:** CoreAudio `kAudioDevicePropertyDeviceIsRunningSomewhere`
|
||||
on the default input device (listener, not poll).
|
||||
- **App present/active:** `NSWorkspace` running/frontmost vs the bundle-ID table.
|
||||
- **Meet (browser):** when a browser is frontmost + mic live, read the active-tab
|
||||
URL (AppleScript/Accessibility); confirm `meet.google.com`.
|
||||
- **Heuristic:** `mic_running` AND (`known_native_app_active` OR `browser+meet_tab`).
|
||||
Debounce ~2 s open; end when mic quiet > N s and the app/tab leaves foreground
|
||||
or quits.
|
||||
- Output: app id + the call window (`SCWindow`) for the `VisualObserver`.
|
||||
|
||||
### 2.2 `AudioRecorder`
|
||||
- **System audio:** `SCStream` with `capturesAudio = true` (mixer-level; works with
|
||||
headphones; no BlackHole). macOS 13+.
|
||||
- **Mic:** `AVAudioEngine` input tap.
|
||||
- Outputs: `mic.wav`, `system.wav`, and the backend deliverable
|
||||
**`mixed_mono_16k.wav`** (mic + system summed → mono → 16 kHz PCM WAV).
|
||||
- **Shared `t0`** (`CACurrentMediaTime`) stamped once; every audio sample and
|
||||
visual observation is relative to it. Non-negotiable — the merge depends on it.
|
||||
- **Mic VAD:** run lightweight energy/VAD on the mic track to emit "the user is
|
||||
speaking" spans. These feed `TimelineBuilder` as pre-seeded **self** segments
|
||||
(high confidence) so the backend names the user even when their own tile isn't
|
||||
read — and so the user's voiceprint enrolls on call one.
|
||||
|
||||
### 2.3 `VisualObserver`
|
||||
- `SCStream` scoped (via `SCContentFilter`) to the **specific call window**.
|
||||
- Throttle to adapter fps (default 3). Hand each frame to the adapter; **release
|
||||
immediately — never persist a frame.**
|
||||
- **Window visibility / focus is NOT required.** SCK captures a window's own
|
||||
rendered content even when it's in the background, occluded by other apps, or
|
||||
on another Space. The user can work in other apps during the call and visual
|
||||
capture continues normally. (This is a key reason for window capture over
|
||||
display capture — also more private.)
|
||||
- **Capture liveness — the one real failure mode.** Two states stop fresh frames:
|
||||
1. **Minimized to the Dock** — macOS may freeze the window's backing buffer, so
|
||||
SCK delivers stale/duplicate frames. Detect minimization
|
||||
(`SCWindow.isOnScreen == false` / window state) and **pause visual analysis +
|
||||
flag a `visual_gap` for that span** rather than emitting bogus observations.
|
||||
2. **Browser tab switched away (Meet only)** — see §2.4 Meet note.
|
||||
In both cases **audio keeps recording**, and the backend voiceprint fallback
|
||||
still names previously-heard speakers — so a gap only costs naming precision for
|
||||
*new, never-seen* speakers during that exact window. Record gaps in
|
||||
`visual_timeline.json` (a `visual_gaps: [{start,end,reason}]` array) so the
|
||||
cause is auditable; `TimelineBuilder` must not interpolate across a gap.
|
||||
|
||||
### 2.4 `AppAdapter` (protocol) + four implementations
|
||||
```swift
|
||||
protocol AppAdapter {
|
||||
static var bundleIDs: [String] { get }
|
||||
var preferredFPS: Int { get }
|
||||
func analyze(frame: CVPixelBuffer, at t: TimeInterval) -> [SpeakerObservation]
|
||||
func namesFromAccessibility() -> [String]? // optional
|
||||
}
|
||||
struct SpeakerObservation {
|
||||
let name: String // OCR'd / a11y name; "" if unknown
|
||||
let speaking: Bool // active-speaker cue detected
|
||||
let bbox: CGRect
|
||||
let confidence: Double // 0..1
|
||||
let t: TimeInterval // relative to session t0
|
||||
}
|
||||
```
|
||||
Per-adapter cues:
|
||||
- **Zoom** (`us.zoom.xos`): colored tile border = active speaker; OCR the tile
|
||||
name label; handle speaker + gallery layouts.
|
||||
- **Teams** (`com.microsoft.teams2`): colored ring/border; labeled; like Zoom.
|
||||
- **Signal** (`org.whispersystems.signal-desktop`): ring around avatar/initials;
|
||||
try `namesFromAccessibility()` first (Electron a11y tree), OCR fallback.
|
||||
- **Meet** (browser): **hybrid** — names via Accessibility/AppleScript (DOM text),
|
||||
speaking cue via Vision (canvas/WebGL animated bars / tile highlight), fused by
|
||||
tile position. Most likely to need iteration.
|
||||
- **Tab-switch caveat (Meet-specific):** if Meet is a browser *tab* and the user
|
||||
switches to a different tab **in the same window**, the browser stops rendering
|
||||
the Meet tab → SCK captures a frozen last-frame (a `visual_gap`). Switching to a
|
||||
different *app* is fine; switching tabs is not. Mitigations, in order: (1)
|
||||
detect the active-tab URL leaving `meet.google.com` and flag a `visual_gap`
|
||||
(don't emit stale observations); (2) prefer capturing Meet in a **dedicated
|
||||
browser window / PWA / standalone window** so tab-switching can't blank it
|
||||
— surface this as a one-time setup tip in the UI; (3) names still come from the
|
||||
a11y/DOM tree where available, and audio + voiceprint fallback carry identity
|
||||
through the gap regardless.
|
||||
|
||||
Each adapter is **testable offline** against PNG/JPEG frame fixtures.
|
||||
|
||||
### 2.5 `TimelineBuilder`
|
||||
Turn noisy per-frame observations into clean `(start, end, name, confidence)`
|
||||
segments.
|
||||
- Group by name; open a segment after K consecutive speaking frames (e.g. 2),
|
||||
close after M quiet frames (e.g. 2) — hysteresis rides out UI-cue lag/flicker.
|
||||
- **Allow overlaps** (crosstalk). Do not force one speaker per instant.
|
||||
- Merge in the mic-VAD **self** spans (the user) with high confidence.
|
||||
- Normalize OCR name variants ("Sarah J" → "Sarah Jones") via a per-session
|
||||
alias table.
|
||||
- Emit `visual_timeline.json` (schema in `03_DATA_CONTRACTS.md`). The flat
|
||||
`segments` array maps directly onto the `timeline` field `label-merge` wants.
|
||||
|
||||
### 2.6 `SessionPackager`
|
||||
Write the session folder and, if the call is longer than ~3 min, produce a
|
||||
**chunk plan**: ~2–3 min windows on `mixed_mono_16k.wav`, each with its
|
||||
**timeline slice rebased to chunk-local seconds**.
|
||||
```
|
||||
~/Ten31Transcripts/sessions/2026-06-05T14-03_zoom/
|
||||
mic.wav system.wav mixed_mono_16k.wav
|
||||
visual_timeline.json
|
||||
manifest.json
|
||||
(chunks/ produced transiently if chunking)
|
||||
speakers.json # written after backend hand-off
|
||||
```
|
||||
|
||||
### 2.7 `SparkControlClient`
|
||||
Deliver to SparkControl. **Primary path = `POST /api/audio/label-merge`** with
|
||||
`file`, `timeline`, `known_voiceprints`, `transcribe=true`.
|
||||
- **Sequential only** — one audio request in flight (parallel ⇒ `503 + Retry-After`).
|
||||
- **Self-signed TLS** — skip verification (`URLSession` delegate trusting the
|
||||
Start9 cert) or trust the Root CA. **No auth on the LAN.**
|
||||
- **Per chunk:** call `label-merge` with that chunk's audio + rebased timeline +
|
||||
the **accumulated** voiceprints; offset returned timestamps back to global and
|
||||
append. Names unify across chunks because the same names/voiceprints are passed
|
||||
each time; new voiceprints accumulate into the store.
|
||||
- Retry on `503` after `Retry-After`; on hard failure keep the session folder and
|
||||
surface "Resend" in the UI.
|
||||
- Limits to respect: **200 MB/request** (`413`), transcription timeout ~300 s,
|
||||
diarization ~600 s. Chunking keeps requests well under these.
|
||||
- See `03_DATA_CONTRACTS.md §4` for exact fields and a real response.
|
||||
|
||||
### 2.8 result assembly → `speakers.json`
|
||||
Concatenate the per-chunk `label-merge` results into one global, named,
|
||||
speaker-attributed transcript (timestamps offset to session time). This is the
|
||||
seam to the user's existing summarizer. The app does not analyze past this.
|
||||
|
||||
### 2.9 `VoiceprintStore`
|
||||
Local persistence of named voiceprints — the compounding-identity layer.
|
||||
- File: `~/Ten31Transcripts/voiceprints.json` →
|
||||
`{ "<name>": { "vector": [192 floats], "updated": <iso>, "calls": <int> } }`.
|
||||
- **On send:** load all entries → pass as `known_voiceprints` to `label-merge`.
|
||||
- **On response:** for each speaker resolved by **visual** (or a high-similarity
|
||||
**voiceprint** match), store/refresh that name's vector. **Never** store
|
||||
`Unknown_N`.
|
||||
- **Update policy (open, start simple):** overwrite with the latest
|
||||
high-confidence vector, or keep a running mean per name. v1 = store latest with
|
||||
`overlap_confidence ≥ ~0.8`; refine to averaging later (`01 §7.4`).
|
||||
- Editable/clearable from the menu-bar UI (rename, delete a person, reset).
|
||||
|
||||
### 2.10 `MenuBarUI` (SwiftUI, `LSUIElement`)
|
||||
Status (idle / detected / recording / uploading), manual start/stop, recent
|
||||
sessions (open folder, resend, delete), adapter toggles, **backend host + a
|
||||
health check** (`GET /api/status`), output folder, voiceprint manager, and a
|
||||
permissions checklist (Screen Recording, Microphone, Accessibility).
|
||||
|
||||
## 3. macOS frameworks & permissions
|
||||
|
||||
| Need | Framework | Permission |
|
||||
|---|---|---|
|
||||
| System audio + window frames | ScreenCaptureKit | Screen Recording |
|
||||
| Microphone | AVFoundation / CoreAudio | Microphone |
|
||||
| Meet/Signal names, tab URL | Accessibility (AXUIElement) / AppleScript | Accessibility + Automation |
|
||||
| OCR + cue analysis | Vision (`VNRecognizeTextRequest`) | none |
|
||||
| App/tab detection | AppKit (`NSWorkspace`) | none |
|
||||
|
||||
Stable signing identity avoids permission re-prompts on rebuild.
|
||||
|
||||
## 4. Performance
|
||||
Window-scoped capture + 3 fps + Vision-on-Neural-Engine is light; audio is cheap;
|
||||
frames are released immediately so memory stays flat. The app idles near-zero
|
||||
until a call starts. Backend requests are sequential and chunked, so they never
|
||||
saturate the GPU.
|
||||
|
||||
## 5. The merge — now done by the backend
|
||||
The app no longer implements the overlap vote. `label-merge` resolves each
|
||||
anonymous cluster in order:
|
||||
1. **visual** — timeline name with the most temporal overlap (`source: "visual"`,
|
||||
`overlap_confidence`);
|
||||
2. **voiceprint** — closest `known_voiceprints` match above `voiceprint_threshold`
|
||||
(`source: "voiceprint"`, `match_similarity`);
|
||||
3. **`Unknown_N`** (`source: "unmatched"`) — never guessed/mislabeled.
|
||||
The app's contribution is a good timeline (incl. mic-VAD self spans) and an
|
||||
ever-growing voiceprint library. `min_overlap` and `voiceprint_threshold` are
|
||||
tunable request params if precision needs adjusting.
|
||||
```
|
||||
```
|
||||
Reference in New Issue
Block a user