Phase 0: menu-bar scaffold, permissions, backend health check

Native SwiftUI menu-bar app (LSUIElement, macOS 13+), generated from project.yml via XcodeGen. Includes: - PermissionsManager (Microphone / Screen Recording / Accessibility) + UI - SparkControlHealth: GET /api/status over self-signed TLS (InsecureTrustDelegate) - AppSettings persistence (host, TLS-skip, output folder, adapter toggles) - Menu-bar panel + Settings, app sandbox & hardened runtime off (LAN tool)
2026-06-05 19:33:53 -05:00
commit b2ae3a62b9
19 changed files with 1448 additions and 0 deletions
@@ -0,0 +1,151 @@
+# Project Brief — Ten31 Transcripts
+
+> Local macOS app that auto-detects conference calls, records local audio, and
+> produces a **visual-derived, timestamped speaker timeline** — then hands the
+> mixed audio + that timeline to the operator's **SparkControl** backend, which
+> diarizes, names the speakers (majority-overlap vote against the timeline), and
+> returns named transcript segments. A growing **voiceprint library** recovers
+> speakers even when the visual cue is missing.
+
+Master context document. Read this first, then `02_ARCHITECTURE.md`,
+`03_DATA_CONTRACTS.md`, `04_BUILD_PLAN.md`. The SparkControl API is now fully
+specified — see `03_DATA_CONTRACTS.md` (and the source `AUDIO_API.md`).
+
+---
+
+## 1. What we are building
+
+A lightweight, always-running **menu-bar app on macOS** that:
+
+1. **Detects** when the user joins a call in Google Meet, Zoom, Microsoft Teams,
+   or Signal.
+2. **Records two local audio tracks** — system audio (everyone else) and the
+   user's microphone (the user) — and **mixes them to one 16 kHz mono WAV** for
+   the backend.
+3. **Watches the call window** at ~2–4 fps and, per app, reads participant
+   **names** and the **active-speaker cue**, producing a
+   `(start, end, name, confidence)` **visual timeline** — its best guess at who
+   was talking when.
+4. **Discards every video frame after extraction.** No video is ever written to
+   disk. Only audio + the derived timeline persist locally.
+5. On call end, **POSTs the mixed audio + the visual timeline (+ the known
+   voiceprint library) to `POST /api/audio/label-merge`** on SparkControl, which
+   returns **named, speaker-attributed transcript segments** and a **voiceprint
+   per speaker**.
+6. **Persists the returned voiceprints** keyed by name, so the next call can pass
+   them as `known_voiceprints` and recover a speaker by voice when the visual cue
+   is absent (camera off, a bad OCR frame).
+
+The app's job ends at receiving and storing the named segments from SparkControl.
+**All transcription, diarization, and the name-merge happen on the backend.** Do
+not build transcription, diarization, or the merge vote in this app.
+
+## 2. Why the visual timeline still matters (the core idea)
+
+Audio diarization (NVIDIA Sortformer on the backend) is excellent at
+**segmentation** — precise speaker boundaries — but its clusters are **anonymous**
+(`Speaker_0`, `Speaker_1`…). It cannot name anyone.
+
+The screen *already knows the names*. Each app visually marks the active speaker
+(colored tile border, animated audio bars, a ring around an avatar) next to that
+person's name/initials.
+
+So responsibilities split cleanly:
+- **Audio (backend)** owns *segmentation* — the exact *when*.
+- **Visual capture (this app)** owns *identity* — the *who*.
+- **`label-merge` (backend)** fuses them: it diarizes, then assigns each cluster
+  the timeline name with the most temporal overlap. The visual track needn't be
+  perfect — it only needs to win the per-cluster vote.
+
+**New compounding layer — the voiceprint library.** Every named cluster comes
+back from the backend with a 192-dim TitaNet voiceprint. The app persists these
+keyed by name and replays them as `known_voiceprints` next time. Resolution order
+per cluster becomes: **visual overlap → voiceprint match → `Unknown_N`** (never
+mislabeled). So the screen capture *enrolls a voice library for free*, and over a
+few calls the system can name regulars even with cameras off.
+
+## 3. Hard scope boundaries
+
+**In scope (this app):**
+- Call detection for Meet / Zoom / Teams / Signal.
+- Dual-track local audio capture + mix-to-mono for the backend.
+- Low-fps window capture → OCR (names) + active-speaker cue detection.
+- Per-app "adapter" modules encapsulating each app's UI quirks.
+- Building the visual timeline; **mic-VAD self-labeling** (the mic track is the
+  user, so hot-mic spans pre-seed the user's name into the timeline).
+- Chunking long calls (~2–3 min) and calling `label-merge` **sequentially**.
+- A local **voiceprint store** (persist + replay named voiceprints).
+- Storing the backend's named transcript segments locally.
+- A minimal menu-bar UI: status, manual start/stop, recent sessions, adapter
+  toggles, backend host/health, output folder.
+
+**Out of scope (owned by the backend):**
+- Transcription, diarization, the name-merge vote, summarization/analysis.
+
+**Explicitly not doing:** saving video; cloud anything. Everything stays on the
+operator's LAN.
+
+## 4. Key decisions (now resolved against the real contract)
+
+| Decision | Choice | Why |
+|---|---|---|
+| Language / framework | Native Swift + SwiftUI menu-bar app (`LSUIElement`) | System audio, window capture, Vision all native; one codebase. |
+| Audio capture | ScreenCaptureKit (system audio) + AVFoundation (mic) | No virtual audio device; works with headphones; macOS 13+. |
+| Backend audio format | **Mixed-mono 16 kHz WAV** | Diarizer separates speakers from one mixed stream; 16 kHz is ideal. |
+| Call detection | CoreAudio "mic running somewhere" + known-app / Meet-tab heuristic | Clean live-mic signal + app disambiguation. |
+| Speaker naming | **Backend, via `POST /api/audio/label-merge`** | One call does diarize + overlap-vote naming + transcription. No client merge. |
+| Identity recovery | **Local voiceprint library** replayed as `known_voiceprints` | Recovers camera-off / OCR-missed speakers by voice; compounds over calls. |
+| Self-identity | mic-VAD → pre-seed user's name in timeline | The mic track is the user; gives the backend a strong prior + enrolls the user's voiceprint immediately. |
+| Requests | **Sequential, one audio request in flight** | Parallel audio requests trip a backend GPU race (`503 + Retry-After`). |
+| Long calls | Chunk ~2–3 min, sequential, stitch via names+voiceprints | Diarizer caps at **4 speakers/chunk**; voiceprints + names unify across chunks. |
+| Transport / TLS | `multipart/form-data`, file field `file`; self-signed Start9 cert (skip verify or trust the Root CA); **no auth on LAN** | Matches every other SparkControl endpoint. |
+| Timing | Batch after call (sync endpoints, no polling) | Endpoints are synchronous; no job/poll machinery needed. |
+
+### On forking Hyprnote
+Unchanged recommendation: the audio capture is the trivial part (~200 lines of
+native Swift) and the rest (Vision screen-reading) is native too. Build native;
+use Hyprnote's capture/detection only as reference. Fork remains an override.
+
+## 5. Target apps & identifiers
+
+| App | Join via | Bundle ID(s) | Speaking cue / names |
+|---|---|---|---|
+| Zoom | Native | `us.zoom.xos` | Colored tile border; name label in tile. |
+| Microsoft Teams | Native (new) | `com.microsoft.teams2` (new), `com.microsoft.teams` (classic) | Colored ring/border; labeled. |
+| Signal | Native (Electron) | `org.whispersystems.signal-desktop` | Ring around avatar/initials; try Accessibility names first. |
+| Google Meet | Browser tab | `com.google.Chrome`, `com.apple.Safari`, `company.thebrowser.Browser` (Arc)… | Canvas video → Vision for the cue; DOM names → Accessibility/AppleScript; confirm via active-tab URL `meet.google.com`. |
+
+Four required adapters; adding a 5th must be one new file conforming to the
+`AppAdapter` protocol.
+
+## 6. The backend (do not rebuild) — now concrete
+
+SparkControl, on the operator's Start9 LAN, fronting two DGX Sparks:
+- **STT:** NVIDIA Parakeet TDT 0.6B — `POST /v1/audio/transcriptions` (OpenAI-compatible).
+- **Diarization:** NVIDIA Sortformer 4spk-v1 — `POST /api/audio/diarize-chunk`
+  (anonymous clusters + voiceprints) and `POST /api/audio/transcribe-with-speakers`.
+- **Embeddings:** NVIDIA TitaNet (192-dim voiceprints).
+- **★ Primary endpoint for this app:** `POST /api/audio/label-merge` — diarize +
+  name from the visual timeline (+ voiceprint fallback), optionally transcribe,
+  in one synchronous call.
+- Health/discovery: `GET /api/status`, `GET /api/endpoints`, `GET /v1/models`.
+
+Full request/response shapes, curl examples, limits, and error formats are in
+`03_DATA_CONTRACTS.md`.
+
+## 7. Remaining open items (small)
+
+1. **Base URL — RESOLVED.** `https://your-spark-backend.local:62419`, also
+   `https://your-spark-backend.local:62419` (prefer the `.local` form; it survives IP
+   changes). Ship the `.local` host as the default; keep it editable in settings.
+   Service-discovery at `GET /api/endpoints`.
+2. **Send trigger** — assume auto-POST on call end; expose a "hold for review"
+   toggle if the user wants to eyeball the timeline first.
+3. **Retention** — keep the session folder after a successful hand-off, or prune
+   audio and keep only `speakers.json` + voiceprints? Default: keep everything,
+   user-configurable.
+4. **Voiceprint update policy** — overwrite vs running-average a person's stored
+   voiceprint across calls (see `02_ARCHITECTURE.md §2.9`). Start simple
+   (store/refresh latest high-confidence), refine later.
+5. **Signing** — stable identity so macOS doesn't re-prompt for permissions on
+   each rebuild.