Files
ten31-transcripts/docs/01_PROJECT_BRIEF.md
T
Grant Gilliam b2ae3a62b9 Phase 0: menu-bar scaffold, permissions, backend health check
Native SwiftUI menu-bar app (LSUIElement, macOS 13+), generated from project.yml
via XcodeGen. Includes:
- PermissionsManager (Microphone / Screen Recording / Accessibility) + UI
- SparkControlHealth: GET /api/status over self-signed TLS (InsecureTrustDelegate)
- AppSettings persistence (host, TLS-skip, output folder, adapter toggles)
- Menu-bar panel + Settings, app sandbox & hardened runtime off (LAN tool)
2026-06-05 19:33:53 -05:00

152 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Project Brief — Ten31 Transcripts
> Local macOS app that auto-detects conference calls, records local audio, and
> produces a **visual-derived, timestamped speaker timeline** — then hands the
> mixed audio + that timeline to the operator's **SparkControl** backend, which
> diarizes, names the speakers (majority-overlap vote against the timeline), and
> returns named transcript segments. A growing **voiceprint library** recovers
> speakers even when the visual cue is missing.
Master context document. Read this first, then `02_ARCHITECTURE.md`,
`03_DATA_CONTRACTS.md`, `04_BUILD_PLAN.md`. The SparkControl API is now fully
specified — see `03_DATA_CONTRACTS.md` (and the source `AUDIO_API.md`).
---
## 1. What we are building
A lightweight, always-running **menu-bar app on macOS** that:
1. **Detects** when the user joins a call in Google Meet, Zoom, Microsoft Teams,
or Signal.
2. **Records two local audio tracks** — system audio (everyone else) and the
user's microphone (the user) — and **mixes them to one 16 kHz mono WAV** for
the backend.
3. **Watches the call window** at ~24 fps and, per app, reads participant
**names** and the **active-speaker cue**, producing a
`(start, end, name, confidence)` **visual timeline** — its best guess at who
was talking when.
4. **Discards every video frame after extraction.** No video is ever written to
disk. Only audio + the derived timeline persist locally.
5. On call end, **POSTs the mixed audio + the visual timeline (+ the known
voiceprint library) to `POST /api/audio/label-merge`** on SparkControl, which
returns **named, speaker-attributed transcript segments** and a **voiceprint
per speaker**.
6. **Persists the returned voiceprints** keyed by name, so the next call can pass
them as `known_voiceprints` and recover a speaker by voice when the visual cue
is absent (camera off, a bad OCR frame).
The app's job ends at receiving and storing the named segments from SparkControl.
**All transcription, diarization, and the name-merge happen on the backend.** Do
not build transcription, diarization, or the merge vote in this app.
## 2. Why the visual timeline still matters (the core idea)
Audio diarization (NVIDIA Sortformer on the backend) is excellent at
**segmentation** — precise speaker boundaries — but its clusters are **anonymous**
(`Speaker_0`, `Speaker_1`…). It cannot name anyone.
The screen *already knows the names*. Each app visually marks the active speaker
(colored tile border, animated audio bars, a ring around an avatar) next to that
person's name/initials.
So responsibilities split cleanly:
- **Audio (backend)** owns *segmentation* — the exact *when*.
- **Visual capture (this app)** owns *identity* — the *who*.
- **`label-merge` (backend)** fuses them: it diarizes, then assigns each cluster
the timeline name with the most temporal overlap. The visual track needn't be
perfect — it only needs to win the per-cluster vote.
**New compounding layer — the voiceprint library.** Every named cluster comes
back from the backend with a 192-dim TitaNet voiceprint. The app persists these
keyed by name and replays them as `known_voiceprints` next time. Resolution order
per cluster becomes: **visual overlap → voiceprint match → `Unknown_N`** (never
mislabeled). So the screen capture *enrolls a voice library for free*, and over a
few calls the system can name regulars even with cameras off.
## 3. Hard scope boundaries
**In scope (this app):**
- Call detection for Meet / Zoom / Teams / Signal.
- Dual-track local audio capture + mix-to-mono for the backend.
- Low-fps window capture → OCR (names) + active-speaker cue detection.
- Per-app "adapter" modules encapsulating each app's UI quirks.
- Building the visual timeline; **mic-VAD self-labeling** (the mic track is the
user, so hot-mic spans pre-seed the user's name into the timeline).
- Chunking long calls (~23 min) and calling `label-merge` **sequentially**.
- A local **voiceprint store** (persist + replay named voiceprints).
- Storing the backend's named transcript segments locally.
- A minimal menu-bar UI: status, manual start/stop, recent sessions, adapter
toggles, backend host/health, output folder.
**Out of scope (owned by the backend):**
- Transcription, diarization, the name-merge vote, summarization/analysis.
**Explicitly not doing:** saving video; cloud anything. Everything stays on the
operator's LAN.
## 4. Key decisions (now resolved against the real contract)
| Decision | Choice | Why |
|---|---|---|
| Language / framework | Native Swift + SwiftUI menu-bar app (`LSUIElement`) | System audio, window capture, Vision all native; one codebase. |
| Audio capture | ScreenCaptureKit (system audio) + AVFoundation (mic) | No virtual audio device; works with headphones; macOS 13+. |
| Backend audio format | **Mixed-mono 16 kHz WAV** | Diarizer separates speakers from one mixed stream; 16 kHz is ideal. |
| Call detection | CoreAudio "mic running somewhere" + known-app / Meet-tab heuristic | Clean live-mic signal + app disambiguation. |
| Speaker naming | **Backend, via `POST /api/audio/label-merge`** | One call does diarize + overlap-vote naming + transcription. No client merge. |
| Identity recovery | **Local voiceprint library** replayed as `known_voiceprints` | Recovers camera-off / OCR-missed speakers by voice; compounds over calls. |
| Self-identity | mic-VAD → pre-seed user's name in timeline | The mic track is the user; gives the backend a strong prior + enrolls the user's voiceprint immediately. |
| Requests | **Sequential, one audio request in flight** | Parallel audio requests trip a backend GPU race (`503 + Retry-After`). |
| Long calls | Chunk ~23 min, sequential, stitch via names+voiceprints | Diarizer caps at **4 speakers/chunk**; voiceprints + names unify across chunks. |
| Transport / TLS | `multipart/form-data`, file field `file`; self-signed Start9 cert (skip verify or trust the Root CA); **no auth on LAN** | Matches every other SparkControl endpoint. |
| Timing | Batch after call (sync endpoints, no polling) | Endpoints are synchronous; no job/poll machinery needed. |
### On forking Hyprnote
Unchanged recommendation: the audio capture is the trivial part (~200 lines of
native Swift) and the rest (Vision screen-reading) is native too. Build native;
use Hyprnote's capture/detection only as reference. Fork remains an override.
## 5. Target apps & identifiers
| App | Join via | Bundle ID(s) | Speaking cue / names |
|---|---|---|---|
| Zoom | Native | `us.zoom.xos` | Colored tile border; name label in tile. |
| Microsoft Teams | Native (new) | `com.microsoft.teams2` (new), `com.microsoft.teams` (classic) | Colored ring/border; labeled. |
| Signal | Native (Electron) | `org.whispersystems.signal-desktop` | Ring around avatar/initials; try Accessibility names first. |
| Google Meet | Browser tab | `com.google.Chrome`, `com.apple.Safari`, `company.thebrowser.Browser` (Arc)… | Canvas video → Vision for the cue; DOM names → Accessibility/AppleScript; confirm via active-tab URL `meet.google.com`. |
Four required adapters; adding a 5th must be one new file conforming to the
`AppAdapter` protocol.
## 6. The backend (do not rebuild) — now concrete
SparkControl, on the operator's Start9 LAN, fronting two DGX Sparks:
- **STT:** NVIDIA Parakeet TDT 0.6B — `POST /v1/audio/transcriptions` (OpenAI-compatible).
- **Diarization:** NVIDIA Sortformer 4spk-v1 — `POST /api/audio/diarize-chunk`
(anonymous clusters + voiceprints) and `POST /api/audio/transcribe-with-speakers`.
- **Embeddings:** NVIDIA TitaNet (192-dim voiceprints).
- **★ Primary endpoint for this app:** `POST /api/audio/label-merge` — diarize +
name from the visual timeline (+ voiceprint fallback), optionally transcribe,
in one synchronous call.
- Health/discovery: `GET /api/status`, `GET /api/endpoints`, `GET /v1/models`.
Full request/response shapes, curl examples, limits, and error formats are in
`03_DATA_CONTRACTS.md`.
## 7. Remaining open items (small)
1. **Base URL — RESOLVED.** `https://your-spark-backend.local:62419`, also
`https://your-spark-backend.local:62419` (prefer the `.local` form; it survives IP
changes). Ship the `.local` host as the default; keep it editable in settings.
Service-discovery at `GET /api/endpoints`.
2. **Send trigger** — assume auto-POST on call end; expose a "hold for review"
toggle if the user wants to eyeball the timeline first.
3. **Retention** — keep the session folder after a successful hand-off, or prune
audio and keep only `speakers.json` + voiceprints? Default: keep everything,
user-configurable.
4. **Voiceprint update policy** — overwrite vs running-average a person's stored
voiceprint across calls (see `02_ARCHITECTURE.md §2.9`). Start simple
(store/refresh latest high-confidence), refine later.
5. **Signing** — stable identity so macOS doesn't re-prompt for permissions on
each rebuild.