- Add AGENTS.md (canonical) + CLAUDE.md symlink + ROADMAP.md - Move Apple Team ID from project.yml into a gitignored Config/Signing.xcconfig via configFiles; commit the .example template - Replace hardcoded backend host in AppSettings with a neutral placeholder + SPARK_BACKEND_URL env-var fallback - Scrub the Team ID, .local host, and raw LAN IP from README/docs - Ignore Config/Signing.xcconfig and .env
8.8 KiB
Project Brief — Ten31 Transcripts
Local macOS app that auto-detects conference calls, records local audio, and produces a visual-derived, timestamped speaker timeline — then hands the mixed audio + that timeline to the operator's SparkControl backend, which diarizes, names the speakers (majority-overlap vote against the timeline), and returns named transcript segments. A growing voiceprint library recovers speakers even when the visual cue is missing.
Master context document. Read this first, then 02_ARCHITECTURE.md,
03_DATA_CONTRACTS.md, 04_BUILD_PLAN.md. The SparkControl API is now fully
specified — see 03_DATA_CONTRACTS.md (and the source AUDIO_API.md).
1. What we are building
A lightweight, always-running menu-bar app on macOS that:
- Detects when the user joins a call in Google Meet, Zoom, Microsoft Teams, or Signal.
- Records two local audio tracks — system audio (everyone else) and the user's microphone (the user) — and mixes them to one 16 kHz mono WAV for the backend.
- Watches the call window at ~2–4 fps and, per app, reads participant
names and the active-speaker cue, producing a
(start, end, name, confidence)visual timeline — its best guess at who was talking when. - Discards every video frame after extraction. No video is ever written to disk. Only audio + the derived timeline persist locally.
- On call end, POSTs the mixed audio + the visual timeline (+ the known
voiceprint library) to
POST /api/audio/label-mergeon SparkControl, which returns named, speaker-attributed transcript segments and a voiceprint per speaker. - Persists the returned voiceprints keyed by name, so the next call can pass
them as
known_voiceprintsand recover a speaker by voice when the visual cue is absent (camera off, a bad OCR frame).
The app's job ends at receiving and storing the named segments from SparkControl. All transcription, diarization, and the name-merge happen on the backend. Do not build transcription, diarization, or the merge vote in this app.
2. Why the visual timeline still matters (the core idea)
Audio diarization (NVIDIA Sortformer on the backend) is excellent at
segmentation — precise speaker boundaries — but its clusters are anonymous
(Speaker_0, Speaker_1…). It cannot name anyone.
The screen already knows the names. Each app visually marks the active speaker (colored tile border, animated audio bars, a ring around an avatar) next to that person's name/initials.
So responsibilities split cleanly:
- Audio (backend) owns segmentation — the exact when.
- Visual capture (this app) owns identity — the who.
label-merge(backend) fuses them: it diarizes, then assigns each cluster the timeline name with the most temporal overlap. The visual track needn't be perfect — it only needs to win the per-cluster vote.
New compounding layer — the voiceprint library. Every named cluster comes
back from the backend with a 192-dim TitaNet voiceprint. The app persists these
keyed by name and replays them as known_voiceprints next time. Resolution order
per cluster becomes: visual overlap → voiceprint match → Unknown_N (never
mislabeled). So the screen capture enrolls a voice library for free, and over a
few calls the system can name regulars even with cameras off.
3. Hard scope boundaries
In scope (this app):
- Call detection for Meet / Zoom / Teams / Signal.
- Dual-track local audio capture + mix-to-mono for the backend.
- Low-fps window capture → OCR (names) + active-speaker cue detection.
- Per-app "adapter" modules encapsulating each app's UI quirks.
- Building the visual timeline; mic-VAD self-labeling (the mic track is the user, so hot-mic spans pre-seed the user's name into the timeline).
- Chunking long calls (~2–3 min) and calling
label-mergesequentially. - A local voiceprint store (persist + replay named voiceprints).
- Storing the backend's named transcript segments locally.
- A minimal menu-bar UI: status, manual start/stop, recent sessions, adapter toggles, backend host/health, output folder.
Out of scope (owned by the backend):
- Transcription, diarization, the name-merge vote, summarization/analysis.
Explicitly not doing: saving video; cloud anything. Everything stays on the operator's LAN.
4. Key decisions (now resolved against the real contract)
| Decision | Choice | Why |
|---|---|---|
| Language / framework | Native Swift + SwiftUI menu-bar app (LSUIElement) |
System audio, window capture, Vision all native; one codebase. |
| Audio capture | ScreenCaptureKit (system audio) + AVFoundation (mic) | No virtual audio device; works with headphones; macOS 13+. |
| Backend audio format | Mixed-mono 16 kHz WAV | Diarizer separates speakers from one mixed stream; 16 kHz is ideal. |
| Call detection | CoreAudio "mic running somewhere" + known-app / Meet-tab heuristic | Clean live-mic signal + app disambiguation. |
| Speaker naming | Backend, via POST /api/audio/label-merge |
One call does diarize + overlap-vote naming + transcription. No client merge. |
| Identity recovery | Local voiceprint library replayed as known_voiceprints |
Recovers camera-off / OCR-missed speakers by voice; compounds over calls. |
| Self-identity | mic-VAD → pre-seed user's name in timeline | The mic track is the user; gives the backend a strong prior + enrolls the user's voiceprint immediately. |
| Requests | Sequential, one audio request in flight | Parallel audio requests trip a backend GPU race (503 + Retry-After). |
| Long calls | Chunk ~2–3 min, sequential, stitch via names+voiceprints | Diarizer caps at 4 speakers/chunk; voiceprints + names unify across chunks. |
| Transport / TLS | multipart/form-data, file field file; self-signed Start9 cert (skip verify or trust the Root CA); no auth on LAN |
Matches every other SparkControl endpoint. |
| Timing | Batch after call (sync endpoints, no polling) | Endpoints are synchronous; no job/poll machinery needed. |
On forking Hyprnote
Unchanged recommendation: the audio capture is the trivial part (~200 lines of native Swift) and the rest (Vision screen-reading) is native too. Build native; use Hyprnote's capture/detection only as reference. Fork remains an override.
5. Target apps & identifiers
| App | Join via | Bundle ID(s) | Speaking cue / names |
|---|---|---|---|
| Zoom | Native | us.zoom.xos |
Colored tile border; name label in tile. |
| Microsoft Teams | Native (new) | com.microsoft.teams2 (new), com.microsoft.teams (classic) |
Colored ring/border; labeled. |
| Signal | Native (Electron) | org.whispersystems.signal-desktop |
Ring around avatar/initials; try Accessibility names first. |
| Google Meet | Browser tab | com.google.Chrome, com.apple.Safari, company.thebrowser.Browser (Arc)… |
Canvas video → Vision for the cue; DOM names → Accessibility/AppleScript; confirm via active-tab URL meet.google.com. |
Four required adapters; adding a 5th must be one new file conforming to the
AppAdapter protocol.
6. The backend (do not rebuild) — now concrete
SparkControl, on the operator's Start9 LAN, fronting two DGX Sparks:
- STT: NVIDIA Parakeet TDT 0.6B —
POST /v1/audio/transcriptions(OpenAI-compatible). - Diarization: NVIDIA Sortformer 4spk-v1 —
POST /api/audio/diarize-chunk(anonymous clusters + voiceprints) andPOST /api/audio/transcribe-with-speakers. - Embeddings: NVIDIA TitaNet (192-dim voiceprints).
- ★ Primary endpoint for this app:
POST /api/audio/label-merge— diarize + name from the visual timeline (+ voiceprint fallback), optionally transcribe, in one synchronous call. - Health/discovery:
GET /api/status,GET /api/endpoints,GET /v1/models.
Full request/response shapes, curl examples, limits, and error formats are in
03_DATA_CONTRACTS.md.
7. Remaining open items (small)
- Base URL — RESOLVED. A private LAN host — a
.localmDNS name (preferred over a raw IP, since it survives IP changes) — configured in Settings or via theSPARK_BACKEND_URLenv var, and never committed. Ship a neutral placeholder as the default; keep it editable in settings. Service-discovery atGET /api/endpoints. - Send trigger — assume auto-POST on call end; expose a "hold for review" toggle if the user wants to eyeball the timeline first.
- Retention — keep the session folder after a successful hand-off, or prune
audio and keep only
speakers.json+ voiceprints? Default: keep everything, user-configurable. - Voiceprint update policy — overwrite vs running-average a person's stored
voiceprint across calls (see
02_ARCHITECTURE.md §2.9). Start simple (store/refresh latest high-confidence), refine later. - Signing — stable identity so macOS doesn't re-prompt for permissions on each rebuild.