From dda4322de70549595e9f6d934e95b125cf9d7ceb Mon Sep 17 00:00:00 2001 From: Grant Gilliam Date: Tue, 16 Jun 2026 22:09:04 -0500 Subject: [PATCH] Reconcile docs/ specs with the shipped app Document the dual-channel label-merge path (mic_file/system_file/self_name/self_vad) and the recap phase (transcript.md + recap.html via the backend LLM) across docs/01-03; correct docs/02 $2.10 to the UI actually shipped; mark docs/01 $7 open items as settled; remove the dead AUDIO_API.md references; note the manifest sha256 fields are not emitted; mark docs/04 as a complete/historical build log. Also drop the last stale "Phase 0" UI string in MenuBarView and retire the now-done doc-debt items in ROADMAP. --- ROADMAP.md | 5 +- Ten31Transcripts/UI/MenuBarView.swift | 2 +- docs/01_PROJECT_BRIEF.md | 81 +++++++++++++++------------ docs/02_ARCHITECTURE.md | 29 ++++++++-- docs/03_DATA_CONTRACTS.md | 39 +++++++++---- docs/04_BUILD_PLAN.md | 6 ++ 6 files changed, 106 insertions(+), 56 deletions(-) diff --git a/ROADMAP.md b/ROADMAP.md index c54c6bd..80453ca 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -27,10 +27,9 @@ Longer-term backlog and deferred decisions. Near-term status + the next few step ## Quality / debt (from the 2026-06-13 independent eval — full queue + evidence in `EVALUATION.md`) - Guard `RecapAnalyzer.mmss()` (`:137`) against NaN/∞ — a malformed backend `duration` aborts the app at recap render (eval P2). Cheap; fold into the next backend change. -- Rewrite the stale README: it claims "Phase 0 / no audio capture" for a shipped Phase-6 app; the `AppSettings.swift:7` comment and the `README.md:49` skip-TLS "on by default" line are also stale (eval P2). - Add `SessionController` state-machine tests (`pendingAutoStop`, visual-adoption generation guard) before refactoring; then extract its saved-session / open-panel UI (eval P2/P3). -- Reconcile `docs/` specs with reality: the dual-channel API fields (`mic_file`/`system_file`/`self_name`/`self_vad`) and the recap/LLM phase are undocumented; `docs/01` §7 lists already-resolved open items; `docs/02` §2.10 claims absent MenuBarUI features (eval P3). -- Smaller P3s in `EVALUATION.md`: incomplete AGENTS Layout listings, unwritten `manifest.json` sha256 contract, unused `NSAppleEventsUsageDescription`, unauthenticated LAN backend (consider a bearer token). +- Optional: sweep the stale "Phase N" references in source comments (e.g. `SparkControlHealth.swift:7` "arrives in Phase 5", `Ten31TranscriptsApp.swift:6` "Phase 0 only") — historical, not false, but dated. `docs/04_BUILD_PLAN.md` is now marked COMPLETE/historical and is the map for these. +- Smaller P3s in `EVALUATION.md`: incomplete AGENTS Layout listings, unwritten `manifest.json` sha256 contract (now documented as not-emitted in `docs/03` §2), unused `NSAppleEventsUsageDescription`, unauthenticated LAN backend (consider a bearer token). ## Deferred decisions - Cross-device self unification (same person, desktop mic vs phone speakerphone) does not work by voiceprint and is treated as a separate identity; revisit only if a reliable signal emerges (mic-channel-as-self remains the robust path). diff --git a/Ten31Transcripts/UI/MenuBarView.swift b/Ten31Transcripts/UI/MenuBarView.swift index fa25c47..2eb5fc9 100644 --- a/Ten31Transcripts/UI/MenuBarView.swift +++ b/Ten31Transcripts/UI/MenuBarView.swift @@ -173,7 +173,7 @@ struct MenuBarView: View { private var header: some View { VStack(alignment: .leading, spacing: 2) { Text("Ten31 Transcripts").font(.headline) - Text("Phase 0 · setup & status") + Text("Setup & status") .font(.caption) .foregroundStyle(.secondary) } diff --git a/docs/01_PROJECT_BRIEF.md b/docs/01_PROJECT_BRIEF.md index 63554dc..3290f08 100644 --- a/docs/01_PROJECT_BRIEF.md +++ b/docs/01_PROJECT_BRIEF.md @@ -7,9 +7,9 @@ > returns named transcript segments. A growing **voiceprint library** recovers > speakers even when the visual cue is missing. -Master context document. Read this first, then `02_ARCHITECTURE.md`, -`03_DATA_CONTRACTS.md`, `04_BUILD_PLAN.md`. The SparkControl API is now fully -specified — see `03_DATA_CONTRACTS.md` (and the source `AUDIO_API.md`). +Master context document. Read this first, then `02_ARCHITECTURE.md` and +`03_DATA_CONTRACTS.md`. The SparkControl API is fully specified in +`03_DATA_CONTRACTS.md`. --- @@ -20,25 +20,30 @@ A lightweight, always-running **menu-bar app on macOS** that: 1. **Detects** when the user joins a call in Google Meet, Zoom, Microsoft Teams, or Signal. 2. **Records two local audio tracks** — system audio (everyone else) and the - user's microphone (the user) — and **mixes them to one 16 kHz mono WAV** for - the backend. + user's microphone (the user). It sends the backend **dual-channel** + (`mic_file` + `system_file`) when the system track is healthy, falling back to + a **mixed-mono 16 kHz WAV** otherwise. 3. **Watches the call window** at ~2–4 fps and, per app, reads participant **names** and the **active-speaker cue**, producing a `(start, end, name, confidence)` **visual timeline** — its best guess at who was talking when. 4. **Discards every video frame after extraction.** No video is ever written to disk. Only audio + the derived timeline persist locally. -5. On call end, **POSTs the mixed audio + the visual timeline (+ the known - voiceprint library) to `POST /api/audio/label-merge`** on SparkControl, which - returns **named, speaker-attributed transcript segments** and a **voiceprint - per speaker**. +5. On call end, **POSTs the audio + the visual timeline (+ the known voiceprint + library) to `POST /api/audio/label-merge`** on SparkControl, which returns + **named, speaker-attributed transcript segments** and a **voiceprint per + speaker**. 6. **Persists the returned voiceprints** keyed by name, so the next call can pass them as `known_voiceprints` and recover a speaker by voice when the visual cue is absent (camera off, a bad OCR frame). +7. **Renders the result locally** — a readable `transcript.md` plus an HTML + `recap.html` (topics + meeting extras, generated via the backend's LLM + endpoint), with an in-app editor for fixing speaker names after the fact. -The app's job ends at receiving and storing the named segments from SparkControl. -**All transcription, diarization, and the name-merge happen on the backend.** Do -not build transcription, diarization, or the merge vote in this app. +The app's job ends at producing the named transcript and recap from SparkControl's +segments. **All transcription, diarization, name-merge, and LLM analysis happen on +the backend.** Do not build transcription, diarization, or the merge vote in this +app. ## 2. Why the visual timeline still matters (the core idea) @@ -68,19 +73,25 @@ few calls the system can name regulars even with cameras off. **In scope (this app):** - Call detection for Meet / Zoom / Teams / Signal. -- Dual-track local audio capture + mix-to-mono for the backend. +- Dual-track local audio capture; **dual-channel send** (mic + system) with a + mix-to-mono fallback for the backend. - Low-fps window capture → OCR (names) + active-speaker cue detection. - Per-app "adapter" modules encapsulating each app's UI quirks. - Building the visual timeline; **mic-VAD self-labeling** (the mic track is the user, so hot-mic spans pre-seed the user's name into the timeline). - Chunking long calls (~2–3 min) and calling `label-merge` **sequentially**. - A local **voiceprint store** (persist + replay named voiceprints). -- Storing the backend's named transcript segments locally. -- A minimal menu-bar UI: status, manual start/stop, recent sessions, adapter - toggles, backend host/health, output folder. +- Storing the backend's named segments and **rendering** them — `transcript.md` + plus an HTML `recap.html` (recap analysis via the backend LLM) — with an in-app + speaker-name editor. +- A minimal menu-bar UI: status, manual start/stop, the last session (reveal, + resend, open recap, edit speakers), adapter toggles, backend host/health, + output folder. **Out of scope (owned by the backend):** -- Transcription, diarization, the name-merge vote, summarization/analysis. +- Transcription, diarization, the name-merge vote, and LLM summarization — these + run on the backend; the app only orchestrates the recap call and renders the + result. **Explicitly not doing:** saving video; cloud anything. Everything stays on the operator's LAN. @@ -91,14 +102,14 @@ operator's LAN. |---|---|---| | Language / framework | Native Swift + SwiftUI menu-bar app (`LSUIElement`) | System audio, window capture, Vision all native; one codebase. | | Audio capture | ScreenCaptureKit (system audio) + AVFoundation (mic) | No virtual audio device; works with headphones; macOS 13+. | -| Backend audio format | **Mixed-mono 16 kHz WAV** | Diarizer separates speakers from one mixed stream; 16 kHz is ideal. | +| Backend audio format | **Dual-channel (mic + system)** when the system track is healthy, else **mixed-mono 16 kHz WAV** | Separate tracks let the backend attribute the user's mic channel directly; the diarizer can still split the mono fallback. | | Call detection | CoreAudio "mic running somewhere" + known-app / Meet-tab heuristic | Clean live-mic signal + app disambiguation. | | Speaker naming | **Backend, via `POST /api/audio/label-merge`** | One call does diarize + overlap-vote naming + transcription. No client merge. | | Identity recovery | **Local voiceprint library** replayed as `known_voiceprints` | Recovers camera-off / OCR-missed speakers by voice; compounds over calls. | | Self-identity | mic-VAD → pre-seed user's name in timeline | The mic track is the user; gives the backend a strong prior + enrolls the user's voiceprint immediately. | | Requests | **Sequential, one audio request in flight** | Parallel audio requests trip a backend GPU race (`503 + Retry-After`). | | Long calls | Chunk ~2–3 min, sequential, stitch via names+voiceprints | Diarizer caps at **4 speakers/chunk**; voiceprints + names unify across chunks. | -| Transport / TLS | `multipart/form-data`, file field `file`; self-signed Start9 cert (skip verify or trust the Root CA); **no auth on LAN** | Matches every other SparkControl endpoint. | +| Transport / TLS | `multipart/form-data`, file field `file` (mono) or `mic_file` + `system_file` (dual-channel); self-signed Start9 cert (trust the Root CA — supported default; host-scoped skip-verify is an off-by-default escape hatch); **no auth on LAN** | Matches every other SparkControl endpoint. | | Timing | Batch after call (sync endpoints, no polling) | Endpoints are synchronous; no job/poll machinery needed. | ### On forking Hyprnote @@ -128,25 +139,25 @@ SparkControl, on the operator's Start9 LAN, fronting two DGX Sparks: - **★ Primary endpoint for this app:** `POST /api/audio/label-merge` — diarize + name from the visual timeline (+ voiceprint fallback), optionally transcribe, in one synchronous call. +- **LLM (recap):** Qwen3 via OpenAI-compatible `POST /v1/chat/completions` — + generates the readable recap (topics + meeting extras) from the transcript. - Health/discovery: `GET /api/status`, `GET /api/endpoints`, `GET /v1/models`. Full request/response shapes, curl examples, limits, and error formats are in `03_DATA_CONTRACTS.md`. -## 7. Remaining open items (small) +## 7. Settled decisions (were open at brief time) -1. **Base URL — RESOLVED.** A private LAN host — a `.local` mDNS name (preferred - over a raw IP, since it survives IP changes) — configured in Settings or via the - `SPARK_BACKEND_URL` env var, and never committed. Ship a neutral placeholder as - the default; keep it editable in settings. Service-discovery at - `GET /api/endpoints`. -2. **Send trigger** — assume auto-POST on call end; expose a "hold for review" - toggle if the user wants to eyeball the timeline first. -3. **Retention** — keep the session folder after a successful hand-off, or prune - audio and keep only `speakers.json` + voiceprints? Default: keep everything, - user-configurable. -4. **Voiceprint update policy** — overwrite vs running-average a person's stored - voiceprint across calls (see `02_ARCHITECTURE.md §2.9`). Start simple - (store/refresh latest high-confidence), refine later. -5. **Signing** — stable identity so macOS doesn't re-prompt for permissions on - each rebuild. +1. **Base URL.** A private LAN host — a `.local` mDNS name (preferred over a raw + IP, since it survives IP changes) — configured in Settings or via the + `SPARK_BACKEND_URL` env var, never committed. A neutral placeholder ships as the + default and stays editable in Settings. Service-discovery at `GET /api/endpoints`. +2. **Send trigger.** Auto-send on call end is a setting (`autoSendOnStop`), **off + by default** — the user reviews the session and sends manually unless they opt in. +3. **Retention.** The session folder is kept after a successful hand-off (output + location is configurable); nothing is pruned automatically. +4. **Voiceprint update policy.** Store/refresh the latest high-confidence vector + per name (`02_ARCHITECTURE.md §2.9`); a per-name running average is a possible + later refinement. +5. **Signing.** A stable identity via `Config/Signing.xcconfig` (gitignored) keeps + macOS from re-prompting for permissions on each rebuild. diff --git a/docs/02_ARCHITECTURE.md b/docs/02_ARCHITECTURE.md index 752aa03..4edebc6 100644 --- a/docs/02_ARCHITECTURE.md +++ b/docs/02_ARCHITECTURE.md @@ -64,6 +64,9 @@ pattern, the macOS APIs, and the SparkControl integration (now fully specified). └────────────────┘ └────────────────────┘ ``` +(After `speakers.json`, a recap phase renders `transcript.md` + `recap.html` via +the backend LLM — see §2.11.) + ## 2. Modules ### 2.1 `CallDetector` @@ -176,8 +179,10 @@ Write the session folder and, if the call is longer than ~3 min, produce a ``` ### 2.7 `SparkControlClient` -Deliver to SparkControl. **Primary path = `POST /api/audio/label-merge`** with -`file`, `timeline`, `known_voiceprints`, `transcribe=true`. +Deliver to SparkControl. **Primary path = `POST /api/audio/label-merge`**. Sends +**dual-channel** (`mic_file` + `system_file` + `self_name` + `self_vad`) when the +system track is healthy, else the **mono** `file`; always with `timeline`, +`known_voiceprints`, `transcribe=true`. - **Sequential only** — one audio request in flight (parallel ⇒ `503 + Retry-After`). - **Self-signed TLS** — skip verification (`URLSession` delegate trusting the Start9 cert) or trust the Root CA. **No auth on the LAN.** @@ -210,10 +215,22 @@ Local persistence of named voiceprints — the compounding-identity layer. - Editable/clearable from the menu-bar UI (rename, delete a person, reset). ### 2.10 `MenuBarUI` (SwiftUI, `LSUIElement`) -Status (idle / detected / recording / uploading), manual start/stop, recent -sessions (open folder, resend, delete), adapter toggles, **backend host + a -health check** (`GET /api/status`), output folder, voiceprint manager, and a -permissions checklist (Screen Recording, Microphone, Accessibility). +Status (idle / detected / recording / finishing), manual start/stop with live +mic/system level meters, and the **last session** — reveal in Finder, resend +("Send to backend"), open recap, and edit speakers — plus "Open saved session…" +to reprocess an existing folder. Also a **backend host + health check** +(`GET /api/status`), adapter toggles, output folder, and a permissions checklist +(Microphone, Screen Recording, Accessibility). (No multi-session list or +voiceprint-manager UI yet — those are in `ROADMAP.md`.) + +### 2.11 Recap (`RecapAnalyzer`, `RecapRenderer`) +After `speakers.json`, the recap phase turns the named transcript into the +human-readable deliverables. `RecapAnalyzer` calls the backend LLM +(`POST /v1/chat/completions`, Qwen3) for topics + meeting extras; `RecapRenderer` +writes `transcript.md` (one line per diarized utterance) and `recap.html` (+ a +`recap.json` sidecar). The in-app speaker editor (`SpeakerEditing` / +`RecapEditModel`) rewrites names across all outputs after the fact. All +language-model work stays on the backend; the app orchestrates and renders. ## 3. macOS frameworks & permissions diff --git a/docs/03_DATA_CONTRACTS.md b/docs/03_DATA_CONTRACTS.md index a2271d0..dbd8715 100644 --- a/docs/03_DATA_CONTRACTS.md +++ b/docs/03_DATA_CONTRACTS.md @@ -1,7 +1,7 @@ # Data Contracts — Ten31 Transcripts Companion to docs 01/02. Defines the files the app produces/stores and the **real -SparkControl contract** (source of truth: `AUDIO_API.md`). The `label-merge` +SparkControl contract** (verified against the live backend). The `label-merge` endpoint is the app's primary integration point. --- @@ -69,8 +69,10 @@ When chunking, **slice to the chunk window and rebase to chunk-local seconds** "app_version": "0.1.0" } ``` -(`mixed_mono_16k.wav` is the one the backend gets; the separate tracks are kept -locally — the mic track is the user's known identity / VAD source.) +(On the dual-channel path the backend gets `mic.wav` + `system.wav` directly; on +the mono fallback it gets `mixed_mono_16k.wav`. The mic track is the user's known +identity / VAD source. **Note:** the per-file `sha256` fields above are part of the +intended contract but are **not currently emitted** by the pipeline.) --- @@ -83,15 +85,17 @@ locally — the mic track is the user's known identity / VAD source.) endpoints in §4–§5 hang off this base. **Make it a setting** so the host can change, and ship a neutral placeholder (`https://your-spark-backend.local`) as the default. -- **TLS:** Start9 self-signed Root CA. Either skip verification (`URLSession` - delegate trusting the cert; curl `-k`; `rejectUnauthorized:false`) **or** install - the Start9 Root CA into the trust store. +- **TLS:** Start9 self-signed Root CA. Supported path: install the Start9 Root CA + into the System keychain (default trust then succeeds). Skip-verification is an + **off-by-default, host-scoped** escape hatch (`InsecureTrustDelegate`, scoped to + the configured backend host), not the default. - **Auth:** **none on the LAN.** No token/key today. - **Limits:** **200 MB/request** (`413` over); timeouts ~300 s (transcription), ~600 s (diarization). **Send audio requests SEQUENTIALLY** — concurrent audio trips a GPU FFT race → `503 + Retry-After`. -- **Transport:** `multipart/form-data`, audio file field name **`file`** (bytes, - not base64/path). +- **Transport:** `multipart/form-data`. Audio file field is **`file`** on the mono + path, or **`mic_file`** + **`system_file`** on the dual-channel path (bytes, not + base64/path). - **All endpoints are synchronous** (no job IDs / polling). - **Errors:** JSON `{"detail": "..."}`; `400` malformed, `413` too large, `503 + Retry-After` transient (retry after the interval). @@ -105,11 +109,16 @@ Diarize + name clusters from the visual timeline (majority temporal overlap), with voiceprint fallback, optionally transcribed. Synchronous. **Stateless** — the app owns the timeline and the voiceprint library. -**Multipart fields:** +**Multipart fields** — two audio shapes: **mono** (`file`) or **dual-channel** +(`mic_file` + `system_file`, preferred when the system track is healthy): | field | required | notes | |---|---|---| -| `file` | **yes** | mixed-mono WAV (the chunk, when chunking) | -| `timeline` | **yes** | flat JSON array `[{"start","end","name","confidence"}]`, chunk-local seconds (§1.1) | +| `file` | mono path | mixed-mono WAV (the chunk, when chunking) | +| `mic_file` | dual path | the user's mic track (chunk) — attributed to `self_name` | +| `system_file` | dual path | the remote/system track (chunk) | +| `self_name` | dual path | the user's name; the mic channel is attributed to them | +| `self_vad` | no | chunk-local windows where the mic is genuinely the user (active + louder than system) | +| `timeline` | **yes** | flat JSON array `[{"start","end","name","confidence"}]`, chunk-local seconds (§1.1); on the dual path it names only the remote speakers | | `known_voiceprints` | no | JSON `{"":[192 floats], ...}` from `VoiceprintStore` | | `transcribe` | no | `"true"` to also return per-segment text (default false) | | `min_overlap` | no | min fraction of a cluster's time overlapping the winning name (default `0.0`) | @@ -213,3 +222,11 @@ Loaded → `known_voiceprints` on every `label-merge` call. Updated from respons `fingerprints` for `visual`/high-confidence `voiceprint` speakers only. Never stores `Unknown_N`. Update policy (`02 §2.9`): start = store latest with `overlap_confidence ≥ ~0.8`; consider per-name running mean later. + +## 8. Recap outputs (`transcript.md`, `recap.{html,json}`) +After `speakers.json` is assembled, the recap phase renders the human-readable +deliverables: a `transcript.md` (one line per diarized utterance) and an HTML +`recap.html`, backed by a structured `recap.json`. The recap's topic/summary +content is generated by the **backend LLM** (`POST /v1/chat/completions`, Qwen3); +the app owns the rendering and the in-app **speaker-name editor**, which can rewrite +names across `speakers.json`, the transcript, and the recap after the fact. diff --git a/docs/04_BUILD_PLAN.md b/docs/04_BUILD_PLAN.md index e1c6faa..a8424fe 100644 --- a/docs/04_BUILD_PLAN.md +++ b/docs/04_BUILD_PLAN.md @@ -1,5 +1,11 @@ # Build Plan — Ten31 Transcripts +> **Status: COMPLETE (historical).** Phases 0–6 shipped and the app is in daily +> use; a recap phase (transcript + HTML recap via the backend LLM) was added after +> this plan was written. Kept as the original build log and as the map for the +> "Phase N" references in the code comments. Forward-looking work lives in +> `ROADMAP.md`; current status in `AGENTS.md`. + Companion to docs 01–03. Phased plan for the Claude Code session, each phase with a demoable milestone. Build in order; the risky/novel work (visual adapters) is isolated for independent tuning. The SparkControl contract is now known