Phase 0: menu-bar scaffold, permissions, backend health check

Native SwiftUI menu-bar app (LSUIElement, macOS 13+), generated from project.yml via XcodeGen. Includes: - PermissionsManager (Microphone / Screen Recording / Accessibility) + UI - SparkControlHealth: GET /api/status over self-signed TLS (InsecureTrustDelegate) - AppSettings persistence (host, TLS-skip, output folder, adapter toggles) - Menu-bar panel + Settings, app sandbox & hardened runtime off (LAN tool)
2026-06-05 19:33:53 -05:00
commit b2ae3a62b9
19 changed files with 1448 additions and 0 deletions
@@ -0,0 +1,214 @@
+# Data Contracts — Ten31 Transcripts
+
+Companion to docs 01/02. Defines the files the app produces/stores and the **real
+SparkControl contract** (source of truth: `AUDIO_API.md`). The `label-merge`
+endpoint is the app's primary integration point.
+
+---
+
+## 1. `visual_timeline.json` (the app's primary output)
+
+Rich, app-native record of who-was-speaking-when, inferred from the screen.
+Times are **seconds relative to session `t0`** (the shared audio clock).
+
+```jsonc
+{
+  "schema_version": "1.1",
+  "session_id": "2026-06-05T14-03_zoom",
+  "app": "zoom",                       // zoom | teams | signal | meet
+  "adapter_version": "zoom-0.3.1",
+  "t0_unix": 1749135780.123,
+  "duration_sec": 2841.7,
+  "fps_sampled": 3,
+  "self_name": "Grant",                // user's name; segments from mic VAD
+  "participants": [
+    { "name": "Grant", "is_self": true },
+    { "name": "Sarah Jones", "aliases": ["Sarah J"] },
+    { "name": "Dmitri" }
+  ],
+  "segments": [
+    { "start": 0.0,  "end": 4.5, "name": "Grant",       "confidence": 0.97, "source": "mic_vad" },
+    { "start": 4.8,  "end": 9.3, "name": "Sarah Jones", "confidence": 0.82, "source": "vision" }
+  ],
+  "visual_gaps": [
+    { "start": 120.4, "end": 138.9, "reason": "minimized" }   // reason: minimized | tab_switched
+  ]
+}
+```
+- Segments **may overlap** (crosstalk) — do not flatten.
+- `source ∈ {vision, accessibility, fused, mic_vad}`.
+- The user (mic-VAD `is_self`) is pre-seeded so the backend names them and
+  enrolls their voiceprint on call one.
+
+### 1.1 Mapping to the `label-merge` `timeline` field
+`label-merge` wants a **flat JSON array** of `{start, end, name, confidence}`
+(seconds). Build it by taking `segments` and dropping `source`:
+```json
+[{"start":0.0,"end":4.5,"name":"Grant","confidence":0.97},
+ {"start":4.8,"end":9.3,"name":"Sarah Jones","confidence":0.82}]
+```
+When chunking, **slice to the chunk window and rebase to chunk-local seconds**
+(subtract chunk start) before sending.
+
+## 2. `manifest.json`
+```jsonc
+{
+  "session_id": "2026-06-05T14-03_zoom",
+  "app": "zoom",
+  "t0_unix": 1749135780.123,
+  "duration_sec": 2841.7,
+  "audio": {
+    "mixed":  { "file": "mixed_mono_16k.wav", "sr": 16000, "channels": 1, "sha256": "..." },
+    "mic":    { "file": "mic.wav",    "sr": 16000, "channels": 1, "sha256": "..." },
+    "system": { "file": "system.wav", "sr": 16000, "channels": 1, "sha256": "..." }
+  },
+  "chunking": { "enabled": true, "chunk_sec": 150, "chunks": 19 },
+  "visual_timeline": "visual_timeline.json",
+  "backend_result": "speakers.json",
+  "adapter_versions": { "zoom": "0.3.1" },
+  "app_version": "0.1.0"
+}
+```
+(`mixed_mono_16k.wav` is the one the backend gets; the separate tracks are kept
+locally — the mic track is the user's known identity / VAD source.)
+
+---
+
+## 3. SparkControl — connection (real)
+
+- **Base URL (confirmed):** `https://your-spark-backend.local:62419` — also reachable at
+  `https://your-spark-backend.local:62419` (the `.local` form survives IP changes;
+  **prefer it as the default**). Service-discovery JSON is at
+  `GET /api/endpoints` (returns current vLLM / Parakeet / Kokoro URLs). All audio
+  endpoints in §4–§5 hang off this base. Still **make it a setting** so the host
+  can change, but ship `https://your-spark-backend.local:62419` as the default.
+- **TLS:** Start9 self-signed Root CA. Either skip verification (`URLSession`
+  delegate trusting the cert; curl `-k`; `rejectUnauthorized:false`) **or** install
+  the Start9 Root CA into the trust store.
+- **Auth:** **none on the LAN.** No token/key today.
+- **Limits:** **200 MB/request** (`413` over); timeouts ~300 s (transcription),
+  ~600 s (diarization). **Send audio requests SEQUENTIALLY** — concurrent audio
+  trips a GPU FFT race → `503 + Retry-After`.
+- **Transport:** `multipart/form-data`, audio file field name **`file`** (bytes,
+  not base64/path).
+- **All endpoints are synchronous** (no job IDs / polling).
+- **Errors:** JSON `{"detail": "..."}`; `400` malformed, `413` too large, `503 +
+  Retry-After` transient (retry after the interval).
+- **Health/discovery:** `GET /api/status`, `GET /api/endpoints`, `GET /v1/models`.
+
+---
+
+## 4. ★ `POST /api/audio/label-merge` — the app's primary call
+
+Diarize + name clusters from the visual timeline (majority temporal overlap),
+with voiceprint fallback, optionally transcribed. Synchronous. **Stateless** —
+the app owns the timeline and the voiceprint library.
+
+**Multipart fields:**
+| field | required | notes |
+|---|---|---|
+| `file` | **yes** | mixed-mono WAV (the chunk, when chunking) |
+| `timeline` | **yes** | flat JSON array `[{"start","end","name","confidence"}]`, chunk-local seconds (§1.1) |
+| `known_voiceprints` | no | JSON `{"<name>":[192 floats], ...}` from `VoiceprintStore` |
+| `transcribe` | no | `"true"` to also return per-segment text (default false) |
+| `min_overlap` | no | min fraction of a cluster's time overlapping the winning name (default `0.0`) |
+| `voiceprint_threshold` | no | cosine sim to accept a voiceprint match (default `0.5`) |
+
+```bash
+curl -k -X POST https://<host>/api/audio/label-merge \
+  -F "file=@chunk_003.wav" \
+  -F 'timeline=[{"start":0,"end":4.5,"name":"Grant"},{"start":4.8,"end":9.3,"name":"Sarah Jones"}]' \
+  -F 'known_voiceprints={"Grant":[/*192*/],"Sarah Jones":[/*192*/]}' \
+  -F "transcribe=true"
+```
+
+**Response (`transcribe=true`):**
+```jsonc
+{
+  "duration": 9.259,
+  "speakers": [
+    { "cluster": "Speaker_0", "name": "Grant",       "source": "visual",
+      "overlap_confidence": 0.9866, "fingerprint": [/*192 floats*/] },
+    { "cluster": "Speaker_1", "name": "Sarah Jones", "source": "voiceprint",
+      "match_similarity": 0.71,     "fingerprint": [/*192 floats*/] }
+  ],
+  "segments": [
+    { "start_ms": 0,    "end_ms": 4480, "speaker": "Grant",
+      "text": "Good morning everyone. I think the energy thesis is strong this quarter." },
+    { "start_ms": 4800, "end_ms": 9040, "speaker": "Sarah Jones",
+      "text": "I agree, but I am worried about the lockup terms and the fee load this time." }
+  ],
+  "fingerprints": { "Grant": [/*192*/], "Sarah Jones": [/*192*/] },
+  "models": { "diarization": "nvidia/diar_sortformer_4spk-v1",
+              "embedding": "nvidia/speakerverification_en_titanet_large" }
+}
+```
+- `transcribe=false` → segments are `{start_s, end_s, speaker, confidence}` (no text).
+- **Resolution order per cluster:** `visual` → `voiceprint` (with
+  `match_similarity`) → `Unknown_N` (`source:"unmatched"`, never mislabeled).
+- **Persist `fingerprints`** keyed by name into `VoiceprintStore` (skip `Unknown_N`).
+- **Diarizer caps at 4 speakers/chunk** and takes **no `num_speakers` hint** — chunk
+  for larger/longer calls; names + voiceprints unify speakers across chunks.
+
+---
+
+## 5. Other SparkControl endpoints (reference / fallback)
+
+### 5.1 `POST /v1/audio/transcriptions` (OpenAI-compatible, sync)
+Plain STT. Fields: `file` (req), `model` (default `parakeet-tdt-0.6b-v3`),
+`response_format` (`json` | `verbose_json` | `text`), `language`, `temperature`,
+`prompt`. `verbose_json` returns word- **and** segment-level timestamps + full
+`text`. Use only if you ever want transcript without speaker labels.
+
+### 5.2 `POST /api/audio/diarize-chunk` (sync)
+Field: `file`. Returns anonymous clusters `{start_s, end_s, speaker, confidence}`,
+`speakers_detected`, and a 192-dim `fingerprints` map per local speaker. Use if
+you ever want to do the merge client-side instead of via `label-merge` (not the
+default path).
+
+### 5.3 `POST /api/audio/transcribe-with-speakers` (sync)
+Field: `file`. ASR + diarization merged into anonymous speaker-attributed blocks
+`{start_ms, end_ms, speaker, text}`. (Anonymous — no naming. `label-merge` is the
+named version.)
+
+---
+
+## 6. `speakers.json` — final stored output (target end state)
+
+Per-chunk `label-merge` results concatenated, **timestamps offset back to session
+(global) seconds**, names unified across chunks. This is the hand-off to the
+user's downstream summarizer; the app stops here.
+
+```jsonc
+{
+  "session_id": "2026-06-05T14-03_zoom",
+  "app": "zoom",
+  "duration_sec": 2841.7,
+  "speakers": [
+    { "name": "Grant",       "source": "visual",     "overlap_confidence": 0.99 },
+    { "name": "Sarah Jones", "source": "voiceprint", "match_similarity": 0.71 },
+    { "name": "Unknown_0",   "source": "unmatched" }
+  ],
+  "segments": [
+    { "start": 0.0,   "end": 4.48,  "speaker": "Grant",       "text": "..." },
+    { "start": 4.80,  "end": 9.04,  "speaker": "Sarah Jones", "text": "..." },
+    { "start": 152.3, "end": 158.1, "speaker": "Unknown_0",   "text": "..." }
+  ],
+  "models": { "diarization": "nvidia/diar_sortformer_4spk-v1",
+              "embedding": "nvidia/speakerverification_en_titanet_large",
+              "transcription": "parakeet-tdt-0.6b-v3" }
+}
+```
+(Convert backend `start_ms`/`end_ms` → seconds; add the chunk offset.)
+
+## 7. `voiceprints.json` — local voiceprint library
+```jsonc
+{
+  "Grant":       { "vector": [/*192 floats*/], "updated": "2026-06-05T14:51Z", "calls": 12 },
+  "Sarah Jones": { "vector": [/*192 floats*/], "updated": "2026-06-02T09:10Z", "calls": 3 }
+}
+```
+Loaded → `known_voiceprints` on every `label-merge` call. Updated from response
+`fingerprints` for `visual`/high-confidence `voiceprint` speakers only. Never
+stores `Unknown_N`. Update policy (`02 §2.9`): start = store latest with
+`overlap_confidence ≥ ~0.8`; consider per-name running mean later.