Phase 0: menu-bar scaffold, permissions, backend health check
Native SwiftUI menu-bar app (LSUIElement, macOS 13+), generated from project.yml via XcodeGen. Includes: - PermissionsManager (Microphone / Screen Recording / Accessibility) + UI - SparkControlHealth: GET /api/status over self-signed TLS (InsecureTrustDelegate) - AppSettings persistence (host, TLS-skip, output folder, adapter toggles) - Menu-bar panel + Settings, app sandbox & hardened runtime off (LAN tool)
This commit is contained in:
@@ -0,0 +1,214 @@
|
||||
# Data Contracts — Ten31 Transcripts
|
||||
|
||||
Companion to docs 01/02. Defines the files the app produces/stores and the **real
|
||||
SparkControl contract** (source of truth: `AUDIO_API.md`). The `label-merge`
|
||||
endpoint is the app's primary integration point.
|
||||
|
||||
---
|
||||
|
||||
## 1. `visual_timeline.json` (the app's primary output)
|
||||
|
||||
Rich, app-native record of who-was-speaking-when, inferred from the screen.
|
||||
Times are **seconds relative to session `t0`** (the shared audio clock).
|
||||
|
||||
```jsonc
|
||||
{
|
||||
"schema_version": "1.1",
|
||||
"session_id": "2026-06-05T14-03_zoom",
|
||||
"app": "zoom", // zoom | teams | signal | meet
|
||||
"adapter_version": "zoom-0.3.1",
|
||||
"t0_unix": 1749135780.123,
|
||||
"duration_sec": 2841.7,
|
||||
"fps_sampled": 3,
|
||||
"self_name": "Grant", // user's name; segments from mic VAD
|
||||
"participants": [
|
||||
{ "name": "Grant", "is_self": true },
|
||||
{ "name": "Sarah Jones", "aliases": ["Sarah J"] },
|
||||
{ "name": "Dmitri" }
|
||||
],
|
||||
"segments": [
|
||||
{ "start": 0.0, "end": 4.5, "name": "Grant", "confidence": 0.97, "source": "mic_vad" },
|
||||
{ "start": 4.8, "end": 9.3, "name": "Sarah Jones", "confidence": 0.82, "source": "vision" }
|
||||
],
|
||||
"visual_gaps": [
|
||||
{ "start": 120.4, "end": 138.9, "reason": "minimized" } // reason: minimized | tab_switched
|
||||
]
|
||||
}
|
||||
```
|
||||
- Segments **may overlap** (crosstalk) — do not flatten.
|
||||
- `source ∈ {vision, accessibility, fused, mic_vad}`.
|
||||
- The user (mic-VAD `is_self`) is pre-seeded so the backend names them and
|
||||
enrolls their voiceprint on call one.
|
||||
|
||||
### 1.1 Mapping to the `label-merge` `timeline` field
|
||||
`label-merge` wants a **flat JSON array** of `{start, end, name, confidence}`
|
||||
(seconds). Build it by taking `segments` and dropping `source`:
|
||||
```json
|
||||
[{"start":0.0,"end":4.5,"name":"Grant","confidence":0.97},
|
||||
{"start":4.8,"end":9.3,"name":"Sarah Jones","confidence":0.82}]
|
||||
```
|
||||
When chunking, **slice to the chunk window and rebase to chunk-local seconds**
|
||||
(subtract chunk start) before sending.
|
||||
|
||||
## 2. `manifest.json`
|
||||
```jsonc
|
||||
{
|
||||
"session_id": "2026-06-05T14-03_zoom",
|
||||
"app": "zoom",
|
||||
"t0_unix": 1749135780.123,
|
||||
"duration_sec": 2841.7,
|
||||
"audio": {
|
||||
"mixed": { "file": "mixed_mono_16k.wav", "sr": 16000, "channels": 1, "sha256": "..." },
|
||||
"mic": { "file": "mic.wav", "sr": 16000, "channels": 1, "sha256": "..." },
|
||||
"system": { "file": "system.wav", "sr": 16000, "channels": 1, "sha256": "..." }
|
||||
},
|
||||
"chunking": { "enabled": true, "chunk_sec": 150, "chunks": 19 },
|
||||
"visual_timeline": "visual_timeline.json",
|
||||
"backend_result": "speakers.json",
|
||||
"adapter_versions": { "zoom": "0.3.1" },
|
||||
"app_version": "0.1.0"
|
||||
}
|
||||
```
|
||||
(`mixed_mono_16k.wav` is the one the backend gets; the separate tracks are kept
|
||||
locally — the mic track is the user's known identity / VAD source.)
|
||||
|
||||
---
|
||||
|
||||
## 3. SparkControl — connection (real)
|
||||
|
||||
- **Base URL (confirmed):** `https://your-spark-backend.local:62419` — also reachable at
|
||||
`https://your-spark-backend.local:62419` (the `.local` form survives IP changes;
|
||||
**prefer it as the default**). Service-discovery JSON is at
|
||||
`GET /api/endpoints` (returns current vLLM / Parakeet / Kokoro URLs). All audio
|
||||
endpoints in §4–§5 hang off this base. Still **make it a setting** so the host
|
||||
can change, but ship `https://your-spark-backend.local:62419` as the default.
|
||||
- **TLS:** Start9 self-signed Root CA. Either skip verification (`URLSession`
|
||||
delegate trusting the cert; curl `-k`; `rejectUnauthorized:false`) **or** install
|
||||
the Start9 Root CA into the trust store.
|
||||
- **Auth:** **none on the LAN.** No token/key today.
|
||||
- **Limits:** **200 MB/request** (`413` over); timeouts ~300 s (transcription),
|
||||
~600 s (diarization). **Send audio requests SEQUENTIALLY** — concurrent audio
|
||||
trips a GPU FFT race → `503 + Retry-After`.
|
||||
- **Transport:** `multipart/form-data`, audio file field name **`file`** (bytes,
|
||||
not base64/path).
|
||||
- **All endpoints are synchronous** (no job IDs / polling).
|
||||
- **Errors:** JSON `{"detail": "..."}`; `400` malformed, `413` too large, `503 +
|
||||
Retry-After` transient (retry after the interval).
|
||||
- **Health/discovery:** `GET /api/status`, `GET /api/endpoints`, `GET /v1/models`.
|
||||
|
||||
---
|
||||
|
||||
## 4. ★ `POST /api/audio/label-merge` — the app's primary call
|
||||
|
||||
Diarize + name clusters from the visual timeline (majority temporal overlap),
|
||||
with voiceprint fallback, optionally transcribed. Synchronous. **Stateless** —
|
||||
the app owns the timeline and the voiceprint library.
|
||||
|
||||
**Multipart fields:**
|
||||
| field | required | notes |
|
||||
|---|---|---|
|
||||
| `file` | **yes** | mixed-mono WAV (the chunk, when chunking) |
|
||||
| `timeline` | **yes** | flat JSON array `[{"start","end","name","confidence"}]`, chunk-local seconds (§1.1) |
|
||||
| `known_voiceprints` | no | JSON `{"<name>":[192 floats], ...}` from `VoiceprintStore` |
|
||||
| `transcribe` | no | `"true"` to also return per-segment text (default false) |
|
||||
| `min_overlap` | no | min fraction of a cluster's time overlapping the winning name (default `0.0`) |
|
||||
| `voiceprint_threshold` | no | cosine sim to accept a voiceprint match (default `0.5`) |
|
||||
|
||||
```bash
|
||||
curl -k -X POST https://<host>/api/audio/label-merge \
|
||||
-F "file=@chunk_003.wav" \
|
||||
-F 'timeline=[{"start":0,"end":4.5,"name":"Grant"},{"start":4.8,"end":9.3,"name":"Sarah Jones"}]' \
|
||||
-F 'known_voiceprints={"Grant":[/*192*/],"Sarah Jones":[/*192*/]}' \
|
||||
-F "transcribe=true"
|
||||
```
|
||||
|
||||
**Response (`transcribe=true`):**
|
||||
```jsonc
|
||||
{
|
||||
"duration": 9.259,
|
||||
"speakers": [
|
||||
{ "cluster": "Speaker_0", "name": "Grant", "source": "visual",
|
||||
"overlap_confidence": 0.9866, "fingerprint": [/*192 floats*/] },
|
||||
{ "cluster": "Speaker_1", "name": "Sarah Jones", "source": "voiceprint",
|
||||
"match_similarity": 0.71, "fingerprint": [/*192 floats*/] }
|
||||
],
|
||||
"segments": [
|
||||
{ "start_ms": 0, "end_ms": 4480, "speaker": "Grant",
|
||||
"text": "Good morning everyone. I think the energy thesis is strong this quarter." },
|
||||
{ "start_ms": 4800, "end_ms": 9040, "speaker": "Sarah Jones",
|
||||
"text": "I agree, but I am worried about the lockup terms and the fee load this time." }
|
||||
],
|
||||
"fingerprints": { "Grant": [/*192*/], "Sarah Jones": [/*192*/] },
|
||||
"models": { "diarization": "nvidia/diar_sortformer_4spk-v1",
|
||||
"embedding": "nvidia/speakerverification_en_titanet_large" }
|
||||
}
|
||||
```
|
||||
- `transcribe=false` → segments are `{start_s, end_s, speaker, confidence}` (no text).
|
||||
- **Resolution order per cluster:** `visual` → `voiceprint` (with
|
||||
`match_similarity`) → `Unknown_N` (`source:"unmatched"`, never mislabeled).
|
||||
- **Persist `fingerprints`** keyed by name into `VoiceprintStore` (skip `Unknown_N`).
|
||||
- **Diarizer caps at 4 speakers/chunk** and takes **no `num_speakers` hint** — chunk
|
||||
for larger/longer calls; names + voiceprints unify speakers across chunks.
|
||||
|
||||
---
|
||||
|
||||
## 5. Other SparkControl endpoints (reference / fallback)
|
||||
|
||||
### 5.1 `POST /v1/audio/transcriptions` (OpenAI-compatible, sync)
|
||||
Plain STT. Fields: `file` (req), `model` (default `parakeet-tdt-0.6b-v3`),
|
||||
`response_format` (`json` | `verbose_json` | `text`), `language`, `temperature`,
|
||||
`prompt`. `verbose_json` returns word- **and** segment-level timestamps + full
|
||||
`text`. Use only if you ever want transcript without speaker labels.
|
||||
|
||||
### 5.2 `POST /api/audio/diarize-chunk` (sync)
|
||||
Field: `file`. Returns anonymous clusters `{start_s, end_s, speaker, confidence}`,
|
||||
`speakers_detected`, and a 192-dim `fingerprints` map per local speaker. Use if
|
||||
you ever want to do the merge client-side instead of via `label-merge` (not the
|
||||
default path).
|
||||
|
||||
### 5.3 `POST /api/audio/transcribe-with-speakers` (sync)
|
||||
Field: `file`. ASR + diarization merged into anonymous speaker-attributed blocks
|
||||
`{start_ms, end_ms, speaker, text}`. (Anonymous — no naming. `label-merge` is the
|
||||
named version.)
|
||||
|
||||
---
|
||||
|
||||
## 6. `speakers.json` — final stored output (target end state)
|
||||
|
||||
Per-chunk `label-merge` results concatenated, **timestamps offset back to session
|
||||
(global) seconds**, names unified across chunks. This is the hand-off to the
|
||||
user's downstream summarizer; the app stops here.
|
||||
|
||||
```jsonc
|
||||
{
|
||||
"session_id": "2026-06-05T14-03_zoom",
|
||||
"app": "zoom",
|
||||
"duration_sec": 2841.7,
|
||||
"speakers": [
|
||||
{ "name": "Grant", "source": "visual", "overlap_confidence": 0.99 },
|
||||
{ "name": "Sarah Jones", "source": "voiceprint", "match_similarity": 0.71 },
|
||||
{ "name": "Unknown_0", "source": "unmatched" }
|
||||
],
|
||||
"segments": [
|
||||
{ "start": 0.0, "end": 4.48, "speaker": "Grant", "text": "..." },
|
||||
{ "start": 4.80, "end": 9.04, "speaker": "Sarah Jones", "text": "..." },
|
||||
{ "start": 152.3, "end": 158.1, "speaker": "Unknown_0", "text": "..." }
|
||||
],
|
||||
"models": { "diarization": "nvidia/diar_sortformer_4spk-v1",
|
||||
"embedding": "nvidia/speakerverification_en_titanet_large",
|
||||
"transcription": "parakeet-tdt-0.6b-v3" }
|
||||
}
|
||||
```
|
||||
(Convert backend `start_ms`/`end_ms` → seconds; add the chunk offset.)
|
||||
|
||||
## 7. `voiceprints.json` — local voiceprint library
|
||||
```jsonc
|
||||
{
|
||||
"Grant": { "vector": [/*192 floats*/], "updated": "2026-06-05T14:51Z", "calls": 12 },
|
||||
"Sarah Jones": { "vector": [/*192 floats*/], "updated": "2026-06-02T09:10Z", "calls": 3 }
|
||||
}
|
||||
```
|
||||
Loaded → `known_voiceprints` on every `label-merge` call. Updated from response
|
||||
`fingerprints` for `visual`/high-confidence `voiceprint` speakers only. Never
|
||||
stores `Unknown_N`. Update policy (`02 §2.9`): start = store latest with
|
||||
`overlap_confidence ≥ ~0.8`; consider per-name running mean later.
|
||||
Reference in New Issue
Block a user