# Data Contracts — Ten31 Transcripts Companion to docs 01/02. Defines the files the app produces/stores and the **real SparkControl contract** (verified against the live backend). The `label-merge` endpoint is the app's primary integration point. --- ## 1. `visual_timeline.json` (the app's primary output) Rich, app-native record of who-was-speaking-when, inferred from the screen. Times are **seconds relative to session `t0`** (the shared audio clock). ```jsonc { "schema_version": "1.1", "session_id": "2026-06-05T14-03_zoom", "app": "zoom", // zoom | teams | signal | meet "adapter_version": "zoom-0.3.1", "t0_unix": 1749135780.123, "duration_sec": 2841.7, "fps_sampled": 3, "self_name": "Grant", // user's name; segments from mic VAD "participants": [ { "name": "Grant", "is_self": true }, { "name": "Sarah Jones", "aliases": ["Sarah J"] }, { "name": "Dmitri" } ], "segments": [ { "start": 0.0, "end": 4.5, "name": "Grant", "confidence": 0.97, "source": "mic_vad" }, { "start": 4.8, "end": 9.3, "name": "Sarah Jones", "confidence": 0.82, "source": "vision" } ], "visual_gaps": [ { "start": 120.4, "end": 138.9, "reason": "minimized" } // reason: minimized | tab_switched ] } ``` - Segments **may overlap** (crosstalk) — do not flatten. - `source ∈ {vision, accessibility, fused, mic_vad}`. - The user (mic-VAD `is_self`) is pre-seeded so the backend names them and enrolls their voiceprint on call one. ### 1.1 Mapping to the `label-merge` `timeline` field `label-merge` wants a **flat JSON array** of `{start, end, name, confidence}` (seconds). Build it by taking `segments` and dropping `source`: ```json [{"start":0.0,"end":4.5,"name":"Grant","confidence":0.97}, {"start":4.8,"end":9.3,"name":"Sarah Jones","confidence":0.82}] ``` When chunking, **slice to the chunk window and rebase to chunk-local seconds** (subtract chunk start) before sending. ## 2. `manifest.json` ```jsonc { "session_id": "2026-06-05T14-03_zoom", "app": "zoom", "t0_unix": 1749135780.123, "duration_sec": 2841.7, "audio": { "mixed": { "file": "mixed_mono_16k.wav", "sr": 16000, "channels": 1, "sha256": "..." }, "mic": { "file": "mic.wav", "sr": 16000, "channels": 1, "sha256": "..." }, "system": { "file": "system.wav", "sr": 16000, "channels": 1, "sha256": "..." } }, "chunking": { "enabled": true, "chunk_sec": 150, "chunks": 19 }, "visual_timeline": "visual_timeline.json", "backend_result": "speakers.json", "adapter_versions": { "zoom": "0.3.1" }, "app_version": "0.1.0" } ``` (On the dual-channel path the backend gets `mic.wav` + `system.wav` directly; on the mono fallback it gets `mixed_mono_16k.wav`. The mic track is the user's known identity / VAD source. **Note:** the per-file `sha256` fields above are part of the intended contract but are **not currently emitted** by the pipeline.) --- ## 3. SparkControl — connection (real) - **Base URL (confirmed):** a private LAN host — a `.local` mDNS name (preferred over a raw IP; it survives IP changes) — configured in Settings or via the `SPARK_BACKEND_URL` env var, and **never committed**. Service-discovery JSON is at `GET /api/endpoints` (returns current vLLM / Parakeet / Kokoro URLs). All audio endpoints in §4–§5 hang off this base. **Make it a setting** so the host can change, and ship a neutral placeholder (`https://your-spark-backend.local`) as the default. - **TLS:** Start9 self-signed Root CA. Supported path: install the Start9 Root CA into the System keychain (default trust then succeeds). Skip-verification is an **off-by-default, host-scoped** escape hatch (`InsecureTrustDelegate`, scoped to the configured backend host), not the default. - **Auth:** **none on the LAN.** No token/key today. - **Limits:** **200 MB/request** (`413` over); timeouts ~300 s (transcription), ~600 s (diarization). **Send audio requests SEQUENTIALLY** — concurrent audio trips a GPU FFT race → `503 + Retry-After`. - **Transport:** `multipart/form-data`. Audio file field is **`file`** on the mono path, or **`mic_file`** + **`system_file`** on the dual-channel path (bytes, not base64/path). - **All endpoints are synchronous** (no job IDs / polling). - **Errors:** JSON `{"detail": "..."}`; `400` malformed, `413` too large, `503 + Retry-After` transient (retry after the interval). - **Health/discovery:** `GET /api/status`, `GET /api/endpoints`, `GET /v1/models`. --- ## 4. ★ `POST /api/audio/label-merge` — the app's primary call Diarize + name clusters from the visual timeline (majority temporal overlap), with voiceprint fallback, optionally transcribed. Synchronous. **Stateless** — the app owns the timeline and the voiceprint library. **Multipart fields** — two audio shapes: **mono** (`file`) or **dual-channel** (`mic_file` + `system_file`, preferred when the system track is healthy): | field | required | notes | |---|---|---| | `file` | mono path | mixed-mono WAV (the chunk, when chunking) | | `mic_file` | dual path | the user's mic track (chunk) — attributed to `self_name` | | `system_file` | dual path | the remote/system track (chunk) | | `self_name` | dual path | the user's name; the mic channel is attributed to them | | `self_vad` | no | chunk-local windows where the mic is genuinely the user (active + louder than system) | | `timeline` | **yes** | flat JSON array `[{"start","end","name","confidence"}]`, chunk-local seconds (§1.1); on the dual path it names only the remote speakers | | `known_voiceprints` | no | JSON `{"":[192 floats], ...}` from `VoiceprintStore` | | `transcribe` | no | `"true"` to also return per-segment text (default false) | | `min_overlap` | no | min fraction of a cluster's time overlapping the winning name (default `0.0`) | | `voiceprint_threshold` | no | cosine sim to accept a voiceprint match (default `0.5`) | ```bash curl -k -X POST https:///api/audio/label-merge \ -F "file=@chunk_003.wav" \ -F 'timeline=[{"start":0,"end":4.5,"name":"Grant"},{"start":4.8,"end":9.3,"name":"Sarah Jones"}]' \ -F 'known_voiceprints={"Grant":[/*192*/],"Sarah Jones":[/*192*/]}' \ -F "transcribe=true" ``` **Response (`transcribe=true`):** ```jsonc { "duration": 9.259, "speakers": [ { "cluster": "Speaker_0", "name": "Grant", "source": "visual", "overlap_confidence": 0.9866, "fingerprint": [/*192 floats*/] }, { "cluster": "Speaker_1", "name": "Sarah Jones", "source": "voiceprint", "match_similarity": 0.71, "fingerprint": [/*192 floats*/] } ], "segments": [ { "start_ms": 0, "end_ms": 4480, "speaker": "Grant", "text": "Good morning everyone. I think the energy thesis is strong this quarter." }, { "start_ms": 4800, "end_ms": 9040, "speaker": "Sarah Jones", "text": "I agree, but I am worried about the lockup terms and the fee load this time." } ], "fingerprints": { "Grant": [/*192*/], "Sarah Jones": [/*192*/] }, "models": { "diarization": "nvidia/diar_sortformer_4spk-v1", "embedding": "nvidia/speakerverification_en_titanet_large" } } ``` - `transcribe=false` → segments are `{start_s, end_s, speaker, confidence}` (no text). - **Resolution order per cluster:** `visual` → `voiceprint` (with `match_similarity`) → `Unknown_N` (`source:"unmatched"`, never mislabeled). - **Persist `fingerprints`** keyed by name into `VoiceprintStore` (skip `Unknown_N`). - **Diarizer caps at 4 speakers/chunk** and takes **no `num_speakers` hint** — chunk for larger/longer calls; names + voiceprints unify speakers across chunks. --- ## 5. Other SparkControl endpoints (reference / fallback) ### 5.1 `POST /v1/audio/transcriptions` (OpenAI-compatible, sync) Plain STT. Fields: `file` (req), `model` (default `parakeet-tdt-0.6b-v3`), `response_format` (`json` | `verbose_json` | `text`), `language`, `temperature`, `prompt`. `verbose_json` returns word- **and** segment-level timestamps + full `text`. Use only if you ever want transcript without speaker labels. ### 5.2 `POST /api/audio/diarize-chunk` (sync) Field: `file`. Returns anonymous clusters `{start_s, end_s, speaker, confidence}`, `speakers_detected`, and a 192-dim `fingerprints` map per local speaker. Use if you ever want to do the merge client-side instead of via `label-merge` (not the default path). ### 5.3 `POST /api/audio/transcribe-with-speakers` (sync) Field: `file`. ASR + diarization merged into anonymous speaker-attributed blocks `{start_ms, end_ms, speaker, text}`. (Anonymous — no naming. `label-merge` is the named version.) --- ## 6. `speakers.json` — final stored output (target end state) Per-chunk `label-merge` results concatenated, **timestamps offset back to session (global) seconds**, names unified across chunks. This is the hand-off to the user's downstream summarizer; the app stops here. ```jsonc { "session_id": "2026-06-05T14-03_zoom", "app": "zoom", "duration_sec": 2841.7, "speakers": [ { "name": "Grant", "source": "visual", "overlap_confidence": 0.99 }, { "name": "Sarah Jones", "source": "voiceprint", "match_similarity": 0.71 }, { "name": "Unknown_0", "source": "unmatched" } ], "segments": [ { "start": 0.0, "end": 4.48, "speaker": "Grant", "text": "..." }, { "start": 4.80, "end": 9.04, "speaker": "Sarah Jones", "text": "..." }, { "start": 152.3, "end": 158.1, "speaker": "Unknown_0", "text": "..." } ], "models": { "diarization": "nvidia/diar_sortformer_4spk-v1", "embedding": "nvidia/speakerverification_en_titanet_large", "transcription": "parakeet-tdt-0.6b-v3" } } ``` (Convert backend `start_ms`/`end_ms` → seconds; add the chunk offset.) ## 7. `voiceprints.json` — local voiceprint library ```jsonc { "Grant": { "vector": [/*192 floats*/], "updated": "2026-06-05T14:51Z", "calls": 12 }, "Sarah Jones": { "vector": [/*192 floats*/], "updated": "2026-06-02T09:10Z", "calls": 3 } } ``` Loaded → `known_voiceprints` on every `label-merge` call. Updated from response `fingerprints` for `visual`/high-confidence `voiceprint` speakers only. Never stores `Unknown_N`. Update policy (`02 §2.9`): start = store latest with `overlap_confidence ≥ ~0.8`; consider per-name running mean later. ## 8. Recap outputs (`transcript.md`, `recap.{html,json}`) After `speakers.json` is assembled, the recap phase renders the human-readable deliverables: a `transcript.md` (one line per diarized utterance) and an HTML `recap.html`, backed by a structured `recap.json`. The recap's topic/summary content is generated by the **backend LLM** (`POST /v1/chat/completions`, Qwen3); the app owns the rendering and the in-app **speaker-name editor**, which can rewrite names across `speakers.json`, the transcript, and the recap after the fact.