- Add AGENTS.md (canonical) + CLAUDE.md symlink + ROADMAP.md - Move Apple Team ID from project.yml into a gitignored Config/Signing.xcconfig via configFiles; commit the .example template - Replace hardcoded backend host in AppSettings with a neutral placeholder + SPARK_BACKEND_URL env-var fallback - Scrub the Team ID, .local host, and raw LAN IP from README/docs - Ignore Config/Signing.xcconfig and .env
9.2 KiB
Data Contracts — Ten31 Transcripts
Companion to docs 01/02. Defines the files the app produces/stores and the real
SparkControl contract (source of truth: AUDIO_API.md). The label-merge
endpoint is the app's primary integration point.
1. visual_timeline.json (the app's primary output)
Rich, app-native record of who-was-speaking-when, inferred from the screen.
Times are seconds relative to session t0 (the shared audio clock).
{
"schema_version": "1.1",
"session_id": "2026-06-05T14-03_zoom",
"app": "zoom", // zoom | teams | signal | meet
"adapter_version": "zoom-0.3.1",
"t0_unix": 1749135780.123,
"duration_sec": 2841.7,
"fps_sampled": 3,
"self_name": "Grant", // user's name; segments from mic VAD
"participants": [
{ "name": "Grant", "is_self": true },
{ "name": "Sarah Jones", "aliases": ["Sarah J"] },
{ "name": "Dmitri" }
],
"segments": [
{ "start": 0.0, "end": 4.5, "name": "Grant", "confidence": 0.97, "source": "mic_vad" },
{ "start": 4.8, "end": 9.3, "name": "Sarah Jones", "confidence": 0.82, "source": "vision" }
],
"visual_gaps": [
{ "start": 120.4, "end": 138.9, "reason": "minimized" } // reason: minimized | tab_switched
]
}
- Segments may overlap (crosstalk) — do not flatten.
source ∈ {vision, accessibility, fused, mic_vad}.- The user (mic-VAD
is_self) is pre-seeded so the backend names them and enrolls their voiceprint on call one.
1.1 Mapping to the label-merge timeline field
label-merge wants a flat JSON array of {start, end, name, confidence}
(seconds). Build it by taking segments and dropping source:
[{"start":0.0,"end":4.5,"name":"Grant","confidence":0.97},
{"start":4.8,"end":9.3,"name":"Sarah Jones","confidence":0.82}]
When chunking, slice to the chunk window and rebase to chunk-local seconds (subtract chunk start) before sending.
2. manifest.json
{
"session_id": "2026-06-05T14-03_zoom",
"app": "zoom",
"t0_unix": 1749135780.123,
"duration_sec": 2841.7,
"audio": {
"mixed": { "file": "mixed_mono_16k.wav", "sr": 16000, "channels": 1, "sha256": "..." },
"mic": { "file": "mic.wav", "sr": 16000, "channels": 1, "sha256": "..." },
"system": { "file": "system.wav", "sr": 16000, "channels": 1, "sha256": "..." }
},
"chunking": { "enabled": true, "chunk_sec": 150, "chunks": 19 },
"visual_timeline": "visual_timeline.json",
"backend_result": "speakers.json",
"adapter_versions": { "zoom": "0.3.1" },
"app_version": "0.1.0"
}
(mixed_mono_16k.wav is the one the backend gets; the separate tracks are kept
locally — the mic track is the user's known identity / VAD source.)
3. SparkControl — connection (real)
- Base URL (confirmed): a private LAN host — a
.localmDNS name (preferred over a raw IP; it survives IP changes) — configured in Settings or via theSPARK_BACKEND_URLenv var, and never committed. Service-discovery JSON is atGET /api/endpoints(returns current vLLM / Parakeet / Kokoro URLs). All audio endpoints in §4–§5 hang off this base. Make it a setting so the host can change, and ship a neutral placeholder (https://your-spark-backend.local) as the default. - TLS: Start9 self-signed Root CA. Either skip verification (
URLSessiondelegate trusting the cert; curl-k;rejectUnauthorized:false) or install the Start9 Root CA into the trust store. - Auth: none on the LAN. No token/key today.
- Limits: 200 MB/request (
413over); timeouts ~300 s (transcription), ~600 s (diarization). Send audio requests SEQUENTIALLY — concurrent audio trips a GPU FFT race →503 + Retry-After. - Transport:
multipart/form-data, audio file field namefile(bytes, not base64/path). - All endpoints are synchronous (no job IDs / polling).
- Errors: JSON
{"detail": "..."};400malformed,413too large,503 + Retry-Aftertransient (retry after the interval). - Health/discovery:
GET /api/status,GET /api/endpoints,GET /v1/models.
4. ★ POST /api/audio/label-merge — the app's primary call
Diarize + name clusters from the visual timeline (majority temporal overlap), with voiceprint fallback, optionally transcribed. Synchronous. Stateless — the app owns the timeline and the voiceprint library.
Multipart fields:
| field | required | notes |
|---|---|---|
file |
yes | mixed-mono WAV (the chunk, when chunking) |
timeline |
yes | flat JSON array [{"start","end","name","confidence"}], chunk-local seconds (§1.1) |
known_voiceprints |
no | JSON {"<name>":[192 floats], ...} from VoiceprintStore |
transcribe |
no | "true" to also return per-segment text (default false) |
min_overlap |
no | min fraction of a cluster's time overlapping the winning name (default 0.0) |
voiceprint_threshold |
no | cosine sim to accept a voiceprint match (default 0.5) |
curl -k -X POST https://<host>/api/audio/label-merge \
-F "file=@chunk_003.wav" \
-F 'timeline=[{"start":0,"end":4.5,"name":"Grant"},{"start":4.8,"end":9.3,"name":"Sarah Jones"}]' \
-F 'known_voiceprints={"Grant":[/*192*/],"Sarah Jones":[/*192*/]}' \
-F "transcribe=true"
Response (transcribe=true):
{
"duration": 9.259,
"speakers": [
{ "cluster": "Speaker_0", "name": "Grant", "source": "visual",
"overlap_confidence": 0.9866, "fingerprint": [/*192 floats*/] },
{ "cluster": "Speaker_1", "name": "Sarah Jones", "source": "voiceprint",
"match_similarity": 0.71, "fingerprint": [/*192 floats*/] }
],
"segments": [
{ "start_ms": 0, "end_ms": 4480, "speaker": "Grant",
"text": "Good morning everyone. I think the energy thesis is strong this quarter." },
{ "start_ms": 4800, "end_ms": 9040, "speaker": "Sarah Jones",
"text": "I agree, but I am worried about the lockup terms and the fee load this time." }
],
"fingerprints": { "Grant": [/*192*/], "Sarah Jones": [/*192*/] },
"models": { "diarization": "nvidia/diar_sortformer_4spk-v1",
"embedding": "nvidia/speakerverification_en_titanet_large" }
}
transcribe=false→ segments are{start_s, end_s, speaker, confidence}(no text).- Resolution order per cluster:
visual→voiceprint(withmatch_similarity) →Unknown_N(source:"unmatched", never mislabeled). - Persist
fingerprintskeyed by name intoVoiceprintStore(skipUnknown_N). - Diarizer caps at 4 speakers/chunk and takes no
num_speakershint — chunk for larger/longer calls; names + voiceprints unify speakers across chunks.
5. Other SparkControl endpoints (reference / fallback)
5.1 POST /v1/audio/transcriptions (OpenAI-compatible, sync)
Plain STT. Fields: file (req), model (default parakeet-tdt-0.6b-v3),
response_format (json | verbose_json | text), language, temperature,
prompt. verbose_json returns word- and segment-level timestamps + full
text. Use only if you ever want transcript without speaker labels.
5.2 POST /api/audio/diarize-chunk (sync)
Field: file. Returns anonymous clusters {start_s, end_s, speaker, confidence},
speakers_detected, and a 192-dim fingerprints map per local speaker. Use if
you ever want to do the merge client-side instead of via label-merge (not the
default path).
5.3 POST /api/audio/transcribe-with-speakers (sync)
Field: file. ASR + diarization merged into anonymous speaker-attributed blocks
{start_ms, end_ms, speaker, text}. (Anonymous — no naming. label-merge is the
named version.)
6. speakers.json — final stored output (target end state)
Per-chunk label-merge results concatenated, timestamps offset back to session
(global) seconds, names unified across chunks. This is the hand-off to the
user's downstream summarizer; the app stops here.
{
"session_id": "2026-06-05T14-03_zoom",
"app": "zoom",
"duration_sec": 2841.7,
"speakers": [
{ "name": "Grant", "source": "visual", "overlap_confidence": 0.99 },
{ "name": "Sarah Jones", "source": "voiceprint", "match_similarity": 0.71 },
{ "name": "Unknown_0", "source": "unmatched" }
],
"segments": [
{ "start": 0.0, "end": 4.48, "speaker": "Grant", "text": "..." },
{ "start": 4.80, "end": 9.04, "speaker": "Sarah Jones", "text": "..." },
{ "start": 152.3, "end": 158.1, "speaker": "Unknown_0", "text": "..." }
],
"models": { "diarization": "nvidia/diar_sortformer_4spk-v1",
"embedding": "nvidia/speakerverification_en_titanet_large",
"transcription": "parakeet-tdt-0.6b-v3" }
}
(Convert backend start_ms/end_ms → seconds; add the chunk offset.)
7. voiceprints.json — local voiceprint library
{
"Grant": { "vector": [/*192 floats*/], "updated": "2026-06-05T14:51Z", "calls": 12 },
"Sarah Jones": { "vector": [/*192 floats*/], "updated": "2026-06-02T09:10Z", "calls": 3 }
}
Loaded → known_voiceprints on every label-merge call. Updated from response
fingerprints for visual/high-confidence voiceprint speakers only. Never
stores Unknown_N. Update policy (02 §2.9): start = store latest with
overlap_confidence ≥ ~0.8; consider per-name running mean later.