- Add redaction gateway (redaction_gateway.py, redaction/ scrub + tests) - Add embeddings proxy and spark_embed service (Dockerfile + main.py) - Expand audio_proxy with speaker-aware handling; deep_health/health/server updates - Package: configureSparks action + sparkConfig model updates, manifest/main wiring - Docs: AUDIO_API, EMBEDDINGS, REDACTION_GATEWAY; HANDOFF and runbook/known-issues refresh
14 KiB
Spark Control — Audio API reference (transcription + diarization)
For external clients (e.g. the Ten31 Transcripts capture app) integrating with the transcription and diarization endpoints. All examples are real responses from the live deployment.
1. Connection / auth
- Base URL:
https://<spark-control-host>(the operator's Start9 LAN address, e.g.https://<spark-control-host>:62419). A.localform also exists (survives IP changes); the operator can provide it. - TLS: Start9's self-signed Root CA. On the LAN, set
verify=False/rejectUnauthorized:false(curl-k), or install the Start9 Root CA into your trust store. Same story as every other Spark Control endpoint. - Auth: none on the LAN today — the endpoints sit behind StartOS access control + TLS on a trusted network. No bearer token / API key. (If you need per-client auth later, the operator can add it; it's not there now.)
- Limits:
- Max upload: 200 MB per request (
413if exceeded). For long calls, chunk (see §4). - Request timeout: transcription ~300 s, diarization ~600 s per request.
- Send requests sequentially, not in parallel. Concurrent audio requests can
trip a GPU FFT race on the backend (returns
503+Retry-After). One in flight at a time is the safe pattern.
- Max upload: 200 MB per request (
2. Audio format
- The backend (NVIDIA Parakeet / NeMo) decodes common formats (WAV, FLAC, MP3, m4a) and internally resamples to 16 kHz mono. So 16 kHz mono WAV is ideal; anything decodable also works (stereo is downmixed).
- Single mixed-mono file is what the endpoints expect. Diarization (Sortformer) separates speakers from one mixed stream, so for diarization you want everyone in one file — mix your system-audio track + mic track to one mono WAV before sending.
- Your two-track capture is an asset: your mic track is, by definition, you —
a known identity. Two clean options:
- Mix both tracks → diarize the mix (simplest; pairs perfectly with your visual-timeline name-merge — see §4 note).
- Diarize only the system track (the other participants) and label your own mic track as the user directly (no diarization needed for your own voice).
- Upload mechanism:
multipart/form-data, file field namefile(OpenAI- compatible). Not base64, not a path/URL — send the bytes.
3. Transcription endpoint
POST /v1/audio/transcriptions — OpenAI-compatible. Synchronous (returns the
result; no job/polling).
Multipart fields:
| field | required | notes |
|---|---|---|
file |
yes | the audio bytes |
model |
no | default parakeet-tdt-0.6b-v3 (one STT model server-side; you don't need to pick) |
response_format |
no | json (default, just text) · verbose_json (timestamps) · text |
language |
no | default auto/en |
temperature, prompt |
no | passthrough |
curl -k -X POST https://<host>/v1/audio/transcriptions \
-F "file=@call.wav" -F "response_format=verbose_json"
Real verbose_json response — includes word-level AND segment-level
timestamps (seconds):
{
"task": "transcribe",
"language": "en",
"duration": 9.259,
"text": "Good morning everyone. I think the energy thesis is strong this quarter. I agree, but I am worried about the lockup terms and the fee load this time.",
"segments": [
{ "start": 0.0, "end": 1.28, "text": "Good morning everyone." },
{ "start": 1.44, "end": 4.48, "text": "I think the energy thesis is strong this quarter." }
],
"words": [
{ "start": 0.0, "end": 0.32, "text": "Good" },
{ "start": 0.32, "end": 0.72, "text": "morning" },
{ "start": 0.8, "end": 1.28, "text": "everyone." }
]
}
(json → {"text": "..."}; text → plain text body.)
4. Diarization
Two endpoints, both synchronous, both returning anonymous clusters
(Speaker_0, Speaker_1, …) with timestamps. NVIDIA Sortformer owns
segmentation; TitaNet produces a voiceprint per speaker.
POST /api/audio/diarize-chunk — segmentation + voice fingerprints
Multipart: file (required). Designed to be called per chunk for long calls;
returns a 192-dim fingerprint per local speaker so you can re-cluster the same
person across chunks.
curl -k -X POST https://<host>/api/audio/diarize-chunk -F "file=@call.wav"
Real response:
{
"duration": 9.259,
"segments": [
{ "start_s": 0.0, "end_s": 1.52, "speaker": "Speaker_0", "confidence": 0.931 },
{ "start_s": 1.6, "end_s": 4.56, "speaker": "Speaker_0", "confidence": 0.9662 },
{ "start_s": 4.88, "end_s": 9.04, "speaker": "Speaker_1", "confidence": 0.9681 }
],
"speakers_detected": ["Speaker_0", "Speaker_1"],
"fingerprints": {
"Speaker_0": [0.0028, 0.0173, -0.0114, "...192 floats"],
"Speaker_1": [0.0020, 0.0056, -0.0045, "...192 floats"]
},
"models": {
"diarization": "nvidia/diar_sortformer_4spk-v1",
"embedding": "nvidia/speakerverification_en_titanet_large"
}
}
confidence∈ [0,1] = mean probability the assigned speaker was active over the segment (threshold it to render uncertain segments as "Speaker_0?").speakerlabels are local to this chunk — use the fingerprints + cosine similarity (NeMo default distance threshold ~0.7) to mergechunkA.Speaker_0withchunkB.Speaker_2when they're the same voice.
POST /api/audio/transcribe-with-speakers — ASR + diarization merged
Multipart: file (required). Runs transcription + diarization and merges by
timestamp into speaker-attributed text blocks (start_ms/end_ms).
{
"duration": 9.259, "language": "en",
"speakers_detected": ["Speaker_0", "Speaker_1"],
"segments": [
{ "start_ms": 0, "end_ms": 4480, "speaker": "Speaker_0",
"text": "Good morning everyone. I think the energy thesis is strong this quarter." },
{ "start_ms": 4800, "end_ms": 9040, "speaker": "Speaker_1",
"text": "I agree, but I am worried about the lockup terms and the fee load this time." }
],
"models": { "transcription": "parakeet", "diarization": "nvidia/diar_sortformer_4spk-v1" }
}
Your key question — can it accept a prior (named timeline) to label clusters?
Yes — that's exactly what /api/audio/label-merge does (§4.5 below). You POST
the audio plus your visual (start, end, name) timeline; the backend diarizes,
runs the majority-temporal-overlap vote, and returns named segments — no
client-side merge needed. (The two endpoints above still return anonymous clusters
if you'd rather do the merge yourself.) Note the diarizer is fixed at ≤4 speakers
per chunk (NVIDIA Sortformer, not pyannote) and takes no num_speakers hint —
for >4-person calls, chunk the audio; your visual timeline actually helps
disambiguate across chunks.
The fingerprints compound. Every named cluster comes back with its 192-dim
voiceprint. Persist those (keyed by name) and pass them back as known_voiceprints
on the next call — label-merge will recover a speaker by voice even when the
visual cue is missing (camera off, a bad OCR frame). Your visual capture enrolls a
voice library for free.
4.5 POST /api/audio/label-merge — named segments from a visual timeline
Diarize + name the clusters from your screen-derived timeline (majority temporal overlap), with an optional voiceprint fallback for anyone the visual track missed. Synchronous. Stateless — you own the timeline and the voiceprint library; the backend just diarizes + merges and persists nothing.
multipart/form-data fields:
| field | required | notes |
|---|---|---|
file |
yes | mixed-mono audio |
timeline |
yes | JSON array: [{"start":0.0,"end":4.5,"name":"Alice","confidence":0.9}, ...] (seconds) |
known_voiceprints |
no | JSON object {"Alice":[192 floats], "Bob":[...]} — named voiceprints from past calls, used to label clusters with no visual overlap |
transcribe |
no | "true" to also return per-segment text (default false) |
min_overlap |
no | min fraction of a cluster's time that must overlap the winning name (default 0.0 = any overlap wins) |
voiceprint_threshold |
no | cosine similarity to accept a voiceprint match (default 0.5) |
curl -k -X POST https://<host>/api/audio/label-merge \
-F "file=@call.wav" \
-F 'timeline=[{"start":0,"end":4.5,"name":"Alice"},{"start":4.8,"end":9.3,"name":"Bob"}]' \
-F "transcribe=true"
Real response (the 2-speaker test clip; visual named both, with transcript):
{
"duration": 9.259,
"speakers": [
{ "cluster": "Speaker_0", "name": "Alice", "source": "visual", "overlap_confidence": 0.9866,
"fingerprint": [0.0028, 0.0173, "...192 floats"] },
{ "cluster": "Speaker_1", "name": "Bob", "source": "visual", "overlap_confidence": 1.0,
"fingerprint": [0.0020, 0.0056, "...192 floats"] }
],
"segments": [
{ "start_ms": 0, "end_ms": 4480, "speaker": "Alice", "text": "Good morning everyone. I think the energy thesis is strong this quarter." },
{ "start_ms": 4800, "end_ms": 9040, "speaker": "Bob", "text": "I agree, but I am worried about the lockup terms and the fee load this time." }
],
"fingerprints": { "Alice": [192 floats], "Bob": [192 floats] },
"models": { "diarization": "nvidia/diar_sortformer_4spk-v1", "embedding": "nvidia/speakerverification_en_titanet_large" }
}
Name resolution per cluster, in order: (1) the visual-timeline name with the
most temporal overlap (source: "visual"); (2) if none, the closest
known_voiceprints match above voiceprint_threshold (source: "voiceprint",
with match_similarity); (3) otherwise Unknown_N (source: "unmatched") — never
mislabeled. The fingerprints map (keyed by the resolved name) is what you persist
to grow your voiceprint library for the next call. When transcribe=false, segments
are {start_s, end_s, speaker, confidence} instead of text blocks.
Verified live — visual match (both speakers named), voiceprint recovery (a
camera-off speaker matched by voice), and unmatched (→ Unknown_0) all confirmed.
Dual-channel mode (recommended for Ten31 Transcripts)
If you capture two sample-aligned tracks — mic_file (the local user) + system_file
(everyone else, from screen capture) — send them instead of file. This is strictly
better than mixing to mono: the diarizer over-segments a mono mix (a stereo clip of two
clean voices comes back as 3 speakers), whereas the two channels let each model get
the easiest possible mono input.
Extra form fields for dual mode:
| field | required | notes |
|---|---|---|
mic_file + system_file |
yes (dual) | the two aligned mono-16k tracks |
self_name |
no | the local user's name (mic channel). Default "Me". |
self_vad |
no | JSON [{"start","end"}] — windows where the mic is active and louder than system. If omitted, computed server-side per-window. |
How it works: the mic track → your words, gated to windows where the mic is genuinely you speaking (the mic also picks up the remote audio as quiet bleed, so this loudness gate is essential — without it the bleed gets transcribed as you). The system track → diarized (it only has to separate the remote people) and named via the timeline + voiceprints. Your clean voiceprint is enrolled from the mic track and injected into the library, so a system cluster that's you dialed in from a second device (dual-login) resolves to you, not a stranger. You also free a Sortformer speaker slot (you no longer consume one).
curl -k -X POST https://<host>/api/audio/label-merge \
-F "mic_file=@mic.wav" -F "system_file=@system.wav" \
-F "self_name=Alice" -F 'timeline=[...]' -F "transcribe=true" \
-F 'known_voiceprints={"Alice":[...],"Bob":[...]}' # include your own
Response is the same shape with "mode":"dual_channel"; speakers includes a
{"name":self_name,"source":"mic_channel"} entry, and fingerprints[self_name] is your
clean mic-enrolled voiceprint to store.
Validated on a real misattributing call: dual-channel fixed both mono-mix
misattributions (a remote "Go Bitcoin" no longer credited to the user; a local "There"
recovered from Unknown), and correctly split overlapping speech — two people saying
"Hello" at once that the coarse ground truth itself had conflated.
One known limit: if loud remote bleed masks a quiet local word, the mic-track ASR can miss it entirely (we can't attribute a word that was never transcribed). A cleaner mic (headphones, so there's no speaker bleed) avoids it; channel-subtraction echo-cancellation is a possible future enhancement since the tracks are sample-aligned.
5. Anything else
- No OpenAPI/Swagger yet. This doc + the curl examples are the contract.
- Health / discovery:
GET /api/status— per-service health (parakeet, etc.).GET /api/endpoints— service-discovery JSON (base URLs + ready flags).GET /v1/models— lists the STT model + diarizer.
- Errors: JSON body, conventional status codes —
400malformed,413too large,503+Retry-Afterif the backend briefly wedges (retry after the interval; transcription auto-recovers). Most error bodies are{"detail": "..."}. - Long calls: chunk into ~2–3 min pieces, send sequentially, diarize each
with
/api/audio/diarize-chunk, and stitch speakers across chunks via the fingerprints. (The operator's other apps use exactly this pattern.)
Backend: NVIDIA Parakeet TDT 0.6B (STT) + Sortformer 4spk-v1 (diarization) + TitaNet (voice fingerprints) on DGX Spark, fronted by Spark Control. All on the operator's LAN — nothing leaves the box.