v0.13.0:4 - redaction gateway, embeddings proxy, expanded audio API
- Add redaction gateway (redaction_gateway.py, redaction/ scrub + tests) - Add embeddings proxy and spark_embed service (Dockerfile + main.py) - Expand audio_proxy with speaker-aware handling; deep_health/health/server updates - Package: configureSparks action + sparkConfig model updates, manifest/main wiring - Docs: AUDIO_API, EMBEDDINGS, REDACTION_GATEWAY; HANDOFF and runbook/known-issues refresh
This commit is contained in:
@@ -0,0 +1,288 @@
|
||||
# Spark Control — Audio API reference (transcription + diarization)
|
||||
|
||||
For external clients (e.g. the **Ten31 Transcripts** capture app) integrating with
|
||||
the transcription and diarization endpoints. All examples are **real responses**
|
||||
from the live deployment.
|
||||
|
||||
---
|
||||
|
||||
## 1. Connection / auth
|
||||
|
||||
- **Base URL:** `https://<spark-control-host>` (the operator's Start9 LAN address,
|
||||
e.g. `https://<spark-control-host>:62419`). A `.local` form also exists (survives IP
|
||||
changes); the operator can provide it.
|
||||
- **TLS:** Start9's self-signed Root CA. On the LAN, set `verify=False` /
|
||||
`rejectUnauthorized:false` (curl `-k`), or install the Start9 Root CA into your
|
||||
trust store. Same story as every other Spark Control endpoint.
|
||||
- **Auth:** **none on the LAN** today — the endpoints sit behind StartOS access
|
||||
control + TLS on a trusted network. No bearer token / API key. (If you need
|
||||
per-client auth later, the operator can add it; it's not there now.)
|
||||
- **Limits:**
|
||||
- Max upload: **200 MB** per request (`413` if exceeded). For long calls, chunk
|
||||
(see §4).
|
||||
- Request timeout: transcription ~300 s, diarization ~600 s per request.
|
||||
- **Send requests sequentially, not in parallel.** Concurrent audio requests can
|
||||
trip a GPU FFT race on the backend (returns `503` + `Retry-After`). One in
|
||||
flight at a time is the safe pattern.
|
||||
|
||||
---
|
||||
|
||||
## 2. Audio format
|
||||
|
||||
- The backend (NVIDIA Parakeet / NeMo) **decodes common formats** (WAV, FLAC, MP3,
|
||||
m4a) and internally resamples to **16 kHz mono**. So **16 kHz mono WAV is ideal**;
|
||||
anything decodable also works (stereo is downmixed).
|
||||
- **Single mixed-mono file** is what the endpoints expect. Diarization (Sortformer)
|
||||
separates speakers *from one mixed stream*, so for diarization you want everyone
|
||||
in one file — **mix your system-audio track + mic track to one mono WAV** before
|
||||
sending.
|
||||
- **Your two-track capture is an asset:** your mic track is, by definition, *you* —
|
||||
a known identity. Two clean options:
|
||||
1. **Mix both tracks → diarize the mix** (simplest; pairs perfectly with your
|
||||
visual-timeline name-merge — see §4 note).
|
||||
2. **Diarize only the system track** (the other participants) and label your own
|
||||
mic track as the user directly (no diarization needed for your own voice).
|
||||
- **Upload mechanism:** `multipart/form-data`, file field name **`file`** (OpenAI-
|
||||
compatible). Not base64, not a path/URL — send the bytes.
|
||||
|
||||
---
|
||||
|
||||
## 3. Transcription endpoint
|
||||
|
||||
**`POST /v1/audio/transcriptions`** — OpenAI-compatible. **Synchronous** (returns the
|
||||
result; no job/polling).
|
||||
|
||||
Multipart fields:
|
||||
| field | required | notes |
|
||||
|---|---|---|
|
||||
| `file` | **yes** | the audio bytes |
|
||||
| `model` | no | default `parakeet-tdt-0.6b-v3` (one STT model server-side; you don't need to pick) |
|
||||
| `response_format` | no | `json` (default, just text) · `verbose_json` (timestamps) · `text` |
|
||||
| `language` | no | default auto/en |
|
||||
| `temperature`, `prompt` | no | passthrough |
|
||||
|
||||
```bash
|
||||
curl -k -X POST https://<host>/v1/audio/transcriptions \
|
||||
-F "file=@call.wav" -F "response_format=verbose_json"
|
||||
```
|
||||
|
||||
**Real `verbose_json` response** — includes **word-level AND segment-level
|
||||
timestamps** (seconds):
|
||||
```json
|
||||
{
|
||||
"task": "transcribe",
|
||||
"language": "en",
|
||||
"duration": 9.259,
|
||||
"text": "Good morning everyone. I think the energy thesis is strong this quarter. I agree, but I am worried about the lockup terms and the fee load this time.",
|
||||
"segments": [
|
||||
{ "start": 0.0, "end": 1.28, "text": "Good morning everyone." },
|
||||
{ "start": 1.44, "end": 4.48, "text": "I think the energy thesis is strong this quarter." }
|
||||
],
|
||||
"words": [
|
||||
{ "start": 0.0, "end": 0.32, "text": "Good" },
|
||||
{ "start": 0.32, "end": 0.72, "text": "morning" },
|
||||
{ "start": 0.8, "end": 1.28, "text": "everyone." }
|
||||
]
|
||||
}
|
||||
```
|
||||
(`json` → `{"text": "..."}`; `text` → plain text body.)
|
||||
|
||||
---
|
||||
|
||||
## 4. Diarization
|
||||
|
||||
Two endpoints, both **synchronous**, both returning **anonymous clusters**
|
||||
(`Speaker_0`, `Speaker_1`, …) with timestamps. NVIDIA **Sortformer** owns
|
||||
segmentation; **TitaNet** produces a voiceprint per speaker.
|
||||
|
||||
### `POST /api/audio/diarize-chunk` — segmentation + voice fingerprints
|
||||
|
||||
Multipart: `file` (required). Designed to be called **per chunk** for long calls;
|
||||
returns a 192-dim fingerprint per local speaker so you can re-cluster the same
|
||||
person across chunks.
|
||||
|
||||
```bash
|
||||
curl -k -X POST https://<host>/api/audio/diarize-chunk -F "file=@call.wav"
|
||||
```
|
||||
**Real response:**
|
||||
```json
|
||||
{
|
||||
"duration": 9.259,
|
||||
"segments": [
|
||||
{ "start_s": 0.0, "end_s": 1.52, "speaker": "Speaker_0", "confidence": 0.931 },
|
||||
{ "start_s": 1.6, "end_s": 4.56, "speaker": "Speaker_0", "confidence": 0.9662 },
|
||||
{ "start_s": 4.88, "end_s": 9.04, "speaker": "Speaker_1", "confidence": 0.9681 }
|
||||
],
|
||||
"speakers_detected": ["Speaker_0", "Speaker_1"],
|
||||
"fingerprints": {
|
||||
"Speaker_0": [0.0028, 0.0173, -0.0114, "...192 floats"],
|
||||
"Speaker_1": [0.0020, 0.0056, -0.0045, "...192 floats"]
|
||||
},
|
||||
"models": {
|
||||
"diarization": "nvidia/diar_sortformer_4spk-v1",
|
||||
"embedding": "nvidia/speakerverification_en_titanet_large"
|
||||
}
|
||||
}
|
||||
```
|
||||
- `confidence` ∈ [0,1] = mean probability the assigned speaker was active over the
|
||||
segment (threshold it to render uncertain segments as "Speaker_0?").
|
||||
- `speaker` labels are **local to this chunk** — use the fingerprints + cosine
|
||||
similarity (NeMo default distance threshold ~0.7) to merge `chunkA.Speaker_0` with
|
||||
`chunkB.Speaker_2` when they're the same voice.
|
||||
|
||||
### `POST /api/audio/transcribe-with-speakers` — ASR + diarization merged
|
||||
|
||||
Multipart: `file` (required). Runs transcription + diarization and **merges by
|
||||
timestamp** into speaker-attributed text blocks (`start_ms`/`end_ms`).
|
||||
```json
|
||||
{
|
||||
"duration": 9.259, "language": "en",
|
||||
"speakers_detected": ["Speaker_0", "Speaker_1"],
|
||||
"segments": [
|
||||
{ "start_ms": 0, "end_ms": 4480, "speaker": "Speaker_0",
|
||||
"text": "Good morning everyone. I think the energy thesis is strong this quarter." },
|
||||
{ "start_ms": 4800, "end_ms": 9040, "speaker": "Speaker_1",
|
||||
"text": "I agree, but I am worried about the lockup terms and the fee load this time." }
|
||||
],
|
||||
"models": { "transcription": "parakeet", "diarization": "nvidia/diar_sortformer_4spk-v1" }
|
||||
}
|
||||
```
|
||||
|
||||
### Your key question — can it accept a *prior* (named timeline) to label clusters?
|
||||
|
||||
**Yes — that's exactly what `/api/audio/label-merge` does (§4.5 below).** You POST
|
||||
the audio plus your visual `(start, end, name)` timeline; the backend diarizes,
|
||||
runs the majority-temporal-overlap vote, and returns **named** segments — no
|
||||
client-side merge needed. (The two endpoints above still return anonymous clusters
|
||||
if you'd rather do the merge yourself.) Note the diarizer is fixed at **≤4 speakers
|
||||
per chunk** (NVIDIA Sortformer, not pyannote) and takes **no `num_speakers` hint** —
|
||||
for >4-person calls, chunk the audio; your visual timeline actually helps
|
||||
disambiguate across chunks.
|
||||
|
||||
**The fingerprints compound.** Every named cluster comes back with its 192-dim
|
||||
voiceprint. Persist those (keyed by name) and pass them back as `known_voiceprints`
|
||||
on the next call — `label-merge` will recover a speaker by **voice** even when the
|
||||
visual cue is missing (camera off, a bad OCR frame). Your visual capture *enrolls a
|
||||
voice library for free*.
|
||||
|
||||
---
|
||||
|
||||
## 4.5 `POST /api/audio/label-merge` — named segments from a visual timeline
|
||||
|
||||
Diarize + **name the clusters** from your screen-derived timeline (majority temporal
|
||||
overlap), with an optional **voiceprint fallback** for anyone the visual track
|
||||
missed. **Synchronous. Stateless** — you own the timeline and the voiceprint
|
||||
library; the backend just diarizes + merges and persists nothing.
|
||||
|
||||
`multipart/form-data` fields:
|
||||
| field | required | notes |
|
||||
|---|---|---|
|
||||
| `file` | **yes** | mixed-mono audio |
|
||||
| `timeline` | **yes** | JSON array: `[{"start":0.0,"end":4.5,"name":"Alice","confidence":0.9}, ...]` (seconds) |
|
||||
| `known_voiceprints` | no | JSON object `{"Alice":[192 floats], "Bob":[...]}` — named voiceprints from past calls, used to label clusters with no visual overlap |
|
||||
| `transcribe` | no | `"true"` to also return per-segment text (default false) |
|
||||
| `min_overlap` | no | min fraction of a cluster's time that must overlap the winning name (default `0.0` = any overlap wins) |
|
||||
| `voiceprint_threshold` | no | cosine similarity to accept a voiceprint match (default `0.5`) |
|
||||
|
||||
```bash
|
||||
curl -k -X POST https://<host>/api/audio/label-merge \
|
||||
-F "file=@call.wav" \
|
||||
-F 'timeline=[{"start":0,"end":4.5,"name":"Alice"},{"start":4.8,"end":9.3,"name":"Bob"}]' \
|
||||
-F "transcribe=true"
|
||||
```
|
||||
|
||||
**Real response** (the 2-speaker test clip; visual named both, with transcript):
|
||||
```json
|
||||
{
|
||||
"duration": 9.259,
|
||||
"speakers": [
|
||||
{ "cluster": "Speaker_0", "name": "Alice", "source": "visual", "overlap_confidence": 0.9866,
|
||||
"fingerprint": [0.0028, 0.0173, "...192 floats"] },
|
||||
{ "cluster": "Speaker_1", "name": "Bob", "source": "visual", "overlap_confidence": 1.0,
|
||||
"fingerprint": [0.0020, 0.0056, "...192 floats"] }
|
||||
],
|
||||
"segments": [
|
||||
{ "start_ms": 0, "end_ms": 4480, "speaker": "Alice", "text": "Good morning everyone. I think the energy thesis is strong this quarter." },
|
||||
{ "start_ms": 4800, "end_ms": 9040, "speaker": "Bob", "text": "I agree, but I am worried about the lockup terms and the fee load this time." }
|
||||
],
|
||||
"fingerprints": { "Alice": [192 floats], "Bob": [192 floats] },
|
||||
"models": { "diarization": "nvidia/diar_sortformer_4spk-v1", "embedding": "nvidia/speakerverification_en_titanet_large" }
|
||||
}
|
||||
```
|
||||
|
||||
**Name resolution per cluster, in order:** (1) the visual-timeline name with the
|
||||
most temporal overlap (`source: "visual"`); (2) if none, the closest
|
||||
`known_voiceprints` match above `voiceprint_threshold` (`source: "voiceprint"`,
|
||||
with `match_similarity`); (3) otherwise `Unknown_N` (`source: "unmatched"`) — never
|
||||
mislabeled. The `fingerprints` map (keyed by the resolved name) is what you persist
|
||||
to grow your voiceprint library for the next call. When `transcribe=false`, segments
|
||||
are `{start_s, end_s, speaker, confidence}` instead of text blocks.
|
||||
|
||||
**Verified live** — visual match (both speakers named), voiceprint recovery (a
|
||||
camera-off speaker matched by voice), and unmatched (→ `Unknown_0`) all confirmed.
|
||||
|
||||
### Dual-channel mode (recommended for Ten31 Transcripts)
|
||||
|
||||
If you capture two sample-aligned tracks — **`mic_file`** (the local user) + **`system_file`**
|
||||
(everyone else, from screen capture) — send them *instead of* `file`. This is strictly
|
||||
better than mixing to mono: the diarizer over-segments a mono mix (a stereo clip of two
|
||||
clean voices comes back as **3** speakers), whereas the two channels let each model get
|
||||
the easiest possible mono input.
|
||||
|
||||
Extra form fields for dual mode:
|
||||
| field | required | notes |
|
||||
|---|---|---|
|
||||
| `mic_file` + `system_file` | **yes (dual)** | the two aligned mono-16k tracks |
|
||||
| `self_name` | no | the local user's name (mic channel). Default `"Me"`. |
|
||||
| `self_vad` | no | JSON `[{"start","end"}]` — windows where the mic is active *and louder than* system. If omitted, computed server-side per-window. |
|
||||
|
||||
How it works: the **mic track** → your words, gated to windows where the mic is genuinely
|
||||
you speaking (the mic also picks up the remote audio as quiet bleed, so this loudness gate
|
||||
is essential — without it the bleed gets transcribed as you). The **system track** →
|
||||
diarized (it only has to separate the *remote* people) and named via the timeline +
|
||||
voiceprints. Your clean voiceprint is **enrolled from the mic track** and injected into the
|
||||
library, so a system cluster that's you dialed in from a second device (dual-login) resolves
|
||||
to you, not a stranger. You also free a Sortformer speaker slot (you no longer consume one).
|
||||
|
||||
```bash
|
||||
curl -k -X POST https://<host>/api/audio/label-merge \
|
||||
-F "mic_file=@mic.wav" -F "system_file=@system.wav" \
|
||||
-F "self_name=Alice" -F 'timeline=[...]' -F "transcribe=true" \
|
||||
-F 'known_voiceprints={"Alice":[...],"Bob":[...]}' # include your own
|
||||
```
|
||||
|
||||
Response is the same shape with `"mode":"dual_channel"`; `speakers` includes a
|
||||
`{"name":self_name,"source":"mic_channel"}` entry, and `fingerprints[self_name]` is your
|
||||
clean mic-enrolled voiceprint to store.
|
||||
|
||||
**Validated on a real misattributing call:** dual-channel fixed both mono-mix
|
||||
misattributions (a remote "Go Bitcoin" no longer credited to the user; a local "There"
|
||||
recovered from `Unknown`), and **correctly split overlapping speech** — two people saying
|
||||
"Hello" at once that the coarse ground truth itself had conflated.
|
||||
|
||||
> **One known limit:** if *loud* remote bleed masks a *quiet* local word, the mic-track ASR
|
||||
> can miss it entirely (we can't attribute a word that was never transcribed). A cleaner mic
|
||||
> (headphones, so there's no speaker bleed) avoids it; channel-subtraction echo-cancellation
|
||||
> is a possible future enhancement since the tracks are sample-aligned.
|
||||
|
||||
---
|
||||
|
||||
## 5. Anything else
|
||||
|
||||
- **No OpenAPI/Swagger yet.** This doc + the curl examples are the contract.
|
||||
- **Health / discovery:**
|
||||
- `GET /api/status` — per-service health (`parakeet`, etc.).
|
||||
- `GET /api/endpoints` — service-discovery JSON (base URLs + ready flags).
|
||||
- `GET /v1/models` — lists the STT model + diarizer.
|
||||
- **Errors:** JSON body, conventional status codes — `400` malformed, `413` too
|
||||
large, `503` + `Retry-After` if the backend briefly wedges (retry after the
|
||||
interval; transcription auto-recovers). Most error bodies are `{"detail": "..."}`.
|
||||
- **Long calls:** chunk into ~2–3 min pieces, send **sequentially**, diarize each
|
||||
with `/api/audio/diarize-chunk`, and stitch speakers across chunks via the
|
||||
fingerprints. (The operator's other apps use exactly this pattern.)
|
||||
|
||||
---
|
||||
|
||||
*Backend: NVIDIA Parakeet TDT 0.6B (STT) + Sortformer 4spk-v1 (diarization) +
|
||||
TitaNet (voice fingerprints) on DGX Spark, fronted by Spark Control. All on the
|
||||
operator's LAN — nothing leaves the box.*
|
||||
Reference in New Issue
Block a user