26 Commits

Author SHA1 Message Date
Grant 6434b01a95 v0.7.0 - Pre-flight launch validation (Test button on every model card)
validate.py:
- Builds the same args list a real swap would pass to 'vllm serve'
- SSHes into Spark 1 and runs vLLM's own argparse layer inside the running vllm_node container, WITHOUT initializing the engine
- Uses FlexibleArgumentParser (from vllm.utils.argparse_utils, with fallback to engine.arg_utils) + make_arg_parser — the exact same parser the 'vllm serve' CLI uses. Earlier attempt with bare argparse.ArgumentParser was too strict (rejected '--moe_backend' with underscore that the real CLI accepts via FlexibleArgumentParser's normalization)
- Returns structured {ok, stage, error, cmd_args, launch_cmd} so the UI can surface the exact failure cause

Endpoint: POST /api/swap/{key}/validate. Cheap (~5s), no engine init, no disruption to the currently-loaded model.

Frontend: 'Test' button on every model card, inline result below the action row (green check or red detailed error). Result stays visible until the user reloads or clicks Test again.

Catches: typos in flag names, deprecated/removed flags after a vLLM upgrade, type mismatches. Does NOT catch runtime-only failures (Mamba block-size assertion, OOM at load, kernel-compat). Ok=true is necessary-but-not-sufficient; ok=false is definitive 'don't bother running it'.
2026-05-12 13:37:37 -05:00
Grant 5827683a09 v0.6.0:1 - fix Qwen3.6 Mamba block-size assertion at launch
vLLM trips on launching Qwen3.6-35B-A3B-NVFP4 with:
  AssertionError: In Mamba cache align mode, block_size (2096) must be
  <= max_num_batched_tokens (2048).

Qwen3.6 uses a Mamba-attention hybrid. The default --max-num-batched-tokens of 2048 is just under the model's required block_size of 2096. The upstream sibling recipe (qwen3.5-35b-a3b-fp8.yaml) sets it to 16384; use the same value.

Earlier qwen36 swaps in this session worked because vLLM hadn't reached the Mamba-validation code path on that prior path (different attention backend pick or auto-retry). Whatever the reason, the explicit flag avoids the dance.

Also documented in known-issues.md.
2026-05-12 13:22:24 -05:00
Grant ee8c2406b8 v0.6.0 - Service-level connectivity tracking + passive failure-report endpoint
connectivity.py:
- Generalized 'spark' subject to any string; renamed 'spark' field to 'subject'
- Legacy v0.5 events with the old 'spark' field are migrated transparently on read (kind defaults to 'transition')
- New record_report(subject, ok, source, detail, latency_ms): always appends an event with kind='report'; does NOT mutate the current state (only active polling is authoritative)
- summary() returns events normalized to the new schema

Wiring:
- /api/status now calls record_state for vllm/parakeet/magpie (dedup on no-change)
- /api/services calls record_state for each service after its http check
- Result: dashboard observes service-level transitions automatically with no extra polling

Passive endpoint:
- POST /api/health-event with {service, ok, source?, error?, ms?}
- Useful for external apps (e.g. Open WebUI) to surface sub-poll-interval failures the dashboard would otherwise miss

UI:
- Connectivity dialog groups events by subject (hosts ordered first, then services)
- Per-subject summary shows transition count, down count, report count, failed-report count
- Transitions and reports render inline with distinct styling; reports show source app + error + latency
- Legacy v0.5 events render unchanged

Docs:
- README documents /api/health-event with a curl example

Package: bump to 0.6.0:0
2026-05-12 13:19:27 -05:00
Grant a02f4db850 v0.5.0 - Wake-on-LAN + connectivity history
wol.py:
- build_magic_packet(): standard 6x0xFF + 16x MAC layout
- send_local_broadcast(): direct from container (ports 9 + 7 for safety)
- send_via_peer(): preferred path; SSHes to the OTHER Spark and runs a Python one-liner there so the packet originates on the target's LAN segment (most reliable)
- MAC validation + normalization

connectivity.py:
- /data/connectivity.json persistence (thread-safe, atomic rename)
- Stores per-Spark current state + last_change timestamp + rolling 200-event log
- Records up/down transitions; computes down_seconds / up_seconds durations
- MAC cache populated lazily during hardware probes

hardware.py:
- Probe now reads MAC via /sys/class/net/<default-route-iface>/address
- After each probe, record_state() emits a transition event if state changed
- record_mac() caches the address so WoL works when the Spark next goes down

Endpoints:
- GET /api/connectivity: macs, current state, last_change, events[]
- POST /api/spark/{name}/wake: tries via-peer first, falls back to direct broadcast

UI:
- Unreachable hardware card shows the cached MAC + 'Wake (WoL)' button (only if MAC known)
- New 'Connectivity log' button opens a modal with per-Spark transition history (last 25 each), including duration of each prior up/down period
- pollHardware also pulls /api/connectivity so WoL buttons appear without an extra fetch

Package: bump 0.5.0:0; main.ts sets CONNECTIVITY_LOG=/data/connectivity.json
2026-05-12 12:51:49 -05:00
Grant 1889ab45fb v0.4.0 - NIM installer + dashboard resilience
Hotfix (was v0.3.1):
- services.py: cache 'unreachable' per (host,user) for 25s so a dead Spark doesn't hang every /api/services call behind 6s ssh timeout
- ssh_run timeout reduced 10 -> 6s for docker_state probes
- hardware probe: shorter SSH timeout (6s), longer cache TTL for failures (25s)
- JS pollStatus retries loadModels() if state.models is empty (recovers from cold-start proxy timeout)
- Unreachable hardware card now includes troubleshooting steps (Spark Control cannot SSH into an unreachable Spark to restart it)

v0.4 NIM installer:
- nim.py module: curated SUGGESTED_NIMS list (Parakeet, Magpie, Riva) + NimManager that runs docker login nvcr.io + docker pull + docker run -d --gpus all -p PORT:PORT -v VOL:/opt/nim/.cache -e NGC_API_KEY -e ... --restart=unless-stopped + chown the volume to uid 1000 + restart. Streams all output via SSE; redacts the API key from log lines.
- custom_services.py: persists installed NIMs to /data/services-overrides.yaml so they appear in the services panel after install
- services.py: merges custom services into the panel
- /api/nim/catalog GET, /api/nim/install POST + GET/SSE
- /api/services/{name} DELETE for custom services
- UI: '+ Install NIM' button next to 'Always-on services'; modal lists curated images each with a 'Pick' button + a custom-image form; installation runs in a second dialog with phase + elapsed timer + collapsible log
- NGC API key field added to Configure Sparks (masked); injected as NGC_API_KEY env var into the container

Package: bump 0.4.0:0; main.ts adds SERVICES_OVERRIDES + NGC_API_KEY env vars
2026-05-12 12:32:29 -05:00
Grant e88fdcfde4 v0.3.0:1 - hotfix: parallel SSH probes + longer timeout
- Hardware probes for spark1 and spark2 now run via asyncio.gather (parallel) so the worst-case wall time is max(per-probe), not sum
- Bump per-probe SSH timeout from 8s to 12s to absorb first-call overhead (StrictHostKeyChecking=accept-new on first connect + nvidia-smi cold start)
- Unreachable Spark now shows up cleanly in the UI as a single 'unreachable' card with the error message
2026-05-12 12:14:36 -05:00
Grant 64ce0fca10 v0.3.0 - Hardware dashboard + knob context + Explain context + Open WebUI link
Hardware dashboard:
- New hardware.py module: SSH probes each Spark for hostname, uptime, load+cores, RAM, disk, GPU (name, util, temp, power) + per-process GPU memory sum
- DGX Spark uses unified memory (nvidia-smi memory.total returns N/A); fall back to per-process compute memory and compute fraction against system RAM. Marks with gpu_unified_memory=true.
- 4s TTL cache in HardwareProbe to avoid hammering
- /api/hardware returns per-Spark snapshot
- UI: 'Spark hardware' section at the top with per-Spark cards (CPU load, RAM, GPU mem (unified), GPU util + temp + power, disk) — bars with warn threshold styling
- Polls every 8s

Knob context (tied to live hardware):
- Each Advanced knob now shows plain-English help text
- 'GPU memory %' shows '~N GB allocated · ~M GB left for OS/buffers' computed from actual Spark RAM
- 'Max context' shows '~N pages of text'
- Toggles show tradeoff descriptions

Explain context:
- ' Explain context' button on the update banner
- /api/explain-updates POST: forwards pending commits to the loaded vLLM model and streams its response back as SSE
- Renders into an expandable 'Explained by the loaded LLM' section under Pending commits
- Reasoning tokens shown italicized when the model emits them

Open WebUI integration:
- New 'Open WebUI URL' optional field in Configure Sparks
- /api/config exposes it; UI shows 'Open chat ↗' button in the top bar if set

Downloads:
- Third radio option: Spark 1 only / Spark 2 only / Both Sparks
- Backend picks SSH target based on mode
- HF repo link icon next to the input
- Helper line about NVFP4 for Blackwell

Model cards:
- Repo name is now a clickable link to its Hugging Face page

Package: bump 0.3.0:0
2026-05-12 12:00:15 -05:00
Grant c6da6b0784 v0.2.4 - Hotfix: Unknown status + copy UX + update banner context
Bug fix:
- config.py: empty PARAKEET_CONTAINER / MAGPIE_CONTAINER env vars (from migrating to v0.2.0+ where the field is optional and saved as '') now fall back to 'parakeet-asr' / 'magpie-tts' via the 'or' idiom. Confirmed live: services classify as 'running' instead of 'unknown'.

UX:
- Replaced text 'Copy' buttons with compact icon buttons (clipboard SVG)
- Endpoint Base URL + Model ID + curl snippet are now click-to-copy themselves (the value AND a separate icon button)
- Service cards: host, base URL, and model are now three separate copyable rows
- Update banner: leading explanatory line — 'Updates to eugr/spark-vllm-docker — the upstream project that orchestrates vLLM on your Sparks. These are not firmware, OS, or model updates.' with a link to the repo.
2026-05-12 11:45:55 -05:00
Grant 75c0ecfd08 docs: update README with v0.2 feature summary 2026-05-12 11:31:14 -05:00
Grant 75fd0846b4 v0.2.3 - Per-model Advanced settings + catalog-add for downloaded models
Backend:
- overrides.py: read/write /data/models-overrides.yaml (knobs + custom entries)
- apply_knobs_to_args(): strip matching flags from bundled vllm_args and append knob values, so knob changes properly override bundled defaults
- extract_knobs_from_args(): seed UI knob values from bundled args so the Advanced dialog has correct starting state
- models.py: load_catalog merges overrides on top of bundled yaml
- GET /api/models returns effective_knobs per model
- PUT /api/models/{key}/knobs persists knob changes
- POST /api/models adds a custom catalog entry
- DELETE /api/models/{key} removes a custom entry (bundled models cannot be deleted)
- swap_manager.reload_catalog() called after each mutation so swaps see latest

Frontend:
- New 'Advanced' button on every card opens a modal dialog: max-model-len input, gpu-memory-utilization slider, three optimization checkboxes (fastsafetensors, prefix caching, FP8 KV cache). Save persists; Cancel discards. Custom models also have a Delete button.
- After a successful download, automatically open the 'Add to catalog' dialog pre-filled with the repo, with the same knob defaults — user just enters key, display name, and clicks Save.
- Custom catalog entries are tagged with a blue 'custom' pill on the card.

Package: bump 0.2.3:0; main.ts sets MODELS_OVERRIDES=/data/models-overrides.yaml so overrides persist on the StartOS volume.
2026-05-12 11:30:47 -05:00
Grant 474417b458 v0.2.2 - spark-vllm-docker update checks + Apply Update
Backend:
- updates.py: get_update_status() runs git fetch + git rev-list --left-right --count HEAD...origin/main to learn ahead/behind/dirty, plus git log for pending commits
- UpdateManager class with asyncio.Lock; one update at a time
- POST /api/updates/apply triggers "git pull --ff-only && ./build-and-copy.sh -c" over SSH with streamed log + phase detection (Pulling / Building the vLLM container / Copying to peer Sparks)
- GET /api/updates returns {ok, behind, ahead, dirty, current, log[], branch}

Frontend:
- Persistent banner near footer: hidden when up-to-date, blue when N commits behind, warn (orange) when local dirty changes block update
- 'Show details' expands a list of pending commits
- 'Apply update' triggers the long-running build with phase + elapsed timer + collapsible logs
- Confirmation dialog explains the 5–40 min duration

Package: bump 0.2.2:0
2026-05-12 11:26:55 -05:00
Grant 9dde938348 v0.2.1 - Model download with %% progress
Backend:
- download.py module: drives ./hf-download.sh <repo> [-c --copy-parallel] over SSH, parses tqdm output (regex matches '8%|...| 2.06G/25.1G [03:20<18:35, 20.6MB/s]') into percent + bytes done/total + elapsed + ETA + rate
- DownloadManager: in-memory job tracking with asyncio.Lock (one download at a time)
- POST /api/download, GET /api/download/{id}, SSE /api/download/{id}/stream
- Phase detection: Connecting / Fetching N files / Downloading / Copying to peer Sparks / Done

Frontend:
- '+ Download a new model' button next to LLM swap section title
- Inline form: HF repo text field + solo/cluster radio + Cancel/Start
- Progress UI: spinner, elapsed timer, phase label, percent fill, stats line (bytes/rate/ETA), collapsible raw logs

Package: bump 0.2.1:0
2026-05-12 11:24:31 -05:00
Grant 27699a2469 v0.2.0 - Always-on services panel with per-service host config
Dashboard:
- New 'Always-on services' section with cards for Parakeet and Magpie
- Each card: host:port, model loaded, status pill (Healthy/Unhealthy/Starting/Not configured)
- Start, Restart, Stop buttons. Buttons disabled when not applicable for current state
- Restart counter shown when > 1 (would have surfaced the old magpie crash loop)

Backend:
- New /api/services GET: docker container state + http health for each support service
- New POST /api/services/{name}/{action} for start | stop | restart
- services.py module: docker_state, run_action via SSH
- config.py: PARAKEET_HOST/USER/CONTAINER and MAGPIE_* env vars, default to spark2_*
- health.py: use per-service hosts (no longer hard-wired to spark2_host)

Package:
- sparkConfig.yaml.ts: add 6 new optional fields
- configureSparks action: optional 'Parakeet host', 'Parakeet container', 'Magpie host', 'Magpie container' fields; descriptions explain they default to Spark 2 when blank
- Handler normalizes nulls to empty strings before merge
- main.ts: pass new env vars to container
- bump to 0.2.0:0
2026-05-12 11:21:15 -05:00
Grant ed54f85442 known-issues: mark magpie crash loop RESOLVED with chown fix recipe
Volume magpie-model-cache was owned by root, container drops to uid 1000. Fix:
docker run --rm -v magpie-model-cache:/cache alpine chown -R 1000:1000 /cache
+ docker restart magpie-tts. After ~3 GB NGC model download, healthy on :9000.
2026-05-12 11:12:25 -05:00
Grant 4cda453c8a 0.1.0:4 - expose /api/endpoints as separate StartOS service interface
Adds a second sdk.createInterface with type='api' and path='/api/endpoints' on the
same uiPort (9999). StartOS dashboard now shows two service interfaces: Web UI and
OpenAI-compatible API. The API URL is discoverable to other services without users
needing to remember the /api/endpoints suffix.
2026-05-12 11:07:51 -05:00
Grant 2ba3da55b1 0.1.0:3 - Show Public Key layout + /api/endpoints service-discovery
- showPublicKey now uses result.group: install command and raw key are each their own one-click copy box; description is brief
- /api/endpoints returns stable shape { vllm, parakeet, magpie } with base_url + model + ready, for other LAN services to consume without hardcoding Spark IPs
- health.py: parakeet/magpie now also expose base_url
- README: documented /api/endpoints shape
2026-05-12 10:52:57 -05:00
Grant 51804b2e5e 0.1.0:2 - remove '<spark-user>' default everywhere (it's Alice's username, not factory)
Per user correction: '<spark-user>' is not the DGX Spark factory default. Generic-ize:
- configureSparks: no default user; placeholder 'your SSH username'
- sparkConfig schema: empty string defaults
- main.ts env fallback: empty
- showPublicKey: drop the '<spark-user>' fallback; skip Spark if user not configured
- Update feedback memory with the correction
2026-05-12 10:39:57 -05:00
Grant 0ddab99468 Bump to 0.1.0:1 — portability + endpoint display
- configureSparks.ts: generic placeholders (e.g. 192.168.1.10), no Alice-specific IPs; descriptions explain the role of each node instead of naming his hardware
- showPublicKey.ts: reads sparkConfig.yaml; emits a ready-to-paste one-liner (KEY='...' followed by 'ssh user@host "echo $KEY >> authorized_keys"' for each configured Spark). Falls back to generic instructions if Configure Sparks hasn't been run yet.
- /api/status now includes vllm.base_url for the OpenAI endpoint
- New endpoint panel in UI: base URL + model ID rows with copy buttons + collapsible curl example
- Bump version to 0.1.0:1
2026-05-12 10:38:18 -05:00
Grant 87334f85f0 Add per-model descriptions + repo-cleanup polish
- models.yaml: add 'description' field for all 5 models (generic, anyone-can-use)
- ModelDef gains optional description: str | None field
- UI: render description below meta tags; mute the repo line further
- escapeHtml() for safety in case descriptions/names contain HTML chars
- Update runbook: how to add a new model with description
2026-05-12 10:19:09 -05:00
Grant c0aebfc98b Add friendly swap UI: timer + phase indicator + progress bar + collapsible logs
- Elapsed timer (mm:ss) in top-right of swap panel
- Phase display: Stopping / Starting / Loading weights (N/M shards) / Compiling / Warming up / Ready
- Progress bar with smooth fill mapped from phase
- Raw vLLM logs hidden behind <details> 'Show technical logs'
- Detection from log content (safetensors %, torch.compile, Application startup, Ray cluster join)
- Backfill from /api/swap/{id} on reattach (mid-swap reload works)
2026-05-12 10:11:14 -05:00
Grant 34bdbb7aba Add Spark prerequisites section to runbook (spark-vllm-docker is upstream + Spark-side) 2026-05-12 10:05:17 -05:00
Grant 53a0b01d88 Update README with build flow + post-install steps; note IPv6/mDNS quirk 2026-05-12 10:03:37 -05:00
Grant 72bf754baa Pack spark-control_x86_64.s9pk (55 MB)
- Move models.yaml into image/ so the docker build context is self-contained
- Fix manifest: dockerfile=../image/Dockerfile, workdir=../image
- Add LICENSE (MIT) and assets/README.md (StartOS marketplace listing)
- s9pk validates: id=spark-control, version=0.1.0:0, osVersion=0.4.0-beta.6, sdkVersion=1.3.3
- Image embeds python:3.12-slim + openssh-client + FastAPI app + models.yaml
2026-05-12 09:52:53 -05:00
Grant 342e150266 Add safe optimization flags to gemma4 + qwen36 (fastsafetensors, prefix-caching, fp8 kv)
Aligned with sibling recipes in eugr/spark-vllm-docker. Applies on next swap to each model.
First real swap gemma4 -> qwen36 succeeded in 5:30 with --moe_backend=flashinfer_cutlass.
2026-05-12 09:49:08 -05:00
Grant dd9d53060b Add StartOS 0.4 package scaffold (manifest, main, interfaces, 2 actions)
- package/Makefile + s9pk.mk + package.json + tsconfig.json
- startos/manifest: dockerBuild source pointing at ../image/Dockerfile
- startos/main: reads /data/config.yaml reactively, passes env vars to container
- startos/interfaces: binds port 9999 as HTTP UI
- startos/actions: showPublicKey (read /data/ssh/id_ed25519.pub), configureSparks
- TS + JS bundle compile clean (tsc --noEmit, ncc build)
2026-05-12 09:36:15 -05:00
Grant ae8efa1754 Initial scaffold: image/ FastAPI app, models.yaml, docs
- image/ FastAPI app: /api/status, /api/swap, /api/swap/{id}/stream, /api/test-connection
- models.yaml: 5-model catalog (qwen3-vl, gemma4, qwen36, qwen3-235b-fp8, qwen25-72b)
- README, runbook, known-issues
- Dry-run swap verified against live Spark 1 (gemma4 currently loaded)
2026-05-12 09:29:13 -05:00
32 changed files with 56 additions and 3604 deletions
+1 -1
View File
@@ -1,6 +1,6 @@
MIT License
Copyright (c) 2026 Grant
Copyright (c) 2026 Alice
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
+10 -10
View File
@@ -31,17 +31,17 @@ Two layers in this repo:
cd image
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
export SPARK1_HOST=192.168.1.103
export SPARK1_USER=modelo
export SPARK2_HOST=192.168.1.87
export SPARK2_USER=modelo
export SPARK1_HOST=<spark-1-ip>
export SPARK1_USER=<spark-user>
export SPARK2_HOST=<spark-2-ip>
export SPARK2_USER=<spark-user>
export SSH_KEY_PATH="$HOME/Library/Application Support/NVIDIA/Sync/config/nvsync.key"
uvicorn app.server:app --host 0.0.0.0 --port 9999 --reload
```
Open <http://localhost:9999>.
> **Note:** use the **IP** `192.168.1.103` for Spark 1, not `spark-27ea.local`. mDNS resolves to IPv6 first and `httpx` hangs on it because vLLM only binds IPv4.
> **Note:** use the **IP** `<spark-1-ip>` for Spark 1, not `<spark-1-host>.local`. mDNS resolves to IPv6 first and `httpx` hangs on it because vLLM only binds IPv4.
## Build the StartOS package
@@ -58,8 +58,8 @@ To sideload onto your Start9: `make install` (needs `host:` set in `~/.startos/c
## Post-install setup (one-time per Start9 install)
1. Open the Spark Control service → **Actions****Show Public Key** → copy the line.
2. SSH to each Spark and append the line to `~/.ssh/authorized_keys` for the `modelo` user.
3. **Actions****Configure Sparks** → enter `192.168.1.103` / `modelo` for Spark 1 and `192.168.1.87` / `modelo` for Spark 2.
2. SSH to each Spark and append the line to `~/.ssh/authorized_keys` for the `<spark-user>` user.
3. **Actions****Configure Sparks** → enter `<spark-1-ip>` / `<spark-user>` for Spark 1 and `<spark-2-ip>` / `<spark-user>` for Spark 2.
4. Start the service. Open the Web UI — current model + health should show within ~5 s.
## Repo layout
@@ -76,9 +76,9 @@ Other services on your LAN can hit `GET /api/endpoints` to learn where the curre
```json
{
"vllm": { "ready": true, "base_url": "http://192.168.1.103:8888/v1", "model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4", "openai_compat": true },
"parakeet":{ "ready": true, "base_url": "http://192.168.1.87:8000", "kind": "stt", "model": "nvidia/parakeet-tdt-0.6b-v3" },
"magpie": { "ready": false, "base_url": "http://192.168.1.87:9000", "kind": "tts" }
"vllm": { "ready": true, "base_url": "http://<spark-1-ip>:8888/v1", "model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4", "openai_compat": true },
"parakeet":{ "ready": true, "base_url": "http://<spark-2-ip>:8000", "kind": "stt", "model": "nvidia/parakeet-tdt-0.6b-v3" },
"magpie": { "ready": false, "base_url": "http://<spark-2-ip>:9000", "kind": "tts" }
}
```
+19 -19
View File
@@ -1,7 +1,7 @@
# Project: spark-control — Model switcher web UI for dual DGX Spark cluster
> **Update 2026-05-12 — Direction change:** the web UI is being built as a
> **StartOS 0.4 package** (sideloaded onto Grant's existing Start9 server),
> **StartOS 0.4 package** (sideloaded onto Alice's existing Start9 server),
> **not** as a FastAPI service running directly on Spark 1. The Start9 server
> shares a LAN with the Sparks and SSHes into Spark 1 to invoke
> `launch-cluster.sh`. StartOS handles `.local` exposure and HTTPS; SSH
@@ -38,8 +38,8 @@ The web UI itself, when deployed, will run on **Spark 1** (where it can directly
From my laptop I can SSH to either Spark directly:
```bash
ssh modelo@192.168.1.103 # Spark 1
ssh modelo@192.168.1.87 # Spark 2
ssh <spark-user>@<spark-1-ip> # Spark 1
ssh <spark-user>@<spark-2-ip> # Spark 2
```
(I can also use SSH key auth — set up earlier.)
@@ -47,7 +47,7 @@ ssh modelo@192.168.1.87 # Spark 2
When you need to run a command on a Spark, use this pattern:
```bash
ssh modelo@192.168.1.103 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
ssh <spark-user>@<spark-1-ip> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
```
For multi-line commands or scripts, you can pipe a heredoc or just SSH in directly and run them interactively. Either works — but always tell me what you're about to run so I can review.
@@ -55,19 +55,19 @@ For multi-line commands or scripts, you can pipe a heredoc or just SSH in direct
For file transfers between my laptop and the Sparks, use `rsync`:
```bash
rsync -avz ~/Projects/spark-control/ modelo@192.168.1.103:~/spark-control/
rsync -avz ~/Projects/spark-control/ <spark-user>@<spark-1-ip>:~/spark-control/
```
## My hardware and what's running
**Two NVIDIA DGX Spark units** networked together:
- **Spark 1** — hostname `spark-27ea`, LAN IP `192.168.1.103`, QSFP IP `192.168.100.10`. Head node for the vLLM cluster.
- **Spark 2** — hostname `spark-32d0`, LAN IP `192.168.1.87`, QSFP IP `192.168.100.11`. Worker node for vLLM cluster, also hosts standalone services.
- **Spark 1** — hostname `<spark-1-host>`, LAN IP `<spark-1-ip>`, QSFP IP `<spark-1-qsfp-ip>`. Head node for the vLLM cluster.
- **Spark 2** — hostname `<spark-2-host>`, LAN IP `<spark-2-ip>`, QSFP IP `<spark-2-qsfp-ip>`. Worker node for vLLM cluster, also hosts standalone services.
Both run Ubuntu 24.04, NVIDIA driver 580.x, CUDA 13.0, Docker, and have 128 GB unified memory each. They share a QSFP cable for high-speed (200 Gb/s) inter-node networking.
Passwordless SSH works in both directions via `~/.ssh/id_ed25519_shared` key. My Linux username on both machines is `modelo`.
Passwordless SSH works in both directions via `~/.ssh/<ssh-key>` key. My Linux username on both machines is `<spark-user>`.
**Currently running:**
- One LLM at a time on the cluster (via the `eugr/spark-vllm-docker` project — see below)
@@ -88,7 +88,7 @@ Key commands (all run from `~/spark-vllm-docker` on Spark 1):
Container names: `vllm_node` (the main vLLM container), `ray_head` and `ray_worker` (Ray cluster), plus support containers.
The vLLM server binds to port **8888** and exposes an OpenAI-compatible API at `http://192.168.1.103:8888/v1`.
The vLLM server binds to port **8888** and exposes an OpenAI-compatible API at `http://<spark-1-ip>:8888/v1`.
## Models I have on disk (both Sparks)
@@ -154,7 +154,7 @@ Note: the `--moe_backend flashinfer_cutlass` flag is Blackwell-specific. If it e
- Status check: `./launch-cluster.sh status`
- See vLLM logs: `docker logs vllm_node` (add `-f` to follow)
- Hard reset if stuck: `./launch-cluster.sh stop && docker ps -aq | xargs -r docker rm -f`
- Health check (is API responding?): `curl -s http://192.168.1.103:8888/v1/models`
- Health check (is API responding?): `curl -s http://<spark-1-ip>:8888/v1/models`
### "Ready" signal
The model is ready to serve when `docker logs vllm_node` contains the line `Application startup complete.` Until then, it's still loading weights or compiling CUDA graphs.
@@ -163,8 +163,8 @@ The model is ready to serve when `docker logs vllm_node` contains the line `Appl
These don't get touched by model swaps:
- **`parakeet-asr`** — STT on port 8000. Already running 24/7. Verify with `curl http://192.168.1.87:8000/health` which should return `{"status":"ready",...}`.
- **`magpie-tts`** — TTS on port 9000. May or may not be running; verify with `docker ps` on Spark 2 and `curl http://192.168.1.87:9000/v1/health/ready`.
- **`parakeet-asr`** — STT on port 8000. Already running 24/7. Verify with `curl http://<spark-2-ip>:8000/health` which should return `{"status":"ready",...}`.
- **`magpie-tts`** — TTS on port 9000. May or may not be running; verify with `docker ps` on Spark 2 and `curl http://<spark-2-ip>:9000/v1/health/ready`.
## What I want you to build
@@ -201,7 +201,7 @@ spark-control/
5. Return exit code 0 on success, non-zero on failure
Two versions might be useful:
- The version that runs on **my laptop** — wraps everything in `ssh modelo@192.168.1.103 ...`
- The version that runs on **my laptop** — wraps everything in `ssh <spark-user>@<spark-1-ip> ...`
- A simpler version that lives on **Spark 1** — runs commands directly without SSH (used by the deployed web UI)
You can either share one script with a `--remote` flag, or make them two distinct files. Your call — propose the cleaner option.
@@ -246,14 +246,14 @@ The web UI runs on **Spark 1** so it can directly invoke `launch-cluster.sh` wit
## First task
1. First, **verify SSH access to both Sparks** from my laptop:
- `ssh modelo@192.168.1.103 hostname` should return `spark-27ea`
- `ssh modelo@192.168.1.87 hostname` should return `spark-32d0`
- `ssh <spark-user>@<spark-1-ip> hostname` should return `<spark-1-host>`
- `ssh <spark-user>@<spark-2-ip> hostname` should return `<spark-2-host>`
2. Then **verify the current state of the cluster** via SSH:
- Confirm `~/spark-vllm-docker` exists on Spark 1 and `launch-cluster.sh` is there: `ssh modelo@192.168.1.103 'ls ~/spark-vllm-docker/launch-cluster.sh'`
- Check which LLM (if any) is currently loaded: `ssh modelo@192.168.1.103 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'` and `ssh modelo@192.168.1.103 'curl -s http://localhost:8888/v1/models'`
- Verify which models are downloaded: `ssh modelo@192.168.1.103 'ls ~/.cache/huggingface/hub/ | grep -iE "qwen|gemma"'`
- Confirm `~/spark-vllm-docker` exists on Spark 1 and `launch-cluster.sh` is there: `ssh <spark-user>@<spark-1-ip> 'ls ~/spark-vllm-docker/launch-cluster.sh'`
- Check which LLM (if any) is currently loaded: `ssh <spark-user>@<spark-1-ip> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'` and `ssh <spark-user>@<spark-1-ip> 'curl -s http://localhost:8888/v1/models'`
- Verify which models are downloaded: `ssh <spark-user>@<spark-1-ip> 'ls ~/.cache/huggingface/hub/ | grep -iE "qwen|gemma"'`
- Specifically check if `Qwen3.6-35B-A3B-NVFP4` is downloaded; if not, that's the prerequisite step (run the `hf-download.sh` command on Spark 1)
- Check what's running on Spark 2: `ssh modelo@192.168.1.87 'docker ps'` (looking for parakeet-asr and possibly magpie-tts)
- Check what's running on Spark 2: `ssh <spark-user>@<spark-2-ip> 'docker ps'` (looking for parakeet-asr and possibly magpie-tts)
3. Then create the repo structure on my laptop at `~/Projects/spark-control/`
4. Then propose the design for `models.yaml` and the swap script before implementing
-12
View File
@@ -12,18 +12,6 @@ RUN chmod +x /app/entrypoint.sh
COPY models.yaml /app/models.yaml
# Parakeet container wrapper patches (diarizer.py + main.py overlay).
# Shipped inside spark-control so the "Reapply speech-model patches" action
# can copy these into the parakeet-asr container on Spark 2 over SSH at any
# time — survives docker rm + redeploy of the parakeet container.
COPY parakeet_patches /app/parakeet_patches
# WhisperX container build context (Dockerfile + requirements.txt + app/).
# The "Install WhisperX" action in spark-control ships these files to Spark 2
# over SSH, then runs `docker build` + `docker run` there. The container
# becomes a managed always-on service alongside parakeet-asr and magpie-tts.
COPY whisperx_container /app/whisperx_container
RUN pip install --no-cache-dir -e .
ENV BIND_PORT=9999
-434
View File
@@ -1,434 +0,0 @@
"""OpenAI-compatible audio proxy: lets any OpenAI-shaped client (Open WebUI,
Home Assistant, etc.) talk to Parakeet (STT) and Magpie (TTS) through one URL.
Endpoints exposed on spark-control's port (same as the dashboard):
GET /v1/models — lists STT model + Magpie voices in OpenAI shape
POST /v1/audio/speech — OpenAI TTS → Magpie /v1/audio/synthesize
POST /v1/audio/transcriptions — forward to Parakeet (already OpenAI-compatible)
Both downstream services already speak HTTP on the LAN; this module just adapts
request/response shapes so OpenAI clients don't need a custom integration.
When Parakeet returns a 500 (commonly the recurring CUDA wedge), the proxy
returns a clearer 503 with Retry-After=60, and fires the deep-health probe in
the background — which detects the wedge and triggers a rate-limited container
restart inside seconds. The client's next attempt ~60s later then succeeds.
"""
from __future__ import annotations
import asyncio
import logging
from typing import Any, Optional
import httpx
from fastapi import APIRouter, Form, HTTPException, Request, UploadFile, File
from fastapi.responses import Response, StreamingResponse
from pydantic import BaseModel
from .config import Settings
logger = logging.getLogger("spark-control.audio")
# Magpie voice name encodes its language. Example:
# Magpie-Multilingual.EN-US.Mia -> en-US
# Magpie-Multilingual.ES-US.Diego -> es-US
# Magpie-Multilingual.FR-FR.Pascal -> fr-FR
def _lang_from_voice(voice: str) -> str:
try:
parts = voice.split(".")
# parts = ["Magpie-Multilingual", "EN-US", "Mia"] (or with emotion suffix)
if len(parts) >= 2 and "-" in parts[1]:
lang_part = parts[1] # "EN-US"
primary, region = lang_part.split("-", 1)
return f"{primary.lower()}-{region.upper()}"
except Exception:
pass
return "en-US"
# Default voice: configurable, falls back to a sensible English voice if unset.
DEFAULT_VOICE = "Magpie-Multilingual.EN-US.Mia"
class SpeechRequest(BaseModel):
"""OpenAI /v1/audio/speech request body."""
model: Optional[str] = None # ignored — Magpie has one model
input: str # the text to speak
voice: Optional[str] = None # e.g. "Magpie-Multilingual.EN-US.Mia"
response_format: Optional[str] = "wav" # only "wav" supported today
speed: Optional[float] = 1.0 # ignored by Magpie
# Magpie-specific extensions (clients may pass these through)
language: Optional[str] = None
sample_rate_hz: Optional[int] = 22050
encoding: Optional[str] = "LINEAR_PCM"
def build_router(settings: Settings, deep_health: Any = None) -> APIRouter:
"""Build the audio proxy router.
If `deep_health` is provided, 500s from Parakeet trigger an immediate
background probe (which contains the same wedge-detect → auto-restart
logic as the 5-minute periodic loop, but fires now instead of waiting).
"""
router = APIRouter()
def _parakeet_base() -> str:
return f"http://{settings.parakeet_host}:{settings.parakeet_port}"
def _magpie_base() -> str:
return f"http://{settings.magpie_host}:{settings.magpie_port}"
# ---- /v1/models ----
@router.get("/v1/models")
async def list_models() -> dict:
"""Advertise the STT model + a small voice menu so clients can
populate their voice-picker UIs. Falls back gracefully if Magpie
is offline (returns just the STT entry)."""
data: list[dict] = [
{
"id": "parakeet-tdt-0.6b-v3",
"object": "model",
"owned_by": "nvidia",
"kind": "stt",
},
]
# Try to enumerate voices from Magpie; if unreachable, just skip.
try:
async with httpx.AsyncClient(timeout=5.0) as client:
r = await client.get(f"{_magpie_base()}/v1/audio/list_voices")
if r.status_code == 200:
voices_by_locales = r.json()
seen = set()
for _locales, payload in voices_by_locales.items():
for v in payload.get("voices", []):
# Collapse emotion variants — expose only the base voice name.
# "Magpie-Multilingual.EN-US.Mia.Angry" -> "Magpie-Multilingual.EN-US.Mia"
parts = v.split(".")
base = ".".join(parts[:3]) if len(parts) >= 3 else v
if base not in seen:
seen.add(base)
data.append({
"id": base,
"object": "model",
"owned_by": "nvidia",
"kind": "tts",
})
except Exception as e:
logger.warning("magpie voice list unavailable: %s", e)
return {"object": "list", "data": data}
# ---- /v1/audio/speech (TTS) ----
@router.post("/v1/audio/speech")
async def speech(body: SpeechRequest) -> Response:
"""OpenAI-style TTS. Translates to Magpie's multipart synth call.
Returns raw WAV bytes (Content-Type: audio/wav) — browsers and most
clients play these directly.
"""
text = (body.input or "").strip()
if not text:
raise HTTPException(400, "input text is required")
voice = body.voice or DEFAULT_VOICE
language = body.language or _lang_from_voice(voice)
sample_rate = int(body.sample_rate_hz or 22050)
encoding = body.encoding or "LINEAR_PCM"
form = {
"text": text,
"language": language,
"voice": voice,
"sample_rate_hz": str(sample_rate),
"encoding": encoding,
}
try:
async with httpx.AsyncClient(timeout=120.0) as client:
r = await client.post(f"{_magpie_base()}/v1/audio/synthesize", data=form)
except httpx.HTTPError as e:
raise HTTPException(502, f"magpie unreachable: {e}")
if r.status_code != 200:
# Surface Magpie's error message verbatim so clients can debug voice/lang typos.
raise HTTPException(r.status_code, r.text[:500])
# Magpie returns WAV bytes already (Content-Type: audio/wav). Pass through.
media_type = r.headers.get("content-type", "audio/wav")
return Response(content=r.content, media_type=media_type)
# ---- /v1/audio/transcriptions (STT) ----
@router.post("/v1/audio/transcriptions")
async def transcriptions(
file: UploadFile = File(...),
model: Optional[str] = Form(default=None),
language: Optional[str] = Form(default=None),
prompt: Optional[str] = Form(default=None),
response_format: Optional[str] = Form(default="json"),
temperature: Optional[float] = Form(default=None),
) -> Response:
"""Forward to Parakeet's already-OpenAI-compatible endpoint.
We relay rather than redirect so clients only need to know one URL
(spark-control's) — and so any future client-side rewrites of the
request shape (e.g. translating Whisper-format params) happen here.
"""
body = await file.read()
files = {"file": (file.filename or "audio.wav", body, file.content_type or "application/octet-stream")}
data: dict[str, str] = {}
if model: data["model"] = model
if language: data["language"] = language
if prompt: data["prompt"] = prompt
if response_format: data["response_format"] = response_format
if temperature is not None: data["temperature"] = str(temperature)
try:
async with httpx.AsyncClient(timeout=300.0) as client:
r = await client.post(
f"{_parakeet_base()}/v1/audio/transcriptions",
files=files, data=data,
)
except httpx.HTTPError as e:
raise HTTPException(502, f"parakeet unreachable: {e}")
if r.status_code == 500:
# Parakeet 500s are almost always the CUDA wedge (CUBLAS_*_ERROR
# mid-attention). Kick deep-health to detect+restart in the
# background, and return a clean retry signal to the client.
err_snippet = r.text[:400]
logger.warning("parakeet 500 — firing deep-health probe in background. detail=%s", err_snippet)
if deep_health is not None:
try:
asyncio.create_task(deep_health.run_one("parakeet"))
except Exception as e:
logger.error("failed to schedule deep-health probe: %s", e)
raise HTTPException(
status_code=503,
detail="Parakeet returned a transient error (likely CUDA wedge). Auto-restart triggered; retry in ~60s.",
headers={"Retry-After": "60"},
)
if r.status_code != 200:
raise HTTPException(r.status_code, r.text[:500])
return Response(content=r.content, media_type=r.headers.get("content-type", "application/json"))
def _whisperx_base() -> str:
return f"http://{settings.whisperx_host}:{settings.whisperx_port}"
async def _whisperx_healthy() -> bool:
try:
async with httpx.AsyncClient(timeout=2.0) as client:
r = await client.get(f"{_whisperx_base()}/health")
return r.status_code == 200 and bool(r.json().get("diarizer_loaded"))
except Exception:
return False
# ---- /api/audio/transcribe-with-speakers (STT + diarization, merged) ----
@router.post("/api/audio/transcribe-with-speakers")
async def transcribe_with_speakers(
file: UploadFile = File(...),
) -> dict:
"""Diarized transcription: run Parakeet ASR and Sortformer diarization on
the same audio in parallel, then merge by timestamp.
Response shape (designed for downstream UIs like recap-relay):
{
"duration": 90.5,
"language": "en",
"speakers_detected": ["Speaker_0", "Speaker_1"],
"segments": [
{"start_ms": 39308, "end_ms": 51000,
"speaker": "Speaker_0", "text": "good morning i think..."},
...
],
"models": {
"transcription": "parakeet-tdt-0.6b-v3",
"diarization": "nvidia/diar_sortformer_4spk-v1"
}
}
Each segment is a block of consecutive words by the same speaker. Speaker
labels are anonymous (Speaker_0, Speaker_1, ...) — name resolution is the
caller's responsibility (LLM analysis with optional participant hints,
or manual mapping UI).
"""
body = await file.read()
if not body:
raise HTTPException(400, "Empty file")
filename = file.filename or "audio.wav"
content_type = file.content_type or "application/octet-stream"
# Prefer WhisperX (single-pipeline, handles long audio properly) when it's
# installed and healthy. Fall back to Parakeet + Sortformer otherwise.
if await _whisperx_healthy():
files = {"file": (filename, body, content_type)}
try:
async with httpx.AsyncClient(timeout=1800.0) as client:
r = await client.post(
f"{_whisperx_base()}/v1/audio/transcribe-with-speakers",
files=files,
)
except httpx.HTTPError as e:
raise HTTPException(502, f"whisperx unreachable: {e}")
if r.status_code != 200:
raise HTTPException(r.status_code, r.text[:500])
return r.json()
# ── Legacy fallback: Parakeet ASR + Sortformer diarizer in parallel ──
async def _call_transcribe(client: httpx.AsyncClient) -> dict:
files = {"file": (filename, body, content_type)}
data = {"response_format": "verbose_json"}
r = await client.post(
f"{_parakeet_base()}/v1/audio/transcriptions",
files=files, data=data,
)
r.raise_for_status()
return r.json()
async def _call_diarize(client: httpx.AsyncClient) -> dict:
files = {"file": (filename, body, content_type)}
r = await client.post(
f"{_parakeet_base()}/v1/audio/diarize",
files=files,
)
r.raise_for_status()
return r.json()
# Run both in parallel against the same Parakeet container — Sortformer
# and Parakeet ASR are independent forward passes that share the GPU.
try:
async with httpx.AsyncClient(timeout=600.0) as client:
stt, diar = await asyncio.gather(
_call_transcribe(client),
_call_diarize(client),
)
except httpx.HTTPStatusError as e:
# Surface upstream errors. If transcribe wedged, kick deep-health.
if e.response.status_code == 500 and deep_health is not None:
try:
asyncio.create_task(deep_health.run_one("parakeet"))
except Exception:
pass
raise HTTPException(
status_code=503,
detail="Parakeet transient error (likely CUDA wedge). Auto-restart triggered; retry in ~60s.",
headers={"Retry-After": "60"},
)
raise HTTPException(e.response.status_code, e.response.text[:500])
except httpx.HTTPError as e:
raise HTTPException(502, f"parakeet unreachable: {e}")
merged = _merge_words_with_speakers(
words=stt.get("words", []),
diar_turns=diar.get("segments", []),
)
return {
"duration": stt.get("duration") or diar.get("duration") or 0.0,
"language": stt.get("language", "en"),
"speakers_detected": diar.get("speakers_detected", []),
"segments": merged,
"models": {
"transcription": stt.get("model") if isinstance(stt.get("model"), str) else "parakeet",
"diarization": diar.get("model", "sortformer"),
},
}
return router
# ---- Merge helper: assign speaker to each word, then group into blocks ----
def _assign_speaker_to_word(word_start_s: float, word_end_s: float, diar_turns: list[dict]) -> str:
"""Find the diarization turn that contains this word, or has the most
overlap with it. Returns the speaker label, or 'Speaker_unknown' if no
turn overlaps at all."""
word_mid = (word_start_s + word_end_s) / 2.0
# Fast path: find the turn containing the midpoint
for t in diar_turns:
if t["start_s"] <= word_mid <= t["end_s"]:
return t["speaker"]
# Slow path: pick the turn with max overlap with the word's span
best_speaker = "Speaker_unknown"
best_overlap = 0.0
for t in diar_turns:
overlap = max(0.0, min(word_end_s, t["end_s"]) - max(word_start_s, t["start_s"]))
if overlap > best_overlap:
best_overlap = overlap
best_speaker = t["speaker"]
return best_speaker
def _merge_words_with_speakers(words: list[dict], diar_turns: list[dict]) -> list[dict]:
"""Group consecutive same-speaker words into blocks.
Each input word: {"start": float_s, "end": float_s, "text": str} (Parakeet
verbose_json format; values are seconds).
Each input turn: {"start_s": float, "end_s": float, "speaker": str}.
Output: [{"start_ms": int, "end_ms": int, "speaker": str, "text": str}, ...]
Also breaks a block on a long silence gap (>1.5 s) even within the same
speaker — keeps blocks readable in UI rendering.
"""
if not words:
return []
SILENCE_BREAK_S = 1.5
def _join_words(parts: list[str]) -> str:
"""Join word tokens with proper spacing. Different STT outputs vary —
some include leading spaces in the word text (' morning'), some don't
('morning'). Normalize by stripping each token then joining with one
space; collapse multiple spaces. Keeps punctuation tight (no space
before period/comma/etc.)."""
cleaned = [p.strip() for p in parts if p and p.strip()]
if not cleaned:
return ""
out = cleaned[0]
for token in cleaned[1:]:
# No leading space before pure-punctuation tokens
if token and token[0] in ".,;:!?)]}'\"":
out += token
else:
out += " " + token
return out
blocks: list[dict] = []
cur_words: list[str] = []
cur_speaker: Optional[str] = None
cur_start_s: Optional[float] = None
cur_end_s: Optional[float] = None
for w in words:
ws = float(w.get("start", 0.0))
we = float(w.get("end", ws))
wt = str(w.get("text", ""))
spk = _assign_speaker_to_word(ws, we, diar_turns)
is_new_block = (
cur_speaker is None
or spk != cur_speaker
or (cur_end_s is not None and ws - cur_end_s > SILENCE_BREAK_S)
)
if is_new_block:
if cur_speaker is not None:
blocks.append({
"start_ms": int(cur_start_s * 1000),
"end_ms": int(cur_end_s * 1000),
"speaker": cur_speaker,
"text": _join_words(cur_words),
})
cur_words = [wt]
cur_speaker = spk
cur_start_s = ws
cur_end_s = we
else:
cur_words.append(wt)
cur_end_s = we
if cur_speaker is not None and cur_words:
blocks.append({
"start_ms": int(cur_start_s * 1000),
"end_ms": int(cur_end_s * 1000),
"speaker": cur_speaker,
"text": _join_words(cur_words),
})
return blocks
+1 -11
View File
@@ -35,11 +35,6 @@ class Settings:
magpie_host: str
magpie_user: str
magpie_container: str
whisperx_host: str
whisperx_user: str
whisperx_container: str
whisperx_port: int
whisperx_model: str
ssh_key_path: str
ssh_known_hosts: str
models_yaml: str
@@ -54,7 +49,7 @@ class Settings:
def from_env(cls) -> "Settings":
spark2_host = _env("SPARK2_HOST")
spark2_user = _env("SPARK2_USER")
# Parakeet, Magpie, and WhisperX all default to Spark 2 unless overridden.
# Parakeet and Magpie default to Spark 2 unless explicitly overridden.
return cls(
spark1_host=_env("SPARK1_HOST"),
spark1_user=_env("SPARK1_USER"),
@@ -66,11 +61,6 @@ class Settings:
magpie_host=_env("MAGPIE_HOST") or spark2_host,
magpie_user=_env("MAGPIE_USER") or spark2_user,
magpie_container=_env("MAGPIE_CONTAINER") or "magpie-tts",
whisperx_host=_env("WHISPERX_HOST") or spark2_host,
whisperx_user=_env("WHISPERX_USER") or spark2_user,
whisperx_container=_env("WHISPERX_CONTAINER") or "whisperx-asr",
whisperx_port=int(_env("WHISPERX_PORT", "8002")),
whisperx_model=_env("WHISPERX_MODEL", "medium"),
ssh_key_path=_env("SSH_KEY_PATH"),
ssh_known_hosts=_env("SSH_KNOWN_HOSTS"),
models_yaml=_resolve_models_yaml(),
+2 -2
View File
@@ -4,8 +4,8 @@ Format:
custom:
- key: my-riva
kind: stt
host: 192.168.1.87
user: modelo
host: <spark-2-ip>
user: <spark-user>
container: riva-asr
port: 8001
health_path: /health
-363
View File
@@ -1,363 +0,0 @@
"""Deep health probes for each service.
Why this exists: Triton's /health endpoint returns 200 as long as the HTTP
layer is alive and the model is registered. It does NOT verify that the CUDA
context inside the worker process is healthy. We've observed Parakeet getting
its CUDA context wedged after an OOM, where /health stays green but every
real transcription returns 500 cudaErrorUnknown.
So this module sends *real* but tiny synthetic inference requests:
- Parakeet: 1 second of digital silence (16 kHz mono PCM, in-memory WAV)
- Magpie: short text-to-speech, response audio discarded
- vLLM: 1-token chat completion against whatever model is loaded
All synthetic payloads are generated on demand into BytesIO, sent over HTTP,
and never touched the filesystem (on either spark-control's side or the
target service's side beyond normal Triton/Riva working memory).
When a probe fails with a signal that looks like a CUDA wedge, we
automatically issue `docker restart <container>`. Rate-limited to 3 restarts
per service per 30 minutes to avoid restart loops.
"""
from __future__ import annotations
import asyncio
import io
import time
import wave
from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Optional
import httpx
from .config import Settings
from .connectivity import record_report
from .services import ServiceDef, run_action, services_from_settings
# Default 5-minute interval, controllable via env. Sub-minute is silly for a
# heavy synthetic probe; we just want to catch wedges within a reasonable
# window — much faster than the user noticing on their next real call.
DEFAULT_INTERVAL_SEC = 300.0
PROBE_TIMEOUT_SEC = 20.0
RESTART_RATE_LIMIT = 3 # max auto-restarts per service
RESTART_RATE_WINDOW_SEC = 1800.0 # within a 30-min window
RESTART_COOLDOWN_SEC = 120.0 # don't restart again within this many seconds of the last one
STARTUP_GRACE_SEC = 60.0 # don't auto-restart for the first minute after this app boots
def _silence_wav(seconds: float = 1.0, sample_rate: int = 16000) -> io.BytesIO:
"""Return an in-memory WAV file containing `seconds` of digital silence."""
n_frames = int(seconds * sample_rate)
buf = io.BytesIO()
with wave.open(buf, "wb") as w:
w.setnchannels(1)
w.setsampwidth(2) # int16
w.setframerate(sample_rate)
w.writeframes(b"\x00\x00" * n_frames)
buf.seek(0)
return buf
def _looks_like_wedge(error: str) -> bool:
"""Heuristic: does this error string look like a stuck CUDA context that
a container restart would clear? We want to be conservative — only act
on signals we're confident about, otherwise leave the user in charge."""
err = (error or "").lower()
needles = [
"cudaerrorunknown",
"cuda error: unknown",
"cuda kernel errors",
"internal server error",
"engine core initialization failed",
"503", # service unavailable from a dependency
"500", # generic 5xx with a body that may not parse
]
return any(n in err for n in needles)
@dataclass
class ProbeResult:
ok: bool
at: str
latency_ms: Optional[int] = None
error: str = ""
note: str = ""
@dataclass
class ServiceState:
last: Optional[ProbeResult] = None
last_ok_at: Optional[str] = None
restarts: list[float] = field(default_factory=list)
class DeepHealth:
def __init__(self, settings: Settings, interval_sec: float = DEFAULT_INTERVAL_SEC) -> None:
self.settings = settings
self.interval_sec = interval_sec
self.state: dict[str, ServiceState] = {
"parakeet": ServiceState(),
"magpie": ServiceState(),
"vllm": ServiceState(),
}
self._stop = asyncio.Event()
self._boot_at = time.monotonic()
# ---- probes ---------------------------------------------------------
async def probe_parakeet(self) -> ProbeResult:
s = self.settings
now_iso = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
if not s.parakeet_host:
return ProbeResult(ok=False, at=now_iso, error="not configured")
url = f"http://{s.parakeet_host}:{s.parakeet_port}/v1/audio/transcriptions"
wav = _silence_wav(1.0)
t0 = time.monotonic()
try:
async with httpx.AsyncClient(timeout=PROBE_TIMEOUT_SEC) as c:
r = await c.post(
url,
files={"file": ("probe.wav", wav, "audio/wav")},
data={"model": "parakeet-tdt-0.6b-v3"},
)
latency = round((time.monotonic() - t0) * 1000)
if 200 <= r.status_code < 300:
return ProbeResult(ok=True, at=now_iso, latency_ms=latency)
return ProbeResult(
ok=False,
at=now_iso,
latency_ms=latency,
error=f"HTTP {r.status_code}: {r.text[:240]}",
)
except Exception as e:
return ProbeResult(ok=False, at=now_iso, error=f"{type(e).__name__}: {e}")
async def probe_magpie(self) -> ProbeResult:
s = self.settings
now_iso = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
if not s.magpie_host:
return ProbeResult(ok=False, at=now_iso, error="not configured")
# Magpie /v1/audio/synthesize expects multipart form-data, not JSON.
# The (None, value) tuple in httpx's `files=` produces a non-file form field.
url = f"http://{s.magpie_host}:{s.magpie_port}/v1/audio/synthesize"
form: dict = {"text": (None, "hi"), "language": (None, "en-US")}
t0 = time.monotonic()
try:
async with httpx.AsyncClient(timeout=PROBE_TIMEOUT_SEC) as c:
r = await c.post(url, files=form)
latency = round((time.monotonic() - t0) * 1000)
if 200 <= r.status_code < 300:
return ProbeResult(ok=True, at=now_iso, latency_ms=latency)
# 4xx that aren't 5xx mean server is alive but our payload is off —
# don't classify as wedge.
if 400 <= r.status_code < 500:
return ProbeResult(
ok=True,
at=now_iso,
latency_ms=latency,
note=f"{r.status_code} — server alive (probe payload may need a voice name)",
)
return ProbeResult(
ok=False,
at=now_iso,
latency_ms=latency,
error=f"HTTP {r.status_code}: {r.text[:240]}",
)
except Exception as e:
return ProbeResult(ok=False, at=now_iso, error=f"{type(e).__name__}: {e}")
async def probe_vllm(self) -> ProbeResult:
s = self.settings
now_iso = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
if not s.spark1_host:
return ProbeResult(ok=False, at=now_iso, error="not configured")
base = f"http://{s.spark1_host}:{s.vllm_port}"
# Step 1: is there a model loaded?
try:
async with httpx.AsyncClient(timeout=5.0) as c:
r = await c.get(f"{base}/v1/models")
if 200 <= r.status_code < 300:
models = r.json().get("data") or []
else:
# 5xx on /v1/models suggests something wedged after a model loaded
return ProbeResult(
ok=False,
at=now_iso,
error=f"list_models HTTP {r.status_code}: {r.text[:240]}",
)
except Exception:
# Connection refused / timeout: usually means no vLLM process listening
# (the vllm_node container is alive but no `vllm serve` is running yet).
# That's an idle state, not a wedge — don't trigger auto-restart.
return ProbeResult(
ok=True,
at=now_iso,
note="no model currently loaded (idle)",
)
if not models:
return ProbeResult(
ok=True,
at=now_iso,
note="no model currently loaded (idle)",
)
model_id = models[0]["id"]
# Step 2: model is loaded; verify it can actually complete a 1-token request.
t0 = time.monotonic()
try:
async with httpx.AsyncClient(timeout=PROBE_TIMEOUT_SEC) as c:
r = await c.post(
f"{base}/v1/chat/completions",
json={
"model": model_id,
"messages": [{"role": "user", "content": "hi"}],
"max_tokens": 1,
"temperature": 0,
},
)
latency = round((time.monotonic() - t0) * 1000)
if 200 <= r.status_code < 300:
return ProbeResult(ok=True, at=now_iso, latency_ms=latency, note=f"model={model_id}")
return ProbeResult(
ok=False,
at=now_iso,
latency_ms=latency,
error=f"HTTP {r.status_code}: {r.text[:240]}",
)
except Exception as e:
return ProbeResult(ok=False, at=now_iso, error=f"{type(e).__name__}: {e}")
# ---- orchestration --------------------------------------------------
PROBES = {
"parakeet": "probe_parakeet",
"magpie": "probe_magpie",
"vllm": "probe_vllm",
}
async def run_one(self, service: str) -> ProbeResult:
fn = getattr(self, self.PROBES[service])
result: ProbeResult = await fn()
st = self.state[service]
prev_ok = st.last.ok if st.last else None
st.last = result
if result.ok:
st.last_ok_at = result.at
# Log to connectivity history: every failure, plus the first success
# after a failure (recovery), plus the first probe ever — but skip
# the "still ok" steady-state to keep the log readable.
if not result.ok:
record_report(
service,
ok=False,
source="deep-health",
detail=result.error[:240],
latency_ms=result.latency_ms,
)
elif prev_ok is False:
record_report(
service,
ok=True,
source="deep-health",
detail="recovered" + (f"{result.note}" if result.note else ""),
latency_ms=result.latency_ms,
)
elif prev_ok is None:
record_report(
service,
ok=True,
source="deep-health",
detail="first probe ok" + (f"{result.note}" if result.note else ""),
latency_ms=result.latency_ms,
)
# Maybe auto-restart
if not result.ok and _looks_like_wedge(result.error):
await self._maybe_restart(service, result.error)
return result
async def _maybe_restart(self, service: str, error: str) -> None:
# No restarts during the boot grace period.
if time.monotonic() - self._boot_at < STARTUP_GRACE_SEC:
return
st = self.state[service]
now = time.monotonic()
st.restarts = [t for t in st.restarts if now - t < RESTART_RATE_WINDOW_SEC]
if st.restarts and now - st.restarts[-1] < RESTART_COOLDOWN_SEC:
return # already restarted recently, give it time
if len(st.restarts) >= RESTART_RATE_LIMIT:
record_report(
service,
ok=False,
source="deep-health",
detail=f"rate-limited; not auto-restarting (would be #{len(st.restarts)+1} in 30 min)",
)
return
services = services_from_settings(self.settings)
if service not in services:
return
svc = services[service]
if not svc.host or not svc.user:
return
result = await run_action(self.settings, svc, "restart")
st.restarts.append(now)
ok = result.get("ok", False)
record_report(
service,
ok=False,
source="deep-health",
detail=f"auto-restart triggered (wedge: {error[:120]}); restart {'OK' if ok else 'FAILED'}",
)
async def run_all(self) -> dict[str, ProbeResult]:
results = {}
for name in self.PROBES:
results[name] = await self.run_one(name)
return results
async def run_periodic(self) -> None:
"""Long-running loop. Cancel via .stop()."""
# Brief initial wait to let app finish startup
try:
await asyncio.wait_for(self._stop.wait(), timeout=10.0)
return
except asyncio.TimeoutError:
pass
while not self._stop.is_set():
try:
await self.run_all()
except Exception:
# Never let the loop die; the periodic check is best-effort
pass
try:
await asyncio.wait_for(self._stop.wait(), timeout=self.interval_sec)
return
except asyncio.TimeoutError:
continue
def stop(self) -> None:
self._stop.set()
def summary(self) -> dict:
out = {}
for name, st in self.state.items():
last = st.last
out[name] = {
"last_ok_at": st.last_ok_at,
"last": (
{
"ok": last.ok,
"at": last.at,
"latency_ms": last.latency_ms,
"error": last.error,
"note": last.note,
}
if last
else None
),
"auto_restarts_window": len(st.restarts),
}
return out
-134
View File
@@ -1,134 +0,0 @@
"""On-disk presence + deletion for Hugging Face model caches on the Sparks.
The HF cache layout for a repo `org/name` is:
~/.cache/huggingface/hub/models--org--name/
We use `du -sb` to measure size (bytes) and `rm -rf` to free it. All operations
are gated by the server endpoints, which refuse to delete a currently-loaded
model or one tied to an in-flight swap/download.
"""
from __future__ import annotations
import asyncio
import re
from dataclasses import dataclass
from typing import Optional
from .config import Settings
from .ssh import ssh_run
# HF cache dirnames are `models--<org>--<name>` where <org> and <name> only contain
# Hugging Face's allowed identifier chars: letters, digits, dot, dash, underscore.
# Validate against this whitelist so we can safely embed the dirname into a shell
# command without quoting (we need $HOME outside the quotes to expand).
_SAFE_DIRNAME = re.compile(r"^[A-Za-z0-9._\-]+$")
def repo_to_cache_dirname(repo: str) -> str:
"""Convert 'org/name' to 'models--org--name' (the HF hub cache directory)."""
if "/" not in repo:
raise ValueError(f"repo must be in 'org/name' form: {repo!r}")
dn = "models--" + repo.replace("/", "--")
if not _SAFE_DIRNAME.fullmatch(dn):
raise ValueError(f"unsafe cache dirname (rejected by whitelist): {dn!r}")
return dn
@dataclass
class HostDiskResult:
host: str
on_disk: bool
size_bytes: int = 0
error: Optional[str] = None
@dataclass
class DiskStatus:
repo: str
on_disk: bool # True if present on AT LEAST one host
total_bytes: int # sum across hosts
per_host: list[HostDiskResult]
async def probe_host(host: str, user: str, repo: str, settings: Settings) -> HostDiskResult:
"""Return whether the model's cache dir exists on this host and its size."""
if not host or not user:
return HostDiskResult(host=host or "?", on_disk=False, error="host not configured")
dn = repo_to_cache_dirname(repo) # whitelisted; safe to embed
# $HOME must expand server-side, so we build the path with double quotes
# (which DO allow variable expansion) rather than shlex.quote single quotes.
cmd = (
f'P="$HOME/.cache/huggingface/hub/{dn}"; '
f'if [ -d "$P" ]; then du -sb "$P" 2>/dev/null | cut -f1; '
f'else echo MISSING; fi'
)
rc, out, err = await ssh_run(host, user, cmd, settings, timeout=20.0)
if rc != 0:
return HostDiskResult(host=host, on_disk=False, error=(err or out).strip() or f"rc={rc}")
raw = out.strip()
if raw == "MISSING" or raw == "":
return HostDiskResult(host=host, on_disk=False)
try:
size = int(raw.splitlines()[-1])
except ValueError:
return HostDiskResult(host=host, on_disk=False, error=f"unparsable du output: {raw!r}")
return HostDiskResult(host=host, on_disk=True, size_bytes=size)
async def probe_disk(repo: str, mode: str, settings: Settings) -> DiskStatus:
"""Probe one model across the relevant Sparks based on its mode (solo|cluster)."""
hosts: list[tuple[str, str]] = [(settings.spark1_host, settings.spark1_user)]
if mode == "cluster" and settings.spark2_host:
hosts.append((settings.spark2_host, settings.spark2_user))
results = await asyncio.gather(*(probe_host(h, u, repo, settings) for h, u in hosts))
on_disk = any(r.on_disk for r in results)
total = sum(r.size_bytes for r in results)
return DiskStatus(repo=repo, on_disk=on_disk, total_bytes=total, per_host=list(results))
async def delete_host(host: str, user: str, repo: str, settings: Settings) -> HostDiskResult:
"""Probe + rm -rf on one host. Returns bytes freed (0 if the dir wasn't there)."""
if not host or not user:
return HostDiskResult(host=host or "?", on_disk=False, error="host not configured")
dn = repo_to_cache_dirname(repo) # whitelisted; safe to embed
# Compute size first, then remove. If absent, still return success (idempotent).
# $HOME is in double-quoted context so it expands; the dirname is whitelisted.
cmd = (
f'set -e; '
f'P="$HOME/.cache/huggingface/hub/{dn}"; '
f'if [ -d "$P" ]; then '
f' SIZE=$(du -sb "$P" 2>/dev/null | cut -f1); '
f' rm -rf -- "$P"; '
f' echo "FREED $SIZE"; '
f'else '
f' echo "FREED 0"; '
f'fi'
)
rc, out, err = await ssh_run(host, user, cmd, settings, timeout=120.0)
if rc != 0:
return HostDiskResult(host=host, on_disk=False, error=(err or out).strip() or f"rc={rc}")
# Parse the "FREED N" line
freed = 0
for line in out.splitlines():
parts = line.strip().split()
if len(parts) == 2 and parts[0] == "FREED":
try:
freed = int(parts[1])
except ValueError:
pass
break
return HostDiskResult(host=host, on_disk=False, size_bytes=freed)
async def delete_from_disk(repo: str, mode: str, settings: Settings) -> DiskStatus:
"""rm -rf the model's cache dir on the relevant Sparks. Idempotent."""
hosts: list[tuple[str, str]] = [(settings.spark1_host, settings.spark1_user)]
if mode == "cluster" and settings.spark2_host:
hosts.append((settings.spark2_host, settings.spark2_user))
results = await asyncio.gather(*(delete_host(h, u, repo, settings) for h, u in hosts))
total_freed = sum(r.size_bytes for r in results)
# After deletion, on_disk should be False on all hosts.
return DiskStatus(repo=repo, on_disk=False, total_bytes=total_freed, per_host=list(results))
-233
View File
@@ -12,9 +12,6 @@ from typing import Literal
from .config import Settings
from .connectivity import get_mac, record_report, record_state, summary as connectivity_summary
from .custom_services import add_custom_service, delete_custom_service
from .audio_proxy import build_router as build_audio_router
from .deep_health import DeepHealth
from .disk import delete_from_disk, probe_disk
from .download import DownloadManager
from .hardware import HardwareProbe
from .health import check_magpie, check_parakeet, check_vllm
@@ -22,9 +19,7 @@ from .models import load_catalog
from .nim import SUGGESTED_NIMS, CATALOG_URL, NimManager
from .overrides import add_custom, delete_custom, extract_knobs_from_args, load_overrides, set_knobs
from .services import docker_state, run_action, services_from_settings
from .speech_models import SpeechModelsManager
from .ssh import ssh_run
from .whisperx_install import WhisperXInstaller
from .swap import SwapManager
from .updates import UpdateManager, get_update_status
from .validate import validate_launch
@@ -38,34 +33,12 @@ download_manager = DownloadManager(settings)
update_manager = UpdateManager(settings)
hardware_probe = HardwareProbe(settings)
nim_manager = NimManager(settings)
deep_health = DeepHealth(settings)
speech_models = SpeechModelsManager(settings)
whisperx_installer = WhisperXInstaller(settings)
app = FastAPI(title="spark-control", version="0.1.0")
@app.on_event("startup")
async def _start_deep_health() -> None:
# Fire-and-forget; the loop catches its own exceptions.
asyncio.create_task(deep_health.run_periodic())
@app.on_event("shutdown")
async def _stop_deep_health() -> None:
deep_health.stop()
_STATIC_DIR = Path(__file__).resolve().parent / "static"
app.mount("/static", StaticFiles(directory=_STATIC_DIR), name="static")
# OpenAI-compatible audio proxy: /v1/audio/speech, /v1/audio/transcriptions, /v1/models.
# Lets Open WebUI, Home Assistant, and any other OpenAI-shaped client talk to
# Parakeet (STT) and Magpie (TTS) through a single spark-control URL.
# Passing deep_health lets the proxy fire an immediate wedge-detect + auto-restart
# when Parakeet returns 500, instead of waiting up to 5 min for the periodic probe.
app.include_router(build_audio_router(settings, deep_health=deep_health))
@app.get("/", include_in_schema=False)
async def index() -> FileResponse:
@@ -152,89 +125,6 @@ async def del_model(key: str) -> dict:
return {"ok": True, "key": key}
@app.get("/api/models/disk-status")
async def get_models_disk_status() -> dict:
"""Probe each catalog model's HF cache on the appropriate Spark(s) in parallel.
Result is keyed by model key: {on_disk, total_bytes, per_host:[{host,on_disk,size_bytes,error?}]}.
Designed to be called once on dashboard load; takes ~13s depending on Spark count.
"""
if not settings.configured:
return {"configured": False, "models": {}}
keys = list(catalog.models.keys())
statuses = await asyncio.gather(*(
probe_disk(catalog.models[k].repo, catalog.models[k].mode, settings) for k in keys
), return_exceptions=True)
out: dict[str, dict] = {}
for k, s in zip(keys, statuses):
if isinstance(s, Exception):
out[k] = {"on_disk": False, "total_bytes": 0, "per_host": [], "error": str(s)}
continue
out[k] = {
"on_disk": s.on_disk,
"total_bytes": s.total_bytes,
"per_host": [
{"host": r.host, "on_disk": r.on_disk, "size_bytes": r.size_bytes, **({"error": r.error} if r.error else {})}
for r in s.per_host
],
}
return {"configured": True, "models": out}
@app.delete("/api/models/{key}/disk")
async def del_model_disk(key: str) -> dict:
"""Delete a model's weights from the Spark filesystem(s). The catalog entry stays.
Safety rails:
- Refuses if the model is currently loaded on vLLM.
- Refuses if a swap or download is in flight.
- Idempotent: if the cache dir is already gone on a host, that host reports 0 bytes freed.
"""
if key not in catalog.models:
raise HTTPException(404, f"unknown model: {key}")
m = catalog.models[key]
# Refuse if currently loaded
try:
vllm = await check_vllm(settings)
except Exception:
vllm = {}
if vllm.get("ok") and vllm.get("current_model") == m.repo:
raise HTTPException(
409,
f"'{m.display_name}' is the currently loaded model. Switch to a different model first, then try again."
)
# Refuse if a swap is in flight
if swap_manager.current_job_id:
raise HTTPException(409, "a model swap is in progress; wait for it to finish")
# Refuse if a download is in flight for this same repo (a different model's download is fine)
if download_manager.current_job_id:
job = download_manager.get(download_manager.current_job_id)
if job and job.repo == m.repo:
raise HTTPException(409, "this model is currently downloading; cancel or wait for it to finish")
status = await delete_from_disk(m.repo, m.mode, settings)
# Audit log
record_report(
f"disk:{key}",
ok=True,
source="disk-delete",
detail=f"freed {status.total_bytes} bytes across {len(status.per_host)} host(s)",
)
return {
"ok": True,
"key": key,
"repo": m.repo,
"bytes_freed": status.total_bytes,
"per_host": [
{"host": r.host, "size_bytes": r.size_bytes, **({"error": r.error} if r.error else {})}
for r in status.per_host
],
}
@app.get("/api/hardware")
async def get_hardware() -> dict:
"""Per-Spark hardware snapshot — RAM, disk, GPU mem + util, CPU load, uptime."""
@@ -247,27 +137,6 @@ async def get_connectivity() -> dict:
return connectivity_summary()
@app.get("/api/deep-health")
async def get_deep_health() -> dict:
"""Last result + auto-restart counters for each service's synthetic probe."""
return deep_health.summary()
@app.post("/api/deep-health/{service}/run")
async def run_deep_health(service: str) -> dict:
"""Manually run a single service's deep-health probe right now."""
if service not in deep_health.PROBES:
raise HTTPException(404, f"unknown service: {service}")
result = await deep_health.run_one(service)
return {
"ok": result.ok,
"at": result.at,
"latency_ms": result.latency_ms,
"error": result.error,
"note": result.note,
}
class HealthEventBody(BaseModel):
service: str # e.g. "parakeet", "magpie", "vllm"
ok: bool # true on success, false on failure
@@ -499,108 +368,6 @@ async def service_action(name: str, action: str) -> dict:
return {"name": name, "action": action, **result}
# ---- Speech model patch management ----
@app.get("/api/speech-models")
async def get_speech_models() -> dict:
"""Status of the parakeet-asr container + the spark-control overlay patches
(diarizer.py + main.py). Drift between local shipped patches and what's
inside the container is surfaced so the UI can prompt for reapply."""
return await speech_models.status()
@app.post("/api/speech-models/reapply")
async def post_speech_models_reapply() -> dict:
"""Copy spark-control's shipped diarizer.py + patched main.py into the
parakeet-asr container, verify Python syntax, restart the container, and
wait for both models (Parakeet ASR + Sortformer) to reload. ~60120 seconds."""
try:
result = await speech_models.reapply_patches()
except RuntimeError as e:
raise HTTPException(409, str(e))
if not result.get("ok"):
# Bubble up which step failed for client-side error rendering.
raise HTTPException(500, {"detail": "patch reapply failed", "result": result})
return result
@app.post("/api/speech-models/restart")
async def post_speech_models_restart() -> dict:
"""`docker restart parakeet-asr` only — no file changes. Useful when the
container's models look wedged but patches are already current."""
try:
result = await speech_models.restart_container()
except RuntimeError as e:
raise HTTPException(409, str(e))
if not result.get("ok"):
raise HTTPException(500, {"detail": "container restart failed", "result": result})
return result
# ---- WhisperX install (Phase 2 of the WhisperX migration) ----
@app.get("/api/whisperx/status")
async def get_whisperx_status() -> dict:
"""Is WhisperX installed + healthy on Spark 2 right now?"""
return await whisperx_installer.status()
@app.post("/api/whisperx/install")
async def post_whisperx_install() -> dict:
"""One-click install: ships the WhisperX build context from inside
spark-control to Spark 2, runs `docker build` + `docker run`, polls
/health until both models are loaded. Streams progress via the matching
GET /api/whisperx/install/{job_id}/stream SSE endpoint."""
try:
job = await whisperx_installer.trigger()
except RuntimeError as e:
raise HTTPException(409, str(e))
return {"job_id": job.id, "started_at": job.started_at}
@app.get("/api/whisperx/install/{job_id}")
async def get_whisperx_install(job_id: str) -> dict:
job = whisperx_installer.get(job_id)
if not job:
raise HTTPException(404, "unknown job")
return {
"id": job.id,
"state": job.state,
"phase": job.phase,
"lines": job.lines,
"started_at": job.started_at,
"finished_at": job.finished_at,
"returncode": job.returncode,
}
@app.get("/api/whisperx/install/{job_id}/stream")
async def stream_whisperx_install(job_id: str) -> StreamingResponse:
job = whisperx_installer.get(job_id)
if not job:
raise HTTPException(404, "unknown job")
async def event_stream():
last_idx = 0
last_phase = ""
last_state = ""
while True:
new_lines = job.lines[last_idx:]
last_idx = len(job.lines)
for line in new_lines:
yield f"data: {json.dumps({'line': line})}\n\n"
if job.phase != last_phase or job.state != last_state:
yield f"event: phase\ndata: {json.dumps({'phase': job.phase, 'state': job.state})}\n\n"
last_phase = job.phase
last_state = job.state
if job.finished_at:
yield f"event: done\ndata: {json.dumps({'state': job.state, 'returncode': job.returncode})}\n\n"
return
await asyncio.sleep(0.6)
return StreamingResponse(event_stream(), media_type="text/event-stream")
@app.get("/api/endpoints")
async def get_endpoints() -> dict:
"""Service-discovery summary. Stable shape; other apps on the LAN can poll this
-8
View File
@@ -65,14 +65,6 @@ def services_from_settings(s: Settings) -> dict[str, ServiceDef]:
container=s.magpie_container,
port=s.magpie_port,
),
"whisperx": ServiceDef(
name="whisperx",
kind="stt+diarize",
host=s.whisperx_host,
user=s.whisperx_user,
container=s.whisperx_container,
port=s.whisperx_port,
),
}
for entry in load_custom_services():
key = entry.get("key")
-319
View File
@@ -1,319 +0,0 @@
"""Speech-model patch management for the parakeet-asr container on Spark 2.
The parakeet-asr container ships with a stock FastAPI wrapper that only supports
ASR (Parakeet TDT). Spark Control augments it with two overlay files —
`diarizer.py` and a patched `main.py` — that add Sortformer-based diarization
and the `/v1/audio/diarize` endpoint.
These overlays survive `docker restart` (writable layer) but NOT `docker rm`
(volume rebuild). If the parakeet container is ever recreated, the overlays
need to be re-applied. This module handles that:
- GET /api/speech-models → current state (loaded models, patch
checksums, drift detection)
- POST /api/speech-models/reapply → copy overlays from spark-control's
shipped /app/parakeet_patches into
the parakeet container + restart
- POST /api/speech-models/restart → just `docker restart parakeet-asr`,
no overlay changes
"""
from __future__ import annotations
import asyncio
import hashlib
import json
import shlex
from datetime import datetime, timezone
from pathlib import Path
from typing import Optional
import httpx
from .config import Settings
from .connectivity import record_report
from .ssh import ssh_run
# /app/parakeet_patches inside the spark-control container image (set up by
# the Dockerfile COPY directive). Each file under here is the canonical
# version we'd push into the parakeet container.
PATCHES_DIR = Path(__file__).resolve().parent.parent / "parakeet_patches"
# Files we manage. Mapped local-source-path -> destination-path-in-container.
MANAGED_FILES = {
"diarizer.py": "/opt/parakeet/app/diarizer.py",
"main.py": "/opt/parakeet/app/main.py",
}
def _sha256_short(text: bytes) -> str:
return hashlib.sha256(text).hexdigest()[:12]
def _local_patches() -> dict[str, dict]:
"""Read the canonical patch files shipped inside spark-control.
Returns: {local_name: {"path": str, "sha": str, "size": int, "missing": bool}}
"""
out: dict[str, dict] = {}
for local_name in MANAGED_FILES:
p = PATCHES_DIR / local_name
if not p.exists():
out[local_name] = {"path": str(p), "missing": True}
continue
body = p.read_bytes()
out[local_name] = {
"path": str(p),
"sha": _sha256_short(body),
"size": len(body),
"missing": False,
}
return out
async def _parakeet_health(settings: Settings) -> dict:
"""Pull current model loading state from Parakeet's /health endpoint."""
url = f"http://{settings.parakeet_host}:{settings.parakeet_port}/health"
try:
async with httpx.AsyncClient(timeout=4.0) as client:
r = await client.get(url)
if r.status_code == 200:
return r.json()
return {"reachable": False, "status_code": r.status_code, "error": r.text[:200]}
except Exception as e:
return {"reachable": False, "error": f"{type(e).__name__}: {e}"}
async def _remote_file_sha(settings: Settings, container_path: str) -> Optional[str]:
"""sha256 of a file inside the parakeet container, or None if missing/error."""
if not settings.parakeet_host or not settings.parakeet_user:
return None
cmd = (
f"docker exec parakeet-asr sh -c "
f"'[ -f {shlex.quote(container_path)} ] && "
f"sha256sum {shlex.quote(container_path)} 2>/dev/null | cut -c1-12 || echo MISSING'"
)
rc, out, _ = await ssh_run(settings.parakeet_host, settings.parakeet_user, cmd, settings, timeout=15)
if rc != 0:
return None
s = out.strip()
if s == "MISSING" or not s:
return None
return s
class SpeechModelsManager:
"""Tracks last-reapply state in-memory; persists nothing across spark-control
restarts (the source-of-truth is what's actually inside the parakeet
container, which we read fresh on every status call)."""
def __init__(self, settings: Settings) -> None:
self.settings = settings
self.last_reapply_at: Optional[str] = None
self.last_reapply_result: Optional[dict] = None
self.last_restart_at: Optional[str] = None
self._reapply_lock = asyncio.Lock()
async def status(self) -> dict:
"""Build the full speech-models status payload for the UI.
Compares the SHAs of files we shipped inside spark-control vs what's
actually running inside the parakeet container — surfaces drift if
patches were applied from an older spark-control version, or never
applied at all.
"""
local = _local_patches()
health = await _parakeet_health(self.settings)
# Probe remote SHAs in parallel
async def _probe(local_name: str) -> tuple[str, Optional[str]]:
return local_name, await _remote_file_sha(self.settings, MANAGED_FILES[local_name])
remote_results = await asyncio.gather(*(_probe(n) for n in MANAGED_FILES))
remote = {name: sha for name, sha in remote_results}
files = []
all_in_sync = True
any_missing_remote = False
for local_name in MANAGED_FILES:
local_info = local.get(local_name, {})
local_sha = local_info.get("sha")
remote_sha = remote.get(local_name)
in_sync = bool(local_sha) and (local_sha == remote_sha)
if not in_sync:
all_in_sync = False
if remote_sha is None:
any_missing_remote = True
files.append({
"name": local_name,
"container_path": MANAGED_FILES[local_name],
"local_sha": local_sha,
"remote_sha": remote_sha,
"in_sync": in_sync,
"size_bytes": local_info.get("size"),
})
# Coarse status for the UI to render a single pill
if any_missing_remote:
patch_status = "missing" # overlay files missing in container
elif all_in_sync:
patch_status = "in_sync"
else:
patch_status = "drift" # local files newer than container
return {
"container_health": health,
"patches": {
"status": patch_status,
"files": files,
"last_reapply_at": self.last_reapply_at,
"last_reapply_result": self.last_reapply_result,
"last_restart_at": self.last_restart_at,
},
}
async def reapply_patches(self) -> dict:
"""Copy the patches shipped inside spark-control into the parakeet
container, verify syntax, and restart it. Same logic as apply.sh but
run from inside spark-control's FastAPI process."""
if self._reapply_lock.locked():
raise RuntimeError("a patch reapply is already in progress")
async with self._reapply_lock:
return await self._do_reapply()
async def _do_reapply(self) -> dict:
s = self.settings
if not s.parakeet_host or not s.parakeet_user:
raise RuntimeError("parakeet host/user not configured")
steps: list[dict] = []
# 0. Verify local patches present
local = _local_patches()
for name, info in local.items():
if info.get("missing"):
steps.append({"step": "verify_local", "ok": False, "name": name, "error": "patch file missing inside spark-control image"})
return self._finish_reapply(False, steps)
steps.append({"step": "verify_local", "ok": True, "files": list(local.keys())})
# 1. Backup main.py inside container (idempotent — only if backup doesn't already exist)
backup_cmd = (
"docker exec parakeet-asr sh -c '"
"test -f /opt/parakeet/app/main.py.pre-sortformer || "
"cp /opt/parakeet/app/main.py /opt/parakeet/app/main.py.pre-sortformer"
"'"
)
rc, out, err = await ssh_run(s.parakeet_host, s.parakeet_user, backup_cmd, s, timeout=15)
steps.append({"step": "backup_original", "ok": rc == 0, "stdout": out.strip()[:200], "stderr": err.strip()[:200]})
if rc != 0:
return self._finish_reapply(False, steps)
# 2. Copy each patch file into the container via `docker exec -i ... 'cat > path'`
for local_name, container_path in MANAGED_FILES.items():
local_body = (PATCHES_DIR / local_name).read_bytes()
copy_cmd = f"docker exec -i parakeet-asr sh -c {shlex.quote('cat > ' + container_path)}"
ok, out, err = await self._ssh_pipe_to_remote(
s.parakeet_host, s.parakeet_user, copy_cmd, local_body, s, timeout=30
)
steps.append({"step": "copy_file", "name": local_name, "ok": ok,
"bytes": len(local_body), "stdout": out[:200], "stderr": err[:200]})
if not ok:
return self._finish_reapply(False, steps)
# 3. Verify Python syntax inside the container
syntax_cmd = (
"docker exec parakeet-asr python3 -c "
"'import ast; "
"ast.parse(open(\"/opt/parakeet/app/diarizer.py\").read()); "
"ast.parse(open(\"/opt/parakeet/app/main.py\").read()); "
"print(\"py OK\")'"
)
rc, out, err = await ssh_run(s.parakeet_host, s.parakeet_user, syntax_cmd, s, timeout=30)
ok = rc == 0 and "py OK" in out
steps.append({"step": "verify_syntax", "ok": ok, "stdout": out.strip()[:300], "stderr": err.strip()[:300]})
if not ok:
return self._finish_reapply(False, steps)
# 4. Restart the container
restart_cmd = "docker restart parakeet-asr"
rc, out, err = await ssh_run(s.parakeet_host, s.parakeet_user, restart_cmd, s, timeout=60)
steps.append({"step": "docker_restart", "ok": rc == 0, "stdout": out.strip()[:200], "stderr": err.strip()[:200]})
if rc != 0:
return self._finish_reapply(False, steps)
# 5. Poll /health until both models are loaded again (up to ~120s)
loaded = False
for _ in range(40):
await asyncio.sleep(3)
h = await _parakeet_health(s)
if h.get("asr_loaded") and h.get("diarizer_loaded"):
loaded = True
steps.append({"step": "verify_health", "ok": True, "asr_loaded": True, "diarizer_loaded": True})
break
if not loaded:
steps.append({"step": "verify_health", "ok": False, "error": "models did not load within 120s"})
return self._finish_reapply(False, steps)
return self._finish_reapply(True, steps)
def _finish_reapply(self, success: bool, steps: list[dict]) -> dict:
now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
self.last_reapply_at = now
result = {"ok": success, "at": now, "steps": steps}
self.last_reapply_result = result
record_report(
"parakeet",
ok=success,
source="speech-models-reapply",
detail=f"reapply patches: {'OK' if success else 'FAILED at step ' + str([s for s in steps if not s.get('ok')][:1])}",
)
return result
async def restart_container(self) -> dict:
"""Restart the parakeet-asr container without changing any files."""
s = self.settings
if not s.parakeet_host or not s.parakeet_user:
raise RuntimeError("parakeet host/user not configured")
rc, out, err = await ssh_run(s.parakeet_host, s.parakeet_user,
"docker restart parakeet-asr", s, timeout=60)
ok = rc == 0
now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
self.last_restart_at = now
record_report(
"parakeet",
ok=ok,
source="speech-models-restart",
detail=f"manual restart: {'OK' if ok else 'rc=' + str(rc) + ' ' + err.strip()[:120]}",
)
return {"ok": ok, "at": now, "stdout": out.strip()[:200], "stderr": err.strip()[:200]}
async def _ssh_pipe_to_remote(
self,
host: str,
user: str,
remote_cmd: str,
payload: bytes,
settings: Settings,
timeout: float = 30.0,
) -> tuple[bool, str, str]:
"""Run `ssh user@host <remote_cmd>` while piping `payload` to its stdin.
This is the bash equivalent of `ssh ... '<cmd>' < local_file`.
Returns (success, stdout_str, stderr_str)."""
from .ssh import _base_args
args = _base_args(settings) + [f"{user}@{host}", remote_cmd]
proc = await asyncio.create_subprocess_exec(
*args,
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
try:
stdout_b, stderr_b = await asyncio.wait_for(
proc.communicate(input=payload), timeout=timeout
)
except asyncio.TimeoutError:
proc.kill()
await proc.wait()
return False, "", f"timeout after {timeout}s"
ok = proc.returncode == 0
return ok, stdout_b.decode(errors="replace"), stderr_b.decode(errors="replace")
+3 -521
View File
@@ -17,9 +17,6 @@ const state = {
config: {},
configured: true,
timer_handle: null,
deep_health: {},
disk_status: {}, // keyed by model key: { on_disk, total_bytes, per_host }
disk_status_loaded: false,
};
const el = (sel) => document.querySelector(sel);
@@ -59,49 +56,12 @@ function renderCards() {
? `<div class="desc">${escapeHtml(m.description)}</div>`
: '';
const customPill = m.custom ? `<span class="tag custom-pill">custom</span>` : '';
// Disk-presence pill + trash button. Until /api/models/disk-status comes back,
// we don't know — render a neutral placeholder.
const disk = state.disk_status[key];
let diskPill = '';
if (state.disk_status_loaded) {
if (disk && disk.on_disk) {
const gb = (disk.total_bytes / 1e9);
diskPill = `<span class="tag on-disk" title="Weights present on disk">on disk · ${gb.toFixed(1)} GB</span>`;
} else {
diskPill = `<span class="tag not-on-disk" title="Weights not downloaded">not downloaded</span>`;
}
}
// Trash button — hidden if not on disk; disabled (with tooltip) if currently loaded.
let trashBtn = '';
if (state.disk_status_loaded && disk && disk.on_disk) {
const disabled = isActive || isSwapping;
const tip = isActive
? 'Currently loaded — switch to another model first'
: isSwapping
? 'A swap is in progress'
: 'Delete weights from disk';
trashBtn = `<button class="icon-btn danger" data-disk-del-key="${key}" title="${escapeHtml(tip)}" aria-label="Delete from disk" ${disabled ? 'disabled' : ''}>${trashIcon}</button>`;
}
// Primary card action: "Switch to this" (green) when on disk; "Download" (blue) when not.
// Before disk-status loads we render the swap button as a sensible default.
const isOnDisk = !state.disk_status_loaded || (disk && disk.on_disk);
const dlInFlight = !!(typeof dlState !== 'undefined' && dlState && dlState.job_id);
let primaryBtn = '';
if (isActive) {
primaryBtn = `<button class="btn" disabled>Current</button>`;
} else if (isOnDisk) {
primaryBtn = `<button class="btn primary" data-swap-key="${key}" ${isSwapping ? 'disabled' : ''}>Switch to this</button>`;
} else {
const tip = dlInFlight ? 'A download is already in progress' : 'Download weights to the Spark(s)';
primaryBtn = `<button class="btn info" data-download-key="${key}" title="${escapeHtml(tip)}" ${dlInFlight ? 'disabled' : ''}>Download</button>`;
}
card.innerHTML = `
<div class="name">${escapeHtml(m.display_name)}</div>
<div class="meta">
<span class="tag mode-${m.mode}">${m.mode}</span>
<span class="tag">${m.size_gb} GB</span>
${customPill}
${diskPill}
${(m.capabilities || []).map(c => `<span class="tag cap">${escapeHtml(c)}</span>`).join('')}
</div>
${desc}
@@ -110,10 +70,11 @@ function renderCards() {
</div>
<div class="spacer"></div>
<div class="card-actions">
${primaryBtn}
<button class="btn ${isActive ? '' : 'primary'}" data-swap-key="${key}" ${isActive || isSwapping ? 'disabled' : ''}>
${isActive ? 'Current' : 'Switch to this'}
</button>
<button class="btn test-btn" data-test-key="${key}" title="Pre-flight check the launch command without starting the engine">Test</button>
<button class="btn adv-btn" data-adv-key="${key}" title="Advanced settings">Advanced</button>
${trashBtn}
</div>
<div class="test-result hidden" data-test-result-for="${key}"></div>
`;
@@ -122,22 +83,14 @@ function renderCards() {
for (const btn of root.querySelectorAll('[data-swap-key]')) {
btn.addEventListener('click', () => triggerSwap(btn.dataset.swapKey));
}
for (const btn of root.querySelectorAll('[data-download-key]')) {
btn.addEventListener('click', () => triggerDownloadForKey(btn.dataset.downloadKey));
}
for (const btn of root.querySelectorAll('[data-adv-key]')) {
btn.addEventListener('click', () => openAdvanced(btn.dataset.advKey));
}
for (const btn of root.querySelectorAll('[data-test-key]')) {
btn.addEventListener('click', () => testLaunch(btn.dataset.testKey, btn));
}
for (const btn of root.querySelectorAll('[data-disk-del-key]')) {
btn.addEventListener('click', () => openDiskDeleteDialog(btn.dataset.diskDelKey));
}
}
const trashIcon = '<svg viewBox="0 0 24 24" width="14" height="14" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true"><polyline points="3 6 5 6 21 6"></polyline><path d="M19 6l-1 14a2 2 0 0 1-2 2H8a2 2 0 0 1-2-2L5 6"></path><path d="M10 11v6"></path><path d="M14 11v6"></path><path d="M9 6V4a2 2 0 0 1 2-2h2a2 2 0 0 1 2 2v2"></path></svg>';
async function testLaunch(key, btn) {
const resultEl = document.querySelector(`[data-test-result-for="${key}"]`);
if (!resultEl) return;
@@ -460,35 +413,6 @@ async function renderServices() {
const restartsRow = s.restart_count != null && s.restart_count > 1
? `<div class="row"><span class="k">Restarts</span><span class="v">${s.restart_count}</span></div>`
: '';
const dh = state.deep_health?.[name];
let deepRow = '';
if (dh && dh.last) {
const last = dh.last;
const when = (last.at || '').slice(11, 19); // HH:MM:SS
const verdict = last.ok
? `<span class="dh-ok">deep check ok</span>`
: `<span class="dh-fail">deep check FAILED</span>`;
const lat = last.latency_ms != null ? ` <span class="muted">${last.latency_ms} ms</span>` : '';
const restarts = dh.auto_restarts_window > 0
? ` <span class="muted">· ${dh.auto_restarts_window} auto-restart${dh.auto_restarts_window === 1 ? '' : 's'} in 30 min</span>`
: '';
deepRow = `
<div class="row deep-row">
<span class="k">Deep</span>
<span class="v deep-v">${verdict} <span class="muted small">${escapeHtml(when)}</span>${lat}${restarts}</span>
<button class="icon-btn dh-run-btn" data-dh-run="${escapeHtml(name)}" title="Run deep check now">↻</button>
</div>
${last.ok ? '' : `<div class="deep-error muted small">${escapeHtml((last.error || last.note || '').slice(0, 200))}</div>`}
`;
} else if (dh) {
deepRow = `
<div class="row deep-row">
<span class="k">Deep</span>
<span class="v muted-v">no probe yet</span>
<button class="icon-btn dh-run-btn" data-dh-run="${escapeHtml(name)}" title="Run deep check now">↻</button>
</div>
`;
}
card.innerHTML = `
<div class="head">
<span class="name">${escapeHtml(name)}</span>
@@ -499,7 +423,6 @@ async function renderServices() {
${urlRow}
${modelRow}
${restartsRow}
${deepRow}
<div class="service-actions">
<button class="btn" data-svc-action="${name}:start" ${disable('start') ? 'disabled' : ''}>Start</button>
<button class="btn" data-svc-action="${name}:restart" ${disable('restart') ? 'disabled' : ''}>Restart</button>
@@ -511,268 +434,6 @@ async function renderServices() {
for (const btn of grid.querySelectorAll('.btn[data-svc-action]')) {
btn.addEventListener('click', () => onServiceAction(btn.dataset.svcAction));
}
for (const btn of grid.querySelectorAll('[data-dh-run]')) {
btn.addEventListener('click', () => onDeepHealthRun(btn.dataset.dhRun, btn));
}
}
async function onDeepHealthRun(name, btn) {
btn.disabled = true;
const orig = btn.textContent;
btn.textContent = '…';
try {
await fetchJSON(`/api/deep-health/${encodeURIComponent(name)}/run`, { method: 'POST' });
} catch (e) {
console.warn('deep-health run failed', e);
} finally {
try { state.deep_health = await fetchJSON('/api/deep-health'); } catch {}
btn.textContent = orig;
btn.disabled = false;
renderServices();
}
}
// ===================== speech-model patches (v0.11) =====================
async function renderSpeechModels() {
const panel = el('#speech-models-panel');
const card = el('#speech-models-card');
if (!panel || !card) return;
let data;
try {
data = await fetchJSON('/api/speech-models');
} catch (e) {
// If parakeet host isn't even configured, hide the section entirely
panel.classList.add('hidden');
return;
}
if (!data || !data.patches) { panel.classList.add('hidden'); return; }
panel.classList.remove('hidden');
const patches = data.patches || {};
const health = data.container_health || {};
const status = patches.status || 'unknown';
let statusPill;
if (status === 'in_sync') {
statusPill = `<span class="tag ok">patches in sync</span>`;
} else if (status === 'drift') {
statusPill = `<span class="tag warn">spark-control has newer patches</span>`;
} else if (status === 'missing') {
statusPill = `<span class="tag bad">patches missing in container</span>`;
} else {
statusPill = `<span class="tag warn">unknown</span>`;
}
const asrLoaded = !!health.asr_loaded;
const diarLoaded = !!health.diarizer_loaded;
const asrModel = escapeHtml(health.model || '—');
const diarModel = escapeHtml(health.diarizer_model || '—');
const fileRows = (patches.files || []).map((f) => {
const sync = f.in_sync
? '<span class="sm-file-ok">✓ in sync</span>'
: f.remote_sha == null
? '<span class="sm-file-bad">✗ missing</span>'
: '<span class="sm-file-warn">⚠ drift</span>';
const local = f.local_sha ? `<code>${escapeHtml(f.local_sha)}</code>` : '<span class="muted">—</span>';
const remote = f.remote_sha ? `<code>${escapeHtml(f.remote_sha)}</code>` : '<span class="muted">—</span>';
return `
<div class="sm-file-row">
<span class="sm-file-name"><code>${escapeHtml(f.name)}</code></span>
<span class="sm-file-sync">${sync}</span>
<span class="sm-file-sha muted small">local ${local} → remote ${remote}</span>
</div>
`;
}).join('');
const lastReapply = patches.last_reapply_at ? new Date(patches.last_reapply_at).toLocaleString() : 'never (since spark-control boot)';
const lastRestart = patches.last_restart_at ? new Date(patches.last_restart_at).toLocaleString() : 'never (since spark-control boot)';
card.innerHTML = `
<div class="sm-header">
<div class="sm-title">parakeet-asr container</div>
${statusPill}
</div>
<div class="sm-models">
<div class="sm-model-row">
<span class="sm-model-kind">Parakeet ASR</span>
<span class="sm-model-name">${asrModel}</span>
<span class="sm-model-loaded">${asrLoaded ? '<span class="tag ok">loaded</span>' : '<span class="tag bad">not loaded</span>'}</span>
</div>
<div class="sm-model-row">
<span class="sm-model-kind">Sortformer diarizer</span>
<span class="sm-model-name">${diarModel}</span>
<span class="sm-model-loaded">${diarLoaded ? '<span class="tag ok">loaded</span>' : '<span class="tag bad">not loaded</span>'}</span>
</div>
</div>
<div class="sm-files">${fileRows}</div>
<div class="sm-meta muted small">
Last reapply: ${escapeHtml(lastReapply)} · Last manual restart: ${escapeHtml(lastRestart)}
</div>
<div class="sm-actions">
<button class="btn primary" id="sm-reapply">Reapply patches</button>
<button class="btn" id="sm-restart">Restart container</button>
</div>
`;
el('#sm-reapply').addEventListener('click', onSpeechModelsReapply);
el('#sm-restart').addEventListener('click', onSpeechModelsRestart);
}
async function onSpeechModelsReapply() {
if (!confirm('Reapply Sortformer patches to the parakeet-asr container? The container will restart and both ASR + diarizer will be unavailable for ~60120 seconds.')) return;
const dlg = el('#speech-models-progress-dialog');
const steps = el('#sm-prog-steps');
const closeBtn = el('#sm-prog-close');
steps.innerHTML = '<div class="muted small">Starting…</div>';
closeBtn.disabled = true;
closeBtn.onclick = () => dlg.close();
dlg.showModal();
try {
const r = await fetchJSON('/api/speech-models/reapply', { method: 'POST' });
steps.innerHTML = (r.steps || []).map((s) => {
const mark = s.ok ? '<span class="sm-file-ok">✓</span>' : '<span class="sm-file-bad">✗</span>';
const extra = s.error ? `<div class="muted small">${escapeHtml(s.error)}</div>` : '';
return `<div class="sm-prog-step">${mark} <strong>${escapeHtml(s.step)}</strong>${s.name ? ` (${escapeHtml(s.name)})` : ''}${extra}</div>`;
}).join('') + `<div class="sm-prog-done sm-file-ok">Done — both models reloaded.</div>`;
} catch (e) {
let parsed = null;
try { parsed = JSON.parse(e.message.split(':').slice(2).join(':').trim()); } catch {}
const stepHtml = parsed && parsed.result && parsed.result.steps
? parsed.result.steps.map((s) => {
const mark = s.ok ? '<span class="sm-file-ok">✓</span>' : '<span class="sm-file-bad">✗</span>';
return `<div class="sm-prog-step">${mark} <strong>${escapeHtml(s.step)}</strong>${s.name ? ` (${escapeHtml(s.name)})` : ''}${s.error ? `<div class="muted small">${escapeHtml(s.error)}</div>` : ''}</div>`;
}).join('')
: `<div class="sm-file-bad">${escapeHtml(e.message)}</div>`;
steps.innerHTML = stepHtml + `<div class="sm-prog-done sm-file-bad">Failed.</div>`;
} finally {
closeBtn.disabled = false;
try { await renderSpeechModels(); } catch {}
}
}
async function onSpeechModelsRestart() {
if (!confirm('Restart parakeet-asr container? STT + diarization will be unavailable for ~30 seconds.')) return;
try {
await fetchJSON('/api/speech-models/restart', { method: 'POST' });
} catch (e) {
alert('Restart failed: ' + e.message);
} finally {
try { await renderSpeechModels(); } catch {}
}
}
// ===================== WhisperX install (v0.12) =====================
const wxState = {
job_id: null,
eventsource: null,
timer_handle: null,
started_at: null,
};
async function renderWhisperXBanner() {
const card = el('#whisperx-install-card');
if (!card) return;
let status;
try {
status = await fetchJSON('/api/whisperx/status');
} catch {
card.classList.add('hidden');
return;
}
if (status.installed && status.healthy) {
card.classList.add('hidden');
} else if (status.configured) {
card.classList.remove('hidden');
} else {
card.classList.add('hidden');
}
}
async function onWhisperXInstall() {
if (wxState.job_id) {
// Just re-attach to the running job
showWhisperXDialog();
return;
}
if (!confirm('Install WhisperX on Spark 2? This builds a new Docker image (~1015 min first time, mostly downloading pyannote + whisper weights). Parakeet/Magpie stay untouched.')) return;
try {
const r = await fetchJSON('/api/whisperx/install', { method: 'POST' });
attachToWhisperXInstall(r.job_id);
} catch (e) {
alert('Failed to start WhisperX install: ' + e.message);
}
}
function showWhisperXDialog() {
el('#whisperx-progress-dialog').showModal();
}
function attachToWhisperXInstall(jobId) {
wxState.job_id = jobId;
el('#wx-prog-title').textContent = 'Installing WhisperX…';
el('#wx-prog-phase').textContent = 'Starting…';
el('#wx-prog-log').textContent = '';
showWhisperXDialog();
// Tick a timer
wxState.started_at = Date.now();
if (wxState.timer_handle) clearInterval(wxState.timer_handle);
wxState.timer_handle = setInterval(() => {
const sec = Math.max(0, Math.floor((Date.now() - wxState.started_at) / 1000));
const m = Math.floor(sec / 60);
el('#wx-prog-elapsed').textContent = `${m}:${(sec % 60).toString().padStart(2, '0')}`;
}, 500);
// Backfill snapshot then connect SSE
fetchJSON(`/api/whisperx/install/${jobId}`).then((snap) => {
el('#wx-prog-phase').textContent = snap.phase || 'Working…';
el('#wx-prog-log').textContent = (snap.lines || []).join('\n');
el('#wx-prog-log').scrollTop = el('#wx-prog-log').scrollHeight;
if (snap.finished_at) {
handleWhisperXDone(snap);
return;
}
const es = new EventSource(`/api/whisperx/install/${jobId}/stream`);
wxState.eventsource = es;
es.onmessage = (ev) => {
try {
const log = el('#wx-prog-log');
log.textContent += JSON.parse(ev.data).line + '\n';
log.scrollTop = log.scrollHeight;
} catch {}
};
es.addEventListener('phase', (ev) => {
try { el('#wx-prog-phase').textContent = JSON.parse(ev.data).phase; } catch {}
});
es.addEventListener('done', (ev) => {
try { handleWhisperXDone(JSON.parse(ev.data)); } catch {}
es.close();
wxState.eventsource = null;
});
es.onerror = () => { es.close(); wxState.eventsource = null; };
}).catch(() => {});
}
function handleWhisperXDone(d) {
if (wxState.timer_handle) { clearInterval(wxState.timer_handle); wxState.timer_handle = null; }
wxState.job_id = null;
const rc = d.returncode;
if (d.state === 'failed' || (rc !== 0 && rc != null)) {
el('#wx-prog-title').textContent = `WhisperX install failed (rc=${rc})`;
el('#wx-prog-phase').textContent = 'Failed — check the build log below';
} else {
el('#wx-prog-title').textContent = 'WhisperX installed';
el('#wx-prog-phase').textContent = 'Ready ✓ — appears in Always-on services below';
// Refresh services + banner state
setTimeout(() => {
renderServices();
renderWhisperXBanner();
}, 1000);
}
}
async function onServiceAction(key) {
@@ -879,64 +540,6 @@ function renderHealth(status) {
function renderBanner(status) {
el('#setup-banner').classList.toggle('hidden', !!status.configured);
// Dashboard tabs share the same "configured" gate as the rest of the
// body — hidden until SSH is set up, then visible.
const tabs = el('#dashboard-tabs');
if (tabs) tabs.classList.toggle('hidden', !status.configured);
}
// ===================== dashboard tabs (LLM / Audio) =====================
const TABS_STORAGE_KEY = 'sparkcontrol.dashboard.activeTab';
function setupDashboardTabs() {
const buttons = $$('.dashboard-tab');
if (!buttons.length) return;
// Restore the last-selected tab, default to "llm"
let saved;
try { saved = localStorage.getItem(TABS_STORAGE_KEY); } catch {}
const initial = saved === 'audio' || saved === 'llm' ? saved : 'llm';
function selectTab(name) {
buttons.forEach((b) => {
const active = b.dataset.tab === name;
b.classList.toggle('active', active);
b.setAttribute('aria-selected', active ? 'true' : 'false');
});
$$('.tab-content').forEach((c) => {
c.classList.toggle('active', c.id === `tab-${name}`);
});
try { localStorage.setItem(TABS_STORAGE_KEY, name); } catch {}
}
buttons.forEach((b) => {
b.addEventListener('click', () => selectTab(b.dataset.tab));
});
selectTab(initial);
}
// ===================== collapsible endpoint card =====================
const ENDPOINT_COLLAPSED_KEY = 'sparkcontrol.endpoint.collapsed';
function setupEndpointCollapse() {
const panel = el('#endpoint-panel');
const btn = el('#ep-collapse');
if (!panel || !btn) return;
// Default: collapsed (most of the time you don't need to see endpoint details)
let collapsed = true;
try {
const v = localStorage.getItem(ENDPOINT_COLLAPSED_KEY);
if (v === 'false') collapsed = false;
else if (v === 'true') collapsed = true;
} catch {}
panel.classList.toggle('collapsed', collapsed);
btn.addEventListener('click', () => {
const nowCollapsed = !panel.classList.contains('collapsed');
panel.classList.toggle('collapsed', nowCollapsed);
try { localStorage.setItem(ENDPOINT_COLLAPSED_KEY, nowCollapsed ? 'true' : 'false'); } catch {}
});
}
function renderSwapPanel() {
@@ -1065,7 +668,6 @@ async function pollStatus() {
// Refresh services state lazily — every 5s poll triggers this too.
try {
state.services = await fetchJSON('/api/services');
try { state.deep_health = await fetchJSON('/api/deep-health'); } catch {}
renderServices();
} catch {}
if (status.current_swap_job && status.current_swap_job !== state.swap_job_id) {
@@ -1086,78 +688,6 @@ async function loadModels() {
state.models = data.models || {};
}
async function loadDiskStatus() {
// Probes each catalog model's HF cache over SSH; takes a beat. Best-effort.
try {
const r = await fetchJSON('/api/models/disk-status');
if (r && r.models) {
state.disk_status = r.models;
state.disk_status_loaded = true;
renderCards();
}
} catch (e) {
// Silent — pills just won't render. Don't block dashboard.
console.warn('disk-status probe failed:', e.message);
}
}
function fmtBytesShort(n) {
if (!Number.isFinite(n) || n <= 0) return '0 B';
if (n >= 1e9) return `${(n / 1e9).toFixed(1)} GB`;
if (n >= 1e6) return `${(n / 1e6).toFixed(1)} MB`;
if (n >= 1e3) return `${(n / 1e3).toFixed(1)} KB`;
return `${n} B`;
}
function openDiskDeleteDialog(key) {
const m = state.models[key];
const disk = state.disk_status[key];
if (!m || !disk || !disk.on_disk) return;
const dlg = el('#disk-delete-dialog');
el('#dd-summary').innerHTML = `Free <strong>${fmtBytesShort(disk.total_bytes)}</strong> by removing <strong>${escapeHtml(m.display_name)}</strong> (<code>${escapeHtml(m.repo)}</code>) from disk.`;
const hostsEl = el('#dd-hosts');
hostsEl.innerHTML = '';
for (const h of (disk.per_host || [])) {
if (!h.on_disk) continue;
const li = document.createElement('li');
li.innerHTML = `<code>${escapeHtml(h.host)}</code> — ${fmtBytesShort(h.size_bytes)}`;
hostsEl.appendChild(li);
}
const errEl = el('#dd-error');
errEl.classList.add('hidden');
errEl.textContent = '';
const confirm = el('#dd-confirm');
const cancel = el('#dd-cancel');
const onCancel = () => dlg.close();
const onConfirm = async () => {
confirm.disabled = true;
cancel.disabled = true;
confirm.textContent = 'Deleting…';
try {
const r = await fetchJSON(`/api/models/${encodeURIComponent(key)}/disk`, { method: 'DELETE' });
dlg.close();
// Optimistically clear local disk state for this key, then refresh.
delete state.disk_status[key];
renderCards();
// Eagerly re-probe so size is accurate (and shows "not downloaded" pill).
loadDiskStatus();
const freed = r && typeof r.bytes_freed === 'number' ? fmtBytesShort(r.bytes_freed) : '';
console.log(`Deleted ${m.display_name} from disk${freed ? ` — freed ${freed}` : ''}.`);
} catch (e) {
errEl.textContent = e.message || 'Delete failed';
errEl.classList.remove('hidden');
} finally {
confirm.disabled = false;
cancel.disabled = false;
confirm.textContent = 'Delete from disk';
}
};
cancel.onclick = onCancel;
confirm.onclick = onConfirm;
dlg.showModal();
}
async function triggerSwap(modelKey) {
if (state.swap_job_id) return;
try {
@@ -1172,38 +702,6 @@ async function triggerSwap(modelKey) {
}
}
async function triggerDownloadForKey(modelKey) {
const m = state.models[modelKey];
if (!m) return;
if (dlState.job_id) {
alert('A download is already in progress; wait for it to finish.');
return;
}
// Pick the download target from the model's mode:
// solo -> spark1 only
// cluster -> both Sparks (fetch on Spark 1, rsync to Spark 2 in parallel)
const dlMode = m.mode === 'cluster' ? 'cluster' : 'spark1';
const sizeNote = m.size_gb ? ` (~${m.size_gb} GB)` : '';
const target = m.mode === 'cluster' ? 'both Sparks' : 'Spark 1';
if (!confirm(`Download "${m.display_name}"${sizeNote} to ${target}? Large models can take a while; you can watch progress in the download panel.`)) {
return;
}
dlState.last_repo = m.repo;
dlState.last_mode = dlMode;
try {
const r = await fetchJSON('/api/download', {
method: 'POST',
headers: { 'content-type': 'application/json' },
body: JSON.stringify({ repo: m.repo, mode: dlMode }),
});
// Open the download panel + attach to progress stream
openDownloadForm();
attachToDownload(r.job_id);
} catch (e) {
alert('Failed to start download: ' + e.message);
}
}
async function attachToSwap(jobId, needsBackfill) {
if (state.swap_eventsource) {
state.swap_eventsource.close();
@@ -1969,30 +1467,14 @@ async function init() {
a.classList.remove('hidden');
}
} catch {}
setupDashboardTabs();
setupEndpointCollapse();
// WhisperX install button
const wxBtn = el('#wx-install');
if (wxBtn) wxBtn.addEventListener('click', onWhisperXInstall);
const wxCloseBtn = el('#wx-prog-close');
if (wxCloseBtn) wxCloseBtn.addEventListener('click', () => el('#whisperx-progress-dialog').close());
await loadModels();
await pollStatus();
await renderServices();
pollHardware();
pollUpdates();
// Disk-status probe runs after first paint — slow over SSH and not blocking.
loadDiskStatus();
// Speech-model patches panel — slow over SSH, runs after first paint.
renderSpeechModels();
// WhisperX install banner — show only when not yet installed/healthy.
renderWhisperXBanner();
setInterval(pollStatus, 5000);
setInterval(pollHardware, 8000); // every 8s
setInterval(pollUpdates, 300000); // every 5 min
setInterval(loadDiskStatus, 60000); // every 60s — disk state changes rarely
setInterval(renderSpeechModels, 120000); // every 2 min — patches change rarely
setInterval(renderWhisperXBanner, 60000); // every 60s — auto-hides banner after install
}
init();
+2 -100
View File
@@ -44,14 +44,8 @@
</dialog>
</section>
<section id="endpoint-panel" class="endpoint-panel hidden collapsed">
<div class="ep-header">
<div class="ep-title muted small">OpenAI-compatible endpoint</div>
<button type="button" class="icon-btn ep-collapse-btn" id="ep-collapse" title="Show / hide endpoint details" aria-label="Toggle endpoint details">
<svg viewBox="0 0 24 24" width="14" height="14" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true"><polyline points="6 9 12 15 18 9"></polyline></svg>
</button>
</div>
<div class="ep-body">
<section id="endpoint-panel" class="endpoint-panel hidden">
<div class="ep-title muted small">OpenAI-compatible endpoint</div>
<div class="ep-row">
<span class="ep-label">Base URL</span>
<code class="ep-value copyable" id="ep-url" data-copy-self title="Click to copy"></code>
@@ -73,7 +67,6 @@
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><rect x="9" y="9" width="13" height="13" rx="2"/><path d="M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1"/></svg>
</button>
</details>
</div><!-- /.ep-body -->
</section>
<section id="swap-panel" class="swap-panel hidden">
@@ -96,53 +89,6 @@
</details>
</section>
<nav id="dashboard-tabs" class="dashboard-tabs hidden" role="tablist">
<button type="button" class="dashboard-tab" data-tab="llm" role="tab" aria-selected="true">LLM</button>
<button type="button" class="dashboard-tab" data-tab="audio" role="tab" aria-selected="false">Audio / Speech</button>
</nav>
<div class="tab-content" id="tab-audio" role="tabpanel" aria-labelledby="tab-audio-trigger">
<section id="whisperx-install-card" class="whisperx-install hidden">
<div class="wx-install-body">
<div class="wx-install-title">
<strong>Add WhisperX</strong>
<span class="tag ok">recommended</span>
</div>
<p class="muted small">
WhisperX is a single-container speech pipeline (faster-whisper for transcription + pyannote 3.1 for diarization)
designed to handle long audio cleanly. Replaces the Parakeet + Sortformer combo we patched together,
which crashed on a 90-min meeting. Pulled and built directly on Spark 2 (~1015 min first time;
you only do this once).
</p>
<p class="muted small">
Requires a Hugging Face token at <code>~/.cache/huggingface/token</code> on Spark 2 (already set up).
</p>
<div class="wx-install-actions">
<button id="wx-install" class="btn primary">Install WhisperX</button>
</div>
</div>
</section>
<dialog id="whisperx-progress-dialog" class="modal">
<form method="dialog" class="modal-form">
<h3 id="wx-prog-title">Installing WhisperX…</h3>
<div class="phase-row">
<span class="spinner"></span>
<div class="phase" id="wx-prog-phase">Starting…</div>
<span class="spacer"></span>
<span class="timer" id="wx-prog-elapsed">0:00</span>
</div>
<details open>
<summary class="muted small">Build log</summary>
<pre id="wx-prog-log" class="log"></pre>
</details>
<div class="modal-actions">
<button type="button" id="wx-prog-close" class="btn">Close</button>
</div>
</form>
</dialog>
<section id="services-panel" class="services hidden">
<div class="section-header">
<h2 class="section-title">Always-on services</h2>
@@ -206,34 +152,6 @@
</dialog>
</section>
<section id="speech-models-panel" class="speech-models hidden">
<div class="section-header">
<h2 class="section-title">Speech model patches</h2>
</div>
<p class="muted small sm-blurb">
Spark Control adds Sortformer speaker diarization to the third-party Parakeet ASR
container via two Python overlays (<code>diarizer.py</code> + a patched <code>main.py</code>).
Overlays survive container restart but not a fresh redeploy — if the parakeet container is
ever rebuilt, click <strong>Reapply patches</strong> below to restore them.
</p>
<div id="speech-models-card" class="speech-models-card"></div>
<dialog id="speech-models-progress-dialog" class="modal">
<form method="dialog" class="modal-form">
<h3>Reapplying speech-model patches…</h3>
<p class="muted small">Copying overlays into the parakeet container, verifying syntax, restarting, waiting for both models to load. Takes ~60120 s.</p>
<div id="sm-prog-steps" class="sm-prog-steps"></div>
<div class="modal-actions">
<button type="button" id="sm-prog-close" class="btn" disabled>Close</button>
</div>
</form>
</dialog>
</section>
</div><!-- /#tab-audio -->
<div class="tab-content" id="tab-llm" role="tabpanel" aria-labelledby="tab-llm-trigger">
<section id="models-section">
<div class="section-header">
<h2 class="section-title">LLM swap</h2>
@@ -270,20 +188,6 @@
</form>
</dialog>
<dialog id="disk-delete-dialog" class="modal">
<form method="dialog" class="modal-form">
<h3>Delete model weights from disk?</h3>
<p id="dd-summary" class="muted small"></p>
<ul class="muted small dd-hosts" id="dd-hosts"></ul>
<p class="muted small">This is reversible — you can re-download from the catalog at any time. The catalog entry stays intact.</p>
<p id="dd-error" class="muted small dd-error hidden"></p>
<div class="modal-actions">
<button type="button" id="dd-cancel" class="btn">Cancel</button>
<button type="button" id="dd-confirm" class="btn danger">Delete from disk</button>
</div>
</form>
</dialog>
<dialog id="advanced-dialog" class="modal">
<form method="dialog" class="modal-form" id="advanced-form">
<h3 id="adv-title">Advanced settings</h3>
@@ -386,8 +290,6 @@
</div>
</section>
</div><!-- /#tab-llm -->
<footer class="footer">
<div class="health">
<span class="health-item" id="h-vllm"><span class="dot"></span> vLLM</span>
+3 -182
View File
@@ -622,19 +622,6 @@ main {
.service-card .row .v.copyable.copied { outline: 1px solid var(--accent); background: rgba(74, 222, 128, 0.05); }
.service-card .row .icon-btn { padding: 3px 6px; }
.service-card .row .icon-btn svg { width: 12px; height: 12px; }
.service-card .deep-row .deep-v { display: flex; align-items: center; gap: 6px; font-family: inherit; flex-wrap: wrap; }
.service-card .dh-ok { color: var(--accent); }
.service-card .dh-fail { color: var(--error); font-weight: 500; }
.service-card .dh-run-btn { font-family: inherit; }
.service-card .deep-error {
padding: 4px 8px;
background: rgba(239, 68, 68, 0.06);
border-left: 2px solid var(--error);
border-radius: 4px;
font-family: ui-monospace, SFMono-Regular, Menlo, monospace;
font-size: 11px;
word-break: break-word;
}
.service-actions {
display: flex;
@@ -687,27 +674,21 @@ main {
border: 1px solid var(--border);
padding: 2px 8px;
border-radius: 999px;
font-size: 12px;
font-size: 11px;
}
.tag.mode-cluster { color: var(--info); border-color: rgba(96, 165, 250, 0.4); }
.tag.mode-solo { color: var(--accent); border-color: rgba(74, 222, 128, 0.4); }
.tag.cap { color: var(--muted); }
/* Semantic status pills — reuse .tag sizing so every pill on the page
renders at the same 11px / 2px×8px footprint. */
.tag.ok { color: var(--accent); border-color: rgba(74, 222, 128, 0.4); }
.tag.warn { color: var(--warn); border-color: rgba(245, 158, 11, 0.4); }
.tag.bad { color: var(--error); border-color: rgba(239, 68, 68, 0.4); }
.btn {
appearance: none;
border: 1px solid var(--border);
background: var(--surface-2);
color: var(--text);
padding: 6px 12px;
padding: 8px 14px;
border-radius: 8px;
cursor: pointer;
font: inherit;
font-size: 12px;
font-weight: 500;
transition: background 0.15s, border-color 0.15s, opacity 0.15s;
}
@@ -717,23 +698,12 @@ main {
.btn:disabled { opacity: 0.45; cursor: not-allowed; }
.btn.danger { color: var(--error); border-color: rgba(239, 68, 68, 0.3); }
.btn.danger:hover:not(:disabled) { background: rgba(239, 68, 68, 0.08); border-color: var(--error); }
.btn.info { background: var(--info); color: #0a1e3d; border-color: var(--info); }
.btn.info:hover:not(:disabled) { background: #82baff; border-color: #82baff; }
.card.active .btn { background: rgba(74, 222, 128, 0.12); color: var(--accent); border-color: rgba(74, 222, 128, 0.4); }
.card-actions { display: flex; gap: 6px; }
.card-actions .btn.primary,
.card-actions .btn.info { flex: 1; }
.card-actions .btn.primary { flex: 1; }
.card .adv-btn,
.card .test-btn { padding: 8px 12px; font-size: 12px; }
.card .custom-pill { color: var(--info); border-color: rgba(96, 165, 250, 0.4); }
.tag.on-disk { color: var(--accent); border-color: rgba(74, 222, 128, 0.4); }
.tag.not-on-disk { color: var(--muted); border-color: var(--border); opacity: 0.7; }
.card-actions .icon-btn.danger { color: var(--error); border-color: rgba(239, 68, 68, 0.3); margin-left: auto; }
.card-actions .icon-btn.danger:hover:not(:disabled) { background: rgba(239, 68, 68, 0.08); border-color: var(--error); color: var(--error); }
.card-actions .icon-btn.danger:disabled { opacity: 0.35; cursor: not-allowed; }
.dd-hosts { padding-left: 18px; margin: 4px 0 8px; }
.dd-hosts code { background: var(--surface-2); padding: 1px 5px; border-radius: 4px; }
.dd-error { color: var(--error); }
.test-result {
font-size: 12px;
@@ -770,152 +740,3 @@ main {
main { padding: 16px 14px 80px; }
.cards { grid-template-columns: 1fr; }
}
/* ===== Speech model patches (v0.11) ===== */
.speech-models { margin-top: 28px; }
.sm-blurb { max-width: 880px; margin-bottom: 14px; }
.sm-blurb code {
background: var(--surface-2);
padding: 1px 6px;
border-radius: 4px;
font-size: 12px;
}
.speech-models-card {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 10px;
padding: 16px;
display: flex;
flex-direction: column;
gap: 14px;
}
.sm-header {
display: flex;
align-items: center;
gap: 10px;
}
.sm-title {
font-weight: 600;
color: var(--text);
}
/* .sm-pill removed in v0.11.0:1 — speech-models pills now reuse the shared
.tag styling (+ .tag.ok / .tag.warn / .tag.bad color modifiers) so every
pill on the page renders identically. */
.sm-models { display: flex; flex-direction: column; gap: 6px; }
.sm-model-row {
display: grid;
grid-template-columns: 160px 1fr auto;
align-items: center;
gap: 12px;
padding: 6px 0;
border-top: 1px solid var(--border);
}
.sm-model-row:first-child { border-top: none; }
.sm-model-kind { color: var(--muted); font-size: 13px; }
.sm-model-name { font-family: ui-monospace, monospace; font-size: 12px; word-break: break-all; }
.sm-files { display: flex; flex-direction: column; gap: 4px; }
.sm-file-row {
display: grid;
grid-template-columns: 160px 100px 1fr;
gap: 12px;
font-size: 12px;
padding: 4px 0;
}
.sm-file-name code {
background: var(--surface-2);
padding: 1px 6px;
border-radius: 4px;
}
.sm-file-ok { color: var(--accent); }
.sm-file-warn { color: var(--warn); }
.sm-file-bad { color: var(--error); }
.sm-file-sha code {
background: var(--surface-2);
padding: 1px 4px;
border-radius: 3px;
font-size: 11px;
}
.sm-meta { margin-top: 4px; }
.sm-actions { display: flex; gap: 10px; }
.sm-prog-steps {
display: flex;
flex-direction: column;
gap: 6px;
margin: 12px 0;
font-size: 13px;
}
.sm-prog-step {
padding: 6px 10px;
background: var(--surface-2);
border-radius: 6px;
}
.sm-prog-done {
font-weight: 600;
margin-top: 8px;
}
/* ===== Collapsible endpoint card (v0.11.0:1) ===== */
.endpoint-panel .ep-header {
display: flex;
align-items: center;
gap: 10px;
}
.endpoint-panel .ep-title { flex: 1; margin: 0; }
.endpoint-panel .ep-collapse-btn {
flex-shrink: 0;
transition: transform 0.2s;
}
.endpoint-panel.collapsed .ep-body { display: none; }
.endpoint-panel.collapsed .ep-collapse-btn svg { transform: rotate(-90deg); }
.endpoint-panel:not(.collapsed) .ep-header { margin-bottom: 10px; }
/* ===== Dashboard tabs (LLM / Audio) (v0.11.0:1) ===== */
.dashboard-tabs {
display: flex;
gap: 4px;
margin-top: 8px;
margin-bottom: 16px;
border-bottom: 1px solid var(--border);
padding: 0 2px;
}
.dashboard-tab {
appearance: none;
background: transparent;
border: 1px solid transparent;
border-bottom: none;
color: var(--muted);
padding: 8px 16px;
border-radius: 6px 6px 0 0;
cursor: pointer;
font: inherit;
font-size: 14px;
font-weight: 500;
margin-bottom: -1px;
transition: color 0.15s, background 0.15s, border-color 0.15s;
}
.dashboard-tab:hover { color: var(--text); }
.dashboard-tab.active {
color: var(--text);
background: var(--surface);
border-color: var(--border);
border-bottom: 1px solid var(--surface);
}
.tab-content { display: none; }
.tab-content.active { display: block; }
/* ===== WhisperX install banner (v0.12) ===== */
.whisperx-install {
background: var(--surface);
border: 1px solid var(--info);
border-radius: var(--radius);
padding: 16px 18px;
margin-bottom: 20px;
}
.wx-install-body { display: flex; flex-direction: column; gap: 10px; }
.wx-install-title { display: flex; align-items: center; gap: 10px; }
.wx-install-title strong { font-size: 15px; color: var(--text); }
.wx-install-actions { display: flex; gap: 10px; margin-top: 4px; }
-267
View File
@@ -1,267 +0,0 @@
"""WhisperX install action — ships the build context from inside spark-control
to Spark 2 over SSH, then runs `docker build` + `docker run` on Spark 2 and
streams progress back as SSE.
Pattern mirrors NimManager (see nim.py) but for a locally-built container
rather than an `nvcr.io` pull. Build context lives at
/app/whisperx_container/ inside the spark-control Docker image (set up by
the Dockerfile COPY directive).
Endpoints:
POST /api/whisperx/install — kick off
GET /api/whisperx/install/{job_id} — snapshot
GET /api/whisperx/install/{job_id}/stream — SSE phase + log lines
GET /api/whisperx/status — installed + healthy?
"""
from __future__ import annotations
import asyncio
import shlex
import uuid
from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Optional
import httpx
from .config import Settings
from .ssh import _base_args, ssh_run, ssh_stream, StreamHandle
# Build context shipped inside the spark-control image (Dockerfile COPYs it).
BUILD_CONTEXT_DIR = Path(__file__).resolve().parent.parent / "whisperx_container"
# Files we ship to Spark 2's build dir. Mapped local-name → remote-relative-path.
BUILD_FILES = {
"Dockerfile": "Dockerfile",
"requirements.txt": "requirements.txt",
"README.md": "README.md",
"app/main.py": "app/main.py",
}
@dataclass
class WhisperXInstallJob:
id: str
started_at: str
state: str = "starting" # starting | sending | building | running | done | failed
phase: str = "Starting…"
lines: list[str] = field(default_factory=list)
returncode: Optional[int] = None
finished_at: Optional[str] = None
def append(self, line: str) -> None:
self.lines.append(line)
if len(self.lines) > 1500:
del self.lines[: len(self.lines) - 1500]
class WhisperXInstaller:
def __init__(self, settings: Settings) -> None:
self.settings = settings
self.lock = asyncio.Lock()
self.jobs: dict[str, WhisperXInstallJob] = {}
self.current_job_id: Optional[str] = None
def get(self, job_id: str) -> WhisperXInstallJob | None:
return self.jobs.get(job_id)
async def status(self) -> dict:
"""Probe whether WhisperX is installed + healthy on its configured host."""
s = self.settings
host_present = bool(s.whisperx_host and s.whisperx_user)
if not host_present:
return {"configured": False, "installed": False, "healthy": False}
# Probe HTTP health
url = f"http://{s.whisperx_host}:{s.whisperx_port}/health"
try:
async with httpx.AsyncClient(timeout=3.0) as client:
r = await client.get(url)
if r.status_code == 200:
body = r.json()
return {
"configured": True,
"installed": True,
"healthy": True,
"model": body.get("model"),
"device": body.get("device"),
"diarizer_loaded": body.get("diarizer_loaded", False),
}
except Exception:
pass
# No HTTP — check if the container exists at all
container_present = await self._container_exists()
return {
"configured": True,
"installed": container_present,
"healthy": False,
"current_job_id": self.current_job_id,
}
async def _container_exists(self) -> bool:
s = self.settings
cmd = f"docker ps -a --filter name=^{s.whisperx_container}$ --format '{{{{.Names}}}}'"
rc, out, _ = await ssh_run(s.whisperx_host, s.whisperx_user, cmd, s, timeout=10)
return rc == 0 and s.whisperx_container in out
async def trigger(self) -> WhisperXInstallJob:
if self.lock.locked():
raise RuntimeError("a WhisperX install is already in progress")
s = self.settings
if not s.whisperx_host or not s.whisperx_user:
raise RuntimeError("whisperx host/user not configured")
for local_name in BUILD_FILES:
if not (BUILD_CONTEXT_DIR / local_name).exists():
raise RuntimeError(f"build context file missing inside spark-control image: {local_name}")
job = WhisperXInstallJob(
id=uuid.uuid4().hex[:8],
started_at=datetime.now(timezone.utc).isoformat(),
)
self.jobs[job.id] = job
self.current_job_id = job.id
asyncio.create_task(self._run(job))
return job
async def _run(self, job: WhisperXInstallJob) -> None:
async with self.lock:
try:
await self._do(job)
if job.state != "failed":
job.state = "done"
job.returncode = 0
job.phase = "Done — WhisperX is running on port 8002"
except Exception as e:
job.append(f"[error] {type(e).__name__}: {e}")
job.state = "failed"
if job.returncode is None:
job.returncode = 1
finally:
job.finished_at = datetime.now(timezone.utc).isoformat()
if self.current_job_id == job.id:
self.current_job_id = None
async def _ssh_pipe(self, host: str, user: str, remote_cmd: str,
payload: bytes, timeout: float = 60.0) -> tuple[bool, str, str]:
"""ssh user@host <remote_cmd> with payload piped to stdin."""
args = _base_args(self.settings) + [f"{user}@{host}", remote_cmd]
proc = await asyncio.create_subprocess_exec(
*args,
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
try:
stdout_b, stderr_b = await asyncio.wait_for(
proc.communicate(input=payload), timeout=timeout
)
except asyncio.TimeoutError:
proc.kill(); await proc.wait()
return False, "", f"timeout after {timeout}s"
return proc.returncode == 0, stdout_b.decode(errors="replace"), stderr_b.decode(errors="replace")
async def _do(self, job: WhisperXInstallJob) -> None:
s = self.settings
host = s.whisperx_host
user = s.whisperx_user
# NOTE: `~` does not expand inside shlex.quote() single-quotes (bit us
# in v0.12.0:0). Use a $HOME-relative path that the REMOTE shell
# expands; all path components are hardcoded so injection is moot.
build_dir_remote = "\"$HOME\"/whisperx-build"
build_dir_display = "~/whisperx-build"
# ── Phase 1: stage build context on Spark 2 ──
job.state = "sending"
job.phase = "Sending build context to Spark 2…"
job.append(f"$ ssh {user}@{host} 'mkdir -p {build_dir_display}/app'")
rc, out, err = await ssh_run(
host, user,
f"mkdir -p {build_dir_remote}/app && "
f"rm -f {build_dir_remote}/Dockerfile {build_dir_remote}/requirements.txt "
f"{build_dir_remote}/README.md {build_dir_remote}/app/main.py",
s, timeout=10,
)
if rc != 0:
job.append(f"[mkdir failed] {err.strip()}")
raise RuntimeError("failed to create build directory")
for local_name, remote_rel in BUILD_FILES.items():
local_path = BUILD_CONTEXT_DIR / local_name
body = local_path.read_bytes()
remote_path_for_shell = f"{build_dir_remote}/{remote_rel}"
# remote_rel is hardcoded ("Dockerfile" / "app/main.py" etc.) — safe
# to embed unquoted inside the double-quoted $HOME path.
cmd = f"cat > {remote_path_for_shell}"
ok, out, err = await self._ssh_pipe(host, user, cmd, body, timeout=30)
if not ok:
job.append(f"[scp {local_name} failed] {err.strip()[:200]}")
raise RuntimeError(f"failed to ship {local_name}")
job.append(f"{build_dir_display}/{remote_rel} ({len(body)} bytes)")
# ── Phase 2: docker build ──
job.state = "building"
job.phase = "Building Docker image on Spark 2 (this is the slow part — 515 min if base layers aren't cached)…"
build_cmd = (
f"set -e; "
f"cd {build_dir_remote}; "
f"echo '=== docker build -t {s.whisperx_container}:latest . ==='; "
f"docker build -t {s.whisperx_container}:latest ."
)
job.append(f"$ {build_cmd}")
handle = StreamHandle()
async for line in ssh_stream(host, user, build_cmd, s, handle=handle):
job.append(line)
if "Step " in line and "/" in line:
# docker build progress: "Step 5/10 : RUN pip install ..."
job.phase = f"Building: {line.strip()[:120]}"
elif "Successfully built" in line or "naming to" in line:
job.phase = "Image built — preparing to start container…"
if (handle.returncode or 0) != 0:
job.returncode = handle.returncode
raise RuntimeError(f"docker build failed (rc={handle.returncode})")
# ── Phase 3: docker run ──
job.state = "running"
job.phase = "Starting container…"
run_cmd = (
f"set -e; "
f"echo '=== removing any prior {s.whisperx_container} container ==='; "
f"docker rm -f {s.whisperx_container} 2>/dev/null || true; "
f"echo '=== docker run -d --restart unless-stopped --name {s.whisperx_container} ==='; "
f"HF_TOKEN=$(cat ~/.cache/huggingface/token 2>/dev/null || true); "
f"if [ -z \"$HF_TOKEN\" ]; then echo 'WARN: no HF_TOKEN found at ~/.cache/huggingface/token — diarization will be disabled until you set one'; fi; "
f"docker run -d --restart unless-stopped "
f"--name {s.whisperx_container} "
f"--gpus all --memory=40g "
f"-p {s.whisperx_port}:{s.whisperx_port} "
f"-v whisperx-models:/root/.cache/huggingface "
f"-e HF_TOKEN=\"$HF_TOKEN\" "
f"-e WHISPER_MODEL={s.whisperx_model} "
f"{s.whisperx_container}:latest"
)
job.append(f"$ {run_cmd}")
rc, out, err = await ssh_run(host, user, run_cmd, s, timeout=60)
if rc != 0:
job.append(f"[docker run failed rc={rc}] {(err or out).strip()[:300]}")
raise RuntimeError("docker run failed")
job.append(out.strip())
# ── Phase 4: wait for /health to report ready ──
job.phase = "Container is starting; loading whisper + alignment + pyannote models (~60120 s on first boot)…"
url = f"http://{s.whisperx_host}:{s.whisperx_port}/health"
ready = False
for i in range(60): # up to ~180 s
await asyncio.sleep(3)
try:
async with httpx.AsyncClient(timeout=4.0) as client:
r = await client.get(url)
if r.status_code == 200:
body = r.json()
if body.get("status") == "ready":
ready = True
job.append(f"[ready] {body}")
break
job.phase = f"Loading models (transcribe={body.get('transcribe_loaded')}, align={body.get('align_loaded')}, diarize={body.get('diarizer_loaded')})…"
except Exception:
pass
if not ready:
raise RuntimeError("container started but /health did not report ready within ~180 s — check `docker logs whisperx-asr` on Spark 2")
job.phase = "Done — WhisperX is healthy and reachable on port 8002"
-2
View File
@@ -30,7 +30,6 @@ models:
- -tp=2
- --distributed-executor-backend=ray
- --max-model-len=32768
- --max-num-batched-tokens=16384
gemma4:
display_name: "Gemma 4 31B"
@@ -46,7 +45,6 @@ models:
vllm_args:
- --gpu-memory-utilization=0.8
- --max-model-len=32768
- --max-num-batched-tokens=16384
- --reasoning-parser=gemma4
- --tool-call-parser=gemma4
- --enable-auto-tool-choice
-54
View File
@@ -1,54 +0,0 @@
#!/bin/bash
# Apply Sortformer diarization patches to a running parakeet-asr container.
#
# Run from the spark-control repo root on the laptop:
# bash image/parakeet_patches/apply.sh <spark2-host> <ssh-user>
#
# What it does:
# 1. Backs up the current /opt/parakeet/app/main.py inside the container
# (writable layer; survives docker restart but NOT docker rm).
# 2. Copies the patched main.py + new diarizer.py into the container.
# 3. Restarts the container so the new code + Sortformer model load.
#
# Reversibility:
# - The backup of main.py is at /opt/parakeet/app/main.py.pre-sortformer
# inside the container. Restore with:
# docker exec parakeet-asr cp /opt/parakeet/app/main.py.pre-sortformer /opt/parakeet/app/main.py
# docker exec parakeet-asr rm -f /opt/parakeet/app/diarizer.py
# docker restart parakeet-asr
# - If the container is ever `docker rm`'d (volume rebuild), re-run this
# script. We will eventually fold this into spark-control as an action.
set -e
HOST="${1:?usage: apply.sh <spark2-host> <ssh-user>}"
USER="${2:?usage: apply.sh <spark2-host> <ssh-user>}"
CONTAINER="${CONTAINER:-parakeet-asr}"
REPO_DIR="$(cd "$(dirname "$0")" && pwd)"
echo "→ Backing up current main.py inside ${CONTAINER}..."
ssh "${USER}@${HOST}" "docker exec ${CONTAINER} sh -c \
'test -f /opt/parakeet/app/main.py.pre-sortformer || cp /opt/parakeet/app/main.py /opt/parakeet/app/main.py.pre-sortformer'"
echo "→ Copying diarizer.py into container..."
ssh "${USER}@${HOST}" "docker exec -i ${CONTAINER} sh -c \
'cat > /opt/parakeet/app/diarizer.py'" < "${REPO_DIR}/diarizer.py"
echo "→ Copying patched main.py into container..."
ssh "${USER}@${HOST}" "docker exec -i ${CONTAINER} sh -c \
'cat > /opt/parakeet/app/main.py'" < "${REPO_DIR}/main.py"
echo "→ Verifying syntax inside container..."
ssh "${USER}@${HOST}" "docker exec ${CONTAINER} python3 -c \
'import ast; ast.parse(open(\"/opt/parakeet/app/diarizer.py\").read()); ast.parse(open(\"/opt/parakeet/app/main.py\").read()); print(\"py OK\")'"
echo "→ Restarting ${CONTAINER}..."
ssh "${USER}@${HOST}" "docker restart ${CONTAINER}"
echo
echo "✔ Patches applied. Sortformer model (~150 MB) will download on first load — wait ~30s before testing."
echo
echo "Test once it's ready:"
echo " curl -sS http://${HOST}:8000/health"
echo " curl -sS -X POST http://${HOST}:8000/v1/audio/diarize -F file=@some-audio.mp3 | head -c 500"
-164
View File
@@ -1,164 +0,0 @@
"""Speaker diarization via NVIDIA NeMo Sortformer.
This module is dropped into the Parakeet container at /opt/parakeet/app/diarizer.py
and loaded alongside the existing ASR model. The Sortformer model identifies who
is speaking when in an audio file, output as a list of {start_s, end_s, speaker}
turns. It does NOT transcribe — pair its output with Parakeet's word-level
timestamps to produce a diarized transcript.
Model: nvidia/diar_sortformer_4spk-v1 (~150 MB, NeMo ecosystem, ungated)
Memory: adds ~200 MB to the running container. Same GPU as Parakeet (Spark 2
unified GB10). No interference with Parakeet inference because they're called
on separate code paths and CUDA handles concurrent kernels.
"""
import io
import os
import logging
import tempfile
import subprocess
from pathlib import Path
from typing import Optional
import torch
import soundfile as sf
import numpy as np
logger = logging.getLogger(__name__)
DIARIZER_MODEL = os.getenv("DIARIZER_MODEL", "nvidia/diar_sortformer_4spk-v1")
TARGET_SAMPLE_RATE = 16000
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
def _convert_to_wav_16k_mono(audio_bytes: bytes, original_filename: str) -> str:
"""Same conversion as transcriber.py — keeps a uniform input format
for the diarizer regardless of upload mime type."""
suffix = Path(original_filename).suffix.lower() if original_filename else ".wav"
with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp_in:
tmp_in.write(audio_bytes)
tmp_in_path = tmp_in.name
tmp_out_path = tmp_in_path + ".converted.wav"
try:
cmd = ["ffmpeg", "-y", "-i", tmp_in_path, "-ac", "1", "-ar", "16000",
"-sample_fmt", "s16", "-f", "wav", tmp_out_path]
result = subprocess.run(cmd, capture_output=True, timeout=300)
if result.returncode != 0:
raise RuntimeError(f"ffmpeg failed: {result.stderr.decode()[:500]}")
return tmp_out_path
finally:
try: os.unlink(tmp_in_path)
except OSError: pass
def _parse_sortformer_segments(raw_output) -> list[dict]:
"""Sortformer.diarize() returns List[List[str]] where each inner list is
per-file results: each entry is a space-separated 'start_s end_s speaker_label'
triplet (e.g., '0.00 4.50 speaker_0'). Normalize to our canonical format."""
if not raw_output:
return []
# Single-file invocation → take first inner list
entries = raw_output[0] if isinstance(raw_output, list) and raw_output and isinstance(raw_output[0], list) else raw_output
segments = []
for entry in entries:
if not entry:
continue
if isinstance(entry, str):
parts = entry.strip().split()
if len(parts) >= 3:
try:
start = float(parts[0])
end = float(parts[1])
speaker_raw = parts[2]
# Normalize "speaker_0" / "spk_0" / "0" → "Speaker_0"
if speaker_raw.lower().startswith("speaker_"):
idx = speaker_raw.split("_", 1)[1]
elif speaker_raw.lower().startswith("spk_"):
idx = speaker_raw.split("_", 1)[1]
elif speaker_raw.isdigit():
idx = speaker_raw
else:
idx = speaker_raw
segments.append({
"start_s": start,
"end_s": end,
"speaker": f"Speaker_{idx}",
})
except (ValueError, IndexError) as e:
logger.warning(f"unparsable sortformer entry: {entry!r} ({e})")
continue
return segments
class SortformerDiarizer:
def __init__(self):
self.model = None
self._loaded = False
def load_model(self):
if self._loaded:
return
logger.info(f"Loading diarizer {DIARIZER_MODEL} on {DEVICE}...")
from nemo.collections.asr.models import SortformerEncLabelModel
self.model = SortformerEncLabelModel.from_pretrained(DIARIZER_MODEL)
self.model.eval()
if DEVICE == "cuda":
self.model = self.model.cuda()
self._loaded = True
logger.info(f"Diarizer loaded on {DEVICE}")
def diarize(self, audio_bytes: bytes, filename: str = "audio.wav") -> dict:
"""Run diarization on a single audio file.
Returns:
{
"segments": [{"start_s": float, "end_s": float, "speaker": str}, ...],
"speakers_detected": ["Speaker_0", "Speaker_1", ...],
"duration": float,
"model": str,
"device": str,
}
Speaker labels are zero-indexed strings like "Speaker_0", "Speaker_1",
etc. They are NOT real names — that mapping happens downstream via LLM
analysis or manual UI correction.
"""
if not self._loaded:
self.load_model()
if not audio_bytes:
raise ValueError("empty audio")
wav_path = None
try:
wav_path = _convert_to_wav_16k_mono(audio_bytes, filename)
data, sr = sf.read(wav_path)
duration = len(data) / sr
logger.info(f"Diarizing {duration:.1f}s of audio ({filename})")
with torch.no_grad():
raw = self.model.diarize(
audio=[wav_path],
batch_size=1,
verbose=False,
)
segments = _parse_sortformer_segments(raw)
speakers = sorted({s["speaker"] for s in segments})
logger.info(f"Detected {len(speakers)} speakers across {len(segments)} turns")
if DEVICE == "cuda":
torch.cuda.empty_cache()
return {
"segments": segments,
"speakers_detected": speakers,
"duration": round(duration, 3),
"model": DIARIZER_MODEL,
"device": DEVICE,
}
finally:
if wav_path:
try: os.unlink(wav_path)
except OSError: pass
diarizer = SortformerDiarizer()
-158
View File
@@ -1,158 +0,0 @@
import os
import time
import logging
from contextlib import asynccontextmanager
from typing import Optional
import torch
from fastapi import FastAPI, File, Form, UploadFile, HTTPException
from fastapi.responses import JSONResponse
from fastapi.middleware.cors import CORSMiddleware
from app.transcriber import transcriber, MODEL_NAME, DEVICE
from app.diarizer import diarizer, DIARIZER_MODEL
logging.basicConfig(level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s")
logger = logging.getLogger("parakeet-api")
@asynccontextmanager
async def lifespan(app: FastAPI):
logger.info(f"Loading ASR model {MODEL_NAME} on {DEVICE}")
transcriber.load_model()
logger.info("ASR model ready")
logger.info(f"Loading diarizer {DIARIZER_MODEL} on {DEVICE}")
diarizer.load_model()
logger.info("Diarizer ready")
yield
app = FastAPI(title="Parakeet ASR + Sortformer Diarization API", version="1.2.0", lifespan=lifespan)
app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_credentials=True,
allow_methods=["*"], allow_headers=["*"])
@app.get("/")
async def root():
return {"service": "parakeet-asr", "model": MODEL_NAME, "diarizer": DIARIZER_MODEL, "device": DEVICE,
"endpoints": {"transcribe": "/v1/audio/transcriptions",
"diarize": "/v1/audio/diarize",
"models": "/v1/models", "health": "/health"}}
@app.get("/health")
async def health():
return {"status": "ready" if (transcriber._loaded and diarizer._loaded) else "loading",
"asr_loaded": transcriber._loaded,
"diarizer_loaded": diarizer._loaded,
"model": MODEL_NAME,
"diarizer_model": DIARIZER_MODEL,
"device": DEVICE}
@app.get("/v1/models")
async def list_models():
return {"object": "list", "data": [
{"id": "parakeet-tdt-0.6b-v3", "object": "model", "owned_by": "nvidia", "kind": "stt"},
{"id": "whisper-1", "object": "model", "owned_by": "nvidia", "kind": "stt"},
{"id": DIARIZER_MODEL.split("/")[-1], "object": "model", "owned_by": "nvidia", "kind": "diarization"}]}
@app.post("/v1/audio/transcriptions")
async def transcribe(
file: UploadFile = File(...),
model: Optional[str] = Form(default="parakeet-tdt-0.6b-v3"),
language: Optional[str] = Form(default=None),
response_format: Optional[str] = Form(default="json"),
temperature: Optional[float] = Form(default=0.0),
prompt: Optional[str] = Form(default=None),
):
if not transcriber._loaded:
raise HTTPException(status_code=503, detail="Model loading")
audio_bytes = await file.read()
if len(audio_bytes) == 0:
raise HTTPException(status_code=400, detail="Empty file")
max_size = int(os.getenv("MAX_UPLOAD_MB", "200")) * 1024 * 1024
if len(audio_bytes) > max_size:
raise HTTPException(status_code=413, detail=f"File too large")
want_timestamps = response_format == "verbose_json"
start_time = time.time()
try:
result = transcriber.transcribe(
audio_bytes, file.filename, language, timestamps=want_timestamps
)
except Exception as e:
logger.exception("Transcription failed")
raise HTTPException(status_code=500, detail=f"Failed: {e}")
elapsed = time.time() - start_time
duration = result.get("duration", 0)
rtfx = duration / elapsed if elapsed > 0 else 0
logger.info(f"Done: {duration:.1f}s in {elapsed:.1f}s ({rtfx:.0f}x rt)")
if response_format == "text":
return JSONResponse(content=result["text"], media_type="text/plain")
if response_format == "verbose_json":
return {
"task": "transcribe",
"language": language or "en",
"duration": duration,
"text": result["text"],
"segments": result.get("segments", []),
"words": result.get("words", []),
}
return {"text": result["text"]}
@app.post("/v1/audio/translations")
async def translate(file: UploadFile = File(...),
model: Optional[str] = Form(default="parakeet-tdt-0.6b-v3"),
language: Optional[str] = Form(default=None),
response_format: Optional[str] = Form(default="json")):
return await transcribe(file=file, model=model, language=language,
response_format=response_format)
@app.post("/v1/audio/diarize")
async def diarize(
file: UploadFile = File(...),
):
"""Speaker diarization via Sortformer.
Returns who-spoke-when as a list of turns. Does NOT transcribe — pair this
output with /v1/audio/transcriptions (verbose_json) and merge by timestamp
to produce a diarized transcript.
Response shape:
{
"segments": [{"start_s": 0.00, "end_s": 4.50, "speaker": "Speaker_0"}, ...],
"speakers_detected": ["Speaker_0", "Speaker_1"],
"duration": 90.5,
"model": "nvidia/diar_sortformer_4spk-v1",
"device": "cuda"
}
"""
if not diarizer._loaded:
raise HTTPException(status_code=503, detail="Diarizer loading")
audio_bytes = await file.read()
if len(audio_bytes) == 0:
raise HTTPException(status_code=400, detail="Empty file")
max_size = int(os.getenv("MAX_UPLOAD_MB", "200")) * 1024 * 1024
if len(audio_bytes) > max_size:
raise HTTPException(status_code=413, detail="File too large")
start_time = time.time()
try:
result = diarizer.diarize(audio_bytes, file.filename or "audio.wav")
except Exception as e:
logger.exception("Diarization failed")
raise HTTPException(status_code=500, detail=f"Failed: {e}")
elapsed = time.time() - start_time
duration = result.get("duration", 0)
rtfx = duration / elapsed if elapsed > 0 else 0
logger.info(f"Diarized {duration:.1f}s in {elapsed:.1f}s ({rtfx:.0f}x rt), "
f"{len(result['speakers_detected'])} speakers, {len(result['segments'])} turns")
return result
-105
View File
@@ -1,105 +0,0 @@
import os
import time
import logging
from contextlib import asynccontextmanager
from typing import Optional
import torch
from fastapi import FastAPI, File, Form, UploadFile, HTTPException
from fastapi.responses import JSONResponse
from fastapi.middleware.cors import CORSMiddleware
from app.transcriber import transcriber, MODEL_NAME, DEVICE
logging.basicConfig(level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s")
logger = logging.getLogger("parakeet-api")
@asynccontextmanager
async def lifespan(app: FastAPI):
logger.info(f"Loading model {MODEL_NAME} on {DEVICE}")
transcriber.load_model()
logger.info("Model ready")
yield
app = FastAPI(title="Parakeet ASR API", version="1.1.0", lifespan=lifespan)
app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_credentials=True,
allow_methods=["*"], allow_headers=["*"])
@app.get("/")
async def root():
return {"service": "parakeet-asr", "model": MODEL_NAME, "device": DEVICE,
"endpoints": {"transcribe": "/v1/audio/transcriptions",
"models": "/v1/models", "health": "/health"}}
@app.get("/health")
async def health():
return {"status": "ready" if transcriber._loaded else "loading",
"model": MODEL_NAME, "device": DEVICE}
@app.get("/v1/models")
async def list_models():
return {"object": "list", "data": [
{"id": "parakeet-tdt-0.6b-v3", "object": "model", "owned_by": "nvidia"},
{"id": "whisper-1", "object": "model", "owned_by": "nvidia"}]}
@app.post("/v1/audio/transcriptions")
async def transcribe(
file: UploadFile = File(...),
model: Optional[str] = Form(default="parakeet-tdt-0.6b-v3"),
language: Optional[str] = Form(default=None),
response_format: Optional[str] = Form(default="json"),
temperature: Optional[float] = Form(default=0.0),
prompt: Optional[str] = Form(default=None),
):
if not transcriber._loaded:
raise HTTPException(status_code=503, detail="Model loading")
audio_bytes = await file.read()
if len(audio_bytes) == 0:
raise HTTPException(status_code=400, detail="Empty file")
max_size = int(os.getenv("MAX_UPLOAD_MB", "200")) * 1024 * 1024
if len(audio_bytes) > max_size:
raise HTTPException(status_code=413, detail=f"File too large")
want_timestamps = response_format == "verbose_json"
start_time = time.time()
try:
result = transcriber.transcribe(
audio_bytes, file.filename, language, timestamps=want_timestamps
)
except Exception as e:
logger.exception("Transcription failed")
raise HTTPException(status_code=500, detail=f"Failed: {e}")
elapsed = time.time() - start_time
duration = result.get("duration", 0)
rtfx = duration / elapsed if elapsed > 0 else 0
logger.info(f"Done: {duration:.1f}s in {elapsed:.1f}s ({rtfx:.0f}x rt)")
if response_format == "text":
return JSONResponse(content=result["text"], media_type="text/plain")
if response_format == "verbose_json":
return {
"task": "transcribe",
"language": language or "en",
"duration": duration,
"text": result["text"],
"segments": result.get("segments", []),
"words": result.get("words", []),
}
return {"text": result["text"]}
@app.post("/v1/audio/translations")
async def translate(file: UploadFile = File(...),
model: Optional[str] = Form(default="parakeet-tdt-0.6b-v3"),
language: Optional[str] = Form(default=None),
response_format: Optional[str] = Form(default="json")):
return await transcribe(file=file, model=model, language=language,
response_format=response_format)
-1
View File
@@ -9,7 +9,6 @@ dependencies = [
"pydantic>=2.9",
"pyyaml>=6.0",
"httpx>=0.27",
"python-multipart>=0.0.9",
]
[build-system]
-51
View File
@@ -1,51 +0,0 @@
# WhisperX ASR + diarization container for Spark 2 (Blackwell GB10, sm_120).
#
# Replaces the custom Parakeet wrapper + Sortformer overlay with a single
# mainline pipeline: faster-whisper for transcription + pyannote.audio 3.1
# for diarization + wav2vec2 forced alignment for word-level timestamps.
#
# Build (on Spark 2, where Blackwell + nvcr.io credentials are available):
# docker build -t whisperx-asr:latest .
#
# Run:
# docker run -d --restart unless-stopped --name whisperx-asr \
# --gpus all --memory=40g \
# -p 8002:8002 \
# -v whisperx-models:/root/.cache/huggingface \
# -e HF_TOKEN="$(cat ~/.cache/huggingface/token)" \
# -e WHISPER_MODEL=medium \
# whisperx-asr:latest
#
# The memory cap is intentional: even if WhisperX hits a pathological input,
# it gets OOM-killed cleanly instead of swap-thrashing the whole Spark.
FROM nvcr.io/nvidia/pytorch:25.11-py3
# WhisperX runs ffmpeg under the hood for audio decoding
RUN apt-get update \
&& apt-get install -y --no-install-recommends ffmpeg \
&& rm -rf /var/lib/apt/lists/*
# Install whisperx + the FastAPI wrapper deps. --break-system-packages because
# the NGC PyTorch image has its own managed Python that's flagged "system".
COPY requirements.txt /tmp/requirements.txt
RUN pip install --break-system-packages --no-cache-dir -r /tmp/requirements.txt
# Pre-warm the default Whisper + alignment models at build time so first-call
# latency on a fresh container is small. (~3 GB cached into the image; if you
# want a smaller image, comment this out and accept the first-call download.)
ARG WHISPER_MODEL=medium
ENV WHISPER_MODEL=${WHISPER_MODEL}
RUN python3 -c "import whisperx; whisperx.load_model('${WHISPER_MODEL}', 'cpu', compute_type='int8')" \
&& python3 -c "import whisperx; whisperx.load_align_model(language_code='en', device='cpu')"
WORKDIR /opt/whisperx
COPY app /opt/whisperx/app
# Expose for spark-control's proxy on Spark 2
EXPOSE 8002
HEALTHCHECK --interval=30s --timeout=10s --start-period=180s \
CMD python3 -c "import urllib.request; urllib.request.urlopen('http://localhost:8002/health')" || exit 1
CMD ["python3", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8002", "--workers", "1"]
-74
View File
@@ -1,74 +0,0 @@
# WhisperX container for Spark 2
Replaces the custom Parakeet wrapper + Sortformer overlay (v0.10/v0.11) with a
single mainline pipeline:
- **faster-whisper** (CTranslate2-optimized) for STT
- **pyannote.audio 3.1** for speaker diarization (sliding-window — handles
long files in bounded memory, fixes the Sortformer OOM on 90-min audio)
- **wav2vec2 forced alignment** for word-level timestamps
Exposes the same API surface spark-control already proxies to, so the cutover
is a one-URL change in the audio proxy:
- `GET /health` — readiness probe
- `GET /v1/models` — model list
- `POST /v1/audio/transcriptions` — OpenAI-shaped STT
- `POST /v1/audio/transcribe-with-speakers` — merged diarized transcript
(matches spark-control's response shape exactly)
## Deploy to Spark 2
```bash
# 1. Copy this directory to Spark 2
rsync -av --delete image/whisperx_container/ modelo@192.168.1.87:~/whisperx-build/
# 2. SSH in and build
ssh modelo@192.168.1.87
cd ~/whisperx-build
docker build -t whisperx-asr:latest .
# 3. Run alongside the existing parakeet-asr (which stays on 8000 for now)
docker run -d --restart unless-stopped --name whisperx-asr \
--gpus all --memory=40g \
-p 8002:8002 \
-v whisperx-models:/root/.cache/huggingface \
-e HF_TOKEN="$(cat ~/.cache/huggingface/token)" \
-e WHISPER_MODEL=medium \
whisperx-asr:latest
# 4. Watch first-start logs (model load + first health check)
docker logs -f whisperx-asr
```
## Model size knobs
`WHISPER_MODEL` env var. Defaults to `medium`. Options:
| Model | Size | Speed (GB10) | Quality |
|---|---|---|---|
| `tiny` | ~75M | ~120x rt | low |
| `base` | ~74M | ~80x rt | ok |
| `small` | ~244M | ~50x rt | good |
| `medium`| ~769M | ~30x rt | excellent (**default**) |
| `large-v3`| ~1.5B | ~15x rt | best |
For a 90-min file, medium takes ~3 min STT + ~9 min diarize ≈ ~12 min total.
## Memory budget
The `--memory=40g` cap is intentional. Spark 2 has 122 GB unified, of which
~35 GB is consumed by parakeet-asr + magpie-tts. The 40 GB cap leaves
comfortable headroom for both the model weights (~5 GB) and pyannote's
in-memory features (~515 GB for a 90-min audio). If WhisperX hits a
pathological input it gets OOM-killed cleanly instead of swap-thrashing the
whole Spark — the symptom we hit with the unbounded Sortformer container.
## Rollback to Parakeet+Sortformer
```bash
docker stop whisperx-asr && docker rm whisperx-asr
```
The parakeet-asr container stays running throughout — spark-control's proxy
URL switch is reversible via config or version downgrade.
-355
View File
@@ -1,355 +0,0 @@
"""WhisperX FastAPI wrapper — STT + speaker diarization in a single endpoint.
Endpoints (designed to be drop-in compatible with the existing spark-control
audio API surface, so the proxy just changes its upstream URL):
GET / — service info
GET /health — readiness probe
GET /v1/models — list loaded models
POST /v1/audio/transcriptions — OpenAI-shaped STT (no speakers)
POST /v1/audio/transcribe-with-speakers — merged diarized transcript
The /transcribe-with-speakers response shape EXACTLY matches what
spark-control's /api/audio/transcribe-with-speakers returns today (the one
that recap-relay's PR spec was written against), so swapping the upstream
from Parakeet+Sortformer to WhisperX is a one-URL change in the proxy.
"""
from __future__ import annotations
import os
import time
import tempfile
import logging
from contextlib import asynccontextmanager
from typing import Optional
import torch
import whisperx
from fastapi import FastAPI, File, Form, UploadFile, HTTPException
from fastapi.responses import JSONResponse
from fastapi.middleware.cors import CORSMiddleware
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
)
logger = logging.getLogger("whisperx-api")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
COMPUTE_TYPE = os.getenv("COMPUTE_TYPE", "float16" if DEVICE == "cuda" else "int8")
WHISPER_MODEL = os.getenv("WHISPER_MODEL", "medium")
DEFAULT_LANG = os.getenv("DEFAULT_LANGUAGE", "en")
BATCH_SIZE = int(os.getenv("BATCH_SIZE", "16"))
HF_TOKEN = os.getenv("HF_TOKEN") or None
class WhisperXEngine:
def __init__(self) -> None:
self.transcribe_model = None
self.align_model = None
self.align_metadata = None
self.diarize_model = None
self._loaded = False
def load(self) -> None:
if self._loaded:
return
logger.info(f"Loading whisper-{WHISPER_MODEL} on {DEVICE} ({COMPUTE_TYPE})")
self.transcribe_model = whisperx.load_model(
WHISPER_MODEL, DEVICE, compute_type=COMPUTE_TYPE
)
logger.info(f"Loading alignment model for {DEFAULT_LANG}")
self.align_model, self.align_metadata = whisperx.load_align_model(
language_code=DEFAULT_LANG, device=DEVICE
)
if HF_TOKEN:
logger.info("Loading pyannote diarization pipeline (3.1)")
try:
self.diarize_model = whisperx.DiarizationPipeline(
use_auth_token=HF_TOKEN, device=DEVICE
)
except Exception as e:
logger.exception(f"Diarization pipeline failed to load: {e}")
self.diarize_model = None
else:
logger.warning(
"HF_TOKEN not set — diarization disabled. /transcribe-with-speakers "
"will return 503. /transcriptions still works."
)
self._loaded = True
logger.info("WhisperX engine ready")
def transcribe(self, audio_bytes: bytes, filename: str, want_timestamps: bool = True) -> dict:
if not self._loaded:
self.load()
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
tmp.write(audio_bytes)
tmp_path = tmp.name
try:
audio = whisperx.load_audio(tmp_path)
duration = float(audio.shape[0]) / 16000.0
result = self.transcribe_model.transcribe(
audio, batch_size=BATCH_SIZE, language=DEFAULT_LANG
)
language = result.get("language") or DEFAULT_LANG
if want_timestamps:
aligned = whisperx.align(
result["segments"],
self.align_model,
self.align_metadata,
audio,
DEVICE,
return_char_alignments=False,
)
segments = aligned.get("segments", [])
else:
segments = result.get("segments", [])
full_text = " ".join(s.get("text", "").strip() for s in segments).strip()
return {
"duration": duration,
"language": language,
"text": full_text,
"segments": segments,
"audio_path": tmp_path,
"audio": audio, # caller can reuse for diarization without re-loading
}
finally:
# NOTE: caller is responsible for unlinking the temp file. We expose it
# in the return dict so diarization can run on the same audio without
# disk re-IO. The unlink happens in the request handler's finally.
pass
def diarize(self, audio) -> dict:
if self.diarize_model is None:
raise RuntimeError(
"Diarization pipeline not loaded (HF_TOKEN missing or load failed)"
)
diar = self.diarize_model(audio)
return diar
engine = WhisperXEngine()
@asynccontextmanager
async def lifespan(app: FastAPI):
engine.load()
yield
app = FastAPI(
title="WhisperX ASR + Diarization",
version="1.0.0",
lifespan=lifespan,
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
@app.get("/")
async def root() -> dict:
return {
"service": "whisperx",
"device": DEVICE,
"models": {
"transcription": f"whisper-{WHISPER_MODEL}",
"alignment": f"wav2vec2-{DEFAULT_LANG}",
"diarization": "pyannote-speaker-diarization-3.1" if engine.diarize_model else None,
},
"endpoints": {
"transcriptions": "/v1/audio/transcriptions",
"transcribe_with_speakers": "/v1/audio/transcribe-with-speakers",
"models": "/v1/models",
"health": "/health",
},
}
@app.get("/health")
async def health() -> dict:
return {
"status": "ready" if engine._loaded else "loading",
"transcribe_loaded": engine.transcribe_model is not None,
"align_loaded": engine.align_model is not None,
"diarizer_loaded": engine.diarize_model is not None,
"model": f"whisper-{WHISPER_MODEL}",
"device": DEVICE,
}
@app.get("/v1/models")
async def list_models() -> dict:
data = [
{"id": f"whisper-{WHISPER_MODEL}", "object": "model", "owned_by": "openai", "kind": "stt"},
]
if engine.diarize_model is not None:
data.append(
{"id": "pyannote-speaker-diarization-3.1", "object": "model",
"owned_by": "pyannote", "kind": "diarization"}
)
return {"object": "list", "data": data}
def _normalize_speaker(label: str) -> str:
"""WhisperX/pyannote uses 'SPEAKER_00' / 'SPEAKER_01' / ... — normalize to
the same 'Speaker_0' shape spark-control's existing endpoint returns."""
if not label:
return "Speaker_unknown"
if label.upper().startswith("SPEAKER_"):
idx = label.split("_", 1)[1].lstrip("0") or "0"
return f"Speaker_{idx}"
return label
def _segments_to_blocks(segments: list[dict]) -> list[dict]:
"""Convert WhisperX's per-utterance segments into the
[{start_ms, end_ms, speaker, text}, ...] block shape spark-control returns
today. Groups consecutive same-speaker segments into one block."""
blocks: list[dict] = []
cur = None
for s in segments:
spk_raw = s.get("speaker") or "Speaker_unknown"
spk = _normalize_speaker(spk_raw)
text = (s.get("text") or "").strip()
start_ms = int(float(s.get("start", 0)) * 1000)
end_ms = int(float(s.get("end", 0)) * 1000)
if not text:
continue
if cur is None or cur["speaker"] != spk or start_ms - cur["end_ms"] > 1500:
if cur is not None:
blocks.append(cur)
cur = {"start_ms": start_ms, "end_ms": end_ms, "speaker": spk, "text": text}
else:
cur["text"] = (cur["text"] + " " + text).strip()
cur["end_ms"] = end_ms
if cur is not None:
blocks.append(cur)
return blocks
@app.post("/v1/audio/transcriptions")
async def transcribe(
file: UploadFile = File(...),
model: Optional[str] = Form(default=None),
language: Optional[str] = Form(default=None),
response_format: Optional[str] = Form(default="json"),
temperature: Optional[float] = Form(default=None),
prompt: Optional[str] = Form(default=None),
):
if not engine._loaded:
raise HTTPException(status_code=503, detail="Engine loading")
audio_bytes = await file.read()
if not audio_bytes:
raise HTTPException(status_code=400, detail="Empty file")
start_t = time.time()
audio_path = None
try:
result = engine.transcribe(
audio_bytes,
file.filename or "audio.wav",
want_timestamps=(response_format == "verbose_json"),
)
audio_path = result.pop("audio_path", None)
result.pop("audio", None)
except Exception as e:
logger.exception("Transcription failed")
raise HTTPException(status_code=500, detail=f"Failed: {e}")
finally:
if audio_path:
try: os.unlink(audio_path)
except OSError: pass
elapsed = time.time() - start_t
duration = result.get("duration", 0.0)
logger.info(f"Transcribed {duration:.1f}s in {elapsed:.1f}s ({duration/elapsed:.0f}x rt)")
if response_format == "text":
return JSONResponse(content=result["text"], media_type="text/plain")
if response_format == "verbose_json":
words = []
for s in result.get("segments", []):
for w in s.get("words", []) or []:
words.append({
"word": w.get("word"),
"start": w.get("start"),
"end": w.get("end"),
"score": w.get("score"),
})
return {
"task": "transcribe",
"language": result.get("language", "en"),
"duration": duration,
"text": result["text"],
"segments": [
{"start": s.get("start"), "end": s.get("end"), "text": s.get("text", "").strip()}
for s in result.get("segments", [])
],
"words": words,
}
return {"text": result["text"]}
@app.post("/v1/audio/transcribe-with-speakers")
async def transcribe_with_speakers(file: UploadFile = File(...)) -> dict:
"""Merged STT + diarization. Response shape matches spark-control's
/api/audio/transcribe-with-speakers exactly — recap-relay's PR spec
needs no changes when we cut over."""
if not engine._loaded:
raise HTTPException(status_code=503, detail="Engine loading")
if engine.diarize_model is None:
raise HTTPException(
status_code=503,
detail="Diarization unavailable — HF_TOKEN not set or pyannote failed to load",
)
audio_bytes = await file.read()
if not audio_bytes:
raise HTTPException(status_code=400, detail="Empty file")
start_t = time.time()
audio_path = None
try:
result = engine.transcribe(
audio_bytes, file.filename or "audio.wav", want_timestamps=True
)
audio_path = result.pop("audio_path", None)
audio = result.pop("audio")
# Diarize on the in-memory audio (no second decode)
logger.info("Running pyannote diarization…")
diar = engine.diarize(audio)
# whisperx.assign_word_speakers writes speaker labels into the
# aligned segments + their nested words
result_with_speakers = whisperx.assign_word_speakers(
diar, {"segments": result["segments"]}
)
segments_in = result_with_speakers.get("segments", [])
blocks = _segments_to_blocks(segments_in)
speakers = sorted({b["speaker"] for b in blocks if b["speaker"] != "Speaker_unknown"})
except Exception as e:
logger.exception("Diarized transcription failed")
raise HTTPException(status_code=500, detail=f"Failed: {e}")
finally:
if audio_path:
try: os.unlink(audio_path)
except OSError: pass
elapsed = time.time() - start_t
duration = result.get("duration", 0.0)
logger.info(
f"Transcribed+diarized {duration:.1f}s in {elapsed:.1f}s "
f"({duration/elapsed:.0f}x rt), {len(speakers)} speakers, {len(blocks)} blocks"
)
return {
"duration": duration,
"language": result.get("language", "en"),
"speakers_detected": speakers,
"segments": blocks,
"models": {
"transcription": f"whisper-{WHISPER_MODEL}",
"diarization": "pyannote-speaker-diarization-3.1",
},
}
@@ -1,5 +0,0 @@
whisperx==3.4.3
fastapi>=0.115
uvicorn[standard]>=0.32
python-multipart>=0.0.9
soundfile>=0.12
+2 -6
View File
@@ -9,7 +9,7 @@
**Fix:**
```bash
ssh modelo@<spark-2-host> 'docker run --rm -v magpie-model-cache:/cache alpine chown -R 1000:1000 /cache && docker restart magpie-tts'
ssh <spark-user>@<spark-2-host> 'docker run --rm -v magpie-model-cache:/cache alpine chown -R 1000:1000 /cache && docker restart magpie-tts'
```
The trick is the `docker run --rm alpine chown` — it runs as root inside the throwaway container, which is enough to chown the bind-mounted volume on the host, without needing `sudo` on the host itself. After the chown + restart, magpie downloaded its ~3 GB model from NGC into the cache and came up healthy on `:9000`.
@@ -24,13 +24,9 @@ This flag is Blackwell-specific. If vLLM in the container reports `unrecognized
Qwen3.6 uses a Mamba-attention hybrid that requires `--max-num-batched-tokens >= 2096`. vLLM's default is 2048, which trips `AssertionError: In Mamba cache align mode, block_size (2096) must be <= max_num_batched_tokens (2048)`. Fix: bake `--max-num-batched-tokens=16384` into the bundled qwen36 entry — matches the upstream qwen3.5-35b-a3b-fp8 recipe.
## Multimodal token budget for vision models (fixed in v0.8.0:1)
After the eugr/spark-vllm-docker update, vLLM became stricter about multimodal token budgets. Vision-capable models like Gemma 4 31B and Qwen3-VL crash at engine init with `ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048)`. Fix: bake `--max-num-batched-tokens=16384` into every model that has the `vision` capability. Now applied to qwen3-vl, gemma4, and qwen36 (which was already set for the Mamba issue).
## Two SSH paths to Spark 1 from the laptop
`ssh modelo@192.168.1.103` does NOT work from the laptop because the NVIDIA Sync ssh_config only has a Host entry for `spark-27ea.local`. Always use the `.local` hostname or `192.168.1.87`-style entries that ARE matched.
`ssh <spark-user>@<spark-1-ip>` does NOT work from the laptop because the NVIDIA Sync ssh_config only has a Host entry for `<spark-1-host>.local`. Always use the `.local` hostname or `<spark-2-ip>`-style entries that ARE matched.
## Older models in `models.yaml`
+1 -1
View File
@@ -1,6 +1,6 @@
MIT License
Copyright (c) 2026 Grant
Copyright (c) 2026 Alice
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
+1 -1
View File
@@ -19,7 +19,7 @@ This package SSHes into your Spark server to run cluster commands, so it needs a
```bash
echo "<paste-pubkey-here>" >> ~/.ssh/authorized_keys
```
3. **Open Actions → Configure Sparks.** Enter the LAN hostnames or IPs for Spark 1 and Spark 2, plus the SSH username (usually `modelo`).
3. **Open Actions → Configure Sparks.** Enter the LAN hostnames or IPs for Spark 1 and Spark 2, plus the SSH username (usually `<spark-user>`).
4. **Open the Web UI.** It will hit each Spark to confirm. If both indicators are green you're done.
## Using Spark Control
+1 -1
View File
@@ -19,7 +19,7 @@ This package SSHes into your Spark server to run cluster commands, so it needs a
```bash
echo "<paste-pubkey-here>" >> ~/.ssh/authorized_keys
```
3. **Open Actions → Configure Sparks.** Enter the LAN hostnames or IPs for Spark 1 and Spark 2, plus the SSH username (usually `modelo`).
3. **Open Actions → Configure Sparks.** Enter the LAN hostnames or IPs for Spark 1 and Spark 2, plus the SSH username (usually `<spark-user>`).
4. **Open the Web UI.** It will hit each Spark to confirm. If both indicators are green you're done.
## Using Spark Control
+2 -2
View File
@@ -1,10 +1,10 @@
import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk'
export const v0_1_0 = VersionInfo.of({
version: '0.12.0:1',
version: '0.7.0:2',
releaseNotes: {
en_US:
'v0.12.0:1 — hotfix: 0.12.0:0\'s install action used shlex.quote() on the remote build path, which wraps `~/whisperx-build/...` in single quotes — the remote shell then doesn\'t expand the tilde and treats it as a literal directory named `~`. Result: "bash: line 1: ~/whisperx-build/Dockerfile: No such file or directory" on the very first file copy. Same bug pattern we hit before with $HOME in the disk probe. Rewrote to embed $HOME in double-quoted remote shell strings; hardcoded file names (Dockerfile, requirements.txt, README.md, app/main.py) embed unquoted inside that scope. All other 0.12.0 behavior is unchanged.',
'v0.7: pre-flight launch validation. New "Test" button on every model card runs vLLM\'s argparse against the proposed launch command inside the running vllm_node container — without starting an engine. Catches unknown flags, bad types, and version-removed flags in about 5 seconds, before disrupting the currently-loaded model. (Runtime-only failures like the Qwen3.6 Mamba block-size assertion still only surface during a real swap, but argparse-stage bugs are now caught up front.)',
},
migrations: {
up: async ({ effects }) => {},
+8 -8
View File
@@ -37,7 +37,7 @@ These take effect on the **next swap to that model**. If a swap fails after this
## Adding a new model
1. Add an entry to `image/models.yaml`. Required fields: `display_name`, `repo`, `size_gb`, `mode` (`solo` or `cluster`), `vllm_args`. Optional but recommended: `description` (one paragraph — what the model is, what it's good for, how it differs from others; renders below the meta tags in each card), `capabilities` (tags like `[vision, reasoning, tools]`), `expected_ready_seconds`.
2. Confirm the weights are on the Spark: `ssh modelo@spark-27ea.local 'ls ~/.cache/huggingface/hub/'`. If not, download with `./hf-download.sh <repo>` on Spark 1.
2. Confirm the weights are on the Spark: `ssh <spark-user>@<spark-1-host>.local 'ls ~/.cache/huggingface/hub/'`. If not, download with `./hf-download.sh <repo>` on Spark 1.
3. Rebuild + redeploy the package: `cd package && make x86 && make install`.
If `description` is omitted, the card simply hides that section — no need to populate it for every model. Keep descriptions generic (not user-specific) so the catalog stays portable.
@@ -47,7 +47,7 @@ If `description` is omitted, the card simply hides that section — no need to p
If the UI is unavailable and you need to swap by hand:
```bash
ssh modelo@spark-27ea.local
ssh <spark-user>@<spark-1-host>.local
cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh --solo -d exec vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
@@ -61,19 +61,19 @@ docker logs -f vllm_node # wait for "Application startup complete."
```bash
# Is vLLM serving?
curl -s http://192.168.1.103:8888/v1/models | jq .
curl -s http://<spark-1-ip>:8888/v1/models | jq .
# Cluster status (containers up?)
ssh modelo@spark-27ea.local 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
ssh <spark-user>@<spark-1-host>.local 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
# Tail current model's logs
ssh modelo@spark-27ea.local 'docker logs --tail 200 -f vllm_node'
ssh <spark-user>@<spark-1-host>.local 'docker logs --tail 200 -f vllm_node'
# Parakeet
curl -s http://192.168.1.87:8000/health
curl -s http://<spark-2-ip>:8000/health
# Magpie (see known-issues.md)
curl -s http://192.168.1.87:9000/v1/health/ready
curl -s http://<spark-2-ip>:9000/v1/health/ready
```
## Hard reset
@@ -81,7 +81,7 @@ curl -s http://192.168.1.87:9000/v1/health/ready
If launch-cluster.sh gets stuck:
```bash
ssh modelo@spark-27ea.local
ssh <spark-user>@<spark-1-host>.local
cd ~/spark-vllm-docker
./launch-cluster.sh stop
docker ps -aq | xargs -r docker rm -f