43 Commits

Author SHA1 Message Date
Keysat 0ba2a3a3fc v0.12.0:1 - hotfix: WhisperX install fails on first scp because ~ doesn't
expand inside shlex.quote()

Symptom: "Failed to ship Dockerfile — bash: line 1: ~/whisperx-build/
Dockerfile: No such file or directory"

Same bug pattern as v0.8.1:1 (disk probe). shlex.quote() wraps in single
quotes, and the remote shell doesn't do tilde expansion inside single
quotes — so it tries to write to a literal directory named "~".

Fix: use $HOME in double-quoted shell context, which the remote shell
expands correctly. The file names (Dockerfile, requirements.txt, etc.)
are hardcoded so they're safe to embed unquoted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:16:44 -05:00
Keysat b87cb0f99b v0.12.0:0 - WhisperX as a one-click dashboard install + managed service
Replaces the manual rsync+build+run with a proper spark-control feature.
First in the audio path that doesn't require shell access on Spark 2.

What's in the box
─────────────────
* image/whisperx_container/   - the build context (Dockerfile, requirements,
  app/main.py FastAPI wrapper). Mainline pipeline: faster-whisper for STT +
  pyannote 3.1 for diarization + wav2vec2 forced alignment. Single endpoint
  /v1/audio/transcribe-with-speakers returns the exact same shape spark-
  control's existing endpoint does, so the recap-relay PR spec needs no
  changes when we cut over.

* image/app/whisperx_install.py - install manager. ships build context to
  Spark 2 over SSH, runs `docker build`, runs `docker run` with 40 GB
  memory cap (vs Sortformer's unbounded which thrashed Spark 2 on a 90-min
  file), polls /health until both Whisper + pyannote report loaded.

* Audio proxy: /api/audio/transcribe-with-speakers now prefers WhisperX
  when its /health reports diarizer_loaded=true, falls back to the legacy
  Parakeet + Sortformer path otherwise. Same response shape either way.
  Clean cutover, easy rollback (`docker rm whisperx-asr`).

* Dashboard (Audio / Speech tab):
  - "Add WhisperX" banner appears when not installed, with a primary
    "Install WhisperX" button. One click triggers the install.
  - Build progress dialog with phase + elapsed timer + live build log via
    SSE (`/api/whisperx/install/{job_id}/stream`).
  - After install, WhisperX auto-registers as a managed service alongside
    Parakeet and Magpie (Start/Restart/Stop, deep-check, auto-restart).
  - Banner self-hides once /api/whisperx/status reports healthy.

New endpoints
─────────────
  GET  /api/whisperx/status
  POST /api/whisperx/install
  GET  /api/whisperx/install/{job_id}
  GET  /api/whisperx/install/{job_id}/stream  (SSE phase + log)

Config additions (env)
──────────────────────
  WHISPERX_HOST       (defaults to spark2_host)
  WHISPERX_USER       (defaults to spark2_user)
  WHISPERX_CONTAINER  (default: whisperx-asr)
  WHISPERX_PORT       (default: 8002)
  WHISPERX_MODEL      (default: medium; tiny/base/small/medium/large-v3)

Dockerfile
──────────
Added COPY whisperx_container /app/whisperx_container so the runtime
install manager can read the build context from inside the spark-control
image and ship it over SSH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:02:26 -05:00
Keysat 346df907d2 v0.11.0:3 - button sizing fix: unify base .btn to 12px / 6px 12px
User feedback: every action button OUTSIDE the parakeet/magpie service
cards looked too big. Specifically called out: "Reapply patches",
"Restart container", "Switch to this", "Download". The ones on the
service cards (Start/Restart/Stop) were the size he liked.

Root cause: the base .btn used font: inherit, so it picked up 15px from
body. .service-actions .btn was the only place with an explicit
font-size: 12px + padding: 6px 12px override.

Fix: change .btn base directly to font-size: 12px + padding: 6px 12px.
Every button across the dashboard now matches the service-card button
footprint. The existing per-context overrides become redundant but
remain in place; they no longer create visible differences.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:54:46 -05:00
Keysat 82224f53e7 v0.11.0:2 - pill sizing fix: match .tag exactly to .status "Healthy" pill
User feedback: every pill outside the Always-On Services cards was rendering
visually taller than the "Healthy" status pill they liked. Root cause was
the .tag additions in 0.11.0:1 (line-height: 1.5, display: inline-block)
that didn't match the .status pill on service cards (which has neither).

Dropped both additions, bumped font-size from 11px → 12px so .tag is now
pixel-identical to .status:
  font-size: 12px;
  padding: 2px 8px;
  border-radius: 999px;
  background: var(--surface-2);
  border: 1px solid var(--border);

Every pill on the dashboard (mode-cluster/mode-solo/cap/on-disk/not-on-disk/
custom-pill/.tag.ok/.tag.warn/.tag.bad) now renders at the same footprint
as the Healthy/Unhealthy/Starting pills on the service cards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:46:57 -05:00
Keysat a7102105aa v0.11.0:1 - dashboard polish: tabs, collapsible endpoint, pill consistency
Three UX improvements, all client-side; no backend or behavior changes.

1. LLM / Audio tabs under the hardware section. The single long column got
   split into two tabbed views:
     * LLM       -> model swap + download panel + spark-vllm-docker updates
     * Audio     -> Parakeet/Magpie services + speech-model patches
   Selection persists in localStorage; default is LLM. The swap-panel
   (in-flight LLM swap) sits ABOVE the tab strip so it stays visible
   regardless of which tab is active.

2. Collapsible OpenAI-compatible Endpoint card. New chevron in the card
   header collapses everything except the title. State persists per browser
   via localStorage. Defaults to collapsed since you rarely need the URL/
   model details visible (and the same info is one tab swap away).

3. Unified pill sizing. The .sm-pill class in speech-models was rendering
   subtly larger than .tag pills on model cards. Dropped .sm-pill entirely
   and reused .tag with semantic color modifiers (.tag.ok / .tag.warn /
   .tag.bad). Same 11px / 2px×8px footprint everywhere now. Also added
   explicit line-height: 1.5 + display: inline-block to .tag to lock down
   vertical sizing.

No new endpoints, no new dependencies. Tested locally with node --check
and ast.parse(). Verified the tab DOM structure wraps the right sections
and the speech-models panel still self-shows/hides on data load.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:33:16 -05:00
Keysat ddfd508c2f v0.11.0:0 - Speech model patches panel (lifecycle for v0.10.0 overlays)
Folds the image/parakeet_patches/apply.sh script into a one-click
dashboard action and adds drift detection so you can see at a glance
whether the parakeet-asr container has the latest Sortformer overlays
that spark-control ships.

Backend:
  * image/app/speech_models.py - SpeechModelsManager: reads /health from
    Parakeet, sha256s the local overlay files inside spark-control's
    Docker image (/app/parakeet_patches), sha256s the same files inside
    the parakeet-asr container via `docker exec ... sha256sum`, surfaces
    in_sync / drift / missing status per file.
  * GET  /api/speech-models           - status payload
  * POST /api/speech-models/reapply   - copies overlays into container,
                                         verifies python syntax, restarts,
                                         polls /health for ~120s, returns
                                         step-by-step result
  * POST /api/speech-models/restart   - plain `docker restart parakeet-asr`

Dockerfile: now COPY parakeet_patches into the image at /app/parakeet_patches
so the runtime can read them. Future spark-control releases auto-carry
newer overlay versions; the panel surfaces drift after upgrade.

Frontend: new "Speech model patches" section on the dashboard with
  * Status pill (in sync / drift / missing)
  * Per-file SHA comparison (local vs container)
  * Loaded-models pills (ASR + diarizer)
  * Reapply + Restart buttons (both with confirmation modals)
  * Live progress display during reapply with per-step ✓/✗

Verified post-install against the running cluster:
  GET /api/speech-models shows both files in_sync (SHAs match) and both
  models loaded ready on Spark 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:58:13 -05:00
Keysat 6664543dec v0.10.0:1 - hotfix: merge function now joins words with proper spacing
Smoke testing v0.10.0:0 against a real anarlog audio.mp3 showed the
output running words together: "I'mrecordingrightnow", "don'tyoutry".

Root cause: _merge_words_with_speakers was doing "".join(cur_words),
assuming Parakeet returns words with leading whitespace (which the
hyprnote local Parakeet does, but the Spark-hosted Parakeet does not).

Rewrote the join with a small helper that:
  - Strips each token (handles both leading-space and no-leading-space
    word formats)
  - Joins with a single space
  - Keeps punctuation tight — no space before period/comma/colon/etc.

Verified post-install with the same test audio:
  [00:06] Speaker_0: I'm I'm recording right now.
  [00:18] Speaker_1: you're you're on your computer and your phone, right?

No other changes — Parakeet container patches and the endpoint shape
stay identical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:42:04 -05:00
Keysat 33768ae3d7 v0.10.0:0 - speaker diarization via Sortformer + merged transcribe-with-speakers
Adds a new pipeline for diarized transcription that any client (recap-relay,
ad-hoc curl, future Mac-side tools) can call. Pure data pipeline, no LLM
or UI included — name resolution / analysis happen downstream where prompts
and rendering are configurable.

Architecture:
  Spark 2 / parakeet-asr container:
    + /opt/parakeet/app/diarizer.py        (new: SortformerDiarizer class)
    + /opt/parakeet/app/main.py            (patched: loads diarizer, adds
                                            /v1/audio/diarize endpoint)
    Model: nvidia/diar_sortformer_4spk-v1  (~150 MB, ungated, NeMo native)

  Spark Control:
    + POST /api/audio/transcribe-with-speakers
      Body: multipart file
      Returns: {
        duration, language, speakers_detected,
        segments: [{start_ms, end_ms, speaker, text}, ...],
        models: {transcription, diarization}
      }
      Runs Parakeet ASR + Sortformer in parallel, merges words to speaker
      turns by timestamp, groups into speaker-change blocks (breaks also
      on >1.5s silence gaps).
    + If Parakeet 500s mid-pipeline, kicks deep-health probe and returns
      503/Retry-After: 60 — same wedge-recovery pattern as v0.9.0:2.

Apply Sortformer patches to the running Parakeet container with:
  bash image/parakeet_patches/apply.sh <spark2-host> <ssh-user>

Patches are reversible — apply.sh backs up the original main.py inside the
container at main.py.pre-sortformer before overwriting. Restore by copying
that file back and removing diarizer.py, then docker restart.

v0.11 follow-up: dashboard "Speech Models" panel to swap/update model
versions from the UI instead of needing to re-run apply.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:14:48 -05:00
Keysat 8a5862dcb2 v0.9.0:2 - audio proxy: turn Parakeet wedge 500 into clean 503 + immediate auto-restart
Parakeet's recurring CUDA wedge (CUBLAS_STATUS_*_ERROR mid-attention)
fires reliably on Open WebUI's WebM/Opus->MP3 audio. Previously the
proxy relayed the upstream 500 verbatim, Open WebUI showed "Server
connection error" with no signal to retry, and recovery took up to
5 minutes (waiting for the next periodic deep-health probe).

Now the proxy:
  1. Detects 500 from /v1/audio/transcriptions
  2. Fires deep_health.run_one("parakeet") as a background asyncio task
     (which contains the same wedge-detect + rate-limited auto-restart
     logic, but runs immediately instead of waiting for the next tick)
  3. Returns 503 with a clear detail message and Retry-After: 60

The client (Open WebUI, Home Assistant, etc.) gets a proper retry
signal; the auto-restart triggers inside seconds; the next attempt
~60s later succeeds. Rate-limiting (3 restarts per 30 min) is
inherited from the deep-health module so this can't cause restart
storms.

server.py: pass deep_health into build_audio_router().
audio_proxy.py: new 503-with-restart branch; signature now accepts
                deep_health as an optional dependency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 18:07:35 -05:00
Keysat 589b3e59ab v0.9.0:1 - hotfix: add python-multipart for /v1/audio/transcriptions
v0.9.0:0 introduced the OpenAI audio proxy whose /v1/audio/transcriptions
endpoint uses FastAPI's Form() + File() parameters. Those require
python-multipart at runtime; it wasn't in image/pyproject.toml because
none of the prior endpoints needed multipart.

Result: FastAPI raised RuntimeError("Form data requires python-multipart")
during route registration, the entrypoint exited 1, and StartOS's
reverse proxy started closing TLS handshakes with PR_END_OF_FILE_ERROR
because there was no upstream to forward to.

Fix: add python-multipart>=0.0.9 to dependencies. Dashboard, /api/*,
and the new /v1/* audio endpoints all come back up cleanly. No other
code changes.

Verified post-install: Uvicorn running on http://0.0.0.0:9999,
"Application startup complete" in the logs, package status 'installed'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 17:01:52 -05:00
Keysat d67152624e v0.9.0:0 - OpenAI-compatible audio proxy for Open WebUI / Home Assistant
Adds three new endpoints to spark-control that translate OpenAI's
audio API shapes to the Parakeet (STT) and Magpie (TTS, NVIDIA Riva)
services on the Sparks:

  GET  /v1/models                — STT model + Magpie's 60+ voices
  POST /v1/audio/speech          — OpenAI body -> Magpie multipart synthesize
                                    (returns audio/wav passthrough)
  POST /v1/audio/transcriptions  — relay to Parakeet (already compatible)

Verified shapes against the live services:
  - Parakeet returns OpenAI-style {"text": "..."} or verbose_json with
    segments+words. Already a perfect drop-in for OpenAI clients.
  - Magpie returns raw WAV bytes with Content-Type: audio/wav. NOT
    base64-wrapped JSON as one might assume. The proxy is literally a
    body-translation on the request side; response is passthrough.

Voice language is auto-derived from the voice name (e.g.
Magpie-Multilingual.EN-US.Mia -> language=en-US) so clients don't
need to set it explicitly.

Open WebUI / Home Assistant / Recap Relay can now all point at one
URL — https://<spark-control>.local/v1 — and get LLM, STT, TTS
behind a single identity. No shim service to deploy.

Pure addition: no existing routes touched; the dashboard, /api/*,
download flow, deep-health, hardware probes are all unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 16:41:48 -05:00
Keysat 26382dc932 v0.8.1:2 - card button flips to blue "Download" when weights are absent
When a model's weights aren't on disk, the green "Switch to this"
button on the card is replaced by a blue "Download" button that
calls /api/download directly with the model's repo and the right
mode (solo -> spark1, cluster -> both). One-click re-install of a
previously-deleted model, no more pasting the repo into the manual
download form.

Also adds a confirmation dialog showing the model name, size, and
target Spark(s) before kicking off the download — and disables the
button when another download is already in flight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 09:30:51 -05:00
Keysat c07eaeb4ee v0.8.1:1 - fix disk probe: $HOME wasn't expanding inside shlex.quote
The 0.8.1:0 probe wrapped the entire path (including $HOME) in
shlex.quote, which produces single quotes — preventing shell
variable expansion. The resulting `[ -d '$HOME/.cache/...' ]` test
looked for a literal path starting with the string $HOME and
always failed, so every model reported as "not downloaded" and no
trash icons rendered.

Fix: embed $HOME in a double-quoted shell context (which allows
expansion) and validate the cache dirname against a whitelist
[A-Za-z0-9._-]+ rather than relying on shlex quoting. The dirname
is fully constrained by HF's naming rules + our org--name munging,
so the whitelist is tight enough.

Verified against Spark 1: probe now correctly reports the
25,075,981,924 bytes (23.4 GB) of Qwen3.6's cache dir.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:58:43 -05:00
Keysat 36ca99f73b v0.8.1:0 - delete model weights from disk via card trash icon
Each model card now shows whether its weights are present on disk
(with GB size) or not yet downloaded. When present and the model
isn't currently loaded, a trash icon appears; clicking it pops a
confirmation showing exactly how many GB will be freed and on
which Spark(s), then runs rm -rf on the HF cache directory via SSH.

Cluster-mode models are removed from both Sparks; solo-mode from
Spark 1 only. Safety rails: refuses to delete the currently-loaded
model, refuses during an in-flight swap or download, and the
catalog entry stays intact so it can be re-downloaded anytime.

Backend:
  - new image/app/disk.py: probe_disk + delete_from_disk over SSH
  - GET  /api/models/disk-status — parallel probe across all catalog models
  - DELETE /api/models/{key}/disk — guarded rm -rf, logs to connectivity events

Frontend:
  - on-disk / not-downloaded pills on every card
  - trash icon-btn in card-actions row (hidden when not on disk)
  - confirmation dialog showing per-host bytes-to-free
  - disk-status re-checked every 60s

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:07:20 -05:00
Grant 5662d957af v0.8.0:4 - vLLM deep-health: 'no model loaded' is idle, not a wedge
Previously a ConnectError on /v1/models classified vLLM as failing, which would feed into the wedge auto-restart heuristic. But when no model is loaded (the normal idle state between swaps, or after a failed swap leaves the vllm_node container up with no process serving), nothing is listening on 8888 — that's by design, not a wedge.

The vLLM probe now does a two-step check:
  1. GET /v1/models. ConnectError or empty list -> ok=true with note='no model currently loaded (idle)'. No auto-restart triggered (it wouldn't help anyway — restarting vllm_node kills any loaded model and doesn't load a new one).
  2. If a model is loaded, POST 1-token chat completion. A 5xx here is a genuine wedge worth restarting for.

Result: deep-health correctly reports 'no model loaded' as informational rather than flagging it as a failure. Auto-restart for vLLM only fires when a model is actually loaded AND inference fails — the right semantics.
2026-05-12 14:50:00 -05:00
Grant 5a5634a3a9 v0.8.0:3 - add --max-num-batched-tokens=16384 to vision models (gemma4, qwen3-vl)
After the recent eugr/spark-vllm-docker update, vLLM became stricter about multimodal token budgets:

  ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is
  larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens.

Each image input produces 2496 tokens, but vLLM's default --max-num-batched-tokens of 2048 is just under. Same class of bug as the Qwen3.6 Mamba block-size assertion we fixed in 0.6.0:1, surfacing on different models.

Fix: bake --max-num-batched-tokens=16384 into every multimodal model entry. Now applied to:
  - qwen36 (already had it for the Mamba constraint; works for multimodal too since Qwen3.6 has vision)
  - gemma4 (crashed today on engine init)
  - qwen3-vl (would crash with the same error if anyone tried it)

The pre-flight Test button validates argparse but the 2048<2496 check happens at runtime engine init, so it's not caught by Test — only by actually trying to load. This is exactly the kind of bug v0.7's Test catches the *syntax* of but not the *semantics*; runtime errors like this still surface only on real swap. Known limitation documented in v0.7 release notes.
2026-05-12 14:47:32 -05:00
Grant 9edb70418e v0.8.0 - Deep health probes + auto-restart on CUDA wedge
deep_health.py:
- Synthetic probes per service, all payloads generated in-memory (BytesIO), never written to disk:
  - Parakeet: 1s of digital silence via in-memory WAV → POST /v1/audio/transcriptions
  - Magpie:   short 'hi' text → POST /v1/audio/synthesize (multipart form-data, real TTS API endpoint discovered via openapi.json)
  - vLLM:     1-token completion against currently-loaded model
- Background loop runs every 5 minutes (configurable). Best-effort: exceptions in the loop never kill it.
- Auto-restart on wedge-pattern errors (cudaErrorUnknown / CUFFT_INTERNAL_ERROR / 500 / Engine core init failed): docker restart of the affected container.
  - Rate-limited: max 3 restarts per service per 30 min.
  - Cooldown: 120 s between consecutive restarts on the same service.
  - 60 s startup grace before any auto-restart can fire after the app boots.
- Probe failures + recoveries logged via record_report(source='deep-health') into the connectivity history alongside the polling-based transitions.

API:
- GET /api/deep-health: per-service last result + auto-restart counters
- POST /api/deep-health/{service}/run: manual trigger now

UI:
- Service cards show 'Deep check ok/FAILED <time> <latency>' inline, plus a ↻ button to run-now
- Auto-restart count in 30-min window surfaced on the card when > 0
- Inline error excerpt shown for failed probes

Bug fix: server.py app startup hook was placed before the FastAPI app object was constructed (would crash on import). Moved after.
2026-05-12 14:41:01 -05:00
Grant b7be1bab24 v0.7.0 - Pre-flight launch validation (Test button on every model card)
validate.py:
- Builds the same args list a real swap would pass to 'vllm serve'
- SSHes into Spark 1 and runs vLLM's own argparse layer inside the running vllm_node container, WITHOUT initializing the engine
- Uses FlexibleArgumentParser (from vllm.utils.argparse_utils, with fallback to engine.arg_utils) + make_arg_parser — the exact same parser the 'vllm serve' CLI uses. Earlier attempt with bare argparse.ArgumentParser was too strict (rejected '--moe_backend' with underscore that the real CLI accepts via FlexibleArgumentParser's normalization)
- Returns structured {ok, stage, error, cmd_args, launch_cmd} so the UI can surface the exact failure cause

Endpoint: POST /api/swap/{key}/validate. Cheap (~5s), no engine init, no disruption to the currently-loaded model.

Frontend: 'Test' button on every model card, inline result below the action row (green check or red detailed error). Result stays visible until the user reloads or clicks Test again.

Catches: typos in flag names, deprecated/removed flags after a vLLM upgrade, type mismatches. Does NOT catch runtime-only failures (Mamba block-size assertion, OOM at load, kernel-compat). Ok=true is necessary-but-not-sufficient; ok=false is definitive 'don't bother running it'.
2026-05-12 13:37:37 -05:00
Grant 6b799113c4 v0.6.0:1 - fix Qwen3.6 Mamba block-size assertion at launch
vLLM trips on launching Qwen3.6-35B-A3B-NVFP4 with:
  AssertionError: In Mamba cache align mode, block_size (2096) must be
  <= max_num_batched_tokens (2048).

Qwen3.6 uses a Mamba-attention hybrid. The default --max-num-batched-tokens of 2048 is just under the model's required block_size of 2096. The upstream sibling recipe (qwen3.5-35b-a3b-fp8.yaml) sets it to 16384; use the same value.

Earlier qwen36 swaps in this session worked because vLLM hadn't reached the Mamba-validation code path on that prior path (different attention backend pick or auto-retry). Whatever the reason, the explicit flag avoids the dance.

Also documented in known-issues.md.
2026-05-12 13:22:24 -05:00
Grant d90e7a230a v0.6.0 - Service-level connectivity tracking + passive failure-report endpoint
connectivity.py:
- Generalized 'spark' subject to any string; renamed 'spark' field to 'subject'
- Legacy v0.5 events with the old 'spark' field are migrated transparently on read (kind defaults to 'transition')
- New record_report(subject, ok, source, detail, latency_ms): always appends an event with kind='report'; does NOT mutate the current state (only active polling is authoritative)
- summary() returns events normalized to the new schema

Wiring:
- /api/status now calls record_state for vllm/parakeet/magpie (dedup on no-change)
- /api/services calls record_state for each service after its http check
- Result: dashboard observes service-level transitions automatically with no extra polling

Passive endpoint:
- POST /api/health-event with {service, ok, source?, error?, ms?}
- Useful for external apps (e.g. Open WebUI) to surface sub-poll-interval failures the dashboard would otherwise miss

UI:
- Connectivity dialog groups events by subject (hosts ordered first, then services)
- Per-subject summary shows transition count, down count, report count, failed-report count
- Transitions and reports render inline with distinct styling; reports show source app + error + latency
- Legacy v0.5 events render unchanged

Docs:
- README documents /api/health-event with a curl example

Package: bump to 0.6.0:0
2026-05-12 13:19:27 -05:00
Grant 6209c40f79 v0.5.0 - Wake-on-LAN + connectivity history
wol.py:
- build_magic_packet(): standard 6x0xFF + 16x MAC layout
- send_local_broadcast(): direct from container (ports 9 + 7 for safety)
- send_via_peer(): preferred path; SSHes to the OTHER Spark and runs a Python one-liner there so the packet originates on the target's LAN segment (most reliable)
- MAC validation + normalization

connectivity.py:
- /data/connectivity.json persistence (thread-safe, atomic rename)
- Stores per-Spark current state + last_change timestamp + rolling 200-event log
- Records up/down transitions; computes down_seconds / up_seconds durations
- MAC cache populated lazily during hardware probes

hardware.py:
- Probe now reads MAC via /sys/class/net/<default-route-iface>/address
- After each probe, record_state() emits a transition event if state changed
- record_mac() caches the address so WoL works when the Spark next goes down

Endpoints:
- GET /api/connectivity: macs, current state, last_change, events[]
- POST /api/spark/{name}/wake: tries via-peer first, falls back to direct broadcast

UI:
- Unreachable hardware card shows the cached MAC + 'Wake (WoL)' button (only if MAC known)
- New 'Connectivity log' button opens a modal with per-Spark transition history (last 25 each), including duration of each prior up/down period
- pollHardware also pulls /api/connectivity so WoL buttons appear without an extra fetch

Package: bump 0.5.0:0; main.ts sets CONNECTIVITY_LOG=/data/connectivity.json
2026-05-12 12:51:49 -05:00
Grant e332363004 v0.4.0 - NIM installer + dashboard resilience
Hotfix (was v0.3.1):
- services.py: cache 'unreachable' per (host,user) for 25s so a dead Spark doesn't hang every /api/services call behind 6s ssh timeout
- ssh_run timeout reduced 10 -> 6s for docker_state probes
- hardware probe: shorter SSH timeout (6s), longer cache TTL for failures (25s)
- JS pollStatus retries loadModels() if state.models is empty (recovers from cold-start proxy timeout)
- Unreachable hardware card now includes troubleshooting steps (Spark Control cannot SSH into an unreachable Spark to restart it)

v0.4 NIM installer:
- nim.py module: curated SUGGESTED_NIMS list (Parakeet, Magpie, Riva) + NimManager that runs docker login nvcr.io + docker pull + docker run -d --gpus all -p PORT:PORT -v VOL:/opt/nim/.cache -e NGC_API_KEY -e ... --restart=unless-stopped + chown the volume to uid 1000 + restart. Streams all output via SSE; redacts the API key from log lines.
- custom_services.py: persists installed NIMs to /data/services-overrides.yaml so they appear in the services panel after install
- services.py: merges custom services into the panel
- /api/nim/catalog GET, /api/nim/install POST + GET/SSE
- /api/services/{name} DELETE for custom services
- UI: '+ Install NIM' button next to 'Always-on services'; modal lists curated images each with a 'Pick' button + a custom-image form; installation runs in a second dialog with phase + elapsed timer + collapsible log
- NGC API key field added to Configure Sparks (masked); injected as NGC_API_KEY env var into the container

Package: bump 0.4.0:0; main.ts adds SERVICES_OVERRIDES + NGC_API_KEY env vars
2026-05-12 12:32:29 -05:00
Grant ea35ac03ef v0.3.0:1 - hotfix: parallel SSH probes + longer timeout
- Hardware probes for spark1 and spark2 now run via asyncio.gather (parallel) so the worst-case wall time is max(per-probe), not sum
- Bump per-probe SSH timeout from 8s to 12s to absorb first-call overhead (StrictHostKeyChecking=accept-new on first connect + nvidia-smi cold start)
- Unreachable Spark now shows up cleanly in the UI as a single 'unreachable' card with the error message
2026-05-12 12:14:36 -05:00
Grant 1a86fb0bf0 v0.3.0 - Hardware dashboard + knob context + Explain context + Open WebUI link
Hardware dashboard:
- New hardware.py module: SSH probes each Spark for hostname, uptime, load+cores, RAM, disk, GPU (name, util, temp, power) + per-process GPU memory sum
- DGX Spark uses unified memory (nvidia-smi memory.total returns N/A); fall back to per-process compute memory and compute fraction against system RAM. Marks with gpu_unified_memory=true.
- 4s TTL cache in HardwareProbe to avoid hammering
- /api/hardware returns per-Spark snapshot
- UI: 'Spark hardware' section at the top with per-Spark cards (CPU load, RAM, GPU mem (unified), GPU util + temp + power, disk) — bars with warn threshold styling
- Polls every 8s

Knob context (tied to live hardware):
- Each Advanced knob now shows plain-English help text
- 'GPU memory %' shows '~N GB allocated · ~M GB left for OS/buffers' computed from actual Spark RAM
- 'Max context' shows '~N pages of text'
- Toggles show tradeoff descriptions

Explain context:
- ' Explain context' button on the update banner
- /api/explain-updates POST: forwards pending commits to the loaded vLLM model and streams its response back as SSE
- Renders into an expandable 'Explained by the loaded LLM' section under Pending commits
- Reasoning tokens shown italicized when the model emits them

Open WebUI integration:
- New 'Open WebUI URL' optional field in Configure Sparks
- /api/config exposes it; UI shows 'Open chat ↗' button in the top bar if set

Downloads:
- Third radio option: Spark 1 only / Spark 2 only / Both Sparks
- Backend picks SSH target based on mode
- HF repo link icon next to the input
- Helper line about NVFP4 for Blackwell

Model cards:
- Repo name is now a clickable link to its Hugging Face page

Package: bump 0.3.0:0
2026-05-12 12:00:15 -05:00
Grant 66be0c1fc1 v0.2.4 - Hotfix: Unknown status + copy UX + update banner context
Bug fix:
- config.py: empty PARAKEET_CONTAINER / MAGPIE_CONTAINER env vars (from migrating to v0.2.0+ where the field is optional and saved as '') now fall back to 'parakeet-asr' / 'magpie-tts' via the 'or' idiom. Confirmed live: services classify as 'running' instead of 'unknown'.

UX:
- Replaced text 'Copy' buttons with compact icon buttons (clipboard SVG)
- Endpoint Base URL + Model ID + curl snippet are now click-to-copy themselves (the value AND a separate icon button)
- Service cards: host, base URL, and model are now three separate copyable rows
- Update banner: leading explanatory line — 'Updates to eugr/spark-vllm-docker — the upstream project that orchestrates vLLM on your Sparks. These are not firmware, OS, or model updates.' with a link to the repo.
2026-05-12 11:45:55 -05:00
Grant 91b5d6d6a6 docs: update README with v0.2 feature summary 2026-05-12 11:31:14 -05:00
Grant 4c67ccd28d v0.2.3 - Per-model Advanced settings + catalog-add for downloaded models
Backend:
- overrides.py: read/write /data/models-overrides.yaml (knobs + custom entries)
- apply_knobs_to_args(): strip matching flags from bundled vllm_args and append knob values, so knob changes properly override bundled defaults
- extract_knobs_from_args(): seed UI knob values from bundled args so the Advanced dialog has correct starting state
- models.py: load_catalog merges overrides on top of bundled yaml
- GET /api/models returns effective_knobs per model
- PUT /api/models/{key}/knobs persists knob changes
- POST /api/models adds a custom catalog entry
- DELETE /api/models/{key} removes a custom entry (bundled models cannot be deleted)
- swap_manager.reload_catalog() called after each mutation so swaps see latest

Frontend:
- New 'Advanced' button on every card opens a modal dialog: max-model-len input, gpu-memory-utilization slider, three optimization checkboxes (fastsafetensors, prefix caching, FP8 KV cache). Save persists; Cancel discards. Custom models also have a Delete button.
- After a successful download, automatically open the 'Add to catalog' dialog pre-filled with the repo, with the same knob defaults — user just enters key, display name, and clicks Save.
- Custom catalog entries are tagged with a blue 'custom' pill on the card.

Package: bump 0.2.3:0; main.ts sets MODELS_OVERRIDES=/data/models-overrides.yaml so overrides persist on the StartOS volume.
2026-05-12 11:30:47 -05:00
Grant ea328c2e2f v0.2.2 - spark-vllm-docker update checks + Apply Update
Backend:
- updates.py: get_update_status() runs git fetch + git rev-list --left-right --count HEAD...origin/main to learn ahead/behind/dirty, plus git log for pending commits
- UpdateManager class with asyncio.Lock; one update at a time
- POST /api/updates/apply triggers "git pull --ff-only && ./build-and-copy.sh -c" over SSH with streamed log + phase detection (Pulling / Building the vLLM container / Copying to peer Sparks)
- GET /api/updates returns {ok, behind, ahead, dirty, current, log[], branch}

Frontend:
- Persistent banner near footer: hidden when up-to-date, blue when N commits behind, warn (orange) when local dirty changes block update
- 'Show details' expands a list of pending commits
- 'Apply update' triggers the long-running build with phase + elapsed timer + collapsible logs
- Confirmation dialog explains the 5–40 min duration

Package: bump 0.2.2:0
2026-05-12 11:26:55 -05:00
Grant 27bfc2d6fd v0.2.1 - Model download with %% progress
Backend:
- download.py module: drives ./hf-download.sh <repo> [-c --copy-parallel] over SSH, parses tqdm output (regex matches '8%|...| 2.06G/25.1G [03:20<18:35, 20.6MB/s]') into percent + bytes done/total + elapsed + ETA + rate
- DownloadManager: in-memory job tracking with asyncio.Lock (one download at a time)
- POST /api/download, GET /api/download/{id}, SSE /api/download/{id}/stream
- Phase detection: Connecting / Fetching N files / Downloading / Copying to peer Sparks / Done

Frontend:
- '+ Download a new model' button next to LLM swap section title
- Inline form: HF repo text field + solo/cluster radio + Cancel/Start
- Progress UI: spinner, elapsed timer, phase label, percent fill, stats line (bytes/rate/ETA), collapsible raw logs

Package: bump 0.2.1:0
2026-05-12 11:24:31 -05:00
Grant 61e5d5cce8 v0.2.0 - Always-on services panel with per-service host config
Dashboard:
- New 'Always-on services' section with cards for Parakeet and Magpie
- Each card: host:port, model loaded, status pill (Healthy/Unhealthy/Starting/Not configured)
- Start, Restart, Stop buttons. Buttons disabled when not applicable for current state
- Restart counter shown when > 1 (would have surfaced the old magpie crash loop)

Backend:
- New /api/services GET: docker container state + http health for each support service
- New POST /api/services/{name}/{action} for start | stop | restart
- services.py module: docker_state, run_action via SSH
- config.py: PARAKEET_HOST/USER/CONTAINER and MAGPIE_* env vars, default to spark2_*
- health.py: use per-service hosts (no longer hard-wired to spark2_host)

Package:
- sparkConfig.yaml.ts: add 6 new optional fields
- configureSparks action: optional 'Parakeet host', 'Parakeet container', 'Magpie host', 'Magpie container' fields; descriptions explain they default to Spark 2 when blank
- Handler normalizes nulls to empty strings before merge
- main.ts: pass new env vars to container
- bump to 0.2.0:0
2026-05-12 11:21:15 -05:00
Grant 0aa4bfb303 known-issues: mark magpie crash loop RESOLVED with chown fix recipe
Volume magpie-model-cache was owned by root, container drops to uid 1000. Fix:
docker run --rm -v magpie-model-cache:/cache alpine chown -R 1000:1000 /cache
+ docker restart magpie-tts. After ~3 GB NGC model download, healthy on :9000.
2026-05-12 11:12:25 -05:00
Grant ad68e0e16e 0.1.0:4 - expose /api/endpoints as separate StartOS service interface
Adds a second sdk.createInterface with type='api' and path='/api/endpoints' on the
same uiPort (9999). StartOS dashboard now shows two service interfaces: Web UI and
OpenAI-compatible API. The API URL is discoverable to other services without users
needing to remember the /api/endpoints suffix.
2026-05-12 11:07:51 -05:00
Grant d6fefec017 0.1.0:3 - Show Public Key layout + /api/endpoints service-discovery
- showPublicKey now uses result.group: install command and raw key are each their own one-click copy box; description is brief
- /api/endpoints returns stable shape { vllm, parakeet, magpie } with base_url + model + ready, for other LAN services to consume without hardcoding Spark IPs
- health.py: parakeet/magpie now also expose base_url
- README: documented /api/endpoints shape
2026-05-12 10:52:57 -05:00
Grant 22c817a4ec 0.1.0:2 - remove 'modelo' default everywhere (it's Grant's username, not factory)
Per user correction: 'modelo' is not the DGX Spark factory default. Generic-ize:
- configureSparks: no default user; placeholder 'your SSH username'
- sparkConfig schema: empty string defaults
- main.ts env fallback: empty
- showPublicKey: drop the 'modelo' fallback; skip Spark if user not configured
- Update feedback memory with the correction
2026-05-12 10:39:57 -05:00
Grant d718a3b78a Bump to 0.1.0:1 — portability + endpoint display
- configureSparks.ts: generic placeholders (e.g. 192.168.1.10), no Grant-specific IPs; descriptions explain the role of each node instead of naming his hardware
- showPublicKey.ts: reads sparkConfig.yaml; emits a ready-to-paste one-liner (KEY='...' followed by 'ssh user@host "echo $KEY >> authorized_keys"' for each configured Spark). Falls back to generic instructions if Configure Sparks hasn't been run yet.
- /api/status now includes vllm.base_url for the OpenAI endpoint
- New endpoint panel in UI: base URL + model ID rows with copy buttons + collapsible curl example
- Bump version to 0.1.0:1
2026-05-12 10:38:18 -05:00
Grant f99241ec3e Add per-model descriptions + repo-cleanup polish
- models.yaml: add 'description' field for all 5 models (generic, anyone-can-use)
- ModelDef gains optional description: str | None field
- UI: render description below meta tags; mute the repo line further
- escapeHtml() for safety in case descriptions/names contain HTML chars
- Update runbook: how to add a new model with description
2026-05-12 10:19:09 -05:00
Grant d6f4390372 Add friendly swap UI: timer + phase indicator + progress bar + collapsible logs
- Elapsed timer (mm:ss) in top-right of swap panel
- Phase display: Stopping / Starting / Loading weights (N/M shards) / Compiling / Warming up / Ready
- Progress bar with smooth fill mapped from phase
- Raw vLLM logs hidden behind <details> 'Show technical logs'
- Detection from log content (safetensors %, torch.compile, Application startup, Ray cluster join)
- Backfill from /api/swap/{id} on reattach (mid-swap reload works)
2026-05-12 10:11:14 -05:00
Grant 8a95609504 Add Spark prerequisites section to runbook (spark-vllm-docker is upstream + Spark-side) 2026-05-12 10:05:17 -05:00
Grant f553547c32 Update README with build flow + post-install steps; note IPv6/mDNS quirk 2026-05-12 10:03:37 -05:00
Grant 49f776f172 Pack spark-control_x86_64.s9pk (55 MB)
- Move models.yaml into image/ so the docker build context is self-contained
- Fix manifest: dockerfile=../image/Dockerfile, workdir=../image
- Add LICENSE (MIT) and assets/README.md (StartOS marketplace listing)
- s9pk validates: id=spark-control, version=0.1.0:0, osVersion=0.4.0-beta.6, sdkVersion=1.3.3
- Image embeds python:3.12-slim + openssh-client + FastAPI app + models.yaml
2026-05-12 09:52:53 -05:00
Grant 4a3eeb4f20 Add safe optimization flags to gemma4 + qwen36 (fastsafetensors, prefix-caching, fp8 kv)
Aligned with sibling recipes in eugr/spark-vllm-docker. Applies on next swap to each model.
First real swap gemma4 -> qwen36 succeeded in 5:30 with --moe_backend=flashinfer_cutlass.
2026-05-12 09:49:08 -05:00
Grant f2beb500e7 Add StartOS 0.4 package scaffold (manifest, main, interfaces, 2 actions)
- package/Makefile + s9pk.mk + package.json + tsconfig.json
- startos/manifest: dockerBuild source pointing at ../image/Dockerfile
- startos/main: reads /data/config.yaml reactively, passes env vars to container
- startos/interfaces: binds port 9999 as HTTP UI
- startos/actions: showPublicKey (read /data/ssh/id_ed25519.pub), configureSparks
- TS + JS bundle compile clean (tsc --noEmit, ncc build)
2026-05-12 09:36:15 -05:00
Grant db2f4269da Initial scaffold: image/ FastAPI app, models.yaml, docs
- image/ FastAPI app: /api/status, /api/swap, /api/swap/{id}/stream, /api/test-connection
- models.yaml: 5-model catalog (qwen3-vl, gemma4, qwen36, qwen3-235b-fp8, qwen25-72b)
- README, runbook, known-issues
- Dry-run swap verified against live Spark 1 (gemma4 currently loaded)
2026-05-12 09:29:13 -05:00
33 changed files with 3805 additions and 57 deletions
+1 -1
View File
@@ -1,6 +1,6 @@
MIT License
Copyright (c) 2026 Alice
Copyright (c) 2026 Grant
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
+10 -10
View File
@@ -31,17 +31,17 @@ Two layers in this repo:
cd image
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
export SPARK1_HOST=<spark-1-ip>
export SPARK1_USER=<spark-user>
export SPARK2_HOST=<spark-2-ip>
export SPARK2_USER=<spark-user>
export SPARK1_HOST=192.168.1.103
export SPARK1_USER=modelo
export SPARK2_HOST=192.168.1.87
export SPARK2_USER=modelo
export SSH_KEY_PATH="$HOME/Library/Application Support/NVIDIA/Sync/config/nvsync.key"
uvicorn app.server:app --host 0.0.0.0 --port 9999 --reload
```
Open <http://localhost:9999>.
> **Note:** use the **IP** `<spark-1-ip>` for Spark 1, not `<spark-1-host>.local`. mDNS resolves to IPv6 first and `httpx` hangs on it because vLLM only binds IPv4.
> **Note:** use the **IP** `192.168.1.103` for Spark 1, not `spark-27ea.local`. mDNS resolves to IPv6 first and `httpx` hangs on it because vLLM only binds IPv4.
## Build the StartOS package
@@ -58,8 +58,8 @@ To sideload onto your Start9: `make install` (needs `host:` set in `~/.startos/c
## Post-install setup (one-time per Start9 install)
1. Open the Spark Control service → **Actions****Show Public Key** → copy the line.
2. SSH to each Spark and append the line to `~/.ssh/authorized_keys` for the `<spark-user>` user.
3. **Actions****Configure Sparks** → enter `<spark-1-ip>` / `<spark-user>` for Spark 1 and `<spark-2-ip>` / `<spark-user>` for Spark 2.
2. SSH to each Spark and append the line to `~/.ssh/authorized_keys` for the `modelo` user.
3. **Actions****Configure Sparks** → enter `192.168.1.103` / `modelo` for Spark 1 and `192.168.1.87` / `modelo` for Spark 2.
4. Start the service. Open the Web UI — current model + health should show within ~5 s.
## Repo layout
@@ -76,9 +76,9 @@ Other services on your LAN can hit `GET /api/endpoints` to learn where the curre
```json
{
"vllm": { "ready": true, "base_url": "http://<spark-1-ip>:8888/v1", "model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4", "openai_compat": true },
"parakeet":{ "ready": true, "base_url": "http://<spark-2-ip>:8000", "kind": "stt", "model": "nvidia/parakeet-tdt-0.6b-v3" },
"magpie": { "ready": false, "base_url": "http://<spark-2-ip>:9000", "kind": "tts" }
"vllm": { "ready": true, "base_url": "http://192.168.1.103:8888/v1", "model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4", "openai_compat": true },
"parakeet":{ "ready": true, "base_url": "http://192.168.1.87:8000", "kind": "stt", "model": "nvidia/parakeet-tdt-0.6b-v3" },
"magpie": { "ready": false, "base_url": "http://192.168.1.87:9000", "kind": "tts" }
}
```
+19 -19
View File
@@ -1,7 +1,7 @@
# Project: spark-control — Model switcher web UI for dual DGX Spark cluster
> **Update 2026-05-12 — Direction change:** the web UI is being built as a
> **StartOS 0.4 package** (sideloaded onto Alice's existing Start9 server),
> **StartOS 0.4 package** (sideloaded onto Grant's existing Start9 server),
> **not** as a FastAPI service running directly on Spark 1. The Start9 server
> shares a LAN with the Sparks and SSHes into Spark 1 to invoke
> `launch-cluster.sh`. StartOS handles `.local` exposure and HTTPS; SSH
@@ -38,8 +38,8 @@ The web UI itself, when deployed, will run on **Spark 1** (where it can directly
From my laptop I can SSH to either Spark directly:
```bash
ssh <spark-user>@<spark-1-ip> # Spark 1
ssh <spark-user>@<spark-2-ip> # Spark 2
ssh modelo@192.168.1.103 # Spark 1
ssh modelo@192.168.1.87 # Spark 2
```
(I can also use SSH key auth — set up earlier.)
@@ -47,7 +47,7 @@ ssh <spark-user>@<spark-2-ip> # Spark 2
When you need to run a command on a Spark, use this pattern:
```bash
ssh <spark-user>@<spark-1-ip> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
ssh modelo@192.168.1.103 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
```
For multi-line commands or scripts, you can pipe a heredoc or just SSH in directly and run them interactively. Either works — but always tell me what you're about to run so I can review.
@@ -55,19 +55,19 @@ For multi-line commands or scripts, you can pipe a heredoc or just SSH in direct
For file transfers between my laptop and the Sparks, use `rsync`:
```bash
rsync -avz ~/Projects/spark-control/ <spark-user>@<spark-1-ip>:~/spark-control/
rsync -avz ~/Projects/spark-control/ modelo@192.168.1.103:~/spark-control/
```
## My hardware and what's running
**Two NVIDIA DGX Spark units** networked together:
- **Spark 1** — hostname `<spark-1-host>`, LAN IP `<spark-1-ip>`, QSFP IP `<spark-1-qsfp-ip>`. Head node for the vLLM cluster.
- **Spark 2** — hostname `<spark-2-host>`, LAN IP `<spark-2-ip>`, QSFP IP `<spark-2-qsfp-ip>`. Worker node for vLLM cluster, also hosts standalone services.
- **Spark 1** — hostname `spark-27ea`, LAN IP `192.168.1.103`, QSFP IP `192.168.100.10`. Head node for the vLLM cluster.
- **Spark 2** — hostname `spark-32d0`, LAN IP `192.168.1.87`, QSFP IP `192.168.100.11`. Worker node for vLLM cluster, also hosts standalone services.
Both run Ubuntu 24.04, NVIDIA driver 580.x, CUDA 13.0, Docker, and have 128 GB unified memory each. They share a QSFP cable for high-speed (200 Gb/s) inter-node networking.
Passwordless SSH works in both directions via `~/.ssh/<ssh-key>` key. My Linux username on both machines is `<spark-user>`.
Passwordless SSH works in both directions via `~/.ssh/id_ed25519_shared` key. My Linux username on both machines is `modelo`.
**Currently running:**
- One LLM at a time on the cluster (via the `eugr/spark-vllm-docker` project — see below)
@@ -88,7 +88,7 @@ Key commands (all run from `~/spark-vllm-docker` on Spark 1):
Container names: `vllm_node` (the main vLLM container), `ray_head` and `ray_worker` (Ray cluster), plus support containers.
The vLLM server binds to port **8888** and exposes an OpenAI-compatible API at `http://<spark-1-ip>:8888/v1`.
The vLLM server binds to port **8888** and exposes an OpenAI-compatible API at `http://192.168.1.103:8888/v1`.
## Models I have on disk (both Sparks)
@@ -154,7 +154,7 @@ Note: the `--moe_backend flashinfer_cutlass` flag is Blackwell-specific. If it e
- Status check: `./launch-cluster.sh status`
- See vLLM logs: `docker logs vllm_node` (add `-f` to follow)
- Hard reset if stuck: `./launch-cluster.sh stop && docker ps -aq | xargs -r docker rm -f`
- Health check (is API responding?): `curl -s http://<spark-1-ip>:8888/v1/models`
- Health check (is API responding?): `curl -s http://192.168.1.103:8888/v1/models`
### "Ready" signal
The model is ready to serve when `docker logs vllm_node` contains the line `Application startup complete.` Until then, it's still loading weights or compiling CUDA graphs.
@@ -163,8 +163,8 @@ The model is ready to serve when `docker logs vllm_node` contains the line `Appl
These don't get touched by model swaps:
- **`parakeet-asr`** — STT on port 8000. Already running 24/7. Verify with `curl http://<spark-2-ip>:8000/health` which should return `{"status":"ready",...}`.
- **`magpie-tts`** — TTS on port 9000. May or may not be running; verify with `docker ps` on Spark 2 and `curl http://<spark-2-ip>:9000/v1/health/ready`.
- **`parakeet-asr`** — STT on port 8000. Already running 24/7. Verify with `curl http://192.168.1.87:8000/health` which should return `{"status":"ready",...}`.
- **`magpie-tts`** — TTS on port 9000. May or may not be running; verify with `docker ps` on Spark 2 and `curl http://192.168.1.87:9000/v1/health/ready`.
## What I want you to build
@@ -201,7 +201,7 @@ spark-control/
5. Return exit code 0 on success, non-zero on failure
Two versions might be useful:
- The version that runs on **my laptop** — wraps everything in `ssh <spark-user>@<spark-1-ip> ...`
- The version that runs on **my laptop** — wraps everything in `ssh modelo@192.168.1.103 ...`
- A simpler version that lives on **Spark 1** — runs commands directly without SSH (used by the deployed web UI)
You can either share one script with a `--remote` flag, or make them two distinct files. Your call — propose the cleaner option.
@@ -246,14 +246,14 @@ The web UI runs on **Spark 1** so it can directly invoke `launch-cluster.sh` wit
## First task
1. First, **verify SSH access to both Sparks** from my laptop:
- `ssh <spark-user>@<spark-1-ip> hostname` should return `<spark-1-host>`
- `ssh <spark-user>@<spark-2-ip> hostname` should return `<spark-2-host>`
- `ssh modelo@192.168.1.103 hostname` should return `spark-27ea`
- `ssh modelo@192.168.1.87 hostname` should return `spark-32d0`
2. Then **verify the current state of the cluster** via SSH:
- Confirm `~/spark-vllm-docker` exists on Spark 1 and `launch-cluster.sh` is there: `ssh <spark-user>@<spark-1-ip> 'ls ~/spark-vllm-docker/launch-cluster.sh'`
- Check which LLM (if any) is currently loaded: `ssh <spark-user>@<spark-1-ip> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'` and `ssh <spark-user>@<spark-1-ip> 'curl -s http://localhost:8888/v1/models'`
- Verify which models are downloaded: `ssh <spark-user>@<spark-1-ip> 'ls ~/.cache/huggingface/hub/ | grep -iE "qwen|gemma"'`
- Confirm `~/spark-vllm-docker` exists on Spark 1 and `launch-cluster.sh` is there: `ssh modelo@192.168.1.103 'ls ~/spark-vllm-docker/launch-cluster.sh'`
- Check which LLM (if any) is currently loaded: `ssh modelo@192.168.1.103 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'` and `ssh modelo@192.168.1.103 'curl -s http://localhost:8888/v1/models'`
- Verify which models are downloaded: `ssh modelo@192.168.1.103 'ls ~/.cache/huggingface/hub/ | grep -iE "qwen|gemma"'`
- Specifically check if `Qwen3.6-35B-A3B-NVFP4` is downloaded; if not, that's the prerequisite step (run the `hf-download.sh` command on Spark 1)
- Check what's running on Spark 2: `ssh <spark-user>@<spark-2-ip> 'docker ps'` (looking for parakeet-asr and possibly magpie-tts)
- Check what's running on Spark 2: `ssh modelo@192.168.1.87 'docker ps'` (looking for parakeet-asr and possibly magpie-tts)
3. Then create the repo structure on my laptop at `~/Projects/spark-control/`
4. Then propose the design for `models.yaml` and the swap script before implementing
+12
View File
@@ -12,6 +12,18 @@ RUN chmod +x /app/entrypoint.sh
COPY models.yaml /app/models.yaml
# Parakeet container wrapper patches (diarizer.py + main.py overlay).
# Shipped inside spark-control so the "Reapply speech-model patches" action
# can copy these into the parakeet-asr container on Spark 2 over SSH at any
# time — survives docker rm + redeploy of the parakeet container.
COPY parakeet_patches /app/parakeet_patches
# WhisperX container build context (Dockerfile + requirements.txt + app/).
# The "Install WhisperX" action in spark-control ships these files to Spark 2
# over SSH, then runs `docker build` + `docker run` there. The container
# becomes a managed always-on service alongside parakeet-asr and magpie-tts.
COPY whisperx_container /app/whisperx_container
RUN pip install --no-cache-dir -e .
ENV BIND_PORT=9999
+434
View File
@@ -0,0 +1,434 @@
"""OpenAI-compatible audio proxy: lets any OpenAI-shaped client (Open WebUI,
Home Assistant, etc.) talk to Parakeet (STT) and Magpie (TTS) through one URL.
Endpoints exposed on spark-control's port (same as the dashboard):
GET /v1/models — lists STT model + Magpie voices in OpenAI shape
POST /v1/audio/speech — OpenAI TTS → Magpie /v1/audio/synthesize
POST /v1/audio/transcriptions — forward to Parakeet (already OpenAI-compatible)
Both downstream services already speak HTTP on the LAN; this module just adapts
request/response shapes so OpenAI clients don't need a custom integration.
When Parakeet returns a 500 (commonly the recurring CUDA wedge), the proxy
returns a clearer 503 with Retry-After=60, and fires the deep-health probe in
the background — which detects the wedge and triggers a rate-limited container
restart inside seconds. The client's next attempt ~60s later then succeeds.
"""
from __future__ import annotations
import asyncio
import logging
from typing import Any, Optional
import httpx
from fastapi import APIRouter, Form, HTTPException, Request, UploadFile, File
from fastapi.responses import Response, StreamingResponse
from pydantic import BaseModel
from .config import Settings
logger = logging.getLogger("spark-control.audio")
# Magpie voice name encodes its language. Example:
# Magpie-Multilingual.EN-US.Mia -> en-US
# Magpie-Multilingual.ES-US.Diego -> es-US
# Magpie-Multilingual.FR-FR.Pascal -> fr-FR
def _lang_from_voice(voice: str) -> str:
try:
parts = voice.split(".")
# parts = ["Magpie-Multilingual", "EN-US", "Mia"] (or with emotion suffix)
if len(parts) >= 2 and "-" in parts[1]:
lang_part = parts[1] # "EN-US"
primary, region = lang_part.split("-", 1)
return f"{primary.lower()}-{region.upper()}"
except Exception:
pass
return "en-US"
# Default voice: configurable, falls back to a sensible English voice if unset.
DEFAULT_VOICE = "Magpie-Multilingual.EN-US.Mia"
class SpeechRequest(BaseModel):
"""OpenAI /v1/audio/speech request body."""
model: Optional[str] = None # ignored — Magpie has one model
input: str # the text to speak
voice: Optional[str] = None # e.g. "Magpie-Multilingual.EN-US.Mia"
response_format: Optional[str] = "wav" # only "wav" supported today
speed: Optional[float] = 1.0 # ignored by Magpie
# Magpie-specific extensions (clients may pass these through)
language: Optional[str] = None
sample_rate_hz: Optional[int] = 22050
encoding: Optional[str] = "LINEAR_PCM"
def build_router(settings: Settings, deep_health: Any = None) -> APIRouter:
"""Build the audio proxy router.
If `deep_health` is provided, 500s from Parakeet trigger an immediate
background probe (which contains the same wedge-detect → auto-restart
logic as the 5-minute periodic loop, but fires now instead of waiting).
"""
router = APIRouter()
def _parakeet_base() -> str:
return f"http://{settings.parakeet_host}:{settings.parakeet_port}"
def _magpie_base() -> str:
return f"http://{settings.magpie_host}:{settings.magpie_port}"
# ---- /v1/models ----
@router.get("/v1/models")
async def list_models() -> dict:
"""Advertise the STT model + a small voice menu so clients can
populate their voice-picker UIs. Falls back gracefully if Magpie
is offline (returns just the STT entry)."""
data: list[dict] = [
{
"id": "parakeet-tdt-0.6b-v3",
"object": "model",
"owned_by": "nvidia",
"kind": "stt",
},
]
# Try to enumerate voices from Magpie; if unreachable, just skip.
try:
async with httpx.AsyncClient(timeout=5.0) as client:
r = await client.get(f"{_magpie_base()}/v1/audio/list_voices")
if r.status_code == 200:
voices_by_locales = r.json()
seen = set()
for _locales, payload in voices_by_locales.items():
for v in payload.get("voices", []):
# Collapse emotion variants — expose only the base voice name.
# "Magpie-Multilingual.EN-US.Mia.Angry" -> "Magpie-Multilingual.EN-US.Mia"
parts = v.split(".")
base = ".".join(parts[:3]) if len(parts) >= 3 else v
if base not in seen:
seen.add(base)
data.append({
"id": base,
"object": "model",
"owned_by": "nvidia",
"kind": "tts",
})
except Exception as e:
logger.warning("magpie voice list unavailable: %s", e)
return {"object": "list", "data": data}
# ---- /v1/audio/speech (TTS) ----
@router.post("/v1/audio/speech")
async def speech(body: SpeechRequest) -> Response:
"""OpenAI-style TTS. Translates to Magpie's multipart synth call.
Returns raw WAV bytes (Content-Type: audio/wav) — browsers and most
clients play these directly.
"""
text = (body.input or "").strip()
if not text:
raise HTTPException(400, "input text is required")
voice = body.voice or DEFAULT_VOICE
language = body.language or _lang_from_voice(voice)
sample_rate = int(body.sample_rate_hz or 22050)
encoding = body.encoding or "LINEAR_PCM"
form = {
"text": text,
"language": language,
"voice": voice,
"sample_rate_hz": str(sample_rate),
"encoding": encoding,
}
try:
async with httpx.AsyncClient(timeout=120.0) as client:
r = await client.post(f"{_magpie_base()}/v1/audio/synthesize", data=form)
except httpx.HTTPError as e:
raise HTTPException(502, f"magpie unreachable: {e}")
if r.status_code != 200:
# Surface Magpie's error message verbatim so clients can debug voice/lang typos.
raise HTTPException(r.status_code, r.text[:500])
# Magpie returns WAV bytes already (Content-Type: audio/wav). Pass through.
media_type = r.headers.get("content-type", "audio/wav")
return Response(content=r.content, media_type=media_type)
# ---- /v1/audio/transcriptions (STT) ----
@router.post("/v1/audio/transcriptions")
async def transcriptions(
file: UploadFile = File(...),
model: Optional[str] = Form(default=None),
language: Optional[str] = Form(default=None),
prompt: Optional[str] = Form(default=None),
response_format: Optional[str] = Form(default="json"),
temperature: Optional[float] = Form(default=None),
) -> Response:
"""Forward to Parakeet's already-OpenAI-compatible endpoint.
We relay rather than redirect so clients only need to know one URL
(spark-control's) — and so any future client-side rewrites of the
request shape (e.g. translating Whisper-format params) happen here.
"""
body = await file.read()
files = {"file": (file.filename or "audio.wav", body, file.content_type or "application/octet-stream")}
data: dict[str, str] = {}
if model: data["model"] = model
if language: data["language"] = language
if prompt: data["prompt"] = prompt
if response_format: data["response_format"] = response_format
if temperature is not None: data["temperature"] = str(temperature)
try:
async with httpx.AsyncClient(timeout=300.0) as client:
r = await client.post(
f"{_parakeet_base()}/v1/audio/transcriptions",
files=files, data=data,
)
except httpx.HTTPError as e:
raise HTTPException(502, f"parakeet unreachable: {e}")
if r.status_code == 500:
# Parakeet 500s are almost always the CUDA wedge (CUBLAS_*_ERROR
# mid-attention). Kick deep-health to detect+restart in the
# background, and return a clean retry signal to the client.
err_snippet = r.text[:400]
logger.warning("parakeet 500 — firing deep-health probe in background. detail=%s", err_snippet)
if deep_health is not None:
try:
asyncio.create_task(deep_health.run_one("parakeet"))
except Exception as e:
logger.error("failed to schedule deep-health probe: %s", e)
raise HTTPException(
status_code=503,
detail="Parakeet returned a transient error (likely CUDA wedge). Auto-restart triggered; retry in ~60s.",
headers={"Retry-After": "60"},
)
if r.status_code != 200:
raise HTTPException(r.status_code, r.text[:500])
return Response(content=r.content, media_type=r.headers.get("content-type", "application/json"))
def _whisperx_base() -> str:
return f"http://{settings.whisperx_host}:{settings.whisperx_port}"
async def _whisperx_healthy() -> bool:
try:
async with httpx.AsyncClient(timeout=2.0) as client:
r = await client.get(f"{_whisperx_base()}/health")
return r.status_code == 200 and bool(r.json().get("diarizer_loaded"))
except Exception:
return False
# ---- /api/audio/transcribe-with-speakers (STT + diarization, merged) ----
@router.post("/api/audio/transcribe-with-speakers")
async def transcribe_with_speakers(
file: UploadFile = File(...),
) -> dict:
"""Diarized transcription: run Parakeet ASR and Sortformer diarization on
the same audio in parallel, then merge by timestamp.
Response shape (designed for downstream UIs like recap-relay):
{
"duration": 90.5,
"language": "en",
"speakers_detected": ["Speaker_0", "Speaker_1"],
"segments": [
{"start_ms": 39308, "end_ms": 51000,
"speaker": "Speaker_0", "text": "good morning i think..."},
...
],
"models": {
"transcription": "parakeet-tdt-0.6b-v3",
"diarization": "nvidia/diar_sortformer_4spk-v1"
}
}
Each segment is a block of consecutive words by the same speaker. Speaker
labels are anonymous (Speaker_0, Speaker_1, ...) — name resolution is the
caller's responsibility (LLM analysis with optional participant hints,
or manual mapping UI).
"""
body = await file.read()
if not body:
raise HTTPException(400, "Empty file")
filename = file.filename or "audio.wav"
content_type = file.content_type or "application/octet-stream"
# Prefer WhisperX (single-pipeline, handles long audio properly) when it's
# installed and healthy. Fall back to Parakeet + Sortformer otherwise.
if await _whisperx_healthy():
files = {"file": (filename, body, content_type)}
try:
async with httpx.AsyncClient(timeout=1800.0) as client:
r = await client.post(
f"{_whisperx_base()}/v1/audio/transcribe-with-speakers",
files=files,
)
except httpx.HTTPError as e:
raise HTTPException(502, f"whisperx unreachable: {e}")
if r.status_code != 200:
raise HTTPException(r.status_code, r.text[:500])
return r.json()
# ── Legacy fallback: Parakeet ASR + Sortformer diarizer in parallel ──
async def _call_transcribe(client: httpx.AsyncClient) -> dict:
files = {"file": (filename, body, content_type)}
data = {"response_format": "verbose_json"}
r = await client.post(
f"{_parakeet_base()}/v1/audio/transcriptions",
files=files, data=data,
)
r.raise_for_status()
return r.json()
async def _call_diarize(client: httpx.AsyncClient) -> dict:
files = {"file": (filename, body, content_type)}
r = await client.post(
f"{_parakeet_base()}/v1/audio/diarize",
files=files,
)
r.raise_for_status()
return r.json()
# Run both in parallel against the same Parakeet container — Sortformer
# and Parakeet ASR are independent forward passes that share the GPU.
try:
async with httpx.AsyncClient(timeout=600.0) as client:
stt, diar = await asyncio.gather(
_call_transcribe(client),
_call_diarize(client),
)
except httpx.HTTPStatusError as e:
# Surface upstream errors. If transcribe wedged, kick deep-health.
if e.response.status_code == 500 and deep_health is not None:
try:
asyncio.create_task(deep_health.run_one("parakeet"))
except Exception:
pass
raise HTTPException(
status_code=503,
detail="Parakeet transient error (likely CUDA wedge). Auto-restart triggered; retry in ~60s.",
headers={"Retry-After": "60"},
)
raise HTTPException(e.response.status_code, e.response.text[:500])
except httpx.HTTPError as e:
raise HTTPException(502, f"parakeet unreachable: {e}")
merged = _merge_words_with_speakers(
words=stt.get("words", []),
diar_turns=diar.get("segments", []),
)
return {
"duration": stt.get("duration") or diar.get("duration") or 0.0,
"language": stt.get("language", "en"),
"speakers_detected": diar.get("speakers_detected", []),
"segments": merged,
"models": {
"transcription": stt.get("model") if isinstance(stt.get("model"), str) else "parakeet",
"diarization": diar.get("model", "sortformer"),
},
}
return router
# ---- Merge helper: assign speaker to each word, then group into blocks ----
def _assign_speaker_to_word(word_start_s: float, word_end_s: float, diar_turns: list[dict]) -> str:
"""Find the diarization turn that contains this word, or has the most
overlap with it. Returns the speaker label, or 'Speaker_unknown' if no
turn overlaps at all."""
word_mid = (word_start_s + word_end_s) / 2.0
# Fast path: find the turn containing the midpoint
for t in diar_turns:
if t["start_s"] <= word_mid <= t["end_s"]:
return t["speaker"]
# Slow path: pick the turn with max overlap with the word's span
best_speaker = "Speaker_unknown"
best_overlap = 0.0
for t in diar_turns:
overlap = max(0.0, min(word_end_s, t["end_s"]) - max(word_start_s, t["start_s"]))
if overlap > best_overlap:
best_overlap = overlap
best_speaker = t["speaker"]
return best_speaker
def _merge_words_with_speakers(words: list[dict], diar_turns: list[dict]) -> list[dict]:
"""Group consecutive same-speaker words into blocks.
Each input word: {"start": float_s, "end": float_s, "text": str} (Parakeet
verbose_json format; values are seconds).
Each input turn: {"start_s": float, "end_s": float, "speaker": str}.
Output: [{"start_ms": int, "end_ms": int, "speaker": str, "text": str}, ...]
Also breaks a block on a long silence gap (>1.5 s) even within the same
speaker — keeps blocks readable in UI rendering.
"""
if not words:
return []
SILENCE_BREAK_S = 1.5
def _join_words(parts: list[str]) -> str:
"""Join word tokens with proper spacing. Different STT outputs vary —
some include leading spaces in the word text (' morning'), some don't
('morning'). Normalize by stripping each token then joining with one
space; collapse multiple spaces. Keeps punctuation tight (no space
before period/comma/etc.)."""
cleaned = [p.strip() for p in parts if p and p.strip()]
if not cleaned:
return ""
out = cleaned[0]
for token in cleaned[1:]:
# No leading space before pure-punctuation tokens
if token and token[0] in ".,;:!?)]}'\"":
out += token
else:
out += " " + token
return out
blocks: list[dict] = []
cur_words: list[str] = []
cur_speaker: Optional[str] = None
cur_start_s: Optional[float] = None
cur_end_s: Optional[float] = None
for w in words:
ws = float(w.get("start", 0.0))
we = float(w.get("end", ws))
wt = str(w.get("text", ""))
spk = _assign_speaker_to_word(ws, we, diar_turns)
is_new_block = (
cur_speaker is None
or spk != cur_speaker
or (cur_end_s is not None and ws - cur_end_s > SILENCE_BREAK_S)
)
if is_new_block:
if cur_speaker is not None:
blocks.append({
"start_ms": int(cur_start_s * 1000),
"end_ms": int(cur_end_s * 1000),
"speaker": cur_speaker,
"text": _join_words(cur_words),
})
cur_words = [wt]
cur_speaker = spk
cur_start_s = ws
cur_end_s = we
else:
cur_words.append(wt)
cur_end_s = we
if cur_speaker is not None and cur_words:
blocks.append({
"start_ms": int(cur_start_s * 1000),
"end_ms": int(cur_end_s * 1000),
"speaker": cur_speaker,
"text": _join_words(cur_words),
})
return blocks
+11 -1
View File
@@ -35,6 +35,11 @@ class Settings:
magpie_host: str
magpie_user: str
magpie_container: str
whisperx_host: str
whisperx_user: str
whisperx_container: str
whisperx_port: int
whisperx_model: str
ssh_key_path: str
ssh_known_hosts: str
models_yaml: str
@@ -49,7 +54,7 @@ class Settings:
def from_env(cls) -> "Settings":
spark2_host = _env("SPARK2_HOST")
spark2_user = _env("SPARK2_USER")
# Parakeet and Magpie default to Spark 2 unless explicitly overridden.
# Parakeet, Magpie, and WhisperX all default to Spark 2 unless overridden.
return cls(
spark1_host=_env("SPARK1_HOST"),
spark1_user=_env("SPARK1_USER"),
@@ -61,6 +66,11 @@ class Settings:
magpie_host=_env("MAGPIE_HOST") or spark2_host,
magpie_user=_env("MAGPIE_USER") or spark2_user,
magpie_container=_env("MAGPIE_CONTAINER") or "magpie-tts",
whisperx_host=_env("WHISPERX_HOST") or spark2_host,
whisperx_user=_env("WHISPERX_USER") or spark2_user,
whisperx_container=_env("WHISPERX_CONTAINER") or "whisperx-asr",
whisperx_port=int(_env("WHISPERX_PORT", "8002")),
whisperx_model=_env("WHISPERX_MODEL", "medium"),
ssh_key_path=_env("SSH_KEY_PATH"),
ssh_known_hosts=_env("SSH_KNOWN_HOSTS"),
models_yaml=_resolve_models_yaml(),
+2 -2
View File
@@ -4,8 +4,8 @@ Format:
custom:
- key: my-riva
kind: stt
host: <spark-2-ip>
user: <spark-user>
host: 192.168.1.87
user: modelo
container: riva-asr
port: 8001
health_path: /health
+363
View File
@@ -0,0 +1,363 @@
"""Deep health probes for each service.
Why this exists: Triton's /health endpoint returns 200 as long as the HTTP
layer is alive and the model is registered. It does NOT verify that the CUDA
context inside the worker process is healthy. We've observed Parakeet getting
its CUDA context wedged after an OOM, where /health stays green but every
real transcription returns 500 cudaErrorUnknown.
So this module sends *real* but tiny synthetic inference requests:
- Parakeet: 1 second of digital silence (16 kHz mono PCM, in-memory WAV)
- Magpie: short text-to-speech, response audio discarded
- vLLM: 1-token chat completion against whatever model is loaded
All synthetic payloads are generated on demand into BytesIO, sent over HTTP,
and never touched the filesystem (on either spark-control's side or the
target service's side beyond normal Triton/Riva working memory).
When a probe fails with a signal that looks like a CUDA wedge, we
automatically issue `docker restart <container>`. Rate-limited to 3 restarts
per service per 30 minutes to avoid restart loops.
"""
from __future__ import annotations
import asyncio
import io
import time
import wave
from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Optional
import httpx
from .config import Settings
from .connectivity import record_report
from .services import ServiceDef, run_action, services_from_settings
# Default 5-minute interval, controllable via env. Sub-minute is silly for a
# heavy synthetic probe; we just want to catch wedges within a reasonable
# window — much faster than the user noticing on their next real call.
DEFAULT_INTERVAL_SEC = 300.0
PROBE_TIMEOUT_SEC = 20.0
RESTART_RATE_LIMIT = 3 # max auto-restarts per service
RESTART_RATE_WINDOW_SEC = 1800.0 # within a 30-min window
RESTART_COOLDOWN_SEC = 120.0 # don't restart again within this many seconds of the last one
STARTUP_GRACE_SEC = 60.0 # don't auto-restart for the first minute after this app boots
def _silence_wav(seconds: float = 1.0, sample_rate: int = 16000) -> io.BytesIO:
"""Return an in-memory WAV file containing `seconds` of digital silence."""
n_frames = int(seconds * sample_rate)
buf = io.BytesIO()
with wave.open(buf, "wb") as w:
w.setnchannels(1)
w.setsampwidth(2) # int16
w.setframerate(sample_rate)
w.writeframes(b"\x00\x00" * n_frames)
buf.seek(0)
return buf
def _looks_like_wedge(error: str) -> bool:
"""Heuristic: does this error string look like a stuck CUDA context that
a container restart would clear? We want to be conservative — only act
on signals we're confident about, otherwise leave the user in charge."""
err = (error or "").lower()
needles = [
"cudaerrorunknown",
"cuda error: unknown",
"cuda kernel errors",
"internal server error",
"engine core initialization failed",
"503", # service unavailable from a dependency
"500", # generic 5xx with a body that may not parse
]
return any(n in err for n in needles)
@dataclass
class ProbeResult:
ok: bool
at: str
latency_ms: Optional[int] = None
error: str = ""
note: str = ""
@dataclass
class ServiceState:
last: Optional[ProbeResult] = None
last_ok_at: Optional[str] = None
restarts: list[float] = field(default_factory=list)
class DeepHealth:
def __init__(self, settings: Settings, interval_sec: float = DEFAULT_INTERVAL_SEC) -> None:
self.settings = settings
self.interval_sec = interval_sec
self.state: dict[str, ServiceState] = {
"parakeet": ServiceState(),
"magpie": ServiceState(),
"vllm": ServiceState(),
}
self._stop = asyncio.Event()
self._boot_at = time.monotonic()
# ---- probes ---------------------------------------------------------
async def probe_parakeet(self) -> ProbeResult:
s = self.settings
now_iso = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
if not s.parakeet_host:
return ProbeResult(ok=False, at=now_iso, error="not configured")
url = f"http://{s.parakeet_host}:{s.parakeet_port}/v1/audio/transcriptions"
wav = _silence_wav(1.0)
t0 = time.monotonic()
try:
async with httpx.AsyncClient(timeout=PROBE_TIMEOUT_SEC) as c:
r = await c.post(
url,
files={"file": ("probe.wav", wav, "audio/wav")},
data={"model": "parakeet-tdt-0.6b-v3"},
)
latency = round((time.monotonic() - t0) * 1000)
if 200 <= r.status_code < 300:
return ProbeResult(ok=True, at=now_iso, latency_ms=latency)
return ProbeResult(
ok=False,
at=now_iso,
latency_ms=latency,
error=f"HTTP {r.status_code}: {r.text[:240]}",
)
except Exception as e:
return ProbeResult(ok=False, at=now_iso, error=f"{type(e).__name__}: {e}")
async def probe_magpie(self) -> ProbeResult:
s = self.settings
now_iso = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
if not s.magpie_host:
return ProbeResult(ok=False, at=now_iso, error="not configured")
# Magpie /v1/audio/synthesize expects multipart form-data, not JSON.
# The (None, value) tuple in httpx's `files=` produces a non-file form field.
url = f"http://{s.magpie_host}:{s.magpie_port}/v1/audio/synthesize"
form: dict = {"text": (None, "hi"), "language": (None, "en-US")}
t0 = time.monotonic()
try:
async with httpx.AsyncClient(timeout=PROBE_TIMEOUT_SEC) as c:
r = await c.post(url, files=form)
latency = round((time.monotonic() - t0) * 1000)
if 200 <= r.status_code < 300:
return ProbeResult(ok=True, at=now_iso, latency_ms=latency)
# 4xx that aren't 5xx mean server is alive but our payload is off —
# don't classify as wedge.
if 400 <= r.status_code < 500:
return ProbeResult(
ok=True,
at=now_iso,
latency_ms=latency,
note=f"{r.status_code} — server alive (probe payload may need a voice name)",
)
return ProbeResult(
ok=False,
at=now_iso,
latency_ms=latency,
error=f"HTTP {r.status_code}: {r.text[:240]}",
)
except Exception as e:
return ProbeResult(ok=False, at=now_iso, error=f"{type(e).__name__}: {e}")
async def probe_vllm(self) -> ProbeResult:
s = self.settings
now_iso = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
if not s.spark1_host:
return ProbeResult(ok=False, at=now_iso, error="not configured")
base = f"http://{s.spark1_host}:{s.vllm_port}"
# Step 1: is there a model loaded?
try:
async with httpx.AsyncClient(timeout=5.0) as c:
r = await c.get(f"{base}/v1/models")
if 200 <= r.status_code < 300:
models = r.json().get("data") or []
else:
# 5xx on /v1/models suggests something wedged after a model loaded
return ProbeResult(
ok=False,
at=now_iso,
error=f"list_models HTTP {r.status_code}: {r.text[:240]}",
)
except Exception:
# Connection refused / timeout: usually means no vLLM process listening
# (the vllm_node container is alive but no `vllm serve` is running yet).
# That's an idle state, not a wedge — don't trigger auto-restart.
return ProbeResult(
ok=True,
at=now_iso,
note="no model currently loaded (idle)",
)
if not models:
return ProbeResult(
ok=True,
at=now_iso,
note="no model currently loaded (idle)",
)
model_id = models[0]["id"]
# Step 2: model is loaded; verify it can actually complete a 1-token request.
t0 = time.monotonic()
try:
async with httpx.AsyncClient(timeout=PROBE_TIMEOUT_SEC) as c:
r = await c.post(
f"{base}/v1/chat/completions",
json={
"model": model_id,
"messages": [{"role": "user", "content": "hi"}],
"max_tokens": 1,
"temperature": 0,
},
)
latency = round((time.monotonic() - t0) * 1000)
if 200 <= r.status_code < 300:
return ProbeResult(ok=True, at=now_iso, latency_ms=latency, note=f"model={model_id}")
return ProbeResult(
ok=False,
at=now_iso,
latency_ms=latency,
error=f"HTTP {r.status_code}: {r.text[:240]}",
)
except Exception as e:
return ProbeResult(ok=False, at=now_iso, error=f"{type(e).__name__}: {e}")
# ---- orchestration --------------------------------------------------
PROBES = {
"parakeet": "probe_parakeet",
"magpie": "probe_magpie",
"vllm": "probe_vllm",
}
async def run_one(self, service: str) -> ProbeResult:
fn = getattr(self, self.PROBES[service])
result: ProbeResult = await fn()
st = self.state[service]
prev_ok = st.last.ok if st.last else None
st.last = result
if result.ok:
st.last_ok_at = result.at
# Log to connectivity history: every failure, plus the first success
# after a failure (recovery), plus the first probe ever — but skip
# the "still ok" steady-state to keep the log readable.
if not result.ok:
record_report(
service,
ok=False,
source="deep-health",
detail=result.error[:240],
latency_ms=result.latency_ms,
)
elif prev_ok is False:
record_report(
service,
ok=True,
source="deep-health",
detail="recovered" + (f"{result.note}" if result.note else ""),
latency_ms=result.latency_ms,
)
elif prev_ok is None:
record_report(
service,
ok=True,
source="deep-health",
detail="first probe ok" + (f"{result.note}" if result.note else ""),
latency_ms=result.latency_ms,
)
# Maybe auto-restart
if not result.ok and _looks_like_wedge(result.error):
await self._maybe_restart(service, result.error)
return result
async def _maybe_restart(self, service: str, error: str) -> None:
# No restarts during the boot grace period.
if time.monotonic() - self._boot_at < STARTUP_GRACE_SEC:
return
st = self.state[service]
now = time.monotonic()
st.restarts = [t for t in st.restarts if now - t < RESTART_RATE_WINDOW_SEC]
if st.restarts and now - st.restarts[-1] < RESTART_COOLDOWN_SEC:
return # already restarted recently, give it time
if len(st.restarts) >= RESTART_RATE_LIMIT:
record_report(
service,
ok=False,
source="deep-health",
detail=f"rate-limited; not auto-restarting (would be #{len(st.restarts)+1} in 30 min)",
)
return
services = services_from_settings(self.settings)
if service not in services:
return
svc = services[service]
if not svc.host or not svc.user:
return
result = await run_action(self.settings, svc, "restart")
st.restarts.append(now)
ok = result.get("ok", False)
record_report(
service,
ok=False,
source="deep-health",
detail=f"auto-restart triggered (wedge: {error[:120]}); restart {'OK' if ok else 'FAILED'}",
)
async def run_all(self) -> dict[str, ProbeResult]:
results = {}
for name in self.PROBES:
results[name] = await self.run_one(name)
return results
async def run_periodic(self) -> None:
"""Long-running loop. Cancel via .stop()."""
# Brief initial wait to let app finish startup
try:
await asyncio.wait_for(self._stop.wait(), timeout=10.0)
return
except asyncio.TimeoutError:
pass
while not self._stop.is_set():
try:
await self.run_all()
except Exception:
# Never let the loop die; the periodic check is best-effort
pass
try:
await asyncio.wait_for(self._stop.wait(), timeout=self.interval_sec)
return
except asyncio.TimeoutError:
continue
def stop(self) -> None:
self._stop.set()
def summary(self) -> dict:
out = {}
for name, st in self.state.items():
last = st.last
out[name] = {
"last_ok_at": st.last_ok_at,
"last": (
{
"ok": last.ok,
"at": last.at,
"latency_ms": last.latency_ms,
"error": last.error,
"note": last.note,
}
if last
else None
),
"auto_restarts_window": len(st.restarts),
}
return out
+134
View File
@@ -0,0 +1,134 @@
"""On-disk presence + deletion for Hugging Face model caches on the Sparks.
The HF cache layout for a repo `org/name` is:
~/.cache/huggingface/hub/models--org--name/
We use `du -sb` to measure size (bytes) and `rm -rf` to free it. All operations
are gated by the server endpoints, which refuse to delete a currently-loaded
model or one tied to an in-flight swap/download.
"""
from __future__ import annotations
import asyncio
import re
from dataclasses import dataclass
from typing import Optional
from .config import Settings
from .ssh import ssh_run
# HF cache dirnames are `models--<org>--<name>` where <org> and <name> only contain
# Hugging Face's allowed identifier chars: letters, digits, dot, dash, underscore.
# Validate against this whitelist so we can safely embed the dirname into a shell
# command without quoting (we need $HOME outside the quotes to expand).
_SAFE_DIRNAME = re.compile(r"^[A-Za-z0-9._\-]+$")
def repo_to_cache_dirname(repo: str) -> str:
"""Convert 'org/name' to 'models--org--name' (the HF hub cache directory)."""
if "/" not in repo:
raise ValueError(f"repo must be in 'org/name' form: {repo!r}")
dn = "models--" + repo.replace("/", "--")
if not _SAFE_DIRNAME.fullmatch(dn):
raise ValueError(f"unsafe cache dirname (rejected by whitelist): {dn!r}")
return dn
@dataclass
class HostDiskResult:
host: str
on_disk: bool
size_bytes: int = 0
error: Optional[str] = None
@dataclass
class DiskStatus:
repo: str
on_disk: bool # True if present on AT LEAST one host
total_bytes: int # sum across hosts
per_host: list[HostDiskResult]
async def probe_host(host: str, user: str, repo: str, settings: Settings) -> HostDiskResult:
"""Return whether the model's cache dir exists on this host and its size."""
if not host or not user:
return HostDiskResult(host=host or "?", on_disk=False, error="host not configured")
dn = repo_to_cache_dirname(repo) # whitelisted; safe to embed
# $HOME must expand server-side, so we build the path with double quotes
# (which DO allow variable expansion) rather than shlex.quote single quotes.
cmd = (
f'P="$HOME/.cache/huggingface/hub/{dn}"; '
f'if [ -d "$P" ]; then du -sb "$P" 2>/dev/null | cut -f1; '
f'else echo MISSING; fi'
)
rc, out, err = await ssh_run(host, user, cmd, settings, timeout=20.0)
if rc != 0:
return HostDiskResult(host=host, on_disk=False, error=(err or out).strip() or f"rc={rc}")
raw = out.strip()
if raw == "MISSING" or raw == "":
return HostDiskResult(host=host, on_disk=False)
try:
size = int(raw.splitlines()[-1])
except ValueError:
return HostDiskResult(host=host, on_disk=False, error=f"unparsable du output: {raw!r}")
return HostDiskResult(host=host, on_disk=True, size_bytes=size)
async def probe_disk(repo: str, mode: str, settings: Settings) -> DiskStatus:
"""Probe one model across the relevant Sparks based on its mode (solo|cluster)."""
hosts: list[tuple[str, str]] = [(settings.spark1_host, settings.spark1_user)]
if mode == "cluster" and settings.spark2_host:
hosts.append((settings.spark2_host, settings.spark2_user))
results = await asyncio.gather(*(probe_host(h, u, repo, settings) for h, u in hosts))
on_disk = any(r.on_disk for r in results)
total = sum(r.size_bytes for r in results)
return DiskStatus(repo=repo, on_disk=on_disk, total_bytes=total, per_host=list(results))
async def delete_host(host: str, user: str, repo: str, settings: Settings) -> HostDiskResult:
"""Probe + rm -rf on one host. Returns bytes freed (0 if the dir wasn't there)."""
if not host or not user:
return HostDiskResult(host=host or "?", on_disk=False, error="host not configured")
dn = repo_to_cache_dirname(repo) # whitelisted; safe to embed
# Compute size first, then remove. If absent, still return success (idempotent).
# $HOME is in double-quoted context so it expands; the dirname is whitelisted.
cmd = (
f'set -e; '
f'P="$HOME/.cache/huggingface/hub/{dn}"; '
f'if [ -d "$P" ]; then '
f' SIZE=$(du -sb "$P" 2>/dev/null | cut -f1); '
f' rm -rf -- "$P"; '
f' echo "FREED $SIZE"; '
f'else '
f' echo "FREED 0"; '
f'fi'
)
rc, out, err = await ssh_run(host, user, cmd, settings, timeout=120.0)
if rc != 0:
return HostDiskResult(host=host, on_disk=False, error=(err or out).strip() or f"rc={rc}")
# Parse the "FREED N" line
freed = 0
for line in out.splitlines():
parts = line.strip().split()
if len(parts) == 2 and parts[0] == "FREED":
try:
freed = int(parts[1])
except ValueError:
pass
break
return HostDiskResult(host=host, on_disk=False, size_bytes=freed)
async def delete_from_disk(repo: str, mode: str, settings: Settings) -> DiskStatus:
"""rm -rf the model's cache dir on the relevant Sparks. Idempotent."""
hosts: list[tuple[str, str]] = [(settings.spark1_host, settings.spark1_user)]
if mode == "cluster" and settings.spark2_host:
hosts.append((settings.spark2_host, settings.spark2_user))
results = await asyncio.gather(*(delete_host(h, u, repo, settings) for h, u in hosts))
total_freed = sum(r.size_bytes for r in results)
# After deletion, on_disk should be False on all hosts.
return DiskStatus(repo=repo, on_disk=False, total_bytes=total_freed, per_host=list(results))
+243
View File
@@ -12,6 +12,9 @@ from typing import Literal
from .config import Settings
from .connectivity import get_mac, record_report, record_state, summary as connectivity_summary
from .custom_services import add_custom_service, delete_custom_service
from .audio_proxy import build_router as build_audio_router
from .deep_health import DeepHealth
from .disk import delete_from_disk, probe_disk
from .download import DownloadManager
from .hardware import HardwareProbe
from .health import check_magpie, check_parakeet, check_vllm
@@ -19,9 +22,12 @@ from .models import load_catalog
from .nim import SUGGESTED_NIMS, CATALOG_URL, NimManager
from .overrides import add_custom, delete_custom, extract_knobs_from_args, load_overrides, set_knobs
from .services import docker_state, run_action, services_from_settings
from .speech_models import SpeechModelsManager
from .ssh import ssh_run
from .whisperx_install import WhisperXInstaller
from .swap import SwapManager
from .updates import UpdateManager, get_update_status
from .validate import validate_launch
from .wol import send_local_broadcast, send_via_peer
@@ -32,12 +38,34 @@ download_manager = DownloadManager(settings)
update_manager = UpdateManager(settings)
hardware_probe = HardwareProbe(settings)
nim_manager = NimManager(settings)
deep_health = DeepHealth(settings)
speech_models = SpeechModelsManager(settings)
whisperx_installer = WhisperXInstaller(settings)
app = FastAPI(title="spark-control", version="0.1.0")
@app.on_event("startup")
async def _start_deep_health() -> None:
# Fire-and-forget; the loop catches its own exceptions.
asyncio.create_task(deep_health.run_periodic())
@app.on_event("shutdown")
async def _stop_deep_health() -> None:
deep_health.stop()
_STATIC_DIR = Path(__file__).resolve().parent / "static"
app.mount("/static", StaticFiles(directory=_STATIC_DIR), name="static")
# OpenAI-compatible audio proxy: /v1/audio/speech, /v1/audio/transcriptions, /v1/models.
# Lets Open WebUI, Home Assistant, and any other OpenAI-shaped client talk to
# Parakeet (STT) and Magpie (TTS) through a single spark-control URL.
# Passing deep_health lets the proxy fire an immediate wedge-detect + auto-restart
# when Parakeet returns 500, instead of waiting up to 5 min for the periodic probe.
app.include_router(build_audio_router(settings, deep_health=deep_health))
@app.get("/", include_in_schema=False)
async def index() -> FileResponse:
@@ -124,6 +152,89 @@ async def del_model(key: str) -> dict:
return {"ok": True, "key": key}
@app.get("/api/models/disk-status")
async def get_models_disk_status() -> dict:
"""Probe each catalog model's HF cache on the appropriate Spark(s) in parallel.
Result is keyed by model key: {on_disk, total_bytes, per_host:[{host,on_disk,size_bytes,error?}]}.
Designed to be called once on dashboard load; takes ~13s depending on Spark count.
"""
if not settings.configured:
return {"configured": False, "models": {}}
keys = list(catalog.models.keys())
statuses = await asyncio.gather(*(
probe_disk(catalog.models[k].repo, catalog.models[k].mode, settings) for k in keys
), return_exceptions=True)
out: dict[str, dict] = {}
for k, s in zip(keys, statuses):
if isinstance(s, Exception):
out[k] = {"on_disk": False, "total_bytes": 0, "per_host": [], "error": str(s)}
continue
out[k] = {
"on_disk": s.on_disk,
"total_bytes": s.total_bytes,
"per_host": [
{"host": r.host, "on_disk": r.on_disk, "size_bytes": r.size_bytes, **({"error": r.error} if r.error else {})}
for r in s.per_host
],
}
return {"configured": True, "models": out}
@app.delete("/api/models/{key}/disk")
async def del_model_disk(key: str) -> dict:
"""Delete a model's weights from the Spark filesystem(s). The catalog entry stays.
Safety rails:
- Refuses if the model is currently loaded on vLLM.
- Refuses if a swap or download is in flight.
- Idempotent: if the cache dir is already gone on a host, that host reports 0 bytes freed.
"""
if key not in catalog.models:
raise HTTPException(404, f"unknown model: {key}")
m = catalog.models[key]
# Refuse if currently loaded
try:
vllm = await check_vllm(settings)
except Exception:
vllm = {}
if vllm.get("ok") and vllm.get("current_model") == m.repo:
raise HTTPException(
409,
f"'{m.display_name}' is the currently loaded model. Switch to a different model first, then try again."
)
# Refuse if a swap is in flight
if swap_manager.current_job_id:
raise HTTPException(409, "a model swap is in progress; wait for it to finish")
# Refuse if a download is in flight for this same repo (a different model's download is fine)
if download_manager.current_job_id:
job = download_manager.get(download_manager.current_job_id)
if job and job.repo == m.repo:
raise HTTPException(409, "this model is currently downloading; cancel or wait for it to finish")
status = await delete_from_disk(m.repo, m.mode, settings)
# Audit log
record_report(
f"disk:{key}",
ok=True,
source="disk-delete",
detail=f"freed {status.total_bytes} bytes across {len(status.per_host)} host(s)",
)
return {
"ok": True,
"key": key,
"repo": m.repo,
"bytes_freed": status.total_bytes,
"per_host": [
{"host": r.host, "size_bytes": r.size_bytes, **({"error": r.error} if r.error else {})}
for r in status.per_host
],
}
@app.get("/api/hardware")
async def get_hardware() -> dict:
"""Per-Spark hardware snapshot — RAM, disk, GPU mem + util, CPU load, uptime."""
@@ -136,6 +247,27 @@ async def get_connectivity() -> dict:
return connectivity_summary()
@app.get("/api/deep-health")
async def get_deep_health() -> dict:
"""Last result + auto-restart counters for each service's synthetic probe."""
return deep_health.summary()
@app.post("/api/deep-health/{service}/run")
async def run_deep_health(service: str) -> dict:
"""Manually run a single service's deep-health probe right now."""
if service not in deep_health.PROBES:
raise HTTPException(404, f"unknown service: {service}")
result = await deep_health.run_one(service)
return {
"ok": result.ok,
"at": result.at,
"latency_ms": result.latency_ms,
"error": result.error,
"note": result.note,
}
class HealthEventBody(BaseModel):
service: str # e.g. "parakeet", "magpie", "vllm"
ok: bool # true on success, false on failure
@@ -367,6 +499,108 @@ async def service_action(name: str, action: str) -> dict:
return {"name": name, "action": action, **result}
# ---- Speech model patch management ----
@app.get("/api/speech-models")
async def get_speech_models() -> dict:
"""Status of the parakeet-asr container + the spark-control overlay patches
(diarizer.py + main.py). Drift between local shipped patches and what's
inside the container is surfaced so the UI can prompt for reapply."""
return await speech_models.status()
@app.post("/api/speech-models/reapply")
async def post_speech_models_reapply() -> dict:
"""Copy spark-control's shipped diarizer.py + patched main.py into the
parakeet-asr container, verify Python syntax, restart the container, and
wait for both models (Parakeet ASR + Sortformer) to reload. ~60120 seconds."""
try:
result = await speech_models.reapply_patches()
except RuntimeError as e:
raise HTTPException(409, str(e))
if not result.get("ok"):
# Bubble up which step failed for client-side error rendering.
raise HTTPException(500, {"detail": "patch reapply failed", "result": result})
return result
@app.post("/api/speech-models/restart")
async def post_speech_models_restart() -> dict:
"""`docker restart parakeet-asr` only — no file changes. Useful when the
container's models look wedged but patches are already current."""
try:
result = await speech_models.restart_container()
except RuntimeError as e:
raise HTTPException(409, str(e))
if not result.get("ok"):
raise HTTPException(500, {"detail": "container restart failed", "result": result})
return result
# ---- WhisperX install (Phase 2 of the WhisperX migration) ----
@app.get("/api/whisperx/status")
async def get_whisperx_status() -> dict:
"""Is WhisperX installed + healthy on Spark 2 right now?"""
return await whisperx_installer.status()
@app.post("/api/whisperx/install")
async def post_whisperx_install() -> dict:
"""One-click install: ships the WhisperX build context from inside
spark-control to Spark 2, runs `docker build` + `docker run`, polls
/health until both models are loaded. Streams progress via the matching
GET /api/whisperx/install/{job_id}/stream SSE endpoint."""
try:
job = await whisperx_installer.trigger()
except RuntimeError as e:
raise HTTPException(409, str(e))
return {"job_id": job.id, "started_at": job.started_at}
@app.get("/api/whisperx/install/{job_id}")
async def get_whisperx_install(job_id: str) -> dict:
job = whisperx_installer.get(job_id)
if not job:
raise HTTPException(404, "unknown job")
return {
"id": job.id,
"state": job.state,
"phase": job.phase,
"lines": job.lines,
"started_at": job.started_at,
"finished_at": job.finished_at,
"returncode": job.returncode,
}
@app.get("/api/whisperx/install/{job_id}/stream")
async def stream_whisperx_install(job_id: str) -> StreamingResponse:
job = whisperx_installer.get(job_id)
if not job:
raise HTTPException(404, "unknown job")
async def event_stream():
last_idx = 0
last_phase = ""
last_state = ""
while True:
new_lines = job.lines[last_idx:]
last_idx = len(job.lines)
for line in new_lines:
yield f"data: {json.dumps({'line': line})}\n\n"
if job.phase != last_phase or job.state != last_state:
yield f"event: phase\ndata: {json.dumps({'phase': job.phase, 'state': job.state})}\n\n"
last_phase = job.phase
last_state = job.state
if job.finished_at:
yield f"event: done\ndata: {json.dumps({'state': job.state, 'returncode': job.returncode})}\n\n"
return
await asyncio.sleep(0.6)
return StreamingResponse(event_stream(), media_type="text/event-stream")
@app.get("/api/endpoints")
async def get_endpoints() -> dict:
"""Service-discovery summary. Stable shape; other apps on the LAN can poll this
@@ -434,6 +668,15 @@ class SwapRequest(BaseModel):
dry_run: bool = False
@app.post("/api/swap/{key}/validate")
async def validate_swap(key: str) -> dict:
"""Pre-flight check: run vLLM's argparse layer against the proposed launch
command WITHOUT starting an engine. Cheap (~5 s) and doesn't disturb the
currently-loaded model.
"""
return await validate_launch(key, catalog, settings)
@app.post("/api/swap")
async def post_swap(req: SwapRequest) -> dict:
if not settings.configured and not req.dry_run:
+8
View File
@@ -65,6 +65,14 @@ def services_from_settings(s: Settings) -> dict[str, ServiceDef]:
container=s.magpie_container,
port=s.magpie_port,
),
"whisperx": ServiceDef(
name="whisperx",
kind="stt+diarize",
host=s.whisperx_host,
user=s.whisperx_user,
container=s.whisperx_container,
port=s.whisperx_port,
),
}
for entry in load_custom_services():
key = entry.get("key")
+319
View File
@@ -0,0 +1,319 @@
"""Speech-model patch management for the parakeet-asr container on Spark 2.
The parakeet-asr container ships with a stock FastAPI wrapper that only supports
ASR (Parakeet TDT). Spark Control augments it with two overlay files —
`diarizer.py` and a patched `main.py` — that add Sortformer-based diarization
and the `/v1/audio/diarize` endpoint.
These overlays survive `docker restart` (writable layer) but NOT `docker rm`
(volume rebuild). If the parakeet container is ever recreated, the overlays
need to be re-applied. This module handles that:
- GET /api/speech-models → current state (loaded models, patch
checksums, drift detection)
- POST /api/speech-models/reapply → copy overlays from spark-control's
shipped /app/parakeet_patches into
the parakeet container + restart
- POST /api/speech-models/restart → just `docker restart parakeet-asr`,
no overlay changes
"""
from __future__ import annotations
import asyncio
import hashlib
import json
import shlex
from datetime import datetime, timezone
from pathlib import Path
from typing import Optional
import httpx
from .config import Settings
from .connectivity import record_report
from .ssh import ssh_run
# /app/parakeet_patches inside the spark-control container image (set up by
# the Dockerfile COPY directive). Each file under here is the canonical
# version we'd push into the parakeet container.
PATCHES_DIR = Path(__file__).resolve().parent.parent / "parakeet_patches"
# Files we manage. Mapped local-source-path -> destination-path-in-container.
MANAGED_FILES = {
"diarizer.py": "/opt/parakeet/app/diarizer.py",
"main.py": "/opt/parakeet/app/main.py",
}
def _sha256_short(text: bytes) -> str:
return hashlib.sha256(text).hexdigest()[:12]
def _local_patches() -> dict[str, dict]:
"""Read the canonical patch files shipped inside spark-control.
Returns: {local_name: {"path": str, "sha": str, "size": int, "missing": bool}}
"""
out: dict[str, dict] = {}
for local_name in MANAGED_FILES:
p = PATCHES_DIR / local_name
if not p.exists():
out[local_name] = {"path": str(p), "missing": True}
continue
body = p.read_bytes()
out[local_name] = {
"path": str(p),
"sha": _sha256_short(body),
"size": len(body),
"missing": False,
}
return out
async def _parakeet_health(settings: Settings) -> dict:
"""Pull current model loading state from Parakeet's /health endpoint."""
url = f"http://{settings.parakeet_host}:{settings.parakeet_port}/health"
try:
async with httpx.AsyncClient(timeout=4.0) as client:
r = await client.get(url)
if r.status_code == 200:
return r.json()
return {"reachable": False, "status_code": r.status_code, "error": r.text[:200]}
except Exception as e:
return {"reachable": False, "error": f"{type(e).__name__}: {e}"}
async def _remote_file_sha(settings: Settings, container_path: str) -> Optional[str]:
"""sha256 of a file inside the parakeet container, or None if missing/error."""
if not settings.parakeet_host or not settings.parakeet_user:
return None
cmd = (
f"docker exec parakeet-asr sh -c "
f"'[ -f {shlex.quote(container_path)} ] && "
f"sha256sum {shlex.quote(container_path)} 2>/dev/null | cut -c1-12 || echo MISSING'"
)
rc, out, _ = await ssh_run(settings.parakeet_host, settings.parakeet_user, cmd, settings, timeout=15)
if rc != 0:
return None
s = out.strip()
if s == "MISSING" or not s:
return None
return s
class SpeechModelsManager:
"""Tracks last-reapply state in-memory; persists nothing across spark-control
restarts (the source-of-truth is what's actually inside the parakeet
container, which we read fresh on every status call)."""
def __init__(self, settings: Settings) -> None:
self.settings = settings
self.last_reapply_at: Optional[str] = None
self.last_reapply_result: Optional[dict] = None
self.last_restart_at: Optional[str] = None
self._reapply_lock = asyncio.Lock()
async def status(self) -> dict:
"""Build the full speech-models status payload for the UI.
Compares the SHAs of files we shipped inside spark-control vs what's
actually running inside the parakeet container — surfaces drift if
patches were applied from an older spark-control version, or never
applied at all.
"""
local = _local_patches()
health = await _parakeet_health(self.settings)
# Probe remote SHAs in parallel
async def _probe(local_name: str) -> tuple[str, Optional[str]]:
return local_name, await _remote_file_sha(self.settings, MANAGED_FILES[local_name])
remote_results = await asyncio.gather(*(_probe(n) for n in MANAGED_FILES))
remote = {name: sha for name, sha in remote_results}
files = []
all_in_sync = True
any_missing_remote = False
for local_name in MANAGED_FILES:
local_info = local.get(local_name, {})
local_sha = local_info.get("sha")
remote_sha = remote.get(local_name)
in_sync = bool(local_sha) and (local_sha == remote_sha)
if not in_sync:
all_in_sync = False
if remote_sha is None:
any_missing_remote = True
files.append({
"name": local_name,
"container_path": MANAGED_FILES[local_name],
"local_sha": local_sha,
"remote_sha": remote_sha,
"in_sync": in_sync,
"size_bytes": local_info.get("size"),
})
# Coarse status for the UI to render a single pill
if any_missing_remote:
patch_status = "missing" # overlay files missing in container
elif all_in_sync:
patch_status = "in_sync"
else:
patch_status = "drift" # local files newer than container
return {
"container_health": health,
"patches": {
"status": patch_status,
"files": files,
"last_reapply_at": self.last_reapply_at,
"last_reapply_result": self.last_reapply_result,
"last_restart_at": self.last_restart_at,
},
}
async def reapply_patches(self) -> dict:
"""Copy the patches shipped inside spark-control into the parakeet
container, verify syntax, and restart it. Same logic as apply.sh but
run from inside spark-control's FastAPI process."""
if self._reapply_lock.locked():
raise RuntimeError("a patch reapply is already in progress")
async with self._reapply_lock:
return await self._do_reapply()
async def _do_reapply(self) -> dict:
s = self.settings
if not s.parakeet_host or not s.parakeet_user:
raise RuntimeError("parakeet host/user not configured")
steps: list[dict] = []
# 0. Verify local patches present
local = _local_patches()
for name, info in local.items():
if info.get("missing"):
steps.append({"step": "verify_local", "ok": False, "name": name, "error": "patch file missing inside spark-control image"})
return self._finish_reapply(False, steps)
steps.append({"step": "verify_local", "ok": True, "files": list(local.keys())})
# 1. Backup main.py inside container (idempotent — only if backup doesn't already exist)
backup_cmd = (
"docker exec parakeet-asr sh -c '"
"test -f /opt/parakeet/app/main.py.pre-sortformer || "
"cp /opt/parakeet/app/main.py /opt/parakeet/app/main.py.pre-sortformer"
"'"
)
rc, out, err = await ssh_run(s.parakeet_host, s.parakeet_user, backup_cmd, s, timeout=15)
steps.append({"step": "backup_original", "ok": rc == 0, "stdout": out.strip()[:200], "stderr": err.strip()[:200]})
if rc != 0:
return self._finish_reapply(False, steps)
# 2. Copy each patch file into the container via `docker exec -i ... 'cat > path'`
for local_name, container_path in MANAGED_FILES.items():
local_body = (PATCHES_DIR / local_name).read_bytes()
copy_cmd = f"docker exec -i parakeet-asr sh -c {shlex.quote('cat > ' + container_path)}"
ok, out, err = await self._ssh_pipe_to_remote(
s.parakeet_host, s.parakeet_user, copy_cmd, local_body, s, timeout=30
)
steps.append({"step": "copy_file", "name": local_name, "ok": ok,
"bytes": len(local_body), "stdout": out[:200], "stderr": err[:200]})
if not ok:
return self._finish_reapply(False, steps)
# 3. Verify Python syntax inside the container
syntax_cmd = (
"docker exec parakeet-asr python3 -c "
"'import ast; "
"ast.parse(open(\"/opt/parakeet/app/diarizer.py\").read()); "
"ast.parse(open(\"/opt/parakeet/app/main.py\").read()); "
"print(\"py OK\")'"
)
rc, out, err = await ssh_run(s.parakeet_host, s.parakeet_user, syntax_cmd, s, timeout=30)
ok = rc == 0 and "py OK" in out
steps.append({"step": "verify_syntax", "ok": ok, "stdout": out.strip()[:300], "stderr": err.strip()[:300]})
if not ok:
return self._finish_reapply(False, steps)
# 4. Restart the container
restart_cmd = "docker restart parakeet-asr"
rc, out, err = await ssh_run(s.parakeet_host, s.parakeet_user, restart_cmd, s, timeout=60)
steps.append({"step": "docker_restart", "ok": rc == 0, "stdout": out.strip()[:200], "stderr": err.strip()[:200]})
if rc != 0:
return self._finish_reapply(False, steps)
# 5. Poll /health until both models are loaded again (up to ~120s)
loaded = False
for _ in range(40):
await asyncio.sleep(3)
h = await _parakeet_health(s)
if h.get("asr_loaded") and h.get("diarizer_loaded"):
loaded = True
steps.append({"step": "verify_health", "ok": True, "asr_loaded": True, "diarizer_loaded": True})
break
if not loaded:
steps.append({"step": "verify_health", "ok": False, "error": "models did not load within 120s"})
return self._finish_reapply(False, steps)
return self._finish_reapply(True, steps)
def _finish_reapply(self, success: bool, steps: list[dict]) -> dict:
now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
self.last_reapply_at = now
result = {"ok": success, "at": now, "steps": steps}
self.last_reapply_result = result
record_report(
"parakeet",
ok=success,
source="speech-models-reapply",
detail=f"reapply patches: {'OK' if success else 'FAILED at step ' + str([s for s in steps if not s.get('ok')][:1])}",
)
return result
async def restart_container(self) -> dict:
"""Restart the parakeet-asr container without changing any files."""
s = self.settings
if not s.parakeet_host or not s.parakeet_user:
raise RuntimeError("parakeet host/user not configured")
rc, out, err = await ssh_run(s.parakeet_host, s.parakeet_user,
"docker restart parakeet-asr", s, timeout=60)
ok = rc == 0
now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
self.last_restart_at = now
record_report(
"parakeet",
ok=ok,
source="speech-models-restart",
detail=f"manual restart: {'OK' if ok else 'rc=' + str(rc) + ' ' + err.strip()[:120]}",
)
return {"ok": ok, "at": now, "stdout": out.strip()[:200], "stderr": err.strip()[:200]}
async def _ssh_pipe_to_remote(
self,
host: str,
user: str,
remote_cmd: str,
payload: bytes,
settings: Settings,
timeout: float = 30.0,
) -> tuple[bool, str, str]:
"""Run `ssh user@host <remote_cmd>` while piping `payload` to its stdin.
This is the bash equivalent of `ssh ... '<cmd>' < local_file`.
Returns (success, stdout_str, stderr_str)."""
from .ssh import _base_args
args = _base_args(settings) + [f"{user}@{host}", remote_cmd]
proc = await asyncio.create_subprocess_exec(
*args,
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
try:
stdout_b, stderr_b = await asyncio.wait_for(
proc.communicate(input=payload), timeout=timeout
)
except asyncio.TimeoutError:
proc.kill()
await proc.wait()
return False, "", f"timeout after {timeout}s"
ok = proc.returncode == 0
return ok, stdout_b.decode(errors="replace"), stderr_b.decode(errors="replace")
+554 -3
View File
@@ -17,6 +17,9 @@ const state = {
config: {},
configured: true,
timer_handle: null,
deep_health: {},
disk_status: {}, // keyed by model key: { on_disk, total_bytes, per_host }
disk_status_loaded: false,
};
const el = (sel) => document.querySelector(sel);
@@ -56,12 +59,49 @@ function renderCards() {
? `<div class="desc">${escapeHtml(m.description)}</div>`
: '';
const customPill = m.custom ? `<span class="tag custom-pill">custom</span>` : '';
// Disk-presence pill + trash button. Until /api/models/disk-status comes back,
// we don't know — render a neutral placeholder.
const disk = state.disk_status[key];
let diskPill = '';
if (state.disk_status_loaded) {
if (disk && disk.on_disk) {
const gb = (disk.total_bytes / 1e9);
diskPill = `<span class="tag on-disk" title="Weights present on disk">on disk · ${gb.toFixed(1)} GB</span>`;
} else {
diskPill = `<span class="tag not-on-disk" title="Weights not downloaded">not downloaded</span>`;
}
}
// Trash button — hidden if not on disk; disabled (with tooltip) if currently loaded.
let trashBtn = '';
if (state.disk_status_loaded && disk && disk.on_disk) {
const disabled = isActive || isSwapping;
const tip = isActive
? 'Currently loaded — switch to another model first'
: isSwapping
? 'A swap is in progress'
: 'Delete weights from disk';
trashBtn = `<button class="icon-btn danger" data-disk-del-key="${key}" title="${escapeHtml(tip)}" aria-label="Delete from disk" ${disabled ? 'disabled' : ''}>${trashIcon}</button>`;
}
// Primary card action: "Switch to this" (green) when on disk; "Download" (blue) when not.
// Before disk-status loads we render the swap button as a sensible default.
const isOnDisk = !state.disk_status_loaded || (disk && disk.on_disk);
const dlInFlight = !!(typeof dlState !== 'undefined' && dlState && dlState.job_id);
let primaryBtn = '';
if (isActive) {
primaryBtn = `<button class="btn" disabled>Current</button>`;
} else if (isOnDisk) {
primaryBtn = `<button class="btn primary" data-swap-key="${key}" ${isSwapping ? 'disabled' : ''}>Switch to this</button>`;
} else {
const tip = dlInFlight ? 'A download is already in progress' : 'Download weights to the Spark(s)';
primaryBtn = `<button class="btn info" data-download-key="${key}" title="${escapeHtml(tip)}" ${dlInFlight ? 'disabled' : ''}>Download</button>`;
}
card.innerHTML = `
<div class="name">${escapeHtml(m.display_name)}</div>
<div class="meta">
<span class="tag mode-${m.mode}">${m.mode}</span>
<span class="tag">${m.size_gb} GB</span>
${customPill}
${diskPill}
${(m.capabilities || []).map(c => `<span class="tag cap">${escapeHtml(c)}</span>`).join('')}
</div>
${desc}
@@ -70,20 +110,60 @@ function renderCards() {
</div>
<div class="spacer"></div>
<div class="card-actions">
<button class="btn ${isActive ? '' : 'primary'}" data-swap-key="${key}" ${isActive || isSwapping ? 'disabled' : ''}>
${isActive ? 'Current' : 'Switch to this'}
</button>
${primaryBtn}
<button class="btn test-btn" data-test-key="${key}" title="Pre-flight check the launch command without starting the engine">Test</button>
<button class="btn adv-btn" data-adv-key="${key}" title="Advanced settings">Advanced</button>
${trashBtn}
</div>
<div class="test-result hidden" data-test-result-for="${key}"></div>
`;
root.appendChild(card);
}
for (const btn of root.querySelectorAll('[data-swap-key]')) {
btn.addEventListener('click', () => triggerSwap(btn.dataset.swapKey));
}
for (const btn of root.querySelectorAll('[data-download-key]')) {
btn.addEventListener('click', () => triggerDownloadForKey(btn.dataset.downloadKey));
}
for (const btn of root.querySelectorAll('[data-adv-key]')) {
btn.addEventListener('click', () => openAdvanced(btn.dataset.advKey));
}
for (const btn of root.querySelectorAll('[data-test-key]')) {
btn.addEventListener('click', () => testLaunch(btn.dataset.testKey, btn));
}
for (const btn of root.querySelectorAll('[data-disk-del-key]')) {
btn.addEventListener('click', () => openDiskDeleteDialog(btn.dataset.diskDelKey));
}
}
const trashIcon = '<svg viewBox="0 0 24 24" width="14" height="14" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true"><polyline points="3 6 5 6 21 6"></polyline><path d="M19 6l-1 14a2 2 0 0 1-2 2H8a2 2 0 0 1-2-2L5 6"></path><path d="M10 11v6"></path><path d="M14 11v6"></path><path d="M9 6V4a2 2 0 0 1 2-2h2a2 2 0 0 1 2 2v2"></path></svg>';
async function testLaunch(key, btn) {
const resultEl = document.querySelector(`[data-test-result-for="${key}"]`);
if (!resultEl) return;
const originalText = btn.textContent;
btn.disabled = true;
btn.textContent = 'Testing…';
resultEl.classList.remove('hidden', 'ok', 'fail');
resultEl.innerHTML = '<span class="muted small">Checking launch args against vLLM\'s parser…</span>';
try {
const r = await fetchJSON(`/api/swap/${encodeURIComponent(key)}/validate`, { method: 'POST' });
if (r.ok) {
resultEl.classList.add('ok');
resultEl.innerHTML = `<span class="ok-mark">✓</span> Launch args parse OK. <span class="muted small">(Doesn't guarantee runtime success — only catches argparse-level issues.)</span>`;
} else {
resultEl.classList.add('fail');
const err = escapeHtml(r.error || 'unknown error');
const stage = r.stage ? ` <span class="muted small">(${escapeHtml(r.stage)})</span>` : '';
resultEl.innerHTML = `<span class="fail-mark">✗</span> Would fail: ${err}${stage}`;
}
} catch (e) {
resultEl.classList.add('fail');
resultEl.innerHTML = `<span class="fail-mark">✗</span> Test failed: ${escapeHtml(e.message)}`;
} finally {
btn.disabled = false;
btn.textContent = originalText;
}
}
function renderCurrent(status) {
@@ -380,6 +460,35 @@ async function renderServices() {
const restartsRow = s.restart_count != null && s.restart_count > 1
? `<div class="row"><span class="k">Restarts</span><span class="v">${s.restart_count}</span></div>`
: '';
const dh = state.deep_health?.[name];
let deepRow = '';
if (dh && dh.last) {
const last = dh.last;
const when = (last.at || '').slice(11, 19); // HH:MM:SS
const verdict = last.ok
? `<span class="dh-ok">deep check ok</span>`
: `<span class="dh-fail">deep check FAILED</span>`;
const lat = last.latency_ms != null ? ` <span class="muted">${last.latency_ms} ms</span>` : '';
const restarts = dh.auto_restarts_window > 0
? ` <span class="muted">· ${dh.auto_restarts_window} auto-restart${dh.auto_restarts_window === 1 ? '' : 's'} in 30 min</span>`
: '';
deepRow = `
<div class="row deep-row">
<span class="k">Deep</span>
<span class="v deep-v">${verdict} <span class="muted small">${escapeHtml(when)}</span>${lat}${restarts}</span>
<button class="icon-btn dh-run-btn" data-dh-run="${escapeHtml(name)}" title="Run deep check now">↻</button>
</div>
${last.ok ? '' : `<div class="deep-error muted small">${escapeHtml((last.error || last.note || '').slice(0, 200))}</div>`}
`;
} else if (dh) {
deepRow = `
<div class="row deep-row">
<span class="k">Deep</span>
<span class="v muted-v">no probe yet</span>
<button class="icon-btn dh-run-btn" data-dh-run="${escapeHtml(name)}" title="Run deep check now">↻</button>
</div>
`;
}
card.innerHTML = `
<div class="head">
<span class="name">${escapeHtml(name)}</span>
@@ -390,6 +499,7 @@ async function renderServices() {
${urlRow}
${modelRow}
${restartsRow}
${deepRow}
<div class="service-actions">
<button class="btn" data-svc-action="${name}:start" ${disable('start') ? 'disabled' : ''}>Start</button>
<button class="btn" data-svc-action="${name}:restart" ${disable('restart') ? 'disabled' : ''}>Restart</button>
@@ -401,6 +511,268 @@ async function renderServices() {
for (const btn of grid.querySelectorAll('.btn[data-svc-action]')) {
btn.addEventListener('click', () => onServiceAction(btn.dataset.svcAction));
}
for (const btn of grid.querySelectorAll('[data-dh-run]')) {
btn.addEventListener('click', () => onDeepHealthRun(btn.dataset.dhRun, btn));
}
}
async function onDeepHealthRun(name, btn) {
btn.disabled = true;
const orig = btn.textContent;
btn.textContent = '…';
try {
await fetchJSON(`/api/deep-health/${encodeURIComponent(name)}/run`, { method: 'POST' });
} catch (e) {
console.warn('deep-health run failed', e);
} finally {
try { state.deep_health = await fetchJSON('/api/deep-health'); } catch {}
btn.textContent = orig;
btn.disabled = false;
renderServices();
}
}
// ===================== speech-model patches (v0.11) =====================
async function renderSpeechModels() {
const panel = el('#speech-models-panel');
const card = el('#speech-models-card');
if (!panel || !card) return;
let data;
try {
data = await fetchJSON('/api/speech-models');
} catch (e) {
// If parakeet host isn't even configured, hide the section entirely
panel.classList.add('hidden');
return;
}
if (!data || !data.patches) { panel.classList.add('hidden'); return; }
panel.classList.remove('hidden');
const patches = data.patches || {};
const health = data.container_health || {};
const status = patches.status || 'unknown';
let statusPill;
if (status === 'in_sync') {
statusPill = `<span class="tag ok">patches in sync</span>`;
} else if (status === 'drift') {
statusPill = `<span class="tag warn">spark-control has newer patches</span>`;
} else if (status === 'missing') {
statusPill = `<span class="tag bad">patches missing in container</span>`;
} else {
statusPill = `<span class="tag warn">unknown</span>`;
}
const asrLoaded = !!health.asr_loaded;
const diarLoaded = !!health.diarizer_loaded;
const asrModel = escapeHtml(health.model || '—');
const diarModel = escapeHtml(health.diarizer_model || '—');
const fileRows = (patches.files || []).map((f) => {
const sync = f.in_sync
? '<span class="sm-file-ok">✓ in sync</span>'
: f.remote_sha == null
? '<span class="sm-file-bad">✗ missing</span>'
: '<span class="sm-file-warn">⚠ drift</span>';
const local = f.local_sha ? `<code>${escapeHtml(f.local_sha)}</code>` : '<span class="muted">—</span>';
const remote = f.remote_sha ? `<code>${escapeHtml(f.remote_sha)}</code>` : '<span class="muted">—</span>';
return `
<div class="sm-file-row">
<span class="sm-file-name"><code>${escapeHtml(f.name)}</code></span>
<span class="sm-file-sync">${sync}</span>
<span class="sm-file-sha muted small">local ${local} → remote ${remote}</span>
</div>
`;
}).join('');
const lastReapply = patches.last_reapply_at ? new Date(patches.last_reapply_at).toLocaleString() : 'never (since spark-control boot)';
const lastRestart = patches.last_restart_at ? new Date(patches.last_restart_at).toLocaleString() : 'never (since spark-control boot)';
card.innerHTML = `
<div class="sm-header">
<div class="sm-title">parakeet-asr container</div>
${statusPill}
</div>
<div class="sm-models">
<div class="sm-model-row">
<span class="sm-model-kind">Parakeet ASR</span>
<span class="sm-model-name">${asrModel}</span>
<span class="sm-model-loaded">${asrLoaded ? '<span class="tag ok">loaded</span>' : '<span class="tag bad">not loaded</span>'}</span>
</div>
<div class="sm-model-row">
<span class="sm-model-kind">Sortformer diarizer</span>
<span class="sm-model-name">${diarModel}</span>
<span class="sm-model-loaded">${diarLoaded ? '<span class="tag ok">loaded</span>' : '<span class="tag bad">not loaded</span>'}</span>
</div>
</div>
<div class="sm-files">${fileRows}</div>
<div class="sm-meta muted small">
Last reapply: ${escapeHtml(lastReapply)} · Last manual restart: ${escapeHtml(lastRestart)}
</div>
<div class="sm-actions">
<button class="btn primary" id="sm-reapply">Reapply patches</button>
<button class="btn" id="sm-restart">Restart container</button>
</div>
`;
el('#sm-reapply').addEventListener('click', onSpeechModelsReapply);
el('#sm-restart').addEventListener('click', onSpeechModelsRestart);
}
async function onSpeechModelsReapply() {
if (!confirm('Reapply Sortformer patches to the parakeet-asr container? The container will restart and both ASR + diarizer will be unavailable for ~60120 seconds.')) return;
const dlg = el('#speech-models-progress-dialog');
const steps = el('#sm-prog-steps');
const closeBtn = el('#sm-prog-close');
steps.innerHTML = '<div class="muted small">Starting…</div>';
closeBtn.disabled = true;
closeBtn.onclick = () => dlg.close();
dlg.showModal();
try {
const r = await fetchJSON('/api/speech-models/reapply', { method: 'POST' });
steps.innerHTML = (r.steps || []).map((s) => {
const mark = s.ok ? '<span class="sm-file-ok">✓</span>' : '<span class="sm-file-bad">✗</span>';
const extra = s.error ? `<div class="muted small">${escapeHtml(s.error)}</div>` : '';
return `<div class="sm-prog-step">${mark} <strong>${escapeHtml(s.step)}</strong>${s.name ? ` (${escapeHtml(s.name)})` : ''}${extra}</div>`;
}).join('') + `<div class="sm-prog-done sm-file-ok">Done — both models reloaded.</div>`;
} catch (e) {
let parsed = null;
try { parsed = JSON.parse(e.message.split(':').slice(2).join(':').trim()); } catch {}
const stepHtml = parsed && parsed.result && parsed.result.steps
? parsed.result.steps.map((s) => {
const mark = s.ok ? '<span class="sm-file-ok">✓</span>' : '<span class="sm-file-bad">✗</span>';
return `<div class="sm-prog-step">${mark} <strong>${escapeHtml(s.step)}</strong>${s.name ? ` (${escapeHtml(s.name)})` : ''}${s.error ? `<div class="muted small">${escapeHtml(s.error)}</div>` : ''}</div>`;
}).join('')
: `<div class="sm-file-bad">${escapeHtml(e.message)}</div>`;
steps.innerHTML = stepHtml + `<div class="sm-prog-done sm-file-bad">Failed.</div>`;
} finally {
closeBtn.disabled = false;
try { await renderSpeechModels(); } catch {}
}
}
async function onSpeechModelsRestart() {
if (!confirm('Restart parakeet-asr container? STT + diarization will be unavailable for ~30 seconds.')) return;
try {
await fetchJSON('/api/speech-models/restart', { method: 'POST' });
} catch (e) {
alert('Restart failed: ' + e.message);
} finally {
try { await renderSpeechModels(); } catch {}
}
}
// ===================== WhisperX install (v0.12) =====================
const wxState = {
job_id: null,
eventsource: null,
timer_handle: null,
started_at: null,
};
async function renderWhisperXBanner() {
const card = el('#whisperx-install-card');
if (!card) return;
let status;
try {
status = await fetchJSON('/api/whisperx/status');
} catch {
card.classList.add('hidden');
return;
}
if (status.installed && status.healthy) {
card.classList.add('hidden');
} else if (status.configured) {
card.classList.remove('hidden');
} else {
card.classList.add('hidden');
}
}
async function onWhisperXInstall() {
if (wxState.job_id) {
// Just re-attach to the running job
showWhisperXDialog();
return;
}
if (!confirm('Install WhisperX on Spark 2? This builds a new Docker image (~1015 min first time, mostly downloading pyannote + whisper weights). Parakeet/Magpie stay untouched.')) return;
try {
const r = await fetchJSON('/api/whisperx/install', { method: 'POST' });
attachToWhisperXInstall(r.job_id);
} catch (e) {
alert('Failed to start WhisperX install: ' + e.message);
}
}
function showWhisperXDialog() {
el('#whisperx-progress-dialog').showModal();
}
function attachToWhisperXInstall(jobId) {
wxState.job_id = jobId;
el('#wx-prog-title').textContent = 'Installing WhisperX…';
el('#wx-prog-phase').textContent = 'Starting…';
el('#wx-prog-log').textContent = '';
showWhisperXDialog();
// Tick a timer
wxState.started_at = Date.now();
if (wxState.timer_handle) clearInterval(wxState.timer_handle);
wxState.timer_handle = setInterval(() => {
const sec = Math.max(0, Math.floor((Date.now() - wxState.started_at) / 1000));
const m = Math.floor(sec / 60);
el('#wx-prog-elapsed').textContent = `${m}:${(sec % 60).toString().padStart(2, '0')}`;
}, 500);
// Backfill snapshot then connect SSE
fetchJSON(`/api/whisperx/install/${jobId}`).then((snap) => {
el('#wx-prog-phase').textContent = snap.phase || 'Working…';
el('#wx-prog-log').textContent = (snap.lines || []).join('\n');
el('#wx-prog-log').scrollTop = el('#wx-prog-log').scrollHeight;
if (snap.finished_at) {
handleWhisperXDone(snap);
return;
}
const es = new EventSource(`/api/whisperx/install/${jobId}/stream`);
wxState.eventsource = es;
es.onmessage = (ev) => {
try {
const log = el('#wx-prog-log');
log.textContent += JSON.parse(ev.data).line + '\n';
log.scrollTop = log.scrollHeight;
} catch {}
};
es.addEventListener('phase', (ev) => {
try { el('#wx-prog-phase').textContent = JSON.parse(ev.data).phase; } catch {}
});
es.addEventListener('done', (ev) => {
try { handleWhisperXDone(JSON.parse(ev.data)); } catch {}
es.close();
wxState.eventsource = null;
});
es.onerror = () => { es.close(); wxState.eventsource = null; };
}).catch(() => {});
}
function handleWhisperXDone(d) {
if (wxState.timer_handle) { clearInterval(wxState.timer_handle); wxState.timer_handle = null; }
wxState.job_id = null;
const rc = d.returncode;
if (d.state === 'failed' || (rc !== 0 && rc != null)) {
el('#wx-prog-title').textContent = `WhisperX install failed (rc=${rc})`;
el('#wx-prog-phase').textContent = 'Failed — check the build log below';
} else {
el('#wx-prog-title').textContent = 'WhisperX installed';
el('#wx-prog-phase').textContent = 'Ready ✓ — appears in Always-on services below';
// Refresh services + banner state
setTimeout(() => {
renderServices();
renderWhisperXBanner();
}, 1000);
}
}
async function onServiceAction(key) {
@@ -507,6 +879,64 @@ function renderHealth(status) {
function renderBanner(status) {
el('#setup-banner').classList.toggle('hidden', !!status.configured);
// Dashboard tabs share the same "configured" gate as the rest of the
// body — hidden until SSH is set up, then visible.
const tabs = el('#dashboard-tabs');
if (tabs) tabs.classList.toggle('hidden', !status.configured);
}
// ===================== dashboard tabs (LLM / Audio) =====================
const TABS_STORAGE_KEY = 'sparkcontrol.dashboard.activeTab';
function setupDashboardTabs() {
const buttons = $$('.dashboard-tab');
if (!buttons.length) return;
// Restore the last-selected tab, default to "llm"
let saved;
try { saved = localStorage.getItem(TABS_STORAGE_KEY); } catch {}
const initial = saved === 'audio' || saved === 'llm' ? saved : 'llm';
function selectTab(name) {
buttons.forEach((b) => {
const active = b.dataset.tab === name;
b.classList.toggle('active', active);
b.setAttribute('aria-selected', active ? 'true' : 'false');
});
$$('.tab-content').forEach((c) => {
c.classList.toggle('active', c.id === `tab-${name}`);
});
try { localStorage.setItem(TABS_STORAGE_KEY, name); } catch {}
}
buttons.forEach((b) => {
b.addEventListener('click', () => selectTab(b.dataset.tab));
});
selectTab(initial);
}
// ===================== collapsible endpoint card =====================
const ENDPOINT_COLLAPSED_KEY = 'sparkcontrol.endpoint.collapsed';
function setupEndpointCollapse() {
const panel = el('#endpoint-panel');
const btn = el('#ep-collapse');
if (!panel || !btn) return;
// Default: collapsed (most of the time you don't need to see endpoint details)
let collapsed = true;
try {
const v = localStorage.getItem(ENDPOINT_COLLAPSED_KEY);
if (v === 'false') collapsed = false;
else if (v === 'true') collapsed = true;
} catch {}
panel.classList.toggle('collapsed', collapsed);
btn.addEventListener('click', () => {
const nowCollapsed = !panel.classList.contains('collapsed');
panel.classList.toggle('collapsed', nowCollapsed);
try { localStorage.setItem(ENDPOINT_COLLAPSED_KEY, nowCollapsed ? 'true' : 'false'); } catch {}
});
}
function renderSwapPanel() {
@@ -635,6 +1065,7 @@ async function pollStatus() {
// Refresh services state lazily — every 5s poll triggers this too.
try {
state.services = await fetchJSON('/api/services');
try { state.deep_health = await fetchJSON('/api/deep-health'); } catch {}
renderServices();
} catch {}
if (status.current_swap_job && status.current_swap_job !== state.swap_job_id) {
@@ -655,6 +1086,78 @@ async function loadModels() {
state.models = data.models || {};
}
async function loadDiskStatus() {
// Probes each catalog model's HF cache over SSH; takes a beat. Best-effort.
try {
const r = await fetchJSON('/api/models/disk-status');
if (r && r.models) {
state.disk_status = r.models;
state.disk_status_loaded = true;
renderCards();
}
} catch (e) {
// Silent — pills just won't render. Don't block dashboard.
console.warn('disk-status probe failed:', e.message);
}
}
function fmtBytesShort(n) {
if (!Number.isFinite(n) || n <= 0) return '0 B';
if (n >= 1e9) return `${(n / 1e9).toFixed(1)} GB`;
if (n >= 1e6) return `${(n / 1e6).toFixed(1)} MB`;
if (n >= 1e3) return `${(n / 1e3).toFixed(1)} KB`;
return `${n} B`;
}
function openDiskDeleteDialog(key) {
const m = state.models[key];
const disk = state.disk_status[key];
if (!m || !disk || !disk.on_disk) return;
const dlg = el('#disk-delete-dialog');
el('#dd-summary').innerHTML = `Free <strong>${fmtBytesShort(disk.total_bytes)}</strong> by removing <strong>${escapeHtml(m.display_name)}</strong> (<code>${escapeHtml(m.repo)}</code>) from disk.`;
const hostsEl = el('#dd-hosts');
hostsEl.innerHTML = '';
for (const h of (disk.per_host || [])) {
if (!h.on_disk) continue;
const li = document.createElement('li');
li.innerHTML = `<code>${escapeHtml(h.host)}</code> — ${fmtBytesShort(h.size_bytes)}`;
hostsEl.appendChild(li);
}
const errEl = el('#dd-error');
errEl.classList.add('hidden');
errEl.textContent = '';
const confirm = el('#dd-confirm');
const cancel = el('#dd-cancel');
const onCancel = () => dlg.close();
const onConfirm = async () => {
confirm.disabled = true;
cancel.disabled = true;
confirm.textContent = 'Deleting…';
try {
const r = await fetchJSON(`/api/models/${encodeURIComponent(key)}/disk`, { method: 'DELETE' });
dlg.close();
// Optimistically clear local disk state for this key, then refresh.
delete state.disk_status[key];
renderCards();
// Eagerly re-probe so size is accurate (and shows "not downloaded" pill).
loadDiskStatus();
const freed = r && typeof r.bytes_freed === 'number' ? fmtBytesShort(r.bytes_freed) : '';
console.log(`Deleted ${m.display_name} from disk${freed ? ` — freed ${freed}` : ''}.`);
} catch (e) {
errEl.textContent = e.message || 'Delete failed';
errEl.classList.remove('hidden');
} finally {
confirm.disabled = false;
cancel.disabled = false;
confirm.textContent = 'Delete from disk';
}
};
cancel.onclick = onCancel;
confirm.onclick = onConfirm;
dlg.showModal();
}
async function triggerSwap(modelKey) {
if (state.swap_job_id) return;
try {
@@ -669,6 +1172,38 @@ async function triggerSwap(modelKey) {
}
}
async function triggerDownloadForKey(modelKey) {
const m = state.models[modelKey];
if (!m) return;
if (dlState.job_id) {
alert('A download is already in progress; wait for it to finish.');
return;
}
// Pick the download target from the model's mode:
// solo -> spark1 only
// cluster -> both Sparks (fetch on Spark 1, rsync to Spark 2 in parallel)
const dlMode = m.mode === 'cluster' ? 'cluster' : 'spark1';
const sizeNote = m.size_gb ? ` (~${m.size_gb} GB)` : '';
const target = m.mode === 'cluster' ? 'both Sparks' : 'Spark 1';
if (!confirm(`Download "${m.display_name}"${sizeNote} to ${target}? Large models can take a while; you can watch progress in the download panel.`)) {
return;
}
dlState.last_repo = m.repo;
dlState.last_mode = dlMode;
try {
const r = await fetchJSON('/api/download', {
method: 'POST',
headers: { 'content-type': 'application/json' },
body: JSON.stringify({ repo: m.repo, mode: dlMode }),
});
// Open the download panel + attach to progress stream
openDownloadForm();
attachToDownload(r.job_id);
} catch (e) {
alert('Failed to start download: ' + e.message);
}
}
async function attachToSwap(jobId, needsBackfill) {
if (state.swap_eventsource) {
state.swap_eventsource.close();
@@ -1434,14 +1969,30 @@ async function init() {
a.classList.remove('hidden');
}
} catch {}
setupDashboardTabs();
setupEndpointCollapse();
// WhisperX install button
const wxBtn = el('#wx-install');
if (wxBtn) wxBtn.addEventListener('click', onWhisperXInstall);
const wxCloseBtn = el('#wx-prog-close');
if (wxCloseBtn) wxCloseBtn.addEventListener('click', () => el('#whisperx-progress-dialog').close());
await loadModels();
await pollStatus();
await renderServices();
pollHardware();
pollUpdates();
// Disk-status probe runs after first paint — slow over SSH and not blocking.
loadDiskStatus();
// Speech-model patches panel — slow over SSH, runs after first paint.
renderSpeechModels();
// WhisperX install banner — show only when not yet installed/healthy.
renderWhisperXBanner();
setInterval(pollStatus, 5000);
setInterval(pollHardware, 8000); // every 8s
setInterval(pollUpdates, 300000); // every 5 min
setInterval(loadDiskStatus, 60000); // every 60s — disk state changes rarely
setInterval(renderSpeechModels, 120000); // every 2 min — patches change rarely
setInterval(renderWhisperXBanner, 60000); // every 60s — auto-hides banner after install
}
init();
+99 -1
View File
@@ -44,8 +44,14 @@
</dialog>
</section>
<section id="endpoint-panel" class="endpoint-panel hidden">
<section id="endpoint-panel" class="endpoint-panel hidden collapsed">
<div class="ep-header">
<div class="ep-title muted small">OpenAI-compatible endpoint</div>
<button type="button" class="icon-btn ep-collapse-btn" id="ep-collapse" title="Show / hide endpoint details" aria-label="Toggle endpoint details">
<svg viewBox="0 0 24 24" width="14" height="14" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true"><polyline points="6 9 12 15 18 9"></polyline></svg>
</button>
</div>
<div class="ep-body">
<div class="ep-row">
<span class="ep-label">Base URL</span>
<code class="ep-value copyable" id="ep-url" data-copy-self title="Click to copy"></code>
@@ -67,6 +73,7 @@
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><rect x="9" y="9" width="13" height="13" rx="2"/><path d="M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1"/></svg>
</button>
</details>
</div><!-- /.ep-body -->
</section>
<section id="swap-panel" class="swap-panel hidden">
@@ -89,6 +96,53 @@
</details>
</section>
<nav id="dashboard-tabs" class="dashboard-tabs hidden" role="tablist">
<button type="button" class="dashboard-tab" data-tab="llm" role="tab" aria-selected="true">LLM</button>
<button type="button" class="dashboard-tab" data-tab="audio" role="tab" aria-selected="false">Audio / Speech</button>
</nav>
<div class="tab-content" id="tab-audio" role="tabpanel" aria-labelledby="tab-audio-trigger">
<section id="whisperx-install-card" class="whisperx-install hidden">
<div class="wx-install-body">
<div class="wx-install-title">
<strong>Add WhisperX</strong>
<span class="tag ok">recommended</span>
</div>
<p class="muted small">
WhisperX is a single-container speech pipeline (faster-whisper for transcription + pyannote 3.1 for diarization)
designed to handle long audio cleanly. Replaces the Parakeet + Sortformer combo we patched together,
which crashed on a 90-min meeting. Pulled and built directly on Spark 2 (~1015 min first time;
you only do this once).
</p>
<p class="muted small">
Requires a Hugging Face token at <code>~/.cache/huggingface/token</code> on Spark 2 (already set up).
</p>
<div class="wx-install-actions">
<button id="wx-install" class="btn primary">Install WhisperX</button>
</div>
</div>
</section>
<dialog id="whisperx-progress-dialog" class="modal">
<form method="dialog" class="modal-form">
<h3 id="wx-prog-title">Installing WhisperX…</h3>
<div class="phase-row">
<span class="spinner"></span>
<div class="phase" id="wx-prog-phase">Starting…</div>
<span class="spacer"></span>
<span class="timer" id="wx-prog-elapsed">0:00</span>
</div>
<details open>
<summary class="muted small">Build log</summary>
<pre id="wx-prog-log" class="log"></pre>
</details>
<div class="modal-actions">
<button type="button" id="wx-prog-close" class="btn">Close</button>
</div>
</form>
</dialog>
<section id="services-panel" class="services hidden">
<div class="section-header">
<h2 class="section-title">Always-on services</h2>
@@ -152,6 +206,34 @@
</dialog>
</section>
<section id="speech-models-panel" class="speech-models hidden">
<div class="section-header">
<h2 class="section-title">Speech model patches</h2>
</div>
<p class="muted small sm-blurb">
Spark Control adds Sortformer speaker diarization to the third-party Parakeet ASR
container via two Python overlays (<code>diarizer.py</code> + a patched <code>main.py</code>).
Overlays survive container restart but not a fresh redeploy — if the parakeet container is
ever rebuilt, click <strong>Reapply patches</strong> below to restore them.
</p>
<div id="speech-models-card" class="speech-models-card"></div>
<dialog id="speech-models-progress-dialog" class="modal">
<form method="dialog" class="modal-form">
<h3>Reapplying speech-model patches…</h3>
<p class="muted small">Copying overlays into the parakeet container, verifying syntax, restarting, waiting for both models to load. Takes ~60120 s.</p>
<div id="sm-prog-steps" class="sm-prog-steps"></div>
<div class="modal-actions">
<button type="button" id="sm-prog-close" class="btn" disabled>Close</button>
</div>
</form>
</dialog>
</section>
</div><!-- /#tab-audio -->
<div class="tab-content" id="tab-llm" role="tabpanel" aria-labelledby="tab-llm-trigger">
<section id="models-section">
<div class="section-header">
<h2 class="section-title">LLM swap</h2>
@@ -188,6 +270,20 @@
</form>
</dialog>
<dialog id="disk-delete-dialog" class="modal">
<form method="dialog" class="modal-form">
<h3>Delete model weights from disk?</h3>
<p id="dd-summary" class="muted small"></p>
<ul class="muted small dd-hosts" id="dd-hosts"></ul>
<p class="muted small">This is reversible — you can re-download from the catalog at any time. The catalog entry stays intact.</p>
<p id="dd-error" class="muted small dd-error hidden"></p>
<div class="modal-actions">
<button type="button" id="dd-cancel" class="btn">Cancel</button>
<button type="button" id="dd-confirm" class="btn danger">Delete from disk</button>
</div>
</form>
</dialog>
<dialog id="advanced-dialog" class="modal">
<form method="dialog" class="modal-form" id="advanced-form">
<h3 id="adv-title">Advanced settings</h3>
@@ -290,6 +386,8 @@
</div>
</section>
</div><!-- /#tab-llm -->
<footer class="footer">
<div class="health">
<span class="health-item" id="h-vllm"><span class="dot"></span> vLLM</span>
+198 -4
View File
@@ -622,6 +622,19 @@ main {
.service-card .row .v.copyable.copied { outline: 1px solid var(--accent); background: rgba(74, 222, 128, 0.05); }
.service-card .row .icon-btn { padding: 3px 6px; }
.service-card .row .icon-btn svg { width: 12px; height: 12px; }
.service-card .deep-row .deep-v { display: flex; align-items: center; gap: 6px; font-family: inherit; flex-wrap: wrap; }
.service-card .dh-ok { color: var(--accent); }
.service-card .dh-fail { color: var(--error); font-weight: 500; }
.service-card .dh-run-btn { font-family: inherit; }
.service-card .deep-error {
padding: 4px 8px;
background: rgba(239, 68, 68, 0.06);
border-left: 2px solid var(--error);
border-radius: 4px;
font-family: ui-monospace, SFMono-Regular, Menlo, monospace;
font-size: 11px;
word-break: break-word;
}
.service-actions {
display: flex;
@@ -674,21 +687,27 @@ main {
border: 1px solid var(--border);
padding: 2px 8px;
border-radius: 999px;
font-size: 11px;
font-size: 12px;
}
.tag.mode-cluster { color: var(--info); border-color: rgba(96, 165, 250, 0.4); }
.tag.mode-solo { color: var(--accent); border-color: rgba(74, 222, 128, 0.4); }
.tag.cap { color: var(--muted); }
/* Semantic status pills — reuse .tag sizing so every pill on the page
renders at the same 11px / 2px×8px footprint. */
.tag.ok { color: var(--accent); border-color: rgba(74, 222, 128, 0.4); }
.tag.warn { color: var(--warn); border-color: rgba(245, 158, 11, 0.4); }
.tag.bad { color: var(--error); border-color: rgba(239, 68, 68, 0.4); }
.btn {
appearance: none;
border: 1px solid var(--border);
background: var(--surface-2);
color: var(--text);
padding: 8px 14px;
padding: 6px 12px;
border-radius: 8px;
cursor: pointer;
font: inherit;
font-size: 12px;
font-weight: 500;
transition: background 0.15s, border-color 0.15s, opacity 0.15s;
}
@@ -698,11 +717,37 @@ main {
.btn:disabled { opacity: 0.45; cursor: not-allowed; }
.btn.danger { color: var(--error); border-color: rgba(239, 68, 68, 0.3); }
.btn.danger:hover:not(:disabled) { background: rgba(239, 68, 68, 0.08); border-color: var(--error); }
.btn.info { background: var(--info); color: #0a1e3d; border-color: var(--info); }
.btn.info:hover:not(:disabled) { background: #82baff; border-color: #82baff; }
.card.active .btn { background: rgba(74, 222, 128, 0.12); color: var(--accent); border-color: rgba(74, 222, 128, 0.4); }
.card-actions { display: flex; gap: 6px; }
.card-actions .btn.primary { flex: 1; }
.card .adv-btn { padding: 8px 12px; font-size: 12px; }
.card-actions .btn.primary,
.card-actions .btn.info { flex: 1; }
.card .adv-btn,
.card .test-btn { padding: 8px 12px; font-size: 12px; }
.card .custom-pill { color: var(--info); border-color: rgba(96, 165, 250, 0.4); }
.tag.on-disk { color: var(--accent); border-color: rgba(74, 222, 128, 0.4); }
.tag.not-on-disk { color: var(--muted); border-color: var(--border); opacity: 0.7; }
.card-actions .icon-btn.danger { color: var(--error); border-color: rgba(239, 68, 68, 0.3); margin-left: auto; }
.card-actions .icon-btn.danger:hover:not(:disabled) { background: rgba(239, 68, 68, 0.08); border-color: var(--error); color: var(--error); }
.card-actions .icon-btn.danger:disabled { opacity: 0.35; cursor: not-allowed; }
.dd-hosts { padding-left: 18px; margin: 4px 0 8px; }
.dd-hosts code { background: var(--surface-2); padding: 1px 5px; border-radius: 4px; }
.dd-error { color: var(--error); }
.test-result {
font-size: 12px;
line-height: 1.45;
padding: 8px 10px;
border-radius: 5px;
margin-top: 4px;
border: 1px solid var(--border);
background: var(--surface-2);
}
.test-result.ok { border-color: rgba(74, 222, 128, 0.4); background: rgba(74, 222, 128, 0.04); }
.test-result.fail { border-color: rgba(239, 68, 68, 0.45); background: rgba(239, 68, 68, 0.06); word-break: break-word; }
.test-result .ok-mark { color: var(--accent); font-weight: 600; }
.test-result .fail-mark { color: var(--error); font-weight: 600; }
.footer {
margin-top: 28px;
@@ -725,3 +770,152 @@ main {
main { padding: 16px 14px 80px; }
.cards { grid-template-columns: 1fr; }
}
/* ===== Speech model patches (v0.11) ===== */
.speech-models { margin-top: 28px; }
.sm-blurb { max-width: 880px; margin-bottom: 14px; }
.sm-blurb code {
background: var(--surface-2);
padding: 1px 6px;
border-radius: 4px;
font-size: 12px;
}
.speech-models-card {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 10px;
padding: 16px;
display: flex;
flex-direction: column;
gap: 14px;
}
.sm-header {
display: flex;
align-items: center;
gap: 10px;
}
.sm-title {
font-weight: 600;
color: var(--text);
}
/* .sm-pill removed in v0.11.0:1 — speech-models pills now reuse the shared
.tag styling (+ .tag.ok / .tag.warn / .tag.bad color modifiers) so every
pill on the page renders identically. */
.sm-models { display: flex; flex-direction: column; gap: 6px; }
.sm-model-row {
display: grid;
grid-template-columns: 160px 1fr auto;
align-items: center;
gap: 12px;
padding: 6px 0;
border-top: 1px solid var(--border);
}
.sm-model-row:first-child { border-top: none; }
.sm-model-kind { color: var(--muted); font-size: 13px; }
.sm-model-name { font-family: ui-monospace, monospace; font-size: 12px; word-break: break-all; }
.sm-files { display: flex; flex-direction: column; gap: 4px; }
.sm-file-row {
display: grid;
grid-template-columns: 160px 100px 1fr;
gap: 12px;
font-size: 12px;
padding: 4px 0;
}
.sm-file-name code {
background: var(--surface-2);
padding: 1px 6px;
border-radius: 4px;
}
.sm-file-ok { color: var(--accent); }
.sm-file-warn { color: var(--warn); }
.sm-file-bad { color: var(--error); }
.sm-file-sha code {
background: var(--surface-2);
padding: 1px 4px;
border-radius: 3px;
font-size: 11px;
}
.sm-meta { margin-top: 4px; }
.sm-actions { display: flex; gap: 10px; }
.sm-prog-steps {
display: flex;
flex-direction: column;
gap: 6px;
margin: 12px 0;
font-size: 13px;
}
.sm-prog-step {
padding: 6px 10px;
background: var(--surface-2);
border-radius: 6px;
}
.sm-prog-done {
font-weight: 600;
margin-top: 8px;
}
/* ===== Collapsible endpoint card (v0.11.0:1) ===== */
.endpoint-panel .ep-header {
display: flex;
align-items: center;
gap: 10px;
}
.endpoint-panel .ep-title { flex: 1; margin: 0; }
.endpoint-panel .ep-collapse-btn {
flex-shrink: 0;
transition: transform 0.2s;
}
.endpoint-panel.collapsed .ep-body { display: none; }
.endpoint-panel.collapsed .ep-collapse-btn svg { transform: rotate(-90deg); }
.endpoint-panel:not(.collapsed) .ep-header { margin-bottom: 10px; }
/* ===== Dashboard tabs (LLM / Audio) (v0.11.0:1) ===== */
.dashboard-tabs {
display: flex;
gap: 4px;
margin-top: 8px;
margin-bottom: 16px;
border-bottom: 1px solid var(--border);
padding: 0 2px;
}
.dashboard-tab {
appearance: none;
background: transparent;
border: 1px solid transparent;
border-bottom: none;
color: var(--muted);
padding: 8px 16px;
border-radius: 6px 6px 0 0;
cursor: pointer;
font: inherit;
font-size: 14px;
font-weight: 500;
margin-bottom: -1px;
transition: color 0.15s, background 0.15s, border-color 0.15s;
}
.dashboard-tab:hover { color: var(--text); }
.dashboard-tab.active {
color: var(--text);
background: var(--surface);
border-color: var(--border);
border-bottom: 1px solid var(--surface);
}
.tab-content { display: none; }
.tab-content.active { display: block; }
/* ===== WhisperX install banner (v0.12) ===== */
.whisperx-install {
background: var(--surface);
border: 1px solid var(--info);
border-radius: var(--radius);
padding: 16px 18px;
margin-bottom: 20px;
}
.wx-install-body { display: flex; flex-direction: column; gap: 10px; }
.wx-install-title { display: flex; align-items: center; gap: 10px; }
.wx-install-title strong { font-size: 15px; color: var(--text); }
.wx-install-actions { display: flex; gap: 10px; margin-top: 4px; }
+137
View File
@@ -0,0 +1,137 @@
"""Pre-flight validation of a proposed vLLM launch command.
Runs vLLM's own argparse layer (EngineArgs) inside the vllm_node container WITHOUT
starting the engine. Catches:
* unknown flag names (typos)
* bad types / values that argparse rejects
* deprecated flags removed in the installed vLLM version
Does NOT catch (these surface only during real engine init):
* model-architecture-specific constraints (e.g. Qwen3.6 Mamba block_size)
* OOM at weight-loading time
* Triton / CUDA-kernel compatibility errors
A pre-flight check that returns "ok" is therefore NOT a guarantee — but a
"failed" verdict is a definitive 'don't bother with the real swap'.
"""
from __future__ import annotations
import json
import shlex
from typing import Any
from .config import Settings
from .models import Catalog, build_launch_command
from .ssh import ssh_run
# Validates the proposed args against the same combined parser vLLM uses for
# `vllm serve` (engine args + server args + frontend args). Returns one JSON
# line on stdout: {"ok": true, ...} or {"ok": false, ...}.
_VALIDATOR_SCRIPT = r"""
import argparse, json, sys
# Mirror what `vllm serve` does internally: FlexibleArgumentParser (which is
# more lenient about dashes vs underscores) wrapped with make_arg_parser
# (which adds engine + server + frontend args).
parser = None
try:
# Newer vLLM path
from vllm.utils.argparse_utils import FlexibleArgumentParser
except Exception:
try:
# Older fallback
from vllm.engine.arg_utils import FlexibleArgumentParser
except Exception:
FlexibleArgumentParser = argparse.ArgumentParser # type: ignore
try:
from vllm.entrypoints.openai.cli_args import make_arg_parser
parser = make_arg_parser(FlexibleArgumentParser(add_help=False))
except Exception:
pass
if parser is None:
try:
from vllm.engine.arg_utils import EngineArgs
parser = FlexibleArgumentParser(add_help=False)
EngineArgs.add_cli_args(parser)
except Exception as e:
print(json.dumps({"ok": False, "stage": "import", "error": f"{type(e).__name__}: {e}"}))
sys.exit(0)
class _ArgError(Exception):
pass
def _err(message):
raise _ArgError(message)
parser.error = _err # capture argparse errors instead of sys.exit(2)
try:
raw = sys.stdin.read()
arglist = json.loads(raw)
ns = parser.parse_args(arglist)
print(json.dumps({"ok": True, "model": getattr(ns, "model", None)}))
except _ArgError as e:
print(json.dumps({"ok": False, "stage": "parse", "error": str(e)}))
except SystemExit as e:
print(json.dumps({"ok": False, "stage": "parse", "error": f"argparse exit {e.code}"}))
except Exception as e:
print(json.dumps({"ok": False, "stage": "parse", "error": f"{type(e).__name__}: {e}"}))
"""
def _vllm_arg_list(key: str, model_def, catalog: Catalog) -> list[str]:
"""Reconstruct the args list passed to `vllm serve` (without the positional model)."""
cmd = build_launch_command(key, model_def, catalog.defaults)
# build_launch_command yields:
# ./launch-cluster.sh [--solo] -d exec vllm serve <repo> <args...>
# We just want the bits after `vllm serve <repo>`.
tokens = shlex.split(cmd)
if "serve" not in tokens:
return []
i = tokens.index("serve")
after = tokens[i + 1 :] # repo, then args
if not after:
return []
args = after[1:] # drop the repo
# EngineArgs expects --model=REPO rather than positional, so prepend it.
return [f"--model={after[0]}", *args]
async def validate_launch(key: str, catalog: Catalog, settings: Settings) -> dict:
if key not in catalog.models:
return {"ok": False, "stage": "lookup", "error": f"unknown model: {key}"}
if not settings.spark1_host or not settings.spark1_user:
return {"ok": False, "stage": "config", "error": "spark1 not configured"}
model = catalog.models[key]
arg_list = _vllm_arg_list(key, model, catalog)
if not arg_list:
return {"ok": False, "stage": "build", "error": "failed to build args list"}
payload = json.dumps(arg_list).replace("'", "'\\''")
# Pipe the JSON args list to a here-doc Python invocation. The validator
# reads from stdin to avoid shell-escaping the args themselves.
cmd = (
f"echo '{payload}' | docker exec -i vllm_node python3 -c "
+ shlex.quote(_VALIDATOR_SCRIPT)
)
rc, out, err = await ssh_run(settings.spark1_host, settings.spark1_user, cmd, settings, timeout=20)
if rc != 0 and not out.strip():
return {
"ok": False,
"stage": "ssh",
"error": err.strip() or f"rc={rc}",
"cmd_args": arg_list,
"launch_cmd": build_launch_command(key, model, catalog.defaults),
}
last = out.strip().splitlines()[-1] if out.strip() else ""
try:
result: dict[str, Any] = json.loads(last)
except json.JSONDecodeError:
result = {"ok": False, "stage": "decode", "error": "validator did not return JSON", "raw": out[-500:]}
result["cmd_args"] = arg_list
result["launch_cmd"] = build_launch_command(key, model, catalog.defaults)
return result
+267
View File
@@ -0,0 +1,267 @@
"""WhisperX install action — ships the build context from inside spark-control
to Spark 2 over SSH, then runs `docker build` + `docker run` on Spark 2 and
streams progress back as SSE.
Pattern mirrors NimManager (see nim.py) but for a locally-built container
rather than an `nvcr.io` pull. Build context lives at
/app/whisperx_container/ inside the spark-control Docker image (set up by
the Dockerfile COPY directive).
Endpoints:
POST /api/whisperx/install — kick off
GET /api/whisperx/install/{job_id} — snapshot
GET /api/whisperx/install/{job_id}/stream — SSE phase + log lines
GET /api/whisperx/status — installed + healthy?
"""
from __future__ import annotations
import asyncio
import shlex
import uuid
from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Optional
import httpx
from .config import Settings
from .ssh import _base_args, ssh_run, ssh_stream, StreamHandle
# Build context shipped inside the spark-control image (Dockerfile COPYs it).
BUILD_CONTEXT_DIR = Path(__file__).resolve().parent.parent / "whisperx_container"
# Files we ship to Spark 2's build dir. Mapped local-name → remote-relative-path.
BUILD_FILES = {
"Dockerfile": "Dockerfile",
"requirements.txt": "requirements.txt",
"README.md": "README.md",
"app/main.py": "app/main.py",
}
@dataclass
class WhisperXInstallJob:
id: str
started_at: str
state: str = "starting" # starting | sending | building | running | done | failed
phase: str = "Starting…"
lines: list[str] = field(default_factory=list)
returncode: Optional[int] = None
finished_at: Optional[str] = None
def append(self, line: str) -> None:
self.lines.append(line)
if len(self.lines) > 1500:
del self.lines[: len(self.lines) - 1500]
class WhisperXInstaller:
def __init__(self, settings: Settings) -> None:
self.settings = settings
self.lock = asyncio.Lock()
self.jobs: dict[str, WhisperXInstallJob] = {}
self.current_job_id: Optional[str] = None
def get(self, job_id: str) -> WhisperXInstallJob | None:
return self.jobs.get(job_id)
async def status(self) -> dict:
"""Probe whether WhisperX is installed + healthy on its configured host."""
s = self.settings
host_present = bool(s.whisperx_host and s.whisperx_user)
if not host_present:
return {"configured": False, "installed": False, "healthy": False}
# Probe HTTP health
url = f"http://{s.whisperx_host}:{s.whisperx_port}/health"
try:
async with httpx.AsyncClient(timeout=3.0) as client:
r = await client.get(url)
if r.status_code == 200:
body = r.json()
return {
"configured": True,
"installed": True,
"healthy": True,
"model": body.get("model"),
"device": body.get("device"),
"diarizer_loaded": body.get("diarizer_loaded", False),
}
except Exception:
pass
# No HTTP — check if the container exists at all
container_present = await self._container_exists()
return {
"configured": True,
"installed": container_present,
"healthy": False,
"current_job_id": self.current_job_id,
}
async def _container_exists(self) -> bool:
s = self.settings
cmd = f"docker ps -a --filter name=^{s.whisperx_container}$ --format '{{{{.Names}}}}'"
rc, out, _ = await ssh_run(s.whisperx_host, s.whisperx_user, cmd, s, timeout=10)
return rc == 0 and s.whisperx_container in out
async def trigger(self) -> WhisperXInstallJob:
if self.lock.locked():
raise RuntimeError("a WhisperX install is already in progress")
s = self.settings
if not s.whisperx_host or not s.whisperx_user:
raise RuntimeError("whisperx host/user not configured")
for local_name in BUILD_FILES:
if not (BUILD_CONTEXT_DIR / local_name).exists():
raise RuntimeError(f"build context file missing inside spark-control image: {local_name}")
job = WhisperXInstallJob(
id=uuid.uuid4().hex[:8],
started_at=datetime.now(timezone.utc).isoformat(),
)
self.jobs[job.id] = job
self.current_job_id = job.id
asyncio.create_task(self._run(job))
return job
async def _run(self, job: WhisperXInstallJob) -> None:
async with self.lock:
try:
await self._do(job)
if job.state != "failed":
job.state = "done"
job.returncode = 0
job.phase = "Done — WhisperX is running on port 8002"
except Exception as e:
job.append(f"[error] {type(e).__name__}: {e}")
job.state = "failed"
if job.returncode is None:
job.returncode = 1
finally:
job.finished_at = datetime.now(timezone.utc).isoformat()
if self.current_job_id == job.id:
self.current_job_id = None
async def _ssh_pipe(self, host: str, user: str, remote_cmd: str,
payload: bytes, timeout: float = 60.0) -> tuple[bool, str, str]:
"""ssh user@host <remote_cmd> with payload piped to stdin."""
args = _base_args(self.settings) + [f"{user}@{host}", remote_cmd]
proc = await asyncio.create_subprocess_exec(
*args,
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
try:
stdout_b, stderr_b = await asyncio.wait_for(
proc.communicate(input=payload), timeout=timeout
)
except asyncio.TimeoutError:
proc.kill(); await proc.wait()
return False, "", f"timeout after {timeout}s"
return proc.returncode == 0, stdout_b.decode(errors="replace"), stderr_b.decode(errors="replace")
async def _do(self, job: WhisperXInstallJob) -> None:
s = self.settings
host = s.whisperx_host
user = s.whisperx_user
# NOTE: `~` does not expand inside shlex.quote() single-quotes (bit us
# in v0.12.0:0). Use a $HOME-relative path that the REMOTE shell
# expands; all path components are hardcoded so injection is moot.
build_dir_remote = "\"$HOME\"/whisperx-build"
build_dir_display = "~/whisperx-build"
# ── Phase 1: stage build context on Spark 2 ──
job.state = "sending"
job.phase = "Sending build context to Spark 2…"
job.append(f"$ ssh {user}@{host} 'mkdir -p {build_dir_display}/app'")
rc, out, err = await ssh_run(
host, user,
f"mkdir -p {build_dir_remote}/app && "
f"rm -f {build_dir_remote}/Dockerfile {build_dir_remote}/requirements.txt "
f"{build_dir_remote}/README.md {build_dir_remote}/app/main.py",
s, timeout=10,
)
if rc != 0:
job.append(f"[mkdir failed] {err.strip()}")
raise RuntimeError("failed to create build directory")
for local_name, remote_rel in BUILD_FILES.items():
local_path = BUILD_CONTEXT_DIR / local_name
body = local_path.read_bytes()
remote_path_for_shell = f"{build_dir_remote}/{remote_rel}"
# remote_rel is hardcoded ("Dockerfile" / "app/main.py" etc.) — safe
# to embed unquoted inside the double-quoted $HOME path.
cmd = f"cat > {remote_path_for_shell}"
ok, out, err = await self._ssh_pipe(host, user, cmd, body, timeout=30)
if not ok:
job.append(f"[scp {local_name} failed] {err.strip()[:200]}")
raise RuntimeError(f"failed to ship {local_name}")
job.append(f"{build_dir_display}/{remote_rel} ({len(body)} bytes)")
# ── Phase 2: docker build ──
job.state = "building"
job.phase = "Building Docker image on Spark 2 (this is the slow part — 515 min if base layers aren't cached)…"
build_cmd = (
f"set -e; "
f"cd {build_dir_remote}; "
f"echo '=== docker build -t {s.whisperx_container}:latest . ==='; "
f"docker build -t {s.whisperx_container}:latest ."
)
job.append(f"$ {build_cmd}")
handle = StreamHandle()
async for line in ssh_stream(host, user, build_cmd, s, handle=handle):
job.append(line)
if "Step " in line and "/" in line:
# docker build progress: "Step 5/10 : RUN pip install ..."
job.phase = f"Building: {line.strip()[:120]}"
elif "Successfully built" in line or "naming to" in line:
job.phase = "Image built — preparing to start container…"
if (handle.returncode or 0) != 0:
job.returncode = handle.returncode
raise RuntimeError(f"docker build failed (rc={handle.returncode})")
# ── Phase 3: docker run ──
job.state = "running"
job.phase = "Starting container…"
run_cmd = (
f"set -e; "
f"echo '=== removing any prior {s.whisperx_container} container ==='; "
f"docker rm -f {s.whisperx_container} 2>/dev/null || true; "
f"echo '=== docker run -d --restart unless-stopped --name {s.whisperx_container} ==='; "
f"HF_TOKEN=$(cat ~/.cache/huggingface/token 2>/dev/null || true); "
f"if [ -z \"$HF_TOKEN\" ]; then echo 'WARN: no HF_TOKEN found at ~/.cache/huggingface/token — diarization will be disabled until you set one'; fi; "
f"docker run -d --restart unless-stopped "
f"--name {s.whisperx_container} "
f"--gpus all --memory=40g "
f"-p {s.whisperx_port}:{s.whisperx_port} "
f"-v whisperx-models:/root/.cache/huggingface "
f"-e HF_TOKEN=\"$HF_TOKEN\" "
f"-e WHISPER_MODEL={s.whisperx_model} "
f"{s.whisperx_container}:latest"
)
job.append(f"$ {run_cmd}")
rc, out, err = await ssh_run(host, user, run_cmd, s, timeout=60)
if rc != 0:
job.append(f"[docker run failed rc={rc}] {(err or out).strip()[:300]}")
raise RuntimeError("docker run failed")
job.append(out.strip())
# ── Phase 4: wait for /health to report ready ──
job.phase = "Container is starting; loading whisper + alignment + pyannote models (~60120 s on first boot)…"
url = f"http://{s.whisperx_host}:{s.whisperx_port}/health"
ready = False
for i in range(60): # up to ~180 s
await asyncio.sleep(3)
try:
async with httpx.AsyncClient(timeout=4.0) as client:
r = await client.get(url)
if r.status_code == 200:
body = r.json()
if body.get("status") == "ready":
ready = True
job.append(f"[ready] {body}")
break
job.phase = f"Loading models (transcribe={body.get('transcribe_loaded')}, align={body.get('align_loaded')}, diarize={body.get('diarizer_loaded')})…"
except Exception:
pass
if not ready:
raise RuntimeError("container started but /health did not report ready within ~180 s — check `docker logs whisperx-asr` on Spark 2")
job.phase = "Done — WhisperX is healthy and reachable on port 8002"
+3
View File
@@ -30,6 +30,7 @@ models:
- -tp=2
- --distributed-executor-backend=ray
- --max-model-len=32768
- --max-num-batched-tokens=16384
gemma4:
display_name: "Gemma 4 31B"
@@ -45,6 +46,7 @@ models:
vllm_args:
- --gpu-memory-utilization=0.8
- --max-model-len=32768
- --max-num-batched-tokens=16384
- --reasoning-parser=gemma4
- --tool-call-parser=gemma4
- --enable-auto-tool-choice
@@ -66,6 +68,7 @@ models:
vllm_args:
- --gpu-memory-utilization=0.85
- --max-model-len=65536
- --max-num-batched-tokens=16384
- --reasoning-parser=qwen3
- --moe_backend=flashinfer_cutlass
- --load-format=fastsafetensors
+54
View File
@@ -0,0 +1,54 @@
#!/bin/bash
# Apply Sortformer diarization patches to a running parakeet-asr container.
#
# Run from the spark-control repo root on the laptop:
# bash image/parakeet_patches/apply.sh <spark2-host> <ssh-user>
#
# What it does:
# 1. Backs up the current /opt/parakeet/app/main.py inside the container
# (writable layer; survives docker restart but NOT docker rm).
# 2. Copies the patched main.py + new diarizer.py into the container.
# 3. Restarts the container so the new code + Sortformer model load.
#
# Reversibility:
# - The backup of main.py is at /opt/parakeet/app/main.py.pre-sortformer
# inside the container. Restore with:
# docker exec parakeet-asr cp /opt/parakeet/app/main.py.pre-sortformer /opt/parakeet/app/main.py
# docker exec parakeet-asr rm -f /opt/parakeet/app/diarizer.py
# docker restart parakeet-asr
# - If the container is ever `docker rm`'d (volume rebuild), re-run this
# script. We will eventually fold this into spark-control as an action.
set -e
HOST="${1:?usage: apply.sh <spark2-host> <ssh-user>}"
USER="${2:?usage: apply.sh <spark2-host> <ssh-user>}"
CONTAINER="${CONTAINER:-parakeet-asr}"
REPO_DIR="$(cd "$(dirname "$0")" && pwd)"
echo "→ Backing up current main.py inside ${CONTAINER}..."
ssh "${USER}@${HOST}" "docker exec ${CONTAINER} sh -c \
'test -f /opt/parakeet/app/main.py.pre-sortformer || cp /opt/parakeet/app/main.py /opt/parakeet/app/main.py.pre-sortformer'"
echo "→ Copying diarizer.py into container..."
ssh "${USER}@${HOST}" "docker exec -i ${CONTAINER} sh -c \
'cat > /opt/parakeet/app/diarizer.py'" < "${REPO_DIR}/diarizer.py"
echo "→ Copying patched main.py into container..."
ssh "${USER}@${HOST}" "docker exec -i ${CONTAINER} sh -c \
'cat > /opt/parakeet/app/main.py'" < "${REPO_DIR}/main.py"
echo "→ Verifying syntax inside container..."
ssh "${USER}@${HOST}" "docker exec ${CONTAINER} python3 -c \
'import ast; ast.parse(open(\"/opt/parakeet/app/diarizer.py\").read()); ast.parse(open(\"/opt/parakeet/app/main.py\").read()); print(\"py OK\")'"
echo "→ Restarting ${CONTAINER}..."
ssh "${USER}@${HOST}" "docker restart ${CONTAINER}"
echo
echo "✔ Patches applied. Sortformer model (~150 MB) will download on first load — wait ~30s before testing."
echo
echo "Test once it's ready:"
echo " curl -sS http://${HOST}:8000/health"
echo " curl -sS -X POST http://${HOST}:8000/v1/audio/diarize -F file=@some-audio.mp3 | head -c 500"
+164
View File
@@ -0,0 +1,164 @@
"""Speaker diarization via NVIDIA NeMo Sortformer.
This module is dropped into the Parakeet container at /opt/parakeet/app/diarizer.py
and loaded alongside the existing ASR model. The Sortformer model identifies who
is speaking when in an audio file, output as a list of {start_s, end_s, speaker}
turns. It does NOT transcribe — pair its output with Parakeet's word-level
timestamps to produce a diarized transcript.
Model: nvidia/diar_sortformer_4spk-v1 (~150 MB, NeMo ecosystem, ungated)
Memory: adds ~200 MB to the running container. Same GPU as Parakeet (Spark 2
unified GB10). No interference with Parakeet inference because they're called
on separate code paths and CUDA handles concurrent kernels.
"""
import io
import os
import logging
import tempfile
import subprocess
from pathlib import Path
from typing import Optional
import torch
import soundfile as sf
import numpy as np
logger = logging.getLogger(__name__)
DIARIZER_MODEL = os.getenv("DIARIZER_MODEL", "nvidia/diar_sortformer_4spk-v1")
TARGET_SAMPLE_RATE = 16000
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
def _convert_to_wav_16k_mono(audio_bytes: bytes, original_filename: str) -> str:
"""Same conversion as transcriber.py — keeps a uniform input format
for the diarizer regardless of upload mime type."""
suffix = Path(original_filename).suffix.lower() if original_filename else ".wav"
with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp_in:
tmp_in.write(audio_bytes)
tmp_in_path = tmp_in.name
tmp_out_path = tmp_in_path + ".converted.wav"
try:
cmd = ["ffmpeg", "-y", "-i", tmp_in_path, "-ac", "1", "-ar", "16000",
"-sample_fmt", "s16", "-f", "wav", tmp_out_path]
result = subprocess.run(cmd, capture_output=True, timeout=300)
if result.returncode != 0:
raise RuntimeError(f"ffmpeg failed: {result.stderr.decode()[:500]}")
return tmp_out_path
finally:
try: os.unlink(tmp_in_path)
except OSError: pass
def _parse_sortformer_segments(raw_output) -> list[dict]:
"""Sortformer.diarize() returns List[List[str]] where each inner list is
per-file results: each entry is a space-separated 'start_s end_s speaker_label'
triplet (e.g., '0.00 4.50 speaker_0'). Normalize to our canonical format."""
if not raw_output:
return []
# Single-file invocation → take first inner list
entries = raw_output[0] if isinstance(raw_output, list) and raw_output and isinstance(raw_output[0], list) else raw_output
segments = []
for entry in entries:
if not entry:
continue
if isinstance(entry, str):
parts = entry.strip().split()
if len(parts) >= 3:
try:
start = float(parts[0])
end = float(parts[1])
speaker_raw = parts[2]
# Normalize "speaker_0" / "spk_0" / "0" → "Speaker_0"
if speaker_raw.lower().startswith("speaker_"):
idx = speaker_raw.split("_", 1)[1]
elif speaker_raw.lower().startswith("spk_"):
idx = speaker_raw.split("_", 1)[1]
elif speaker_raw.isdigit():
idx = speaker_raw
else:
idx = speaker_raw
segments.append({
"start_s": start,
"end_s": end,
"speaker": f"Speaker_{idx}",
})
except (ValueError, IndexError) as e:
logger.warning(f"unparsable sortformer entry: {entry!r} ({e})")
continue
return segments
class SortformerDiarizer:
def __init__(self):
self.model = None
self._loaded = False
def load_model(self):
if self._loaded:
return
logger.info(f"Loading diarizer {DIARIZER_MODEL} on {DEVICE}...")
from nemo.collections.asr.models import SortformerEncLabelModel
self.model = SortformerEncLabelModel.from_pretrained(DIARIZER_MODEL)
self.model.eval()
if DEVICE == "cuda":
self.model = self.model.cuda()
self._loaded = True
logger.info(f"Diarizer loaded on {DEVICE}")
def diarize(self, audio_bytes: bytes, filename: str = "audio.wav") -> dict:
"""Run diarization on a single audio file.
Returns:
{
"segments": [{"start_s": float, "end_s": float, "speaker": str}, ...],
"speakers_detected": ["Speaker_0", "Speaker_1", ...],
"duration": float,
"model": str,
"device": str,
}
Speaker labels are zero-indexed strings like "Speaker_0", "Speaker_1",
etc. They are NOT real names — that mapping happens downstream via LLM
analysis or manual UI correction.
"""
if not self._loaded:
self.load_model()
if not audio_bytes:
raise ValueError("empty audio")
wav_path = None
try:
wav_path = _convert_to_wav_16k_mono(audio_bytes, filename)
data, sr = sf.read(wav_path)
duration = len(data) / sr
logger.info(f"Diarizing {duration:.1f}s of audio ({filename})")
with torch.no_grad():
raw = self.model.diarize(
audio=[wav_path],
batch_size=1,
verbose=False,
)
segments = _parse_sortformer_segments(raw)
speakers = sorted({s["speaker"] for s in segments})
logger.info(f"Detected {len(speakers)} speakers across {len(segments)} turns")
if DEVICE == "cuda":
torch.cuda.empty_cache()
return {
"segments": segments,
"speakers_detected": speakers,
"duration": round(duration, 3),
"model": DIARIZER_MODEL,
"device": DEVICE,
}
finally:
if wav_path:
try: os.unlink(wav_path)
except OSError: pass
diarizer = SortformerDiarizer()
+158
View File
@@ -0,0 +1,158 @@
import os
import time
import logging
from contextlib import asynccontextmanager
from typing import Optional
import torch
from fastapi import FastAPI, File, Form, UploadFile, HTTPException
from fastapi.responses import JSONResponse
from fastapi.middleware.cors import CORSMiddleware
from app.transcriber import transcriber, MODEL_NAME, DEVICE
from app.diarizer import diarizer, DIARIZER_MODEL
logging.basicConfig(level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s")
logger = logging.getLogger("parakeet-api")
@asynccontextmanager
async def lifespan(app: FastAPI):
logger.info(f"Loading ASR model {MODEL_NAME} on {DEVICE}")
transcriber.load_model()
logger.info("ASR model ready")
logger.info(f"Loading diarizer {DIARIZER_MODEL} on {DEVICE}")
diarizer.load_model()
logger.info("Diarizer ready")
yield
app = FastAPI(title="Parakeet ASR + Sortformer Diarization API", version="1.2.0", lifespan=lifespan)
app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_credentials=True,
allow_methods=["*"], allow_headers=["*"])
@app.get("/")
async def root():
return {"service": "parakeet-asr", "model": MODEL_NAME, "diarizer": DIARIZER_MODEL, "device": DEVICE,
"endpoints": {"transcribe": "/v1/audio/transcriptions",
"diarize": "/v1/audio/diarize",
"models": "/v1/models", "health": "/health"}}
@app.get("/health")
async def health():
return {"status": "ready" if (transcriber._loaded and diarizer._loaded) else "loading",
"asr_loaded": transcriber._loaded,
"diarizer_loaded": diarizer._loaded,
"model": MODEL_NAME,
"diarizer_model": DIARIZER_MODEL,
"device": DEVICE}
@app.get("/v1/models")
async def list_models():
return {"object": "list", "data": [
{"id": "parakeet-tdt-0.6b-v3", "object": "model", "owned_by": "nvidia", "kind": "stt"},
{"id": "whisper-1", "object": "model", "owned_by": "nvidia", "kind": "stt"},
{"id": DIARIZER_MODEL.split("/")[-1], "object": "model", "owned_by": "nvidia", "kind": "diarization"}]}
@app.post("/v1/audio/transcriptions")
async def transcribe(
file: UploadFile = File(...),
model: Optional[str] = Form(default="parakeet-tdt-0.6b-v3"),
language: Optional[str] = Form(default=None),
response_format: Optional[str] = Form(default="json"),
temperature: Optional[float] = Form(default=0.0),
prompt: Optional[str] = Form(default=None),
):
if not transcriber._loaded:
raise HTTPException(status_code=503, detail="Model loading")
audio_bytes = await file.read()
if len(audio_bytes) == 0:
raise HTTPException(status_code=400, detail="Empty file")
max_size = int(os.getenv("MAX_UPLOAD_MB", "200")) * 1024 * 1024
if len(audio_bytes) > max_size:
raise HTTPException(status_code=413, detail=f"File too large")
want_timestamps = response_format == "verbose_json"
start_time = time.time()
try:
result = transcriber.transcribe(
audio_bytes, file.filename, language, timestamps=want_timestamps
)
except Exception as e:
logger.exception("Transcription failed")
raise HTTPException(status_code=500, detail=f"Failed: {e}")
elapsed = time.time() - start_time
duration = result.get("duration", 0)
rtfx = duration / elapsed if elapsed > 0 else 0
logger.info(f"Done: {duration:.1f}s in {elapsed:.1f}s ({rtfx:.0f}x rt)")
if response_format == "text":
return JSONResponse(content=result["text"], media_type="text/plain")
if response_format == "verbose_json":
return {
"task": "transcribe",
"language": language or "en",
"duration": duration,
"text": result["text"],
"segments": result.get("segments", []),
"words": result.get("words", []),
}
return {"text": result["text"]}
@app.post("/v1/audio/translations")
async def translate(file: UploadFile = File(...),
model: Optional[str] = Form(default="parakeet-tdt-0.6b-v3"),
language: Optional[str] = Form(default=None),
response_format: Optional[str] = Form(default="json")):
return await transcribe(file=file, model=model, language=language,
response_format=response_format)
@app.post("/v1/audio/diarize")
async def diarize(
file: UploadFile = File(...),
):
"""Speaker diarization via Sortformer.
Returns who-spoke-when as a list of turns. Does NOT transcribe — pair this
output with /v1/audio/transcriptions (verbose_json) and merge by timestamp
to produce a diarized transcript.
Response shape:
{
"segments": [{"start_s": 0.00, "end_s": 4.50, "speaker": "Speaker_0"}, ...],
"speakers_detected": ["Speaker_0", "Speaker_1"],
"duration": 90.5,
"model": "nvidia/diar_sortformer_4spk-v1",
"device": "cuda"
}
"""
if not diarizer._loaded:
raise HTTPException(status_code=503, detail="Diarizer loading")
audio_bytes = await file.read()
if len(audio_bytes) == 0:
raise HTTPException(status_code=400, detail="Empty file")
max_size = int(os.getenv("MAX_UPLOAD_MB", "200")) * 1024 * 1024
if len(audio_bytes) > max_size:
raise HTTPException(status_code=413, detail="File too large")
start_time = time.time()
try:
result = diarizer.diarize(audio_bytes, file.filename or "audio.wav")
except Exception as e:
logger.exception("Diarization failed")
raise HTTPException(status_code=500, detail=f"Failed: {e}")
elapsed = time.time() - start_time
duration = result.get("duration", 0)
rtfx = duration / elapsed if elapsed > 0 else 0
logger.info(f"Diarized {duration:.1f}s in {elapsed:.1f}s ({rtfx:.0f}x rt), "
f"{len(result['speakers_detected'])} speakers, {len(result['segments'])} turns")
return result
+105
View File
@@ -0,0 +1,105 @@
import os
import time
import logging
from contextlib import asynccontextmanager
from typing import Optional
import torch
from fastapi import FastAPI, File, Form, UploadFile, HTTPException
from fastapi.responses import JSONResponse
from fastapi.middleware.cors import CORSMiddleware
from app.transcriber import transcriber, MODEL_NAME, DEVICE
logging.basicConfig(level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s")
logger = logging.getLogger("parakeet-api")
@asynccontextmanager
async def lifespan(app: FastAPI):
logger.info(f"Loading model {MODEL_NAME} on {DEVICE}")
transcriber.load_model()
logger.info("Model ready")
yield
app = FastAPI(title="Parakeet ASR API", version="1.1.0", lifespan=lifespan)
app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_credentials=True,
allow_methods=["*"], allow_headers=["*"])
@app.get("/")
async def root():
return {"service": "parakeet-asr", "model": MODEL_NAME, "device": DEVICE,
"endpoints": {"transcribe": "/v1/audio/transcriptions",
"models": "/v1/models", "health": "/health"}}
@app.get("/health")
async def health():
return {"status": "ready" if transcriber._loaded else "loading",
"model": MODEL_NAME, "device": DEVICE}
@app.get("/v1/models")
async def list_models():
return {"object": "list", "data": [
{"id": "parakeet-tdt-0.6b-v3", "object": "model", "owned_by": "nvidia"},
{"id": "whisper-1", "object": "model", "owned_by": "nvidia"}]}
@app.post("/v1/audio/transcriptions")
async def transcribe(
file: UploadFile = File(...),
model: Optional[str] = Form(default="parakeet-tdt-0.6b-v3"),
language: Optional[str] = Form(default=None),
response_format: Optional[str] = Form(default="json"),
temperature: Optional[float] = Form(default=0.0),
prompt: Optional[str] = Form(default=None),
):
if not transcriber._loaded:
raise HTTPException(status_code=503, detail="Model loading")
audio_bytes = await file.read()
if len(audio_bytes) == 0:
raise HTTPException(status_code=400, detail="Empty file")
max_size = int(os.getenv("MAX_UPLOAD_MB", "200")) * 1024 * 1024
if len(audio_bytes) > max_size:
raise HTTPException(status_code=413, detail=f"File too large")
want_timestamps = response_format == "verbose_json"
start_time = time.time()
try:
result = transcriber.transcribe(
audio_bytes, file.filename, language, timestamps=want_timestamps
)
except Exception as e:
logger.exception("Transcription failed")
raise HTTPException(status_code=500, detail=f"Failed: {e}")
elapsed = time.time() - start_time
duration = result.get("duration", 0)
rtfx = duration / elapsed if elapsed > 0 else 0
logger.info(f"Done: {duration:.1f}s in {elapsed:.1f}s ({rtfx:.0f}x rt)")
if response_format == "text":
return JSONResponse(content=result["text"], media_type="text/plain")
if response_format == "verbose_json":
return {
"task": "transcribe",
"language": language or "en",
"duration": duration,
"text": result["text"],
"segments": result.get("segments", []),
"words": result.get("words", []),
}
return {"text": result["text"]}
@app.post("/v1/audio/translations")
async def translate(file: UploadFile = File(...),
model: Optional[str] = Form(default="parakeet-tdt-0.6b-v3"),
language: Optional[str] = Form(default=None),
response_format: Optional[str] = Form(default="json")):
return await transcribe(file=file, model=model, language=language,
response_format=response_format)
+1
View File
@@ -9,6 +9,7 @@ dependencies = [
"pydantic>=2.9",
"pyyaml>=6.0",
"httpx>=0.27",
"python-multipart>=0.0.9",
]
[build-system]
+51
View File
@@ -0,0 +1,51 @@
# WhisperX ASR + diarization container for Spark 2 (Blackwell GB10, sm_120).
#
# Replaces the custom Parakeet wrapper + Sortformer overlay with a single
# mainline pipeline: faster-whisper for transcription + pyannote.audio 3.1
# for diarization + wav2vec2 forced alignment for word-level timestamps.
#
# Build (on Spark 2, where Blackwell + nvcr.io credentials are available):
# docker build -t whisperx-asr:latest .
#
# Run:
# docker run -d --restart unless-stopped --name whisperx-asr \
# --gpus all --memory=40g \
# -p 8002:8002 \
# -v whisperx-models:/root/.cache/huggingface \
# -e HF_TOKEN="$(cat ~/.cache/huggingface/token)" \
# -e WHISPER_MODEL=medium \
# whisperx-asr:latest
#
# The memory cap is intentional: even if WhisperX hits a pathological input,
# it gets OOM-killed cleanly instead of swap-thrashing the whole Spark.
FROM nvcr.io/nvidia/pytorch:25.11-py3
# WhisperX runs ffmpeg under the hood for audio decoding
RUN apt-get update \
&& apt-get install -y --no-install-recommends ffmpeg \
&& rm -rf /var/lib/apt/lists/*
# Install whisperx + the FastAPI wrapper deps. --break-system-packages because
# the NGC PyTorch image has its own managed Python that's flagged "system".
COPY requirements.txt /tmp/requirements.txt
RUN pip install --break-system-packages --no-cache-dir -r /tmp/requirements.txt
# Pre-warm the default Whisper + alignment models at build time so first-call
# latency on a fresh container is small. (~3 GB cached into the image; if you
# want a smaller image, comment this out and accept the first-call download.)
ARG WHISPER_MODEL=medium
ENV WHISPER_MODEL=${WHISPER_MODEL}
RUN python3 -c "import whisperx; whisperx.load_model('${WHISPER_MODEL}', 'cpu', compute_type='int8')" \
&& python3 -c "import whisperx; whisperx.load_align_model(language_code='en', device='cpu')"
WORKDIR /opt/whisperx
COPY app /opt/whisperx/app
# Expose for spark-control's proxy on Spark 2
EXPOSE 8002
HEALTHCHECK --interval=30s --timeout=10s --start-period=180s \
CMD python3 -c "import urllib.request; urllib.request.urlopen('http://localhost:8002/health')" || exit 1
CMD ["python3", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8002", "--workers", "1"]
+74
View File
@@ -0,0 +1,74 @@
# WhisperX container for Spark 2
Replaces the custom Parakeet wrapper + Sortformer overlay (v0.10/v0.11) with a
single mainline pipeline:
- **faster-whisper** (CTranslate2-optimized) for STT
- **pyannote.audio 3.1** for speaker diarization (sliding-window — handles
long files in bounded memory, fixes the Sortformer OOM on 90-min audio)
- **wav2vec2 forced alignment** for word-level timestamps
Exposes the same API surface spark-control already proxies to, so the cutover
is a one-URL change in the audio proxy:
- `GET /health` — readiness probe
- `GET /v1/models` — model list
- `POST /v1/audio/transcriptions` — OpenAI-shaped STT
- `POST /v1/audio/transcribe-with-speakers` — merged diarized transcript
(matches spark-control's response shape exactly)
## Deploy to Spark 2
```bash
# 1. Copy this directory to Spark 2
rsync -av --delete image/whisperx_container/ modelo@192.168.1.87:~/whisperx-build/
# 2. SSH in and build
ssh modelo@192.168.1.87
cd ~/whisperx-build
docker build -t whisperx-asr:latest .
# 3. Run alongside the existing parakeet-asr (which stays on 8000 for now)
docker run -d --restart unless-stopped --name whisperx-asr \
--gpus all --memory=40g \
-p 8002:8002 \
-v whisperx-models:/root/.cache/huggingface \
-e HF_TOKEN="$(cat ~/.cache/huggingface/token)" \
-e WHISPER_MODEL=medium \
whisperx-asr:latest
# 4. Watch first-start logs (model load + first health check)
docker logs -f whisperx-asr
```
## Model size knobs
`WHISPER_MODEL` env var. Defaults to `medium`. Options:
| Model | Size | Speed (GB10) | Quality |
|---|---|---|---|
| `tiny` | ~75M | ~120x rt | low |
| `base` | ~74M | ~80x rt | ok |
| `small` | ~244M | ~50x rt | good |
| `medium`| ~769M | ~30x rt | excellent (**default**) |
| `large-v3`| ~1.5B | ~15x rt | best |
For a 90-min file, medium takes ~3 min STT + ~9 min diarize ≈ ~12 min total.
## Memory budget
The `--memory=40g` cap is intentional. Spark 2 has 122 GB unified, of which
~35 GB is consumed by parakeet-asr + magpie-tts. The 40 GB cap leaves
comfortable headroom for both the model weights (~5 GB) and pyannote's
in-memory features (~515 GB for a 90-min audio). If WhisperX hits a
pathological input it gets OOM-killed cleanly instead of swap-thrashing the
whole Spark — the symptom we hit with the unbounded Sortformer container.
## Rollback to Parakeet+Sortformer
```bash
docker stop whisperx-asr && docker rm whisperx-asr
```
The parakeet-asr container stays running throughout — spark-control's proxy
URL switch is reversible via config or version downgrade.
+355
View File
@@ -0,0 +1,355 @@
"""WhisperX FastAPI wrapper — STT + speaker diarization in a single endpoint.
Endpoints (designed to be drop-in compatible with the existing spark-control
audio API surface, so the proxy just changes its upstream URL):
GET / — service info
GET /health — readiness probe
GET /v1/models — list loaded models
POST /v1/audio/transcriptions — OpenAI-shaped STT (no speakers)
POST /v1/audio/transcribe-with-speakers — merged diarized transcript
The /transcribe-with-speakers response shape EXACTLY matches what
spark-control's /api/audio/transcribe-with-speakers returns today (the one
that recap-relay's PR spec was written against), so swapping the upstream
from Parakeet+Sortformer to WhisperX is a one-URL change in the proxy.
"""
from __future__ import annotations
import os
import time
import tempfile
import logging
from contextlib import asynccontextmanager
from typing import Optional
import torch
import whisperx
from fastapi import FastAPI, File, Form, UploadFile, HTTPException
from fastapi.responses import JSONResponse
from fastapi.middleware.cors import CORSMiddleware
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
)
logger = logging.getLogger("whisperx-api")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
COMPUTE_TYPE = os.getenv("COMPUTE_TYPE", "float16" if DEVICE == "cuda" else "int8")
WHISPER_MODEL = os.getenv("WHISPER_MODEL", "medium")
DEFAULT_LANG = os.getenv("DEFAULT_LANGUAGE", "en")
BATCH_SIZE = int(os.getenv("BATCH_SIZE", "16"))
HF_TOKEN = os.getenv("HF_TOKEN") or None
class WhisperXEngine:
def __init__(self) -> None:
self.transcribe_model = None
self.align_model = None
self.align_metadata = None
self.diarize_model = None
self._loaded = False
def load(self) -> None:
if self._loaded:
return
logger.info(f"Loading whisper-{WHISPER_MODEL} on {DEVICE} ({COMPUTE_TYPE})")
self.transcribe_model = whisperx.load_model(
WHISPER_MODEL, DEVICE, compute_type=COMPUTE_TYPE
)
logger.info(f"Loading alignment model for {DEFAULT_LANG}")
self.align_model, self.align_metadata = whisperx.load_align_model(
language_code=DEFAULT_LANG, device=DEVICE
)
if HF_TOKEN:
logger.info("Loading pyannote diarization pipeline (3.1)")
try:
self.diarize_model = whisperx.DiarizationPipeline(
use_auth_token=HF_TOKEN, device=DEVICE
)
except Exception as e:
logger.exception(f"Diarization pipeline failed to load: {e}")
self.diarize_model = None
else:
logger.warning(
"HF_TOKEN not set — diarization disabled. /transcribe-with-speakers "
"will return 503. /transcriptions still works."
)
self._loaded = True
logger.info("WhisperX engine ready")
def transcribe(self, audio_bytes: bytes, filename: str, want_timestamps: bool = True) -> dict:
if not self._loaded:
self.load()
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
tmp.write(audio_bytes)
tmp_path = tmp.name
try:
audio = whisperx.load_audio(tmp_path)
duration = float(audio.shape[0]) / 16000.0
result = self.transcribe_model.transcribe(
audio, batch_size=BATCH_SIZE, language=DEFAULT_LANG
)
language = result.get("language") or DEFAULT_LANG
if want_timestamps:
aligned = whisperx.align(
result["segments"],
self.align_model,
self.align_metadata,
audio,
DEVICE,
return_char_alignments=False,
)
segments = aligned.get("segments", [])
else:
segments = result.get("segments", [])
full_text = " ".join(s.get("text", "").strip() for s in segments).strip()
return {
"duration": duration,
"language": language,
"text": full_text,
"segments": segments,
"audio_path": tmp_path,
"audio": audio, # caller can reuse for diarization without re-loading
}
finally:
# NOTE: caller is responsible for unlinking the temp file. We expose it
# in the return dict so diarization can run on the same audio without
# disk re-IO. The unlink happens in the request handler's finally.
pass
def diarize(self, audio) -> dict:
if self.diarize_model is None:
raise RuntimeError(
"Diarization pipeline not loaded (HF_TOKEN missing or load failed)"
)
diar = self.diarize_model(audio)
return diar
engine = WhisperXEngine()
@asynccontextmanager
async def lifespan(app: FastAPI):
engine.load()
yield
app = FastAPI(
title="WhisperX ASR + Diarization",
version="1.0.0",
lifespan=lifespan,
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
@app.get("/")
async def root() -> dict:
return {
"service": "whisperx",
"device": DEVICE,
"models": {
"transcription": f"whisper-{WHISPER_MODEL}",
"alignment": f"wav2vec2-{DEFAULT_LANG}",
"diarization": "pyannote-speaker-diarization-3.1" if engine.diarize_model else None,
},
"endpoints": {
"transcriptions": "/v1/audio/transcriptions",
"transcribe_with_speakers": "/v1/audio/transcribe-with-speakers",
"models": "/v1/models",
"health": "/health",
},
}
@app.get("/health")
async def health() -> dict:
return {
"status": "ready" if engine._loaded else "loading",
"transcribe_loaded": engine.transcribe_model is not None,
"align_loaded": engine.align_model is not None,
"diarizer_loaded": engine.diarize_model is not None,
"model": f"whisper-{WHISPER_MODEL}",
"device": DEVICE,
}
@app.get("/v1/models")
async def list_models() -> dict:
data = [
{"id": f"whisper-{WHISPER_MODEL}", "object": "model", "owned_by": "openai", "kind": "stt"},
]
if engine.diarize_model is not None:
data.append(
{"id": "pyannote-speaker-diarization-3.1", "object": "model",
"owned_by": "pyannote", "kind": "diarization"}
)
return {"object": "list", "data": data}
def _normalize_speaker(label: str) -> str:
"""WhisperX/pyannote uses 'SPEAKER_00' / 'SPEAKER_01' / ... — normalize to
the same 'Speaker_0' shape spark-control's existing endpoint returns."""
if not label:
return "Speaker_unknown"
if label.upper().startswith("SPEAKER_"):
idx = label.split("_", 1)[1].lstrip("0") or "0"
return f"Speaker_{idx}"
return label
def _segments_to_blocks(segments: list[dict]) -> list[dict]:
"""Convert WhisperX's per-utterance segments into the
[{start_ms, end_ms, speaker, text}, ...] block shape spark-control returns
today. Groups consecutive same-speaker segments into one block."""
blocks: list[dict] = []
cur = None
for s in segments:
spk_raw = s.get("speaker") or "Speaker_unknown"
spk = _normalize_speaker(spk_raw)
text = (s.get("text") or "").strip()
start_ms = int(float(s.get("start", 0)) * 1000)
end_ms = int(float(s.get("end", 0)) * 1000)
if not text:
continue
if cur is None or cur["speaker"] != spk or start_ms - cur["end_ms"] > 1500:
if cur is not None:
blocks.append(cur)
cur = {"start_ms": start_ms, "end_ms": end_ms, "speaker": spk, "text": text}
else:
cur["text"] = (cur["text"] + " " + text).strip()
cur["end_ms"] = end_ms
if cur is not None:
blocks.append(cur)
return blocks
@app.post("/v1/audio/transcriptions")
async def transcribe(
file: UploadFile = File(...),
model: Optional[str] = Form(default=None),
language: Optional[str] = Form(default=None),
response_format: Optional[str] = Form(default="json"),
temperature: Optional[float] = Form(default=None),
prompt: Optional[str] = Form(default=None),
):
if not engine._loaded:
raise HTTPException(status_code=503, detail="Engine loading")
audio_bytes = await file.read()
if not audio_bytes:
raise HTTPException(status_code=400, detail="Empty file")
start_t = time.time()
audio_path = None
try:
result = engine.transcribe(
audio_bytes,
file.filename or "audio.wav",
want_timestamps=(response_format == "verbose_json"),
)
audio_path = result.pop("audio_path", None)
result.pop("audio", None)
except Exception as e:
logger.exception("Transcription failed")
raise HTTPException(status_code=500, detail=f"Failed: {e}")
finally:
if audio_path:
try: os.unlink(audio_path)
except OSError: pass
elapsed = time.time() - start_t
duration = result.get("duration", 0.0)
logger.info(f"Transcribed {duration:.1f}s in {elapsed:.1f}s ({duration/elapsed:.0f}x rt)")
if response_format == "text":
return JSONResponse(content=result["text"], media_type="text/plain")
if response_format == "verbose_json":
words = []
for s in result.get("segments", []):
for w in s.get("words", []) or []:
words.append({
"word": w.get("word"),
"start": w.get("start"),
"end": w.get("end"),
"score": w.get("score"),
})
return {
"task": "transcribe",
"language": result.get("language", "en"),
"duration": duration,
"text": result["text"],
"segments": [
{"start": s.get("start"), "end": s.get("end"), "text": s.get("text", "").strip()}
for s in result.get("segments", [])
],
"words": words,
}
return {"text": result["text"]}
@app.post("/v1/audio/transcribe-with-speakers")
async def transcribe_with_speakers(file: UploadFile = File(...)) -> dict:
"""Merged STT + diarization. Response shape matches spark-control's
/api/audio/transcribe-with-speakers exactly — recap-relay's PR spec
needs no changes when we cut over."""
if not engine._loaded:
raise HTTPException(status_code=503, detail="Engine loading")
if engine.diarize_model is None:
raise HTTPException(
status_code=503,
detail="Diarization unavailable — HF_TOKEN not set or pyannote failed to load",
)
audio_bytes = await file.read()
if not audio_bytes:
raise HTTPException(status_code=400, detail="Empty file")
start_t = time.time()
audio_path = None
try:
result = engine.transcribe(
audio_bytes, file.filename or "audio.wav", want_timestamps=True
)
audio_path = result.pop("audio_path", None)
audio = result.pop("audio")
# Diarize on the in-memory audio (no second decode)
logger.info("Running pyannote diarization…")
diar = engine.diarize(audio)
# whisperx.assign_word_speakers writes speaker labels into the
# aligned segments + their nested words
result_with_speakers = whisperx.assign_word_speakers(
diar, {"segments": result["segments"]}
)
segments_in = result_with_speakers.get("segments", [])
blocks = _segments_to_blocks(segments_in)
speakers = sorted({b["speaker"] for b in blocks if b["speaker"] != "Speaker_unknown"})
except Exception as e:
logger.exception("Diarized transcription failed")
raise HTTPException(status_code=500, detail=f"Failed: {e}")
finally:
if audio_path:
try: os.unlink(audio_path)
except OSError: pass
elapsed = time.time() - start_t
duration = result.get("duration", 0.0)
logger.info(
f"Transcribed+diarized {duration:.1f}s in {elapsed:.1f}s "
f"({duration/elapsed:.0f}x rt), {len(speakers)} speakers, {len(blocks)} blocks"
)
return {
"duration": duration,
"language": result.get("language", "en"),
"speakers_detected": speakers,
"segments": blocks,
"models": {
"transcription": f"whisper-{WHISPER_MODEL}",
"diarization": "pyannote-speaker-diarization-3.1",
},
}
@@ -0,0 +1,5 @@
whisperx==3.4.3
fastapi>=0.115
uvicorn[standard]>=0.32
python-multipart>=0.0.9
soundfile>=0.12
+10 -2
View File
@@ -9,7 +9,7 @@
**Fix:**
```bash
ssh <spark-user>@<spark-2-host> 'docker run --rm -v magpie-model-cache:/cache alpine chown -R 1000:1000 /cache && docker restart magpie-tts'
ssh modelo@<spark-2-host> 'docker run --rm -v magpie-model-cache:/cache alpine chown -R 1000:1000 /cache && docker restart magpie-tts'
```
The trick is the `docker run --rm alpine chown` — it runs as root inside the throwaway container, which is enough to chown the bind-mounted volume on the host, without needing `sudo` on the host itself. After the chown + restart, magpie downloaded its ~3 GB model from NGC into the cache and came up healthy on `:9000`.
@@ -20,9 +20,17 @@ The trick is the `docker run --rm alpine chown` — it runs as root inside the t
This flag is Blackwell-specific. If vLLM in the container reports `unrecognized arguments: --moe_backend` or similar, edit `models.yaml` for `qwen36` and drop that flag. The swap UI does NOT auto-fallback in v0.1 — failure surfaces in the log stream.
## Qwen3.6 Mamba block-size assertion (fixed in v0.6.0:1)
Qwen3.6 uses a Mamba-attention hybrid that requires `--max-num-batched-tokens >= 2096`. vLLM's default is 2048, which trips `AssertionError: In Mamba cache align mode, block_size (2096) must be <= max_num_batched_tokens (2048)`. Fix: bake `--max-num-batched-tokens=16384` into the bundled qwen36 entry — matches the upstream qwen3.5-35b-a3b-fp8 recipe.
## Multimodal token budget for vision models (fixed in v0.8.0:1)
After the eugr/spark-vllm-docker update, vLLM became stricter about multimodal token budgets. Vision-capable models like Gemma 4 31B and Qwen3-VL crash at engine init with `ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048)`. Fix: bake `--max-num-batched-tokens=16384` into every model that has the `vision` capability. Now applied to qwen3-vl, gemma4, and qwen36 (which was already set for the Mamba issue).
## Two SSH paths to Spark 1 from the laptop
`ssh <spark-user>@<spark-1-ip>` does NOT work from the laptop because the NVIDIA Sync ssh_config only has a Host entry for `<spark-1-host>.local`. Always use the `.local` hostname or `<spark-2-ip>`-style entries that ARE matched.
`ssh modelo@192.168.1.103` does NOT work from the laptop because the NVIDIA Sync ssh_config only has a Host entry for `spark-27ea.local`. Always use the `.local` hostname or `192.168.1.87`-style entries that ARE matched.
## Older models in `models.yaml`
+1 -1
View File
@@ -1,6 +1,6 @@
MIT License
Copyright (c) 2026 Alice
Copyright (c) 2026 Grant
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
+1 -1
View File
@@ -19,7 +19,7 @@ This package SSHes into your Spark server to run cluster commands, so it needs a
```bash
echo "<paste-pubkey-here>" >> ~/.ssh/authorized_keys
```
3. **Open Actions → Configure Sparks.** Enter the LAN hostnames or IPs for Spark 1 and Spark 2, plus the SSH username (usually `<spark-user>`).
3. **Open Actions → Configure Sparks.** Enter the LAN hostnames or IPs for Spark 1 and Spark 2, plus the SSH username (usually `modelo`).
4. **Open the Web UI.** It will hit each Spark to confirm. If both indicators are green you're done.
## Using Spark Control
+1 -1
View File
@@ -19,7 +19,7 @@ This package SSHes into your Spark server to run cluster commands, so it needs a
```bash
echo "<paste-pubkey-here>" >> ~/.ssh/authorized_keys
```
3. **Open Actions → Configure Sparks.** Enter the LAN hostnames or IPs for Spark 1 and Spark 2, plus the SSH username (usually `<spark-user>`).
3. **Open Actions → Configure Sparks.** Enter the LAN hostnames or IPs for Spark 1 and Spark 2, plus the SSH username (usually `modelo`).
4. **Open the Web UI.** It will hit each Spark to confirm. If both indicators are green you're done.
## Using Spark Control
+2 -2
View File
@@ -1,10 +1,10 @@
import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk'
export const v0_1_0 = VersionInfo.of({
version: '0.6.0:0',
version: '0.12.0:1',
releaseNotes: {
en_US:
'v0.6: Service-level connectivity tracking and a passive failure-report endpoint. The connectivity log now records up/down transitions for Parakeet, Magpie, and vLLM in addition to the Spark hosts (driven by the existing /api/status and /api/services polling). A new POST /api/health-event endpoint lets external apps (e.g. Open WebUI) record failures they observed even when the failure was brief enough to slip between polls. The Connectivity log dialog shows hosts and services with separate badges, and reports appear inline with their source app + error detail.',
'v0.12.0:1 — hotfix: 0.12.0:0\'s install action used shlex.quote() on the remote build path, which wraps `~/whisperx-build/...` in single quotes — the remote shell then doesn\'t expand the tilde and treats it as a literal directory named `~`. Result: "bash: line 1: ~/whisperx-build/Dockerfile: No such file or directory" on the very first file copy. Same bug pattern we hit before with $HOME in the disk probe. Rewrote to embed $HOME in double-quoted remote shell strings; hardcoded file names (Dockerfile, requirements.txt, README.md, app/main.py) embed unquoted inside that scope. All other 0.12.0 behavior is unchanged.',
},
migrations: {
up: async ({ effects }) => {},
+8 -8
View File
@@ -37,7 +37,7 @@ These take effect on the **next swap to that model**. If a swap fails after this
## Adding a new model
1. Add an entry to `image/models.yaml`. Required fields: `display_name`, `repo`, `size_gb`, `mode` (`solo` or `cluster`), `vllm_args`. Optional but recommended: `description` (one paragraph — what the model is, what it's good for, how it differs from others; renders below the meta tags in each card), `capabilities` (tags like `[vision, reasoning, tools]`), `expected_ready_seconds`.
2. Confirm the weights are on the Spark: `ssh <spark-user>@<spark-1-host>.local 'ls ~/.cache/huggingface/hub/'`. If not, download with `./hf-download.sh <repo>` on Spark 1.
2. Confirm the weights are on the Spark: `ssh modelo@spark-27ea.local 'ls ~/.cache/huggingface/hub/'`. If not, download with `./hf-download.sh <repo>` on Spark 1.
3. Rebuild + redeploy the package: `cd package && make x86 && make install`.
If `description` is omitted, the card simply hides that section — no need to populate it for every model. Keep descriptions generic (not user-specific) so the catalog stays portable.
@@ -47,7 +47,7 @@ If `description` is omitted, the card simply hides that section — no need to p
If the UI is unavailable and you need to swap by hand:
```bash
ssh <spark-user>@<spark-1-host>.local
ssh modelo@spark-27ea.local
cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh --solo -d exec vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
@@ -61,19 +61,19 @@ docker logs -f vllm_node # wait for "Application startup complete."
```bash
# Is vLLM serving?
curl -s http://<spark-1-ip>:8888/v1/models | jq .
curl -s http://192.168.1.103:8888/v1/models | jq .
# Cluster status (containers up?)
ssh <spark-user>@<spark-1-host>.local 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
ssh modelo@spark-27ea.local 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
# Tail current model's logs
ssh <spark-user>@<spark-1-host>.local 'docker logs --tail 200 -f vllm_node'
ssh modelo@spark-27ea.local 'docker logs --tail 200 -f vllm_node'
# Parakeet
curl -s http://<spark-2-ip>:8000/health
curl -s http://192.168.1.87:8000/health
# Magpie (see known-issues.md)
curl -s http://<spark-2-ip>:9000/v1/health/ready
curl -s http://192.168.1.87:9000/v1/health/ready
```
## Hard reset
@@ -81,7 +81,7 @@ curl -s http://<spark-2-ip>:9000/v1/health/ready
If launch-cluster.sh gets stuck:
```bash
ssh <spark-user>@<spark-1-host>.local
ssh modelo@spark-27ea.local
cd ~/spark-vllm-docker
./launch-cluster.sh stop
docker ps -aq | xargs -r docker rm -f