Files
spark-control/docs/COORDINATION.md
T
Keysat 7ae6ab3ba8 v0.25.0:0 - cluster coordination layer (swap lock + webhook + schedule registry)
GPU-arbiter safety layer for when automation, not just the dashboard, swaps
models:
- swap reservation lock (POST/GET/DELETE /api/swap/lock); 423-enforced in
  post_swap via a single-read gate, TTL-bounded, secret-token auth, human
  force-release override + dashboard banner
- swap webhook (swap_complete/swap_failed) fired outside the swap lock, optional
  HMAC signature, configurable URL+secret
- read-only schedule registry (GET/POST/DELETE /api/schedule) + dashboard panel

New module image/app/coordination.py; docs/COORDINATION.md for consumers; 22
offline tests in test_coordination.py.
2026-06-18 07:07:08 -05:00

5.1 KiB

Cluster coordination through Spark Control (v0.25.0)

Spark Control is the GPU arbiter, not a job runner. Your recurring pipelines (model-warming crons, "daily X" generators, batch jobs) live in your own services and drive Spark Control's swap API. This page documents the safety layer around that: a swap reservation lock, a swap-event webhook, and a read-only schedule registry.

If only the dashboard ever swaps models, you don't need any of this — it's for when something automated also swaps.

All endpoints are on the Spark Control host (same LAN/VPN URL as the LLM, audio, and embeddings proxies). There is no API-token auth by design (LAN + split-tunnel VPN only); a non-browser client passes the same-origin guard automatically.


1. Swap reservation lock

A short, TTL-bounded reservation of the swap path. While a lock is held, any real swap that doesn't present the holder's token is refused with 423 Locked — including the dashboard's manual swap. The holder name is descriptive; the returned token is the secret that authorises swaps and the release.

The lock is in-memory: it resets to unlocked if Spark Control restarts (the safe-for-availability default), and the swap engine's own in-progress guard still prevents two swaps running at once.

POST /api/swap/lock — acquire (or extend)

// request
{ "holder": "openclaw-daily-vol", "ttl_seconds": 900, "note": "daily vol run" }

// 200 response
{
  "held": true,
  "holder": "openclaw-daily-vol",
  "acquired_at": "2026-06-17T12:00:00+00:00",
  "expires_at":  "2026-06-17T12:15:00+00:00",
  "seconds_remaining": 900,
  "note": "daily vol run",
  "token": "a1b2c3…"        // SECRET — store it; needed to swap and to release
}
  • ttl_seconds is optional (default 900) and clamped to [1, 86400].
  • 409 if a different holder already holds it (body includes the current lock state). To extend your own lock, POST again with the same holder and your token — the token is preserved and the window slides forward.

GET /api/swap/lock — status (no token)

{ "held": true, "holder": "openclaw-daily-vol", "expires_at": "…", "seconds_remaining": 612, "note": "…" }
// or
{ "held": false }

DELETE /api/swap/lock — release

Send your token in the X-Swap-Lock-Token header (or ?token=):

DELETE /api/swap/lock
X-Swap-Lock-Token: a1b2c3…
  • 403 if the token doesn't match. The dashboard's human override is DELETE /api/swap/lock?force=true (no token).

Swapping while you hold the lock

Pass the token on the swap call; the dashboard (no token) is then blocked:

POST /api/swap
X-Swap-Lock-Token: a1b2c3…
{ "model_key": "gemma-3-27b" }

Recommended scheduler flow: acquire → swap (with token) → poll /api/swap/{id} → release. Always release in a finally; if you crash, the TTL frees it.

POST /api/swap/{key}/validate (pre-flight) and dry-run swaps are not blocked by the lock — they don't touch the cluster.


2. Swap-event webhook

Configure a URL in Configure Sparks → "Swap webhook URL". After every real swap, Spark Control POSTs:

{
  "event": "swap_complete",   // or "swap_failed"
  "job_id": "1a2b3c4d",
  "model_key": "gemma-3-27b",
  "state": "ready",           // or "failed"
  "returncode": 0,
  "started_at": "2026-06-17T12:00:00+00:00",
  "finished_at": "2026-06-17T12:03:11+00:00",
  "dry_run": false
}

Headers: X-Spark-Event: swap_complete. If you set a webhook secret, the body is signed: X-Spark-Signature: sha256=<hmac> (HMAC-SHA256 of the raw body with the shared secret). Verify it like:

import hmac, hashlib
expected = "sha256=" + hmac.new(secret.encode(), raw_body, hashlib.sha256).hexdigest()
assert hmac.compare_digest(expected, request.headers["X-Spark-Signature"])

Delivery is best-effort and fire-and-forget (5 s timeout, no retries) — a webhook failure never affects the swap itself. Dry runs don't fire.


3. Schedule registry (read-only display)

So the dashboard can show what's scheduled to touch the GPU and when, your schedulers register their jobs here. Spark Control only displays these — it never executes them.

POST /api/schedule — register / update

// request (pass a stable `id` to update in place on re-register)
{ "id": "daily-vol", "name": "Daily Vol", "owner": "openclaw",
  "cron": "0 6 * * *", "next_run": "2026-06-18T06:00:00Z",
  "description": "Swaps to the big model, generates the vol report" }

// response: the stored entry (generates an id if you omit one)

name is required; id (if given) must match [A-Za-z0-9_.-] (≤64 chars).

GET /api/schedule — list

{ "schedules": [ { "id": "daily-vol", "name": "Daily Vol", "owner": "openclaw",
  "cron": "0 6 * * *", "next_run": "…", "description": "…",
  "registered_at": "…", "updated_at": "…" } ] }

DELETE /api/schedule/{id} — deregister

{ "deleted": true }

The registry is in-memory — re-register your schedules on your own startup so they survive a Spark Control restart.