v0.25.0:0 - cluster coordination layer (swap lock + webhook + schedule registry)
GPU-arbiter safety layer for when automation, not just the dashboard, swaps models: - swap reservation lock (POST/GET/DELETE /api/swap/lock); 423-enforced in post_swap via a single-read gate, TTL-bounded, secret-token auth, human force-release override + dashboard banner - swap webhook (swap_complete/swap_failed) fired outside the swap lock, optional HMAC signature, configurable URL+secret - read-only schedule registry (GET/POST/DELETE /api/schedule) + dashboard panel New module image/app/coordination.py; docs/COORDINATION.md for consumers; 22 offline tests in test_coordination.py.
This commit is contained in:
@@ -0,0 +1,157 @@
|
||||
# Cluster coordination through Spark Control (v0.25.0)
|
||||
|
||||
Spark Control is the **GPU arbiter, not a job runner.** Your recurring pipelines
|
||||
(model-warming crons, "daily X" generators, batch jobs) live in your own
|
||||
services and *drive Spark Control's swap API*. This page documents the safety
|
||||
layer around that: a **swap reservation lock**, a **swap-event webhook**, and a
|
||||
**read-only schedule registry**.
|
||||
|
||||
If only the dashboard ever swaps models, you don't need any of this — it's for
|
||||
when something automated also swaps.
|
||||
|
||||
All endpoints are on the Spark Control host (same LAN/VPN URL as the LLM, audio,
|
||||
and embeddings proxies). There is no API-token auth by design (LAN + split-tunnel
|
||||
VPN only); a non-browser client passes the same-origin guard automatically.
|
||||
|
||||
---
|
||||
|
||||
## 1. Swap reservation lock
|
||||
|
||||
A short, TTL-bounded reservation of the swap path. While a lock is held, **any
|
||||
real swap that doesn't present the holder's token is refused with `423 Locked`**
|
||||
— including the dashboard's manual swap. The holder *name* is descriptive; the
|
||||
returned **token** is the secret that authorises swaps and the release.
|
||||
|
||||
The lock is in-memory: it resets to *unlocked* if Spark Control restarts (the
|
||||
safe-for-availability default), and the swap engine's own in-progress guard
|
||||
still prevents two swaps running at once.
|
||||
|
||||
### `POST /api/swap/lock` — acquire (or extend)
|
||||
|
||||
```json
|
||||
// request
|
||||
{ "holder": "openclaw-daily-vol", "ttl_seconds": 900, "note": "daily vol run" }
|
||||
|
||||
// 200 response
|
||||
{
|
||||
"held": true,
|
||||
"holder": "openclaw-daily-vol",
|
||||
"acquired_at": "2026-06-17T12:00:00+00:00",
|
||||
"expires_at": "2026-06-17T12:15:00+00:00",
|
||||
"seconds_remaining": 900,
|
||||
"note": "daily vol run",
|
||||
"token": "a1b2c3…" // SECRET — store it; needed to swap and to release
|
||||
}
|
||||
```
|
||||
|
||||
- `ttl_seconds` is optional (default 900) and clamped to `[1, 86400]`.
|
||||
- **`409`** if a *different* holder already holds it (body includes the current
|
||||
`lock` state). To **extend** your own lock, POST again with the same `holder`
|
||||
**and** your `token` — the token is preserved and the window slides forward.
|
||||
|
||||
### `GET /api/swap/lock` — status (no token)
|
||||
|
||||
```json
|
||||
{ "held": true, "holder": "openclaw-daily-vol", "expires_at": "…", "seconds_remaining": 612, "note": "…" }
|
||||
// or
|
||||
{ "held": false }
|
||||
```
|
||||
|
||||
### `DELETE /api/swap/lock` — release
|
||||
|
||||
Send your token in the `X-Swap-Lock-Token` header (or `?token=`):
|
||||
|
||||
```
|
||||
DELETE /api/swap/lock
|
||||
X-Swap-Lock-Token: a1b2c3…
|
||||
```
|
||||
|
||||
- **`403`** if the token doesn't match. The dashboard's human override is
|
||||
`DELETE /api/swap/lock?force=true` (no token).
|
||||
|
||||
### Swapping while you hold the lock
|
||||
|
||||
Pass the token on the swap call; the dashboard (no token) is then blocked:
|
||||
|
||||
```
|
||||
POST /api/swap
|
||||
X-Swap-Lock-Token: a1b2c3…
|
||||
{ "model_key": "gemma-3-27b" }
|
||||
```
|
||||
|
||||
Recommended scheduler flow: **acquire → swap (with token) → poll `/api/swap/{id}`
|
||||
→ release**. Always release in a `finally`; if you crash, the TTL frees it.
|
||||
|
||||
> `POST /api/swap/{key}/validate` (pre-flight) and dry-run swaps are **not**
|
||||
> blocked by the lock — they don't touch the cluster.
|
||||
|
||||
---
|
||||
|
||||
## 2. Swap-event webhook
|
||||
|
||||
Configure a URL in **Configure Sparks → "Swap webhook URL"**. After every real
|
||||
swap, Spark Control POSTs:
|
||||
|
||||
```json
|
||||
{
|
||||
"event": "swap_complete", // or "swap_failed"
|
||||
"job_id": "1a2b3c4d",
|
||||
"model_key": "gemma-3-27b",
|
||||
"state": "ready", // or "failed"
|
||||
"returncode": 0,
|
||||
"started_at": "2026-06-17T12:00:00+00:00",
|
||||
"finished_at": "2026-06-17T12:03:11+00:00",
|
||||
"dry_run": false
|
||||
}
|
||||
```
|
||||
|
||||
Headers: `X-Spark-Event: swap_complete`. If you set a **webhook secret**, the
|
||||
body is signed: `X-Spark-Signature: sha256=<hmac>` (HMAC-SHA256 of the raw body
|
||||
with the shared secret). Verify it like:
|
||||
|
||||
```python
|
||||
import hmac, hashlib
|
||||
expected = "sha256=" + hmac.new(secret.encode(), raw_body, hashlib.sha256).hexdigest()
|
||||
assert hmac.compare_digest(expected, request.headers["X-Spark-Signature"])
|
||||
```
|
||||
|
||||
Delivery is best-effort and fire-and-forget (5 s timeout, no retries) — a
|
||||
webhook failure never affects the swap itself. Dry runs don't fire.
|
||||
|
||||
---
|
||||
|
||||
## 3. Schedule registry (read-only display)
|
||||
|
||||
So the dashboard can show *what's scheduled to touch the GPU and when*, your
|
||||
schedulers register their jobs here. **Spark Control only displays these — it
|
||||
never executes them.**
|
||||
|
||||
### `POST /api/schedule` — register / update
|
||||
|
||||
```json
|
||||
// request (pass a stable `id` to update in place on re-register)
|
||||
{ "id": "daily-vol", "name": "Daily Vol", "owner": "openclaw",
|
||||
"cron": "0 6 * * *", "next_run": "2026-06-18T06:00:00Z",
|
||||
"description": "Swaps to the big model, generates the vol report" }
|
||||
|
||||
// response: the stored entry (generates an id if you omit one)
|
||||
```
|
||||
|
||||
`name` is required; `id` (if given) must match `[A-Za-z0-9_.-]` (≤64 chars).
|
||||
|
||||
### `GET /api/schedule` — list
|
||||
|
||||
```json
|
||||
{ "schedules": [ { "id": "daily-vol", "name": "Daily Vol", "owner": "openclaw",
|
||||
"cron": "0 6 * * *", "next_run": "…", "description": "…",
|
||||
"registered_at": "…", "updated_at": "…" } ] }
|
||||
```
|
||||
|
||||
### `DELETE /api/schedule/{id}` — deregister
|
||||
|
||||
```json
|
||||
{ "deleted": true }
|
||||
```
|
||||
|
||||
The registry is in-memory — re-register your schedules on your own startup so
|
||||
they survive a Spark Control restart.
|
||||
Reference in New Issue
Block a user