v0.25.0:0 - cluster coordination layer (swap lock + webhook + schedule registry)

GPU-arbiter safety layer for when automation, not just the dashboard, swaps models: - swap reservation lock (POST/GET/DELETE /api/swap/lock); 423-enforced in post_swap via a single-read gate, TTL-bounded, secret-token auth, human force-release override + dashboard banner - swap webhook (swap_complete/swap_failed) fired outside the swap lock, optional HMAC signature, configurable URL+secret - read-only schedule registry (GET/POST/DELETE /api/schedule) + dashboard panel New module image/app/coordination.py; docs/COORDINATION.md for consumers; 22 offline tests in test_coordination.py.
2026-06-18 07:07:08 -05:00
parent dd3d1412d4
commit 7ae6ab3ba8
15 changed files with 1026 additions and 15 deletions
@@ -173,6 +173,24 @@ const inputSpec = InputSpec.of({
    placeholder: 'starts with "nvapi-..."',
    masked: true,
  }),
+  swap_webhook_url: Value.text({
+    name: 'Swap webhook URL (optional)',
+    description:
+      'If you run automation that needs to know when the loaded model changes, paste a URL here. Spark Control POSTs a small JSON event (swap_complete / swap_failed) to it after every model swap, so the consumer can re-point its config to the new model. Leave blank to disable. Only needed if something other than this dashboard cares about swaps.',
+    required: false,
+    default: null,
+    placeholder: 'e.g. https://my-service.local/spark-swap',
+    masked: false,
+  }),
+  swap_webhook_secret: Value.text({
+    name: 'Swap webhook secret (optional)',
+    description:
+      'Optional shared secret. If set, each webhook is signed with an "X-Spark-Signature: sha256=…" header (HMAC of the body) so the receiver can verify it really came from Spark Control. Leave blank to send the webhook unsigned.',
+    required: false,
+    default: null,
+    placeholder: 'a random string the receiver also knows',
+    masked: true,
+  }),
 })

 export const configureSparks = sdk.Action.withInput(
@@ -35,6 +35,11 @@ export const sparkConfigSchema = z.object({
  open_webui_url: z.string().catch(''),
  // Optional NGC API key for pulling NIM containers from nvcr.io/nim/...
  ngc_api_key: z.string().catch(''),
+  // Optional coordination webhook: POSTed on swap_complete/swap_failed so
+  // downstream consumers re-point their model config. Blank => disabled.
+  swap_webhook_url: z.string().catch(''),
+  // Optional shared secret; if set, the webhook body is HMAC-signed.
+  swap_webhook_secret: z.string().catch(''),
 })

 export type SparkConfig = z.infer<typeof sparkConfigSchema>
@@ -32,6 +32,8 @@ export const main = sdk.setupMain(async ({ effects }) => {
    matrix_bridge_user: '',
    open_webui_url: '',
    ngc_api_key: '',
+    swap_webhook_url: '',
+    swap_webhook_secret: '',
  }

  return sdk.Daemons.of(effects).addDaemon('primary', {
@@ -75,6 +77,8 @@ export const main = sdk.setupMain(async ({ effects }) => {
        CONNECTIVITY_LOG: '/data/connectivity.json',
        OPEN_WEBUI_URL: cfg.open_webui_url,
        NGC_API_KEY: cfg.ngc_api_key,
+        SWAP_WEBHOOK_URL: cfg.swap_webhook_url,
+        SWAP_WEBHOOK_SECRET: cfg.swap_webhook_secret,
        BIND_PORT: String(uiPort),
      },
    },
@@ -1,10 +1,10 @@
 import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk'

 export const v0_1_0 = VersionInfo.of({
-  version: '0.24.0:0',
+  version: '0.25.0:0',
  releaseNotes: {
    en_US:
-      "v0.24.0:0 — configurable cluster topology. Spark Control no longer assumes our exact layout, so a cluster that's wired differently can be monitored without forking. Three new optional settings in Configure Sparks: (1) vLLM container name — defaults to \"vllm_node\"; set it if your swappable vLLM runs under a different container name (the swap log view and pre-flight validator exec into it by name). (2) Services to hide — a comma-separated list of built-in services your cluster doesn't run (parakeet, kokoro, embeddings, qdrant); hidden ones show no tile and are never probed, so e.g. a vLLM sharing Parakeet's default port 8000 no longer gets a confusing Parakeet probe. (3) Monitor a second vLLM — register a vLLM on another Spark as a custom service with kind \"vllm\" (in /data/services-overrides.yaml); it gets a read-only health tile (loaded model + container state + start/stop/restart) alongside the swappable one. API: /api/endpoints now reports a `disabled` flag per service.",
+      "v0.25.0:0 — cluster coordination layer (GPU arbiter). For clusters where automation, not just this dashboard, swaps models. Three additions: (1) Swap reservation lock — an external scheduler can reserve the GPU swap path (POST /api/swap/lock) and gets a secret token; while held, any swap without the token is refused (423), so the dashboard's manual swap is paused and shows who holds the GPU and until when (with a human Release override). The lock is TTL-bounded and self-frees. (2) Swap webhook — set a URL (and optional signing secret) in Configure Sparks; Spark Control POSTs a swap_complete / swap_failed event after each swap so downstream consumers re-point their model config. (3) Schedule registry — your automation can register its cron jobs (POST /api/schedule) for a read-only \"Scheduled jobs\" panel on the dashboard; Spark Control only displays them, it never runs them. New API: /api/swap/lock (GET/POST/DELETE), /api/schedule (GET/POST/DELETE). See docs/COORDINATION.md. Spark Control remains a control plane, not a job runner — business pipelines stay in their own services and call the swap API.",
  },
  migrations: {
    up: async ({ effects }) => {},