v0.26.0:0 - disk-driven model menu (scan sparks; recipes; needs-setup)

The dashboard menu is now the set of models actually downloaded on the Sparks, not a hard-coded catalog. models.yaml + overrides are reframed as launch recipes matched to an on-disk model by repo; an on-disk model with no recipe is flagged needs_setup and its launch settings are inferred from its config.json for a one-time operator confirmation (discovery.py). - delete now removes weights AND the menu card (delete_from_disk sweeps all hosts; the delete endpoint resolves keys via the live menu) - new GET /api/models/suggest; /api/models returns the menu + a recipes list (download autocomplete); GET /api/models/disk-status removed - dropped the two legacy Qwen recipes (235B FP8, 2.5 72B) - tests: +test_discovery.py (cache parsing, infer_recipe, build_menu merge)
2026-06-18 11:09:56 -05:00
parent c0b35184ba
commit df9f244eae
14 changed files with 795 additions and 238 deletions
@@ -1,9 +1,14 @@
-# spark-control model catalog
+# spark-control launch recipes
 #
-# Edit this file (or override at runtime via the StartOS "Edit Model Catalog"
-# action) to add or change available models.
+# These are NOT the dashboard menu. The menu is whatever is actually downloaded
+# on the Sparks — Spark Control scans the Hugging Face cache on each load and
+# shows what it finds. These entries are launch *recipes*: matched to an on-disk
+# model by `repo`, they say HOW to launch it. A downloaded model with no recipe
+# here shows up as "needs setup", and the dashboard infers + saves one on first
+# use (from the model's own config.json). Add a recipe to make a known model
+# launch correctly the moment it's downloaded, with no setup prompt.
 #
-# Each model entry produces this command on Spark 1:
+# Each recipe produces this command on Spark 1:
 #   cd ~/spark-vllm-docker
 #   ./launch-cluster.sh [--solo] -d exec vllm serve <repo> \
 #     --port=<defaults.port> --host=<defaults.host> <vllm_args...>
@@ -54,6 +59,34 @@ models:
      - --enable-prefix-caching
      - --kv-cache-dtype=fp8

+  gemma4-26b:
+    display_name: "Gemma 4 26B-A4B (vision, light)"
+    description: >-
+      Lighter, faster sibling of the Gemma 4 31B above: a Mixture-of-Experts
+      model with 26B total parameters but only ~4B active per token, so it
+      generates quickly. Takes images as well as text (good for tasks like
+      reading a business card into structured text). Reasoning is a bit
+      shallower than the dense 31B. Runs solo on one Spark.
+    repo: nvidia/Gemma-4-26B-A4B-NVFP4
+    size_gb: 17
+    mode: solo
+    capabilities: [vision, reasoning, tools]
+    expected_ready_seconds: 240
+    vllm_args:
+      - --gpu-memory-utilization=0.8
+      - --max-model-len=32768
+      - --max-num-batched-tokens=16384
+      - --reasoning-parser=gemma4
+      - --tool-call-parser=gemma4
+      - --enable-auto-tool-choice
+      # MoE backend: research found this model's expert layers fall back to
+      # 'marlin' on GB10 (the fast flashinfer_cutlass path errors on sm_121).
+      # If a swap fails to start, this flag is the first thing to flip.
+      - --moe_backend=marlin
+      - --load-format=fastsafetensors
+      - --enable-prefix-caching
+      - --kv-cache-dtype=fp8
+
  qwen36:
    display_name: "Qwen3.6 35B-A3B (daily driver)"
    description: >-
@@ -74,36 +107,3 @@ models:
      - --load-format=fastsafetensors
      - --enable-prefix-caching
      - --kv-cache-dtype=fp8
-
-  qwen3-235b-fp8:
-    display_name: "Qwen3 235B-A22B FP8 (legacy)"
-    description: >-
-      Earlier generation of the Qwen 235B family in native FP8 precision.
-      Runs across both Sparks. Mostly superseded by Qwen3-VL above; keep
-      around for text-only baseline comparisons.
-    repo: Qwen/Qwen3-235B-A22B-FP8
-    size_gb: 220
-    mode: cluster
-    capabilities: []
-    expected_ready_seconds: 360
-    vllm_args:
-      - --gpu-memory-utilization=0.7
-      - -tp=2
-      - --distributed-executor-backend=ray
-      - --max-model-len=32768
-
-  qwen25-72b:
-    display_name: "Qwen2.5 72B (legacy)"
-    description: >-
-      Last-generation 72B dense model. Cluster mode required due to size.
-      Kept for compatibility and baseline comparison against newer Qwens.
-    repo: Qwen/Qwen2.5-72B-Instruct
-    size_gb: 145
-    mode: cluster
-    capabilities: []
-    expected_ready_seconds: 360
-    vllm_args:
-      - --gpu-memory-utilization=0.7
-      - -tp=2
-      - --distributed-executor-backend=ray
-      - --max-model-len=32768