v0.26.0:0 - disk-driven model menu (scan sparks; recipes; needs-setup)
The dashboard menu is now the set of models actually downloaded on the Sparks, not a hard-coded catalog. models.yaml + overrides are reframed as launch recipes matched to an on-disk model by repo; an on-disk model with no recipe is flagged needs_setup and its launch settings are inferred from its config.json for a one-time operator confirmation (discovery.py). - delete now removes weights AND the menu card (delete_from_disk sweeps all hosts; the delete endpoint resolves keys via the live menu) - new GET /api/models/suggest; /api/models returns the menu + a recipes list (download autocomplete); GET /api/models/disk-status removed - dropped the two legacy Qwen recipes (235B FP8, 2.5 72B) - tests: +test_discovery.py (cache parsing, infer_recipe, build_menu merge)
This commit is contained in:
+37
-37
@@ -1,9 +1,14 @@
|
||||
# spark-control model catalog
|
||||
# spark-control launch recipes
|
||||
#
|
||||
# Edit this file (or override at runtime via the StartOS "Edit Model Catalog"
|
||||
# action) to add or change available models.
|
||||
# These are NOT the dashboard menu. The menu is whatever is actually downloaded
|
||||
# on the Sparks — Spark Control scans the Hugging Face cache on each load and
|
||||
# shows what it finds. These entries are launch *recipes*: matched to an on-disk
|
||||
# model by `repo`, they say HOW to launch it. A downloaded model with no recipe
|
||||
# here shows up as "needs setup", and the dashboard infers + saves one on first
|
||||
# use (from the model's own config.json). Add a recipe to make a known model
|
||||
# launch correctly the moment it's downloaded, with no setup prompt.
|
||||
#
|
||||
# Each model entry produces this command on Spark 1:
|
||||
# Each recipe produces this command on Spark 1:
|
||||
# cd ~/spark-vllm-docker
|
||||
# ./launch-cluster.sh [--solo] -d exec vllm serve <repo> \
|
||||
# --port=<defaults.port> --host=<defaults.host> <vllm_args...>
|
||||
@@ -54,6 +59,34 @@ models:
|
||||
- --enable-prefix-caching
|
||||
- --kv-cache-dtype=fp8
|
||||
|
||||
gemma4-26b:
|
||||
display_name: "Gemma 4 26B-A4B (vision, light)"
|
||||
description: >-
|
||||
Lighter, faster sibling of the Gemma 4 31B above: a Mixture-of-Experts
|
||||
model with 26B total parameters but only ~4B active per token, so it
|
||||
generates quickly. Takes images as well as text (good for tasks like
|
||||
reading a business card into structured text). Reasoning is a bit
|
||||
shallower than the dense 31B. Runs solo on one Spark.
|
||||
repo: nvidia/Gemma-4-26B-A4B-NVFP4
|
||||
size_gb: 17
|
||||
mode: solo
|
||||
capabilities: [vision, reasoning, tools]
|
||||
expected_ready_seconds: 240
|
||||
vllm_args:
|
||||
- --gpu-memory-utilization=0.8
|
||||
- --max-model-len=32768
|
||||
- --max-num-batched-tokens=16384
|
||||
- --reasoning-parser=gemma4
|
||||
- --tool-call-parser=gemma4
|
||||
- --enable-auto-tool-choice
|
||||
# MoE backend: research found this model's expert layers fall back to
|
||||
# 'marlin' on GB10 (the fast flashinfer_cutlass path errors on sm_121).
|
||||
# If a swap fails to start, this flag is the first thing to flip.
|
||||
- --moe_backend=marlin
|
||||
- --load-format=fastsafetensors
|
||||
- --enable-prefix-caching
|
||||
- --kv-cache-dtype=fp8
|
||||
|
||||
qwen36:
|
||||
display_name: "Qwen3.6 35B-A3B (daily driver)"
|
||||
description: >-
|
||||
@@ -74,36 +107,3 @@ models:
|
||||
- --load-format=fastsafetensors
|
||||
- --enable-prefix-caching
|
||||
- --kv-cache-dtype=fp8
|
||||
|
||||
qwen3-235b-fp8:
|
||||
display_name: "Qwen3 235B-A22B FP8 (legacy)"
|
||||
description: >-
|
||||
Earlier generation of the Qwen 235B family in native FP8 precision.
|
||||
Runs across both Sparks. Mostly superseded by Qwen3-VL above; keep
|
||||
around for text-only baseline comparisons.
|
||||
repo: Qwen/Qwen3-235B-A22B-FP8
|
||||
size_gb: 220
|
||||
mode: cluster
|
||||
capabilities: []
|
||||
expected_ready_seconds: 360
|
||||
vllm_args:
|
||||
- --gpu-memory-utilization=0.7
|
||||
- -tp=2
|
||||
- --distributed-executor-backend=ray
|
||||
- --max-model-len=32768
|
||||
|
||||
qwen25-72b:
|
||||
display_name: "Qwen2.5 72B (legacy)"
|
||||
description: >-
|
||||
Last-generation 72B dense model. Cluster mode required due to size.
|
||||
Kept for compatibility and baseline comparison against newer Qwens.
|
||||
repo: Qwen/Qwen2.5-72B-Instruct
|
||||
size_gb: 145
|
||||
mode: cluster
|
||||
capabilities: []
|
||||
expected_ready_seconds: 360
|
||||
vllm_args:
|
||||
- --gpu-memory-utilization=0.7
|
||||
- -tp=2
|
||||
- --distributed-executor-backend=ray
|
||||
- --max-model-len=32768
|
||||
|
||||
Reference in New Issue
Block a user