Files
spark-control/image/models.yaml
T
Keysat df9f244eae v0.26.0:0 - disk-driven model menu (scan sparks; recipes; needs-setup)
The dashboard menu is now the set of models actually downloaded on the
Sparks, not a hard-coded catalog. models.yaml + overrides are reframed as
launch recipes matched to an on-disk model by repo; an on-disk model with
no recipe is flagged needs_setup and its launch settings are inferred from
its config.json for a one-time operator confirmation (discovery.py).

- delete now removes weights AND the menu card (delete_from_disk sweeps all
  hosts; the delete endpoint resolves keys via the live menu)
- new GET /api/models/suggest; /api/models returns the menu + a recipes list
  (download autocomplete); GET /api/models/disk-status removed
- dropped the two legacy Qwen recipes (235B FP8, 2.5 72B)
- tests: +test_discovery.py (cache parsing, infer_recipe, build_menu merge)
2026-06-18 11:09:56 -05:00

110 lines
4.1 KiB
YAML

# spark-control launch recipes
#
# These are NOT the dashboard menu. The menu is whatever is actually downloaded
# on the Sparks — Spark Control scans the Hugging Face cache on each load and
# shows what it finds. These entries are launch *recipes*: matched to an on-disk
# model by `repo`, they say HOW to launch it. A downloaded model with no recipe
# here shows up as "needs setup", and the dashboard infers + saves one on first
# use (from the model's own config.json). Add a recipe to make a known model
# launch correctly the moment it's downloaded, with no setup prompt.
#
# Each recipe produces this command on Spark 1:
# cd ~/spark-vllm-docker
# ./launch-cluster.sh [--solo] -d exec vllm serve <repo> \
# --port=<defaults.port> --host=<defaults.host> <vllm_args...>
defaults:
port: 8888
host: 0.0.0.0
models:
qwen3-vl:
display_name: "Qwen3-VL 235B (vision)"
description: >-
Qwen's flagship multimodal model. 235B total parameters with ~22B
active per token (Mixture-of-Experts). Handles text, images, and
many languages. The most capable model in this catalog — also the
slowest to load because it splits across both Sparks.
repo: RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4
size_gb: 135
mode: cluster
capabilities: [vision, multilingual]
expected_ready_seconds: 300
vllm_args:
- --gpu-memory-utilization=0.7
- -tp=2
- --distributed-executor-backend=ray
- --max-model-len=32768
- --max-num-batched-tokens=16384
gemma4:
display_name: "Gemma 4 31B"
description: >-
Google's mid-size reasoning model. 31B dense parameters with built-in
thinking mode and function-calling. Strong on math, logic, and
structured outputs; also supports vision input. Runs solo on one Spark.
repo: RedHatAI/gemma-4-31B-it-NVFP4
size_gb: 23
mode: solo
capabilities: [vision, reasoning, tools]
expected_ready_seconds: 240
vllm_args:
- --gpu-memory-utilization=0.8
- --max-model-len=32768
- --max-num-batched-tokens=16384
- --reasoning-parser=gemma4
- --tool-call-parser=gemma4
- --enable-auto-tool-choice
- --load-format=fastsafetensors
- --enable-prefix-caching
- --kv-cache-dtype=fp8
gemma4-26b:
display_name: "Gemma 4 26B-A4B (vision, light)"
description: >-
Lighter, faster sibling of the Gemma 4 31B above: a Mixture-of-Experts
model with 26B total parameters but only ~4B active per token, so it
generates quickly. Takes images as well as text (good for tasks like
reading a business card into structured text). Reasoning is a bit
shallower than the dense 31B. Runs solo on one Spark.
repo: nvidia/Gemma-4-26B-A4B-NVFP4
size_gb: 17
mode: solo
capabilities: [vision, reasoning, tools]
expected_ready_seconds: 240
vllm_args:
- --gpu-memory-utilization=0.8
- --max-model-len=32768
- --max-num-batched-tokens=16384
- --reasoning-parser=gemma4
- --tool-call-parser=gemma4
- --enable-auto-tool-choice
# MoE backend: research found this model's expert layers fall back to
# 'marlin' on GB10 (the fast flashinfer_cutlass path errors on sm_121).
# If a swap fails to start, this flag is the first thing to flip.
- --moe_backend=marlin
- --load-format=fastsafetensors
- --enable-prefix-caching
- --kv-cache-dtype=fp8
qwen36:
display_name: "Qwen3.6 35B-A3B (daily driver)"
description: >-
Qwen's latest fast Mixture-of-Experts model: 35B total parameters but
only ~3B active per token, making inference quick. Long 64K-token
context window. A good default for everyday chat and longer documents.
repo: RedHatAI/Qwen3.6-35B-A3B-NVFP4
size_gb: 20
mode: solo
capabilities: [reasoning]
expected_ready_seconds: 300
vllm_args:
- --gpu-memory-utilization=0.85
- --max-model-len=65536
- --max-num-batched-tokens=16384
- --reasoning-parser=qwen3
- --moe_backend=flashinfer_cutlass
- --load-format=fastsafetensors
- --enable-prefix-caching
- --kv-cache-dtype=fp8