1f359e3c79
A 12MP photo expands past vLLM's ~4096-image-token limit -> 400. Cap via --mm-processor-kwargs max_pixels in the qwen36 recipe so big images auto- downscale server-side for every /v1 consumer (verified live: 400->200). Remove the v0.27.2 in-dashboard vision-check button per owner request; the vision badge already signals capability.
118 lines
4.7 KiB
YAML
118 lines
4.7 KiB
YAML
# spark-control launch recipes
|
|
#
|
|
# These are NOT the dashboard menu. The menu is whatever is actually downloaded
|
|
# on the Sparks — Spark Control scans the Hugging Face cache on each load and
|
|
# shows what it finds. These entries are launch *recipes*: matched to an on-disk
|
|
# model by `repo`, they say HOW to launch it. A downloaded model with no recipe
|
|
# here shows up as "needs setup", and the dashboard infers + saves one on first
|
|
# use (from the model's own config.json). Add a recipe to make a known model
|
|
# launch correctly the moment it's downloaded, with no setup prompt.
|
|
#
|
|
# Each recipe produces this command on Spark 1:
|
|
# cd ~/spark-vllm-docker
|
|
# ./launch-cluster.sh [--solo] -d exec vllm serve <repo> \
|
|
# --port=<defaults.port> --host=<defaults.host> <vllm_args...>
|
|
|
|
defaults:
|
|
port: 8888
|
|
host: 0.0.0.0
|
|
|
|
models:
|
|
qwen3-vl:
|
|
display_name: "Qwen3-VL 235B (vision)"
|
|
description: >-
|
|
Qwen's flagship multimodal model. 235B total parameters with ~22B
|
|
active per token (Mixture-of-Experts). Handles text, images, and
|
|
many languages. The most capable model in this catalog — also the
|
|
slowest to load because it splits across both Sparks.
|
|
repo: RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4
|
|
size_gb: 135
|
|
mode: cluster
|
|
capabilities: [vision, multilingual]
|
|
expected_ready_seconds: 300
|
|
vllm_args:
|
|
- --gpu-memory-utilization=0.7
|
|
- -tp=2
|
|
- --distributed-executor-backend=ray
|
|
- --max-model-len=32768
|
|
- --max-num-batched-tokens=16384
|
|
|
|
gemma4:
|
|
display_name: "Gemma 4 31B"
|
|
description: >-
|
|
Google's mid-size reasoning model. 31B dense parameters with built-in
|
|
thinking mode and function-calling. Strong on math, logic, and
|
|
structured outputs; also supports vision input. Runs solo on one Spark.
|
|
repo: RedHatAI/gemma-4-31B-it-NVFP4
|
|
size_gb: 23
|
|
mode: solo
|
|
capabilities: [vision, reasoning, tools]
|
|
expected_ready_seconds: 240
|
|
vllm_args:
|
|
- --gpu-memory-utilization=0.8
|
|
- --max-model-len=32768
|
|
- --max-num-batched-tokens=16384
|
|
- --reasoning-parser=gemma4
|
|
- --tool-call-parser=gemma4
|
|
- --enable-auto-tool-choice
|
|
- --load-format=fastsafetensors
|
|
- --enable-prefix-caching
|
|
- --kv-cache-dtype=fp8
|
|
|
|
gemma4-26b:
|
|
display_name: "Gemma 4 26B-A4B (vision, light)"
|
|
description: >-
|
|
Lighter, faster sibling of the Gemma 4 31B above: a Mixture-of-Experts
|
|
model with 26B total parameters but only ~4B active per token, so it
|
|
generates quickly. Takes images as well as text (good for tasks like
|
|
reading a business card into structured text). Reasoning is a bit
|
|
shallower than the dense 31B. Runs solo on one Spark.
|
|
repo: nvidia/Gemma-4-26B-A4B-NVFP4
|
|
size_gb: 17
|
|
mode: solo
|
|
capabilities: [vision, reasoning, tools]
|
|
expected_ready_seconds: 240
|
|
vllm_args:
|
|
- --gpu-memory-utilization=0.8
|
|
- --max-model-len=32768
|
|
- --max-num-batched-tokens=16384
|
|
- --reasoning-parser=gemma4
|
|
- --tool-call-parser=gemma4
|
|
- --enable-auto-tool-choice
|
|
# MoE backend: research found this model's expert layers fall back to
|
|
# 'marlin' on GB10 (the fast flashinfer_cutlass path errors on sm_121).
|
|
# If a swap fails to start, this flag is the first thing to flip.
|
|
- --moe_backend=marlin
|
|
- --load-format=fastsafetensors
|
|
- --enable-prefix-caching
|
|
- --kv-cache-dtype=fp8
|
|
|
|
qwen36:
|
|
display_name: "Qwen3.6 35B-A3B (daily driver)"
|
|
description: >-
|
|
Qwen's latest fast Mixture-of-Experts model: 35B total parameters but
|
|
only ~3B active per token, making inference quick. Long 64K-token
|
|
context window. A good default for everyday chat and longer documents.
|
|
repo: RedHatAI/Qwen3.6-35B-A3B-NVFP4
|
|
size_gb: 20
|
|
mode: solo
|
|
# Qwen3.6-35B-A3B is natively multimodal (Qwen3_5MoeForConditionalGeneration,
|
|
# vision tower ships in the checkpoint). Confirmed reading a business card
|
|
# cleanly on this cluster — use the "Vision check" button on the live card.
|
|
capabilities: [vision, reasoning]
|
|
expected_ready_seconds: 300
|
|
vllm_args:
|
|
- --gpu-memory-utilization=0.85
|
|
- --max-model-len=65536
|
|
- --max-num-batched-tokens=16384
|
|
- --reasoning-parser=qwen3
|
|
- --moe_backend=flashinfer_cutlass
|
|
- --load-format=fastsafetensors
|
|
- --enable-prefix-caching
|
|
- --kv-cache-dtype=fp8
|
|
# Cap image resolution: a large phone photo (e.g. 12MP) otherwise expands
|
|
# to ~11.8k vision tokens, blowing past vLLM's ~4096-image-token limit and
|
|
# getting rejected with a 400. ~2MP auto-downscales big images server-side
|
|
# (so every /v1 consumer is covered) while staying sharp enough for OCR.
|
|
- '--mm-processor-kwargs={"max_pixels": 2000000}'
|