# spark-control launch recipes # # These are NOT the dashboard menu. The menu is whatever is actually downloaded # on the Sparks — Spark Control scans the Hugging Face cache on each load and # shows what it finds. These entries are launch *recipes*: matched to an on-disk # model by `repo`, they say HOW to launch it. A downloaded model with no recipe # here shows up as "needs setup", and the dashboard infers + saves one on first # use (from the model's own config.json). Add a recipe to make a known model # launch correctly the moment it's downloaded, with no setup prompt. # # Each recipe produces this command on Spark 1: # cd ~/spark-vllm-docker # ./launch-cluster.sh [--solo] -d exec vllm serve \ # --port= --host= defaults: port: 8888 host: 0.0.0.0 models: qwen3-vl: display_name: "Qwen3-VL 235B (vision)" description: >- Qwen's flagship multimodal model. 235B total parameters with ~22B active per token (Mixture-of-Experts). Handles text, images, and many languages. The most capable model in this catalog — also the slowest to load because it splits across both Sparks. repo: RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 size_gb: 135 mode: cluster capabilities: [vision, multilingual] expected_ready_seconds: 300 vllm_args: - --gpu-memory-utilization=0.7 - -tp=2 - --distributed-executor-backend=ray - --max-model-len=32768 - --max-num-batched-tokens=16384 gemma4: display_name: "Gemma 4 31B" description: >- Google's mid-size reasoning model. 31B dense parameters with built-in thinking mode and function-calling. Strong on math, logic, and structured outputs; also supports vision input. Runs solo on one Spark. repo: RedHatAI/gemma-4-31B-it-NVFP4 size_gb: 23 mode: solo capabilities: [vision, reasoning, tools] expected_ready_seconds: 240 vllm_args: - --gpu-memory-utilization=0.8 - --max-model-len=32768 - --max-num-batched-tokens=16384 - --reasoning-parser=gemma4 - --tool-call-parser=gemma4 - --enable-auto-tool-choice - --load-format=fastsafetensors - --enable-prefix-caching - --kv-cache-dtype=fp8 gemma4-26b: display_name: "Gemma 4 26B-A4B (vision, light)" description: >- Lighter, faster sibling of the Gemma 4 31B above: a Mixture-of-Experts model with 26B total parameters but only ~4B active per token, so it generates quickly. Takes images as well as text (good for tasks like reading a business card into structured text). Reasoning is a bit shallower than the dense 31B. Runs solo on one Spark. repo: nvidia/Gemma-4-26B-A4B-NVFP4 size_gb: 17 mode: solo capabilities: [vision, reasoning, tools] expected_ready_seconds: 240 vllm_args: - --gpu-memory-utilization=0.8 - --max-model-len=32768 - --max-num-batched-tokens=16384 - --reasoning-parser=gemma4 - --tool-call-parser=gemma4 - --enable-auto-tool-choice # MoE backend: research found this model's expert layers fall back to # 'marlin' on GB10 (the fast flashinfer_cutlass path errors on sm_121). # If a swap fails to start, this flag is the first thing to flip. - --moe_backend=marlin - --load-format=fastsafetensors - --enable-prefix-caching - --kv-cache-dtype=fp8 qwen36: display_name: "Qwen3.6 35B-A3B (daily driver)" description: >- Qwen's latest fast Mixture-of-Experts model: 35B total parameters but only ~3B active per token, making inference quick. Long 64K-token context window. A good default for everyday chat and longer documents. repo: RedHatAI/Qwen3.6-35B-A3B-NVFP4 size_gb: 20 mode: solo # Qwen3.6-35B-A3B is natively multimodal (Qwen3_5MoeForConditionalGeneration, # vision tower ships in the checkpoint). Confirmed reading a business card # cleanly on this cluster — use the "Vision check" button on the live card. capabilities: [vision, reasoning] expected_ready_seconds: 300 vllm_args: - --gpu-memory-utilization=0.85 - --max-model-len=65536 - --max-num-batched-tokens=16384 - --reasoning-parser=qwen3 - --moe_backend=flashinfer_cutlass - --load-format=fastsafetensors - --enable-prefix-caching - --kv-cache-dtype=fp8 # Cap image resolution: a large phone photo (e.g. 12MP) otherwise expands # to ~11.8k vision tokens, blowing past vLLM's ~4096-image-token limit and # getting rejected with a 400. ~2MP auto-downscales big images server-side # (so every /v1 consumer is covered) while staying sharp enough for OCR. - '--mm-processor-kwargs={"max_pixels": 2000000}'