Eugene Rakhmatulin
b87854fd4c
Fixed qwen3.6 recipes
2026-05-06 10:56:09 -07:00
Eugene Rakhmatulin
c67c5b5c1e
Add chat template and recipe for Qwen3.6-35B-A3B-FP8 model
2026-05-06 10:32:46 -07:00
Eugene Rakhmatulin
97e51d5d23
fixed gemma4 recipe
2026-04-29 12:56:07 -07:00
Eugene Rakhmatulin
87cb9f6e1e
Reverted gemma4 to safetensors. Fixes #214 and #217 .
2026-04-29 10:56:40 -07:00
eugr
e3243bf555
Merge pull request #197 from mmonad/minimax-m2.7-awq-recipe
...
Add recipe for MiniMax-M2.7-AWQ
2026-04-25 19:26:43 -07:00
Eugene Rakhmatulin
43a00ed90f
Fixed #205
2026-04-25 18:39:46 -07:00
L.B.R.
caa28c8e12
Add recipe for MiniMax-M2.7-AWQ
...
Add a vLLM serving recipe for the MiniMax M2.7 model using
the cyankiwi/MiniMax-M2.7-AWQ-4bit quantization. Uses the
same minimax_m2 tool-call and reasoning parsers as the
existing M2 recipe, with Ray distributed backend on 2 GPUs.
2026-04-18 22:44:26 +01:00
Eugene Rakhmatulin
d49fac1b8b
Re-enable flashinfer_cutlass
2026-04-16 16:40:56 -07:00
Eugene Rakhmatulin
1ad85442ac
Added a helper mod for Qwen3.5-397B recipe
2026-04-12 19:14:23 -07:00
Eugene Rakhmatulin
288da8e911
Mod to fix Gemma4 tool parser
2026-04-04 16:48:07 -07:00
Eugene Rakhmatulin
7bc4e4ce5e
Fixes #158 by adding build args to gemma4 recipe
2026-04-04 10:46:06 -07:00
Eugene Rakhmatulin
ed32612cdd
A recipe for Gemma4-26B
2026-04-02 23:53:55 -07:00
Eugene Rakhmatulin
12caec228e
switching gpt-oss-120b to solo only for now
2026-04-01 10:27:50 -07:00
Eugene Rakhmatulin
27eb35f08d
Fixed 4x qwen recipe
2026-04-01 10:09:01 -07:00
Eugene Rakhmatulin
044557943c
Bugfixes
2026-03-31 17:49:17 -07:00
Eugene Rakhmatulin
c1a6cec074
Updated documentation; default image tags in build script
2026-03-27 16:41:09 -07:00
eugr
47a896d722
Removed expert-parallel from 3x-node Qwen
2026-03-26 22:44:48 -07:00
Eugene Rakhmatulin
0fa585f909
Fix typo in pipeline_parallel setting in Qwen3.5-397B-INT4-Autoround recipe
2026-03-26 18:43:17 -07:00
Eugene Rakhmatulin
cecec74828
Add recipe for Qwen3.5-397B-INT4-Autoround in pipeline-parallel mode
2026-03-26 18:41:57 -07:00
Eugene Rakhmatulin
efacbd69f2
Updated Nemotron3-Super recipe
2026-03-25 12:43:12 -07:00
Eugene Rakhmatulin
9e089acf2b
Updated Nemotron recipes to use VLLM CUTLASS
2026-03-22 23:03:24 -07:00
Eugene Rakhmatulin
57b458570e
Added experimental Qwen3.5-397B support for dual Spark configuration
2026-03-17 19:05:36 -07:00
Eugene Rakhmatulin
b1eeefc0eb
Changed Nemotron-3-Nano-NVFP4 to Marlin backend
2026-03-17 13:10:48 -07:00
eugr
7c198b1ceb
Merge pull request #90 from sonusflow/pr/qwen35-397b-tp4
...
Add Qwen3.5-397B INT4-AutoRound TP=4 recipe (37 tok/s)
2026-03-12 15:04:23 -07:00
Eugene Rakhmatulin
8fec9bed06
Updated Nemotron to support dual sparks
2026-03-12 13:30:15 -07:00
Eugene Rakhmatulin
6f9a2f981c
Adjusted model parameters
2026-03-12 12:59:05 -07:00
remi
122edc8229
super nemotron mod & recipe for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
2026-03-11 20:53:44 +01:00
Eugene Rakhmatulin
7ceea85647
Fixed qwen3-coder-next-int4-autoround to exclude Ray
2026-03-11 11:20:56 -07:00
Eugene Rakhmatulin
45066e2b16
Updated README
2026-03-11 09:57:34 -07:00
Eugene Rakhmatulin
f2cf11b047
Added a recipe for qwen3-coder-next-int4-autoround
2026-03-11 09:23:23 -07:00
sonusflow
3baca14eb1
Move recipe to 4x-spark-cluster/ and add UMA memory optimizations
...
- Move qwen3.5-397b-int4-autoround.yaml to recipes/4x-spark-cluster/
per maintainer request (multi-node recipes in separate directory)
- Add PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to recipe env
- Optimize Ray for GB10 UMA (128GB shared CPU/GPU memory):
- Disable Ray dashboard (saves ~1.2 GiB per node)
- Limit Ray object store to 1 GiB (default 30% of RAM = 33 GiB)
- Disable pre-started idle workers (saves ~8 GiB on head node)
- Set --num-cpus 2 and --disable-usage-stats on all nodes
- Net effect: ~40+ GiB freed across 4-node cluster for model/KV cache
2026-03-11 07:29:45 +00:00
sonusflow
006734910c
Add Qwen3.5-397B INT4-AutoRound TP=4 recipe and Marlin fix
...
Production-tested recipe for running Qwen3.5-397B-A17B with INT4 AutoRound
quantization across 4 DGX Spark nodes using tensor parallelism.
Performance (4× DGX Spark, driver 580.126.09):
- Single user: 37 tok/s
- 4 concurrent: ~26 tok/s per user, ~103 tok/s aggregate
The Marlin TP fix resolves the MIN_THREAD_N=64 constraint that breaks
in_proj_ba layers at TP=4 (output_size=128/4=32 < 64). Solution:
ReplicatedLinear for B/A projections, applied via diff patches.
Key config:
- VLLM_MARLIN_USE_ATOMIC_ADD=1 (required for Marlin correctness)
- KV cache FP8, prefix caching enabled
- gpu_memory_utilization 0.78 (UMA safe margin)
- CUDAGraphs enabled (default, requires driver 580.x)
Note: Driver 590.x has CUDAGraph capture deadlock on GB10 unified memory.
Stay on driver 580.126.09.
2026-03-09 21:30:28 +00:00
Eugene Rakhmatulin
d42c4199fa
Unsloth chat template for qwen3.5
2026-03-06 23:35:18 -08:00
Eugene Rakhmatulin
9dc09bd04b
Renamed recipe for qwen3.5-35b-a3b-fp8 to match others
2026-03-06 13:56:06 -08:00
eugr
d148d95a19
Merge pull request #80 from oliverjohnwilson/recipe-add_minimax-m2.5_qwen3.5-397b-a17B-fp8
...
added minimax-m2.5 and qwen3.5-397b-a17B-fp8 recipes to a recipes/4x-spark-cluster/ subdirectory
2026-03-06 11:46:37 -08:00
eugr
3fabd3fb1c
Merge pull request #72 from erikvullings/main
...
Add Qwen35-35B-A3B recipe in FP8 format
2026-03-05 16:27:50 -08:00
Eugene Rakhmatulin
a749fcce87
Added a recipe for qwen3.5-122B-FP8
2026-03-04 16:49:39 -08:00
oliverjohnwilson
4303f8b6d0
added minimax-m2.5 and qwen3.5-397b-a17B-fp8 recipes to a recipes/4x-spark-cluster/ subdirectory
2026-03-04 16:01:37 -06:00
Erik Vullings
163f23d85b
Update qwen35-35b-a3b-fp8.yaml
...
--max_num_batched_tokens is a default variable now, which can be overriden via the CLI
2026-03-03 12:46:12 +01:00
Eugene Rakhmatulin
7d8465fd9c
Added recipe for qwen3.5-122b-int4-autoround, updated README
2026-03-02 12:18:16 -08:00
Erik Vullings
e8f94d6b8b
Add Qwen35-35B-A3B recipe in FP8 format
2026-02-27 17:46:06 +01:00
Eugene Rakhmatulin
4c8f90395b
Changed reasoning parser in MInimax for better compatibility with modern clients (like coding tools).
2026-02-21 11:53:13 -08:00
Eugene Rakhmatulin
5b2313dddb
Changed KV type to fp8 in qwen3-coder-next recipe and reduced default context size to 131072 to ensure it all fits in a single Spark.
2026-02-17 13:07:54 -08:00
Eugene Rakhmatulin
1e7f2d5640
Small fix for M2.5 recipe
2026-02-16 11:38:34 -08:00
Eugene Rakhmatulin
24f42be5cc
Added a recipe for MiniMax M2.5 AWQ
2026-02-16 11:35:53 -08:00
Eugene Rakhmatulin
701147b1eb
Qwen3-Coder-Next fixes and updated recipe
2026-02-12 15:56:32 -08:00
Eugene Rakhmatulin
c6b245cfe8
Added prefix caching to nemotron recipe
2026-02-10 18:25:01 -08:00
Eugene Rakhmatulin
74876dd442
Added recipes for nemotron-nano-3 and qwen3-coder-next
2026-02-09 14:33:35 -08:00
Raphael Amorim
6943a51ced
Adding tests and refactoring repeated methods
2026-02-09 17:21:32 -05:00
Raphael Amorim
b7c3cdcfcb
Enhancement: add -- pass-through for arbitrary vLLM arguments
...
Implements Unix-style pass-through allowing any vLLM argument to be
passed after `--` separator. Arguments are appended verbatim to the
generated vLLM command.
Examples:
./run-recipe.py model --solo -- --load-format safetensors
./run-recipe.py model --solo -- --served-model-name my-api
./run-recipe.py model --solo -- -cc.cudagraph_mode=PIECEWISE
Features:
- Uses parse_known_args() to capture arguments after --
- Warns when extra args duplicate CLI overrides (--port, --tp, etc.)
- Works in both solo and cluster modes
Adds 10 integration tests covering:
- --load-format, --served-model-name, equals syntax
- Multiple arguments, empty --, cluster mode
- Duplicate detection warnings for port/tp/gpu-mem
Closes #30
2026-02-08 02:36:49 -05:00