spark-vllm-docker

Author	SHA1	Message	Date
Eugene Rakhmatulin	b87854fd4c	Fixed qwen3.6 recipes	2026-05-06 10:56:09 -07:00
Eugene Rakhmatulin	c67c5b5c1e	Add chat template and recipe for Qwen3.6-35B-A3B-FP8 model	2026-05-06 10:32:46 -07:00
Eugene Rakhmatulin	97e51d5d23	fixed gemma4 recipe	2026-04-29 12:56:07 -07:00
Eugene Rakhmatulin	87cb9f6e1e	Reverted gemma4 to safetensors. Fixes #214 and #217 .	2026-04-29 10:56:40 -07:00
eugr	e3243bf555	Merge pull request #197 from mmonad/minimax-m2.7-awq-recipe Add recipe for MiniMax-M2.7-AWQ	2026-04-25 19:26:43 -07:00
Eugene Rakhmatulin	43a00ed90f	Fixed #205	2026-04-25 18:39:46 -07:00
L.B.R.	caa28c8e12	Add recipe for MiniMax-M2.7-AWQ Add a vLLM serving recipe for the MiniMax M2.7 model using the cyankiwi/MiniMax-M2.7-AWQ-4bit quantization. Uses the same minimax_m2 tool-call and reasoning parsers as the existing M2 recipe, with Ray distributed backend on 2 GPUs.	2026-04-18 22:44:26 +01:00
Eugene Rakhmatulin	d49fac1b8b	Re-enable flashinfer_cutlass	2026-04-16 16:40:56 -07:00
Eugene Rakhmatulin	1ad85442ac	Added a helper mod for Qwen3.5-397B recipe	2026-04-12 19:14:23 -07:00
Eugene Rakhmatulin	288da8e911	Mod to fix Gemma4 tool parser	2026-04-04 16:48:07 -07:00
Eugene Rakhmatulin	7bc4e4ce5e	Fixes #158 by adding build args to gemma4 recipe	2026-04-04 10:46:06 -07:00
Eugene Rakhmatulin	ed32612cdd	A recipe for Gemma4-26B	2026-04-02 23:53:55 -07:00
Eugene Rakhmatulin	12caec228e	switching gpt-oss-120b to solo only for now	2026-04-01 10:27:50 -07:00
Eugene Rakhmatulin	27eb35f08d	Fixed 4x qwen recipe	2026-04-01 10:09:01 -07:00
Eugene Rakhmatulin	044557943c	Bugfixes	2026-03-31 17:49:17 -07:00
Eugene Rakhmatulin	c1a6cec074	Updated documentation; default image tags in build script	2026-03-27 16:41:09 -07:00
eugr	47a896d722	Removed expert-parallel from 3x-node Qwen	2026-03-26 22:44:48 -07:00
Eugene Rakhmatulin	0fa585f909	Fix typo in pipeline_parallel setting in Qwen3.5-397B-INT4-Autoround recipe	2026-03-26 18:43:17 -07:00
Eugene Rakhmatulin	cecec74828	Add recipe for Qwen3.5-397B-INT4-Autoround in pipeline-parallel mode	2026-03-26 18:41:57 -07:00
Eugene Rakhmatulin	efacbd69f2	Updated Nemotron3-Super recipe	2026-03-25 12:43:12 -07:00
Eugene Rakhmatulin	9e089acf2b	Updated Nemotron recipes to use VLLM CUTLASS	2026-03-22 23:03:24 -07:00
Eugene Rakhmatulin	57b458570e	Added experimental Qwen3.5-397B support for dual Spark configuration	2026-03-17 19:05:36 -07:00
Eugene Rakhmatulin	b1eeefc0eb	Changed Nemotron-3-Nano-NVFP4 to Marlin backend	2026-03-17 13:10:48 -07:00
eugr	7c198b1ceb	Merge pull request #90 from sonusflow/pr/qwen35-397b-tp4 Add Qwen3.5-397B INT4-AutoRound TP=4 recipe (37 tok/s)	2026-03-12 15:04:23 -07:00
Eugene Rakhmatulin	8fec9bed06	Updated Nemotron to support dual sparks	2026-03-12 13:30:15 -07:00
Eugene Rakhmatulin	6f9a2f981c	Adjusted model parameters	2026-03-12 12:59:05 -07:00
remi	122edc8229	super nemotron mod & recipe for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	2026-03-11 20:53:44 +01:00
Eugene Rakhmatulin	7ceea85647	Fixed qwen3-coder-next-int4-autoround to exclude Ray	2026-03-11 11:20:56 -07:00
Eugene Rakhmatulin	45066e2b16	Updated README	2026-03-11 09:57:34 -07:00
Eugene Rakhmatulin	f2cf11b047	Added a recipe for qwen3-coder-next-int4-autoround	2026-03-11 09:23:23 -07:00
sonusflow	3baca14eb1	Move recipe to 4x-spark-cluster/ and add UMA memory optimizations - Move qwen3.5-397b-int4-autoround.yaml to recipes/4x-spark-cluster/ per maintainer request (multi-node recipes in separate directory) - Add PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to recipe env - Optimize Ray for GB10 UMA (128GB shared CPU/GPU memory): - Disable Ray dashboard (saves ~1.2 GiB per node) - Limit Ray object store to 1 GiB (default 30% of RAM = 33 GiB) - Disable pre-started idle workers (saves ~8 GiB on head node) - Set --num-cpus 2 and --disable-usage-stats on all nodes - Net effect: ~40+ GiB freed across 4-node cluster for model/KV cache	2026-03-11 07:29:45 +00:00
sonusflow	006734910c	Add Qwen3.5-397B INT4-AutoRound TP=4 recipe and Marlin fix Production-tested recipe for running Qwen3.5-397B-A17B with INT4 AutoRound quantization across 4 DGX Spark nodes using tensor parallelism. Performance (4× DGX Spark, driver 580.126.09): - Single user: 37 tok/s - 4 concurrent: ~26 tok/s per user, ~103 tok/s aggregate The Marlin TP fix resolves the MIN_THREAD_N=64 constraint that breaks in_proj_ba layers at TP=4 (output_size=128/4=32 < 64). Solution: ReplicatedLinear for B/A projections, applied via diff patches. Key config: - VLLM_MARLIN_USE_ATOMIC_ADD=1 (required for Marlin correctness) - KV cache FP8, prefix caching enabled - gpu_memory_utilization 0.78 (UMA safe margin) - CUDAGraphs enabled (default, requires driver 580.x) Note: Driver 590.x has CUDAGraph capture deadlock on GB10 unified memory. Stay on driver 580.126.09.	2026-03-09 21:30:28 +00:00
Eugene Rakhmatulin	d42c4199fa	Unsloth chat template for qwen3.5	2026-03-06 23:35:18 -08:00
Eugene Rakhmatulin	9dc09bd04b	Renamed recipe for qwen3.5-35b-a3b-fp8 to match others	2026-03-06 13:56:06 -08:00
eugr	d148d95a19	Merge pull request #80 from oliverjohnwilson/recipe-add_minimax-m2.5_qwen3.5-397b-a17B-fp8 added minimax-m2.5 and qwen3.5-397b-a17B-fp8 recipes to a recipes/4x-spark-cluster/ subdirectory	2026-03-06 11:46:37 -08:00
eugr	3fabd3fb1c	Merge pull request #72 from erikvullings/main Add Qwen35-35B-A3B recipe in FP8 format	2026-03-05 16:27:50 -08:00
Eugene Rakhmatulin	a749fcce87	Added a recipe for qwen3.5-122B-FP8	2026-03-04 16:49:39 -08:00
oliverjohnwilson	4303f8b6d0	added minimax-m2.5 and qwen3.5-397b-a17B-fp8 recipes to a recipes/4x-spark-cluster/ subdirectory	2026-03-04 16:01:37 -06:00
Erik Vullings	163f23d85b	Update qwen35-35b-a3b-fp8.yaml --max_num_batched_tokens is a default variable now, which can be overriden via the CLI	2026-03-03 12:46:12 +01:00
Eugene Rakhmatulin	7d8465fd9c	Added recipe for qwen3.5-122b-int4-autoround, updated README	2026-03-02 12:18:16 -08:00
Erik Vullings	e8f94d6b8b	Add Qwen35-35B-A3B recipe in FP8 format	2026-02-27 17:46:06 +01:00
Eugene Rakhmatulin	4c8f90395b	Changed reasoning parser in MInimax for better compatibility with modern clients (like coding tools).	2026-02-21 11:53:13 -08:00
Eugene Rakhmatulin	5b2313dddb	Changed KV type to fp8 in qwen3-coder-next recipe and reduced default context size to 131072 to ensure it all fits in a single Spark.	2026-02-17 13:07:54 -08:00
Eugene Rakhmatulin	1e7f2d5640	Small fix for M2.5 recipe	2026-02-16 11:38:34 -08:00
Eugene Rakhmatulin	24f42be5cc	Added a recipe for MiniMax M2.5 AWQ	2026-02-16 11:35:53 -08:00
Eugene Rakhmatulin	701147b1eb	Qwen3-Coder-Next fixes and updated recipe	2026-02-12 15:56:32 -08:00
Eugene Rakhmatulin	c6b245cfe8	Added prefix caching to nemotron recipe	2026-02-10 18:25:01 -08:00
Eugene Rakhmatulin	74876dd442	Added recipes for nemotron-nano-3 and qwen3-coder-next	2026-02-09 14:33:35 -08:00
Raphael Amorim	6943a51ced	Adding tests and refactoring repeated methods	2026-02-09 17:21:32 -05:00
Raphael Amorim	b7c3cdcfcb	Enhancement: add -- pass-through for arbitrary vLLM arguments Implements Unix-style pass-through allowing any vLLM argument to be passed after `--` separator. Arguments are appended verbatim to the generated vLLM command. Examples: ./run-recipe.py model --solo -- --load-format safetensors ./run-recipe.py model --solo -- --served-model-name my-api ./run-recipe.py model --solo -- -cc.cudagraph_mode=PIECEWISE Features: - Uses parse_known_args() to capture arguments after -- - Warns when extra args duplicate CLI overrides (--port, --tp, etc.) - Works in both solo and cluster modes Adds 10 integration tests covering: - --load-format, --served-model-name, equals syntax - Multiple arguments, empty --, cluster mode - Duplicate detection warnings for port/tp/gpu-mem Closes #30	2026-02-08 02:36:49 -05:00

1 2

52 Commits