spark-vllm-docker

Author	SHA1	Message	Date
Eugene Rakhmatulin	2755b62d12	Fixes #108	2026-03-18 13:26:39 -07:00
Eugene Rakhmatulin	f327b92abe	Fixes #106 and #108	2026-03-18 13:06:44 -07:00
Eugene Rakhmatulin	57b458570e	Added experimental Qwen3.5-397B support for dual Spark configuration	2026-03-17 19:05:36 -07:00
Eugene Rakhmatulin	57ed099465	Updated README file to reflect new launch-cluster options.	2026-03-17 16:16:04 -07:00
Eugene Rakhmatulin	fb0687cd1b	Updated README to describe no-ray mode	2026-03-17 15:27:22 -07:00
Eugene Rakhmatulin	ccea2ba861	Bugfixes	2026-03-17 13:54:42 -07:00
Eugene Rakhmatulin	957605498c	Added extra passthrough variables to run-recipe	2026-03-17 13:41:40 -07:00
Eugene Rakhmatulin	b1eeefc0eb	Changed Nemotron-3-Nano-NVFP4 to Marlin backend	2026-03-17 13:10:48 -07:00
Alan Pairmont	b879b7748f	add network arg to common build flags	2026-03-16 12:09:59 -04:00
Eugene Rakhmatulin	fa645f3e4b	bugfixes	2026-03-13 13:39:30 -07:00
Eugene Rakhmatulin	dedbd0a01d	bugfixes	2026-03-13 12:41:48 -07:00
Eugene Rakhmatulin	caa83d9e5b	Bugfixes	2026-03-13 12:32:43 -07:00
Eugene Rakhmatulin	4bcbbaa25a	Bugfixes	2026-03-13 12:23:41 -07:00
Eugene Rakhmatulin	d08266a123	Bugfixes	2026-03-13 12:18:22 -07:00
Eugene Rakhmatulin	03b055d7f0	Major cluster orchestration refactoring to support running without Ray	2026-03-13 11:55:18 -07:00
Eugene Rakhmatulin	d609fecef3	Merge branch 'main' of github.com:eugr/spark-vllm-docker	2026-03-12 15:04:41 -07:00
eugr	7c198b1ceb	Merge pull request #90 from sonusflow/pr/qwen35-397b-tp4 Add Qwen3.5-397B INT4-AutoRound TP=4 recipe (37 tok/s)	2026-03-12 15:04:23 -07:00
Eugene Rakhmatulin	8ae51192e5	Experimental mod to support gpu-memory-utilization-gb	2026-03-12 13:37:44 -07:00
Eugene Rakhmatulin	8fec9bed06	Updated Nemotron to support dual sparks	2026-03-12 13:30:15 -07:00
Eugene Rakhmatulin	6a323cc6f5	Merge pull request #93	2026-03-12 13:00:13 -07:00
Eugene Rakhmatulin	6f9a2f981c	Adjusted model parameters	2026-03-12 12:59:05 -07:00
remi	122edc8229	super nemotron mod & recipe for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	2026-03-11 20:53:44 +01:00
Eugene Rakhmatulin	7ceea85647	Fixed qwen3-coder-next-int4-autoround to exclude Ray	2026-03-11 11:20:56 -07:00
Eugene Rakhmatulin	45066e2b16	Updated README	2026-03-11 09:57:34 -07:00
Eugene Rakhmatulin	f2cf11b047	Added a recipe for qwen3-coder-next-int4-autoround	2026-03-11 09:23:23 -07:00
sonusflow	3baca14eb1	Move recipe to 4x-spark-cluster/ and add UMA memory optimizations - Move qwen3.5-397b-int4-autoround.yaml to recipes/4x-spark-cluster/ per maintainer request (multi-node recipes in separate directory) - Add PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to recipe env - Optimize Ray for GB10 UMA (128GB shared CPU/GPU memory): - Disable Ray dashboard (saves ~1.2 GiB per node) - Limit Ray object store to 1 GiB (default 30% of RAM = 33 GiB) - Disable pre-started idle workers (saves ~8 GiB on head node) - Set --num-cpus 2 and --disable-usage-stats on all nodes - Net effect: ~40+ GiB freed across 4-node cluster for model/KV cache	2026-03-11 07:29:45 +00:00
Eugene Rakhmatulin	66b5c85907	Merge branch 'main' of github.com:eugr/spark-vllm-docker	2026-03-10 10:29:10 -07:00
eugr	0019bdf5ed	Merge pull request #85 from saladinomario/feat/recipe-env-passthrough Add -e/--env passthrough to run-recipe.py	2026-03-10 10:28:29 -07:00
sonusflow	006734910c	Add Qwen3.5-397B INT4-AutoRound TP=4 recipe and Marlin fix Production-tested recipe for running Qwen3.5-397B-A17B with INT4 AutoRound quantization across 4 DGX Spark nodes using tensor parallelism. Performance (4× DGX Spark, driver 580.126.09): - Single user: 37 tok/s - 4 concurrent: ~26 tok/s per user, ~103 tok/s aggregate The Marlin TP fix resolves the MIN_THREAD_N=64 constraint that breaks in_proj_ba layers at TP=4 (output_size=128/4=32 < 64). Solution: ReplicatedLinear for B/A projections, applied via diff patches. Key config: - VLLM_MARLIN_USE_ATOMIC_ADD=1 (required for Marlin correctness) - KV cache FP8, prefix caching enabled - gpu_memory_utilization 0.78 (UMA safe margin) - CUDAGraphs enabled (default, requires driver 580.x) Note: Driver 590.x has CUDAGraph capture deadlock on GB10 unified memory. Stay on driver 580.126.09.	2026-03-09 21:30:28 +00:00
Eugene Rakhmatulin	e225c709fb	Revert "fix: add temporary patch for CUDA graphs estimation" as it has been merged to main This reverts commit `63b2a8dbed`.	2026-03-09 09:46:50 -07:00
Eugene Rakhmatulin	63b2a8dbed	fix: add temporary patch for CUDA graphs estimation	2026-03-08 22:43:41 -07:00
eugr	9724619dbd	Merge pull request #87 from SeraphimSerapis/fix_wheels_download fix: skip empty lines in wheel download read loop	2026-03-07 09:34:31 -08:00
Eugene Rakhmatulin	d42c4199fa	Unsloth chat template for qwen3.5 staging-current-1772875976	2026-03-06 23:35:18 -08:00
Tim Messerschmidt	b9fc32ec34	fix: skip empty lines in wheel download read loop Add a guard to skip empty lines (e.g. trailing newlines) in the while-read loop to prevent try_download_wheels from breaking on unexpected blank input.	2026-03-07 05:06:12 +01:00
Eugene Rakhmatulin	9dc09bd04b	Renamed recipe for qwen3.5-35b-a3b-fp8 to match others	2026-03-06 13:56:06 -08:00
eugr	e88426646b	Merge pull request #76 from mmonad/fix-exec-arg-quoting Fix shell quoting for exec command arguments	2026-03-06 13:45:53 -08:00
mariosaladino	f95beba566	Add -e/--env passthrough to run-recipe.py Fixes #81. Allows passing environment variables (e.g. HF_TOKEN) through to the container when launching via recipes, mirroring the existing -e flag in launch-cluster.sh. Usage: ./run-recipe.sh glm-4.7-flash-awq --solo -e HF_TOKEN=$HF_TOKEN	2026-03-06 21:50:29 +01:00
Olivier Paroz	eb8abcca7f	Prevent 169.254.x.x fallback when setting fix IP address (#84 ) * Prevent 169.254.x.x fallback when setting fix IP address To force the use of the IP we've chosen to be assigned to the interface, it's safer to disable the fallback to avoid problems down the line * Prevent 169.254.x.x fallback when setting fix IP address To force the use of the static IP address we've chosen to be assigned to the interface, it's safer to disable the fallback to avoid problems down the line	2026-03-06 11:47:47 -08:00
eugr	d148d95a19	Merge pull request #80 from oliverjohnwilson/recipe-add_minimax-m2.5_qwen3.5-397b-a17B-fp8 added minimax-m2.5 and qwen3.5-397b-a17B-fp8 recipes to a recipes/4x-spark-cluster/ subdirectory	2026-03-06 11:46:37 -08:00
Eugene Rakhmatulin	5346372f14	More robust wheels check before download	2026-03-05 17:06:57 -08:00
Eugene Rakhmatulin	5f8f988d91	Merge branch 'main' of github.com:eugr/spark-vllm-docker	2026-03-05 16:29:00 -08:00
eugr	3fabd3fb1c	Merge pull request #72 from erikvullings/main Add Qwen35-35B-A3B recipe in FP8 format	2026-03-05 16:27:50 -08:00
Eugene Rakhmatulin	2d03bc138d	saving flashinfer and vllm commits in wheels directories	2026-03-05 14:41:25 -08:00
Eugene Rakhmatulin	a749fcce87	Added a recipe for qwen3.5-122B-FP8 staging-current-1772696417 staging-current-1772696532	2026-03-04 16:49:39 -08:00
Eugene Rakhmatulin	505a060a7d	vLLM prebuilt wheels support	2026-03-04 16:01:50 -08:00
Eugene Rakhmatulin	ca34ebcffc	Merge branch 'main' into vllm-wheels	2026-03-04 15:59:16 -08:00
oliverjohnwilson	4303f8b6d0	added minimax-m2.5 and qwen3.5-397b-a17B-fp8 recipes to a recipes/4x-spark-cluster/ subdirectory	2026-03-04 16:01:37 -06:00
Eugene Rakhmatulin	2152ef127d	Now can use prebuilt vLLM wheels	2026-03-04 13:33:32 -08:00
Eugene Rakhmatulin	19f06a0d16	Fixed a bug with checking whether we need to download remote wheels staging-current-1772668424 staging-current-1772668553	2026-03-04 13:00:40 -08:00
Eugene Rakhmatulin	bbd7db2813	revert bumping up base image staging-current-1772642670 staging-current-1772642791	2026-03-04 07:29:53 -08:00

1 2 3 4 5 ...

324 Commits