Commit Graph

324 Commits

Author SHA1 Message Date
Eugene Rakhmatulin
2755b62d12 Fixes #108 2026-03-18 13:26:39 -07:00
Eugene Rakhmatulin
f327b92abe Fixes #106 and #108 2026-03-18 13:06:44 -07:00
Eugene Rakhmatulin
57b458570e Added experimental Qwen3.5-397B support for dual Spark configuration 2026-03-17 19:05:36 -07:00
Eugene Rakhmatulin
57ed099465 Updated README file to reflect new launch-cluster options. 2026-03-17 16:16:04 -07:00
Eugene Rakhmatulin
fb0687cd1b Updated README to describe no-ray mode 2026-03-17 15:27:22 -07:00
Eugene Rakhmatulin
ccea2ba861 Bugfixes 2026-03-17 13:54:42 -07:00
Eugene Rakhmatulin
957605498c Added extra passthrough variables to run-recipe 2026-03-17 13:41:40 -07:00
Eugene Rakhmatulin
b1eeefc0eb Changed Nemotron-3-Nano-NVFP4 to Marlin backend 2026-03-17 13:10:48 -07:00
Alan Pairmont
b879b7748f add network arg to common build flags 2026-03-16 12:09:59 -04:00
Eugene Rakhmatulin
fa645f3e4b bugfixes 2026-03-13 13:39:30 -07:00
Eugene Rakhmatulin
dedbd0a01d bugfixes 2026-03-13 12:41:48 -07:00
Eugene Rakhmatulin
caa83d9e5b Bugfixes 2026-03-13 12:32:43 -07:00
Eugene Rakhmatulin
4bcbbaa25a Bugfixes 2026-03-13 12:23:41 -07:00
Eugene Rakhmatulin
d08266a123 Bugfixes 2026-03-13 12:18:22 -07:00
Eugene Rakhmatulin
03b055d7f0 Major cluster orchestration refactoring to support running without Ray 2026-03-13 11:55:18 -07:00
Eugene Rakhmatulin
d609fecef3 Merge branch 'main' of github.com:eugr/spark-vllm-docker 2026-03-12 15:04:41 -07:00
eugr
7c198b1ceb Merge pull request #90 from sonusflow/pr/qwen35-397b-tp4
Add Qwen3.5-397B INT4-AutoRound TP=4 recipe (37 tok/s)
2026-03-12 15:04:23 -07:00
Eugene Rakhmatulin
8ae51192e5 Experimental mod to support gpu-memory-utilization-gb 2026-03-12 13:37:44 -07:00
Eugene Rakhmatulin
8fec9bed06 Updated Nemotron to support dual sparks 2026-03-12 13:30:15 -07:00
Eugene Rakhmatulin
6a323cc6f5 Merge pull request #93 2026-03-12 13:00:13 -07:00
Eugene Rakhmatulin
6f9a2f981c Adjusted model parameters 2026-03-12 12:59:05 -07:00
remi
122edc8229 super nemotron mod & recipe for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 2026-03-11 20:53:44 +01:00
Eugene Rakhmatulin
7ceea85647 Fixed qwen3-coder-next-int4-autoround to exclude Ray 2026-03-11 11:20:56 -07:00
Eugene Rakhmatulin
45066e2b16 Updated README 2026-03-11 09:57:34 -07:00
Eugene Rakhmatulin
f2cf11b047 Added a recipe for qwen3-coder-next-int4-autoround 2026-03-11 09:23:23 -07:00
sonusflow
3baca14eb1 Move recipe to 4x-spark-cluster/ and add UMA memory optimizations
- Move qwen3.5-397b-int4-autoround.yaml to recipes/4x-spark-cluster/
  per maintainer request (multi-node recipes in separate directory)
- Add PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to recipe env
- Optimize Ray for GB10 UMA (128GB shared CPU/GPU memory):
  - Disable Ray dashboard (saves ~1.2 GiB per node)
  - Limit Ray object store to 1 GiB (default 30% of RAM = 33 GiB)
  - Disable pre-started idle workers (saves ~8 GiB on head node)
  - Set --num-cpus 2 and --disable-usage-stats on all nodes
- Net effect: ~40+ GiB freed across 4-node cluster for model/KV cache
2026-03-11 07:29:45 +00:00
Eugene Rakhmatulin
66b5c85907 Merge branch 'main' of github.com:eugr/spark-vllm-docker 2026-03-10 10:29:10 -07:00
eugr
0019bdf5ed Merge pull request #85 from saladinomario/feat/recipe-env-passthrough
Add -e/--env passthrough to run-recipe.py
2026-03-10 10:28:29 -07:00
sonusflow
006734910c Add Qwen3.5-397B INT4-AutoRound TP=4 recipe and Marlin fix
Production-tested recipe for running Qwen3.5-397B-A17B with INT4 AutoRound
quantization across 4 DGX Spark nodes using tensor parallelism.

Performance (4× DGX Spark, driver 580.126.09):
- Single user: 37 tok/s
- 4 concurrent: ~26 tok/s per user, ~103 tok/s aggregate

The Marlin TP fix resolves the MIN_THREAD_N=64 constraint that breaks
in_proj_ba layers at TP=4 (output_size=128/4=32 < 64). Solution:
ReplicatedLinear for B/A projections, applied via diff patches.

Key config:
- VLLM_MARLIN_USE_ATOMIC_ADD=1 (required for Marlin correctness)
- KV cache FP8, prefix caching enabled
- gpu_memory_utilization 0.78 (UMA safe margin)
- CUDAGraphs enabled (default, requires driver 580.x)

Note: Driver 590.x has CUDAGraph capture deadlock on GB10 unified memory.
Stay on driver 580.126.09.
2026-03-09 21:30:28 +00:00
Eugene Rakhmatulin
e225c709fb Revert "fix: add temporary patch for CUDA graphs estimation" as it has been merged to main
This reverts commit 63b2a8dbed.
2026-03-09 09:46:50 -07:00
Eugene Rakhmatulin
63b2a8dbed fix: add temporary patch for CUDA graphs estimation 2026-03-08 22:43:41 -07:00
eugr
9724619dbd Merge pull request #87 from SeraphimSerapis/fix_wheels_download
fix: skip empty lines in wheel download read loop
2026-03-07 09:34:31 -08:00
Eugene Rakhmatulin
d42c4199fa Unsloth chat template for qwen3.5 staging-current-1772875976 2026-03-06 23:35:18 -08:00
Tim Messerschmidt
b9fc32ec34 fix: skip empty lines in wheel download read loop
Add a guard to skip empty lines (e.g. trailing newlines) in the
while-read loop to prevent try_download_wheels from breaking on
unexpected blank input.
2026-03-07 05:06:12 +01:00
Eugene Rakhmatulin
9dc09bd04b Renamed recipe for qwen3.5-35b-a3b-fp8 to match others 2026-03-06 13:56:06 -08:00
eugr
e88426646b Merge pull request #76 from mmonad/fix-exec-arg-quoting
Fix shell quoting for exec command arguments
2026-03-06 13:45:53 -08:00
mariosaladino
f95beba566 Add -e/--env passthrough to run-recipe.py
Fixes #81. Allows passing environment variables (e.g. HF_TOKEN)
through to the container when launching via recipes, mirroring
the existing -e flag in launch-cluster.sh.

Usage: ./run-recipe.sh glm-4.7-flash-awq --solo -e HF_TOKEN=$HF_TOKEN
2026-03-06 21:50:29 +01:00
Olivier Paroz
eb8abcca7f Prevent 169.254.x.x fallback when setting fix IP address (#84)
* Prevent 169.254.x.x fallback when setting fix IP address

To force the use of the IP we've chosen to be assigned to the interface, it's safer to disable the fallback to avoid problems down the line

* Prevent 169.254.x.x fallback when setting fix IP address

To force the use of the static IP address we've chosen to be assigned to the interface, it's safer to disable the fallback to avoid problems down the line
2026-03-06 11:47:47 -08:00
eugr
d148d95a19 Merge pull request #80 from oliverjohnwilson/recipe-add_minimax-m2.5_qwen3.5-397b-a17B-fp8
added minimax-m2.5 and qwen3.5-397b-a17B-fp8 recipes to a recipes/4x-spark-cluster/ subdirectory
2026-03-06 11:46:37 -08:00
Eugene Rakhmatulin
5346372f14 More robust wheels check before download 2026-03-05 17:06:57 -08:00
Eugene Rakhmatulin
5f8f988d91 Merge branch 'main' of github.com:eugr/spark-vllm-docker 2026-03-05 16:29:00 -08:00
eugr
3fabd3fb1c Merge pull request #72 from erikvullings/main
Add Qwen35-35B-A3B recipe in FP8 format
2026-03-05 16:27:50 -08:00
Eugene Rakhmatulin
2d03bc138d saving flashinfer and vllm commits in wheels directories 2026-03-05 14:41:25 -08:00
Eugene Rakhmatulin
a749fcce87 Added a recipe for qwen3.5-122B-FP8 staging-current-1772696417 staging-current-1772696532 2026-03-04 16:49:39 -08:00
Eugene Rakhmatulin
505a060a7d vLLM prebuilt wheels support 2026-03-04 16:01:50 -08:00
Eugene Rakhmatulin
ca34ebcffc Merge branch 'main' into vllm-wheels 2026-03-04 15:59:16 -08:00
oliverjohnwilson
4303f8b6d0 added minimax-m2.5 and qwen3.5-397b-a17B-fp8 recipes to a recipes/4x-spark-cluster/ subdirectory 2026-03-04 16:01:37 -06:00
Eugene Rakhmatulin
2152ef127d Now can use prebuilt vLLM wheels 2026-03-04 13:33:32 -08:00
Eugene Rakhmatulin
19f06a0d16 Fixed a bug with checking whether we need to download remote wheels staging-current-1772668424 staging-current-1772668553 2026-03-04 13:00:40 -08:00
Eugene Rakhmatulin
bbd7db2813 revert bumping up base image staging-current-1772642670 staging-current-1772642791 2026-03-04 07:29:53 -08:00