Eugene Rakhmatulin
2d749742e4
Changed base image back to base CUDA development one
2026-03-21 18:11:20 -07:00
Eugene Rakhmatulin
7a54657abf
Revert "cuda 13.2 torch"
...
This reverts commit 926dd57a87 .
2026-03-21 15:36:17 -07:00
Eugene Rakhmatulin
926dd57a87
cuda 13.2 torch
2026-03-21 15:15:01 -07:00
Eugene Rakhmatulin
6e8d85c914
cleanup
2026-03-21 15:12:12 -07:00
Eugene Rakhmatulin
8385506c5e
Fixes
2026-03-20 23:51:21 -07:00
Eugene Rakhmatulin
8caebe3155
Reverting back to CUDA image + pytorch from wheels
2026-03-20 17:03:18 -07:00
Eugene Rakhmatulin
919a881cb1
Merge branch 'main' of gitlab.home.eugr.net:ai/spark-vllm
2026-03-18 22:03:25 -07:00
Eugene Rakhmatulin
8ddc259619
Fixed #111
2026-03-18 22:03:04 -07:00
eugr
22f3fa6c21
Merge pull request #103 from apairmont/network_arg
...
Add docker --network arg to common build flags
2026-03-18 21:48:48 -07:00
Eugene Rakhmatulin
15d295887c
Updated README to reflect --master-port parameter
2026-03-18 21:23:28 -07:00
Eugene Rakhmatulin
7e4150feed
Added master-port argument
2026-03-18 16:57:55 -07:00
eugr
7b752c31c5
Merge pull request #110 from voloszad/patch-1
...
Remove run-cluster-node.sh script copy and permission commands from Dockerfile.mxfp4
2026-03-18 14:54:11 -07:00
Andrej V.
bdd2b10f54
Remove script copy and permission commands from Dockerfile
...
Removed script copying and permission setting for run-cluster-node.sh.
2026-03-18 21:57:56 +01:00
Eugene Rakhmatulin
2755b62d12
Fixes #108
2026-03-18 13:26:39 -07:00
Eugene Rakhmatulin
f327b92abe
Fixes #106 and #108
2026-03-18 13:06:44 -07:00
Eugene Rakhmatulin
57b458570e
Added experimental Qwen3.5-397B support for dual Spark configuration
2026-03-17 19:05:36 -07:00
Eugene Rakhmatulin
57ed099465
Updated README file to reflect new launch-cluster options.
2026-03-17 16:16:04 -07:00
Eugene Rakhmatulin
fb0687cd1b
Updated README to describe no-ray mode
2026-03-17 15:27:22 -07:00
Eugene Rakhmatulin
ccea2ba861
Bugfixes
2026-03-17 13:54:42 -07:00
Eugene Rakhmatulin
957605498c
Added extra passthrough variables to run-recipe
2026-03-17 13:41:40 -07:00
Eugene Rakhmatulin
b1eeefc0eb
Changed Nemotron-3-Nano-NVFP4 to Marlin backend
2026-03-17 13:10:48 -07:00
Alan Pairmont
b879b7748f
add network arg to common build flags
2026-03-16 12:09:59 -04:00
Eugene Rakhmatulin
fa645f3e4b
bugfixes
2026-03-13 13:39:30 -07:00
Eugene Rakhmatulin
dedbd0a01d
bugfixes
2026-03-13 12:41:48 -07:00
Eugene Rakhmatulin
caa83d9e5b
Bugfixes
2026-03-13 12:32:43 -07:00
Eugene Rakhmatulin
4bcbbaa25a
Bugfixes
2026-03-13 12:23:41 -07:00
Eugene Rakhmatulin
d08266a123
Bugfixes
2026-03-13 12:18:22 -07:00
Eugene Rakhmatulin
03b055d7f0
Major cluster orchestration refactoring to support running without Ray
2026-03-13 11:55:18 -07:00
Eugene Rakhmatulin
d609fecef3
Merge branch 'main' of github.com:eugr/spark-vllm-docker
2026-03-12 15:04:41 -07:00
eugr
7c198b1ceb
Merge pull request #90 from sonusflow/pr/qwen35-397b-tp4
...
Add Qwen3.5-397B INT4-AutoRound TP=4 recipe (37 tok/s)
2026-03-12 15:04:23 -07:00
Eugene Rakhmatulin
8ae51192e5
Experimental mod to support gpu-memory-utilization-gb
2026-03-12 13:37:44 -07:00
Eugene Rakhmatulin
8fec9bed06
Updated Nemotron to support dual sparks
2026-03-12 13:30:15 -07:00
Eugene Rakhmatulin
6a323cc6f5
Merge pull request #93
2026-03-12 13:00:13 -07:00
Eugene Rakhmatulin
6f9a2f981c
Adjusted model parameters
2026-03-12 12:59:05 -07:00
remi
122edc8229
super nemotron mod & recipe for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
2026-03-11 20:53:44 +01:00
Eugene Rakhmatulin
7ceea85647
Fixed qwen3-coder-next-int4-autoround to exclude Ray
2026-03-11 11:20:56 -07:00
Eugene Rakhmatulin
45066e2b16
Updated README
2026-03-11 09:57:34 -07:00
Eugene Rakhmatulin
f2cf11b047
Added a recipe for qwen3-coder-next-int4-autoround
2026-03-11 09:23:23 -07:00
sonusflow
3baca14eb1
Move recipe to 4x-spark-cluster/ and add UMA memory optimizations
...
- Move qwen3.5-397b-int4-autoround.yaml to recipes/4x-spark-cluster/
per maintainer request (multi-node recipes in separate directory)
- Add PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to recipe env
- Optimize Ray for GB10 UMA (128GB shared CPU/GPU memory):
- Disable Ray dashboard (saves ~1.2 GiB per node)
- Limit Ray object store to 1 GiB (default 30% of RAM = 33 GiB)
- Disable pre-started idle workers (saves ~8 GiB on head node)
- Set --num-cpus 2 and --disable-usage-stats on all nodes
- Net effect: ~40+ GiB freed across 4-node cluster for model/KV cache
2026-03-11 07:29:45 +00:00
Eugene Rakhmatulin
66b5c85907
Merge branch 'main' of github.com:eugr/spark-vllm-docker
2026-03-10 10:29:10 -07:00
eugr
0019bdf5ed
Merge pull request #85 from saladinomario/feat/recipe-env-passthrough
...
Add -e/--env passthrough to run-recipe.py
2026-03-10 10:28:29 -07:00
sonusflow
006734910c
Add Qwen3.5-397B INT4-AutoRound TP=4 recipe and Marlin fix
...
Production-tested recipe for running Qwen3.5-397B-A17B with INT4 AutoRound
quantization across 4 DGX Spark nodes using tensor parallelism.
Performance (4× DGX Spark, driver 580.126.09):
- Single user: 37 tok/s
- 4 concurrent: ~26 tok/s per user, ~103 tok/s aggregate
The Marlin TP fix resolves the MIN_THREAD_N=64 constraint that breaks
in_proj_ba layers at TP=4 (output_size=128/4=32 < 64). Solution:
ReplicatedLinear for B/A projections, applied via diff patches.
Key config:
- VLLM_MARLIN_USE_ATOMIC_ADD=1 (required for Marlin correctness)
- KV cache FP8, prefix caching enabled
- gpu_memory_utilization 0.78 (UMA safe margin)
- CUDAGraphs enabled (default, requires driver 580.x)
Note: Driver 590.x has CUDAGraph capture deadlock on GB10 unified memory.
Stay on driver 580.126.09.
2026-03-09 21:30:28 +00:00
Eugene Rakhmatulin
e225c709fb
Revert "fix: add temporary patch for CUDA graphs estimation" as it has been merged to main
...
This reverts commit 63b2a8dbed .
2026-03-09 09:46:50 -07:00
Eugene Rakhmatulin
63b2a8dbed
fix: add temporary patch for CUDA graphs estimation
2026-03-08 22:43:41 -07:00
eugr
9724619dbd
Merge pull request #87 from SeraphimSerapis/fix_wheels_download
...
fix: skip empty lines in wheel download read loop
2026-03-07 09:34:31 -08:00
Eugene Rakhmatulin
d42c4199fa
Unsloth chat template for qwen3.5
staging-current-1772875976
2026-03-06 23:35:18 -08:00
Tim Messerschmidt
b9fc32ec34
fix: skip empty lines in wheel download read loop
...
Add a guard to skip empty lines (e.g. trailing newlines) in the
while-read loop to prevent try_download_wheels from breaking on
unexpected blank input.
2026-03-07 05:06:12 +01:00
Eugene Rakhmatulin
9dc09bd04b
Renamed recipe for qwen3.5-35b-a3b-fp8 to match others
2026-03-06 13:56:06 -08:00
eugr
e88426646b
Merge pull request #76 from mmonad/fix-exec-arg-quoting
...
Fix shell quoting for exec command arguments
2026-03-06 13:45:53 -08:00
mariosaladino
f95beba566
Add -e/--env passthrough to run-recipe.py
...
Fixes #81 . Allows passing environment variables (e.g. HF_TOKEN)
through to the container when launching via recipes, mirroring
the existing -e flag in launch-cluster.sh.
Usage: ./run-recipe.sh glm-4.7-flash-awq --solo -e HF_TOKEN=$HF_TOKEN
2026-03-06 21:50:29 +01:00