Commit Graph

387 Commits

Author SHA1 Message Date
Eugene Rakhmatulin
7a54657abf Revert "cuda 13.2 torch"
This reverts commit 926dd57a87.
2026-03-21 15:36:17 -07:00
Eugene Rakhmatulin
926dd57a87 cuda 13.2 torch 2026-03-21 15:15:01 -07:00
Eugene Rakhmatulin
6e8d85c914 cleanup 2026-03-21 15:12:12 -07:00
Drew Botwinick
d6e76f8e2f add build metadata generation and include in Dockerfiles 2026-03-21 16:10:04 -05:00
Eugene Rakhmatulin
8385506c5e Fixes 2026-03-20 23:51:21 -07:00
Eugene Rakhmatulin
8caebe3155 Reverting back to CUDA image + pytorch from wheels 2026-03-20 17:03:18 -07:00
Eugene Rakhmatulin
919a881cb1 Merge branch 'main' of gitlab.home.eugr.net:ai/spark-vllm 2026-03-18 22:03:25 -07:00
Eugene Rakhmatulin
8ddc259619 Fixed #111 2026-03-18 22:03:04 -07:00
eugr
22f3fa6c21 Merge pull request #103 from apairmont/network_arg
Add docker --network arg to common build flags
2026-03-18 21:48:48 -07:00
Eugene Rakhmatulin
15d295887c Updated README to reflect --master-port parameter 2026-03-18 21:23:28 -07:00
Eugene Rakhmatulin
7e4150feed Added master-port argument 2026-03-18 16:57:55 -07:00
eugr
7b752c31c5 Merge pull request #110 from voloszad/patch-1
Remove run-cluster-node.sh script copy and permission commands from Dockerfile.mxfp4
2026-03-18 14:54:11 -07:00
Andrej V.
bdd2b10f54 Remove script copy and permission commands from Dockerfile
Removed script copying and permission setting for run-cluster-node.sh.
2026-03-18 21:57:56 +01:00
Eugene Rakhmatulin
2755b62d12 Fixes #108 2026-03-18 13:26:39 -07:00
Eugene Rakhmatulin
f327b92abe Fixes #106 and #108 2026-03-18 13:06:44 -07:00
Eugene Rakhmatulin
57b458570e Added experimental Qwen3.5-397B support for dual Spark configuration 2026-03-17 19:05:36 -07:00
Eugene Rakhmatulin
57ed099465 Updated README file to reflect new launch-cluster options. 2026-03-17 16:16:04 -07:00
Eugene Rakhmatulin
fb0687cd1b Updated README to describe no-ray mode 2026-03-17 15:27:22 -07:00
Eugene Rakhmatulin
ccea2ba861 Bugfixes 2026-03-17 13:54:42 -07:00
Eugene Rakhmatulin
957605498c Added extra passthrough variables to run-recipe 2026-03-17 13:41:40 -07:00
Eugene Rakhmatulin
b1eeefc0eb Changed Nemotron-3-Nano-NVFP4 to Marlin backend 2026-03-17 13:10:48 -07:00
Alan Pairmont
b879b7748f add network arg to common build flags 2026-03-16 12:09:59 -04:00
Eugene Rakhmatulin
fa645f3e4b bugfixes 2026-03-13 13:39:30 -07:00
Eugene Rakhmatulin
dedbd0a01d bugfixes 2026-03-13 12:41:48 -07:00
Eugene Rakhmatulin
caa83d9e5b Bugfixes 2026-03-13 12:32:43 -07:00
Eugene Rakhmatulin
4bcbbaa25a Bugfixes 2026-03-13 12:23:41 -07:00
Eugene Rakhmatulin
d08266a123 Bugfixes 2026-03-13 12:18:22 -07:00
Eugene Rakhmatulin
03b055d7f0 Major cluster orchestration refactoring to support running without Ray 2026-03-13 11:55:18 -07:00
Eugene Rakhmatulin
d609fecef3 Merge branch 'main' of github.com:eugr/spark-vllm-docker 2026-03-12 15:04:41 -07:00
eugr
7c198b1ceb Merge pull request #90 from sonusflow/pr/qwen35-397b-tp4
Add Qwen3.5-397B INT4-AutoRound TP=4 recipe (37 tok/s)
2026-03-12 15:04:23 -07:00
Eugene Rakhmatulin
8ae51192e5 Experimental mod to support gpu-memory-utilization-gb 2026-03-12 13:37:44 -07:00
Eugene Rakhmatulin
8fec9bed06 Updated Nemotron to support dual sparks 2026-03-12 13:30:15 -07:00
Eugene Rakhmatulin
6a323cc6f5 Merge pull request #93 2026-03-12 13:00:13 -07:00
Eugene Rakhmatulin
6f9a2f981c Adjusted model parameters 2026-03-12 12:59:05 -07:00
remi
122edc8229 super nemotron mod & recipe for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 2026-03-11 20:53:44 +01:00
Eugene Rakhmatulin
7ceea85647 Fixed qwen3-coder-next-int4-autoround to exclude Ray 2026-03-11 11:20:56 -07:00
Eugene Rakhmatulin
45066e2b16 Updated README 2026-03-11 09:57:34 -07:00
Eugene Rakhmatulin
f2cf11b047 Added a recipe for qwen3-coder-next-int4-autoround 2026-03-11 09:23:23 -07:00
sonusflow
3baca14eb1 Move recipe to 4x-spark-cluster/ and add UMA memory optimizations
- Move qwen3.5-397b-int4-autoround.yaml to recipes/4x-spark-cluster/
  per maintainer request (multi-node recipes in separate directory)
- Add PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to recipe env
- Optimize Ray for GB10 UMA (128GB shared CPU/GPU memory):
  - Disable Ray dashboard (saves ~1.2 GiB per node)
  - Limit Ray object store to 1 GiB (default 30% of RAM = 33 GiB)
  - Disable pre-started idle workers (saves ~8 GiB on head node)
  - Set --num-cpus 2 and --disable-usage-stats on all nodes
- Net effect: ~40+ GiB freed across 4-node cluster for model/KV cache
2026-03-11 07:29:45 +00:00
Eugene Rakhmatulin
66b5c85907 Merge branch 'main' of github.com:eugr/spark-vllm-docker 2026-03-10 10:29:10 -07:00
eugr
0019bdf5ed Merge pull request #85 from saladinomario/feat/recipe-env-passthrough
Add -e/--env passthrough to run-recipe.py
2026-03-10 10:28:29 -07:00
sonusflow
006734910c Add Qwen3.5-397B INT4-AutoRound TP=4 recipe and Marlin fix
Production-tested recipe for running Qwen3.5-397B-A17B with INT4 AutoRound
quantization across 4 DGX Spark nodes using tensor parallelism.

Performance (4× DGX Spark, driver 580.126.09):
- Single user: 37 tok/s
- 4 concurrent: ~26 tok/s per user, ~103 tok/s aggregate

The Marlin TP fix resolves the MIN_THREAD_N=64 constraint that breaks
in_proj_ba layers at TP=4 (output_size=128/4=32 < 64). Solution:
ReplicatedLinear for B/A projections, applied via diff patches.

Key config:
- VLLM_MARLIN_USE_ATOMIC_ADD=1 (required for Marlin correctness)
- KV cache FP8, prefix caching enabled
- gpu_memory_utilization 0.78 (UMA safe margin)
- CUDAGraphs enabled (default, requires driver 580.x)

Note: Driver 590.x has CUDAGraph capture deadlock on GB10 unified memory.
Stay on driver 580.126.09.
2026-03-09 21:30:28 +00:00
Eugene Rakhmatulin
e225c709fb Revert "fix: add temporary patch for CUDA graphs estimation" as it has been merged to main
This reverts commit 63b2a8dbed.
2026-03-09 09:46:50 -07:00
Eugene Rakhmatulin
63b2a8dbed fix: add temporary patch for CUDA graphs estimation 2026-03-08 22:43:41 -07:00
eugr
9724619dbd Merge pull request #87 from SeraphimSerapis/fix_wheels_download
fix: skip empty lines in wheel download read loop
2026-03-07 09:34:31 -08:00
Eugene Rakhmatulin
d42c4199fa Unsloth chat template for qwen3.5 staging-current-1772875976 2026-03-06 23:35:18 -08:00
Tim Messerschmidt
b9fc32ec34 fix: skip empty lines in wheel download read loop
Add a guard to skip empty lines (e.g. trailing newlines) in the
while-read loop to prevent try_download_wheels from breaking on
unexpected blank input.
2026-03-07 05:06:12 +01:00
Eugene Rakhmatulin
9dc09bd04b Renamed recipe for qwen3.5-35b-a3b-fp8 to match others 2026-03-06 13:56:06 -08:00
eugr
e88426646b Merge pull request #76 from mmonad/fix-exec-arg-quoting
Fix shell quoting for exec command arguments
2026-03-06 13:45:53 -08:00
mariosaladino
f95beba566 Add -e/--env passthrough to run-recipe.py
Fixes #81. Allows passing environment variables (e.g. HF_TOKEN)
through to the container when launching via recipes, mirroring
the existing -e flag in launch-cluster.sh.

Usage: ./run-recipe.sh glm-4.7-flash-awq --solo -e HF_TOKEN=$HF_TOKEN
2026-03-06 21:50:29 +01:00