Commit Graph

297 Commits

Author SHA1 Message Date
Eugene Rakhmatulin
07fac71dac Fixed bug with CONTAINER_NAME variable 2026-03-25 14:42:01 -07:00
Eugene Rakhmatulin
ad2cd3373f .env configuration support for launch-cluster.sh 2026-03-25 14:18:00 -07:00
Eugene Rakhmatulin
c4b078b868 Merge branch 'main' into 3-node 2026-03-24 22:21:25 -07:00
Eugene Rakhmatulin
3be2fb24a8 Merge pull request #122 2026-03-24 22:18:52 -07:00
Eugene Rakhmatulin
7fa69187df metadata changes 2026-03-24 22:18:07 -07:00
Drew Botwinick
8298c3d7f8 Merge remote-tracking branch 'upstream/main'
# Conflicts:
#	Dockerfile
2026-03-24 15:41:09 -05:00
Eugene Rakhmatulin
f8c2653fd3 Quick fix for NCCL dependency 2026-03-23 23:20:59 -07:00
Eugene Rakhmatulin
990a7b3837 Use mesh-optimized NCCL 2026-03-23 15:43:18 -07:00
Eugene Rakhmatulin
9e089acf2b Updated Nemotron recipes to use VLLM CUTLASS 2026-03-22 23:03:24 -07:00
Eugene Rakhmatulin
2d749742e4 Changed base image back to base CUDA development one 2026-03-21 18:11:20 -07:00
Eugene Rakhmatulin
7a54657abf Revert "cuda 13.2 torch"
This reverts commit 926dd57a87.
2026-03-21 15:36:17 -07:00
Eugene Rakhmatulin
926dd57a87 cuda 13.2 torch 2026-03-21 15:15:01 -07:00
Eugene Rakhmatulin
6e8d85c914 cleanup 2026-03-21 15:12:12 -07:00
Drew Botwinick
d6e76f8e2f add build metadata generation and include in Dockerfiles 2026-03-21 16:10:04 -05:00
Eugene Rakhmatulin
8385506c5e Fixes 2026-03-20 23:51:21 -07:00
Eugene Rakhmatulin
8caebe3155 Reverting back to CUDA image + pytorch from wheels 2026-03-20 17:03:18 -07:00
Eugene Rakhmatulin
919a881cb1 Merge branch 'main' of gitlab.home.eugr.net:ai/spark-vllm 2026-03-18 22:03:25 -07:00
Eugene Rakhmatulin
8ddc259619 Fixed #111 2026-03-18 22:03:04 -07:00
eugr
22f3fa6c21 Merge pull request #103 from apairmont/network_arg
Add docker --network arg to common build flags
2026-03-18 21:48:48 -07:00
Eugene Rakhmatulin
15d295887c Updated README to reflect --master-port parameter 2026-03-18 21:23:28 -07:00
Eugene Rakhmatulin
7e4150feed Added master-port argument 2026-03-18 16:57:55 -07:00
eugr
7b752c31c5 Merge pull request #110 from voloszad/patch-1
Remove run-cluster-node.sh script copy and permission commands from Dockerfile.mxfp4
2026-03-18 14:54:11 -07:00
Andrej V.
bdd2b10f54 Remove script copy and permission commands from Dockerfile
Removed script copying and permission setting for run-cluster-node.sh.
2026-03-18 21:57:56 +01:00
Eugene Rakhmatulin
2755b62d12 Fixes #108 2026-03-18 13:26:39 -07:00
Eugene Rakhmatulin
f327b92abe Fixes #106 and #108 2026-03-18 13:06:44 -07:00
Eugene Rakhmatulin
57b458570e Added experimental Qwen3.5-397B support for dual Spark configuration 2026-03-17 19:05:36 -07:00
Eugene Rakhmatulin
57ed099465 Updated README file to reflect new launch-cluster options. 2026-03-17 16:16:04 -07:00
Eugene Rakhmatulin
fb0687cd1b Updated README to describe no-ray mode 2026-03-17 15:27:22 -07:00
Eugene Rakhmatulin
ccea2ba861 Bugfixes 2026-03-17 13:54:42 -07:00
Eugene Rakhmatulin
957605498c Added extra passthrough variables to run-recipe 2026-03-17 13:41:40 -07:00
Eugene Rakhmatulin
b1eeefc0eb Changed Nemotron-3-Nano-NVFP4 to Marlin backend 2026-03-17 13:10:48 -07:00
Alan Pairmont
b879b7748f add network arg to common build flags 2026-03-16 12:09:59 -04:00
Eugene Rakhmatulin
fa645f3e4b bugfixes 2026-03-13 13:39:30 -07:00
Eugene Rakhmatulin
dedbd0a01d bugfixes 2026-03-13 12:41:48 -07:00
Eugene Rakhmatulin
caa83d9e5b Bugfixes 2026-03-13 12:32:43 -07:00
Eugene Rakhmatulin
4bcbbaa25a Bugfixes 2026-03-13 12:23:41 -07:00
Eugene Rakhmatulin
d08266a123 Bugfixes 2026-03-13 12:18:22 -07:00
Eugene Rakhmatulin
03b055d7f0 Major cluster orchestration refactoring to support running without Ray 2026-03-13 11:55:18 -07:00
Eugene Rakhmatulin
d609fecef3 Merge branch 'main' of github.com:eugr/spark-vllm-docker 2026-03-12 15:04:41 -07:00
eugr
7c198b1ceb Merge pull request #90 from sonusflow/pr/qwen35-397b-tp4
Add Qwen3.5-397B INT4-AutoRound TP=4 recipe (37 tok/s)
2026-03-12 15:04:23 -07:00
Eugene Rakhmatulin
8ae51192e5 Experimental mod to support gpu-memory-utilization-gb 2026-03-12 13:37:44 -07:00
Eugene Rakhmatulin
8fec9bed06 Updated Nemotron to support dual sparks 2026-03-12 13:30:15 -07:00
Eugene Rakhmatulin
6a323cc6f5 Merge pull request #93 2026-03-12 13:00:13 -07:00
Eugene Rakhmatulin
6f9a2f981c Adjusted model parameters 2026-03-12 12:59:05 -07:00
remi
122edc8229 super nemotron mod & recipe for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 2026-03-11 20:53:44 +01:00
Eugene Rakhmatulin
7ceea85647 Fixed qwen3-coder-next-int4-autoround to exclude Ray 2026-03-11 11:20:56 -07:00
Eugene Rakhmatulin
45066e2b16 Updated README 2026-03-11 09:57:34 -07:00
Eugene Rakhmatulin
f2cf11b047 Added a recipe for qwen3-coder-next-int4-autoround 2026-03-11 09:23:23 -07:00
sonusflow
3baca14eb1 Move recipe to 4x-spark-cluster/ and add UMA memory optimizations
- Move qwen3.5-397b-int4-autoround.yaml to recipes/4x-spark-cluster/
  per maintainer request (multi-node recipes in separate directory)
- Add PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to recipe env
- Optimize Ray for GB10 UMA (128GB shared CPU/GPU memory):
  - Disable Ray dashboard (saves ~1.2 GiB per node)
  - Limit Ray object store to 1 GiB (default 30% of RAM = 33 GiB)
  - Disable pre-started idle workers (saves ~8 GiB on head node)
  - Set --num-cpus 2 and --disable-usage-stats on all nodes
- Net effect: ~40+ GiB freed across 4-node cluster for model/KV cache
2026-03-11 07:29:45 +00:00
Eugene Rakhmatulin
66b5c85907 Merge branch 'main' of github.com:eugr/spark-vllm-docker 2026-03-10 10:29:10 -07:00