Eugene Rakhmatulin
|
c8ee2a2511
|
Perform node count check in any mode
|
2026-03-26 18:15:09 -07:00 |
|
Eugene Rakhmatulin
|
ce293b5f05
|
Additional checks for parallelism and cluster size
|
2026-03-26 17:52:47 -07:00 |
|
Eugene Rakhmatulin
|
f872cc17a8
|
Fix for --setup behavior
|
2026-03-26 16:49:09 -07:00 |
|
Eugene Rakhmatulin
|
00c16746e5
|
Handle new copy hosts setup in run-recipe.py
|
2026-03-26 16:45:35 -07:00 |
|
Eugene Rakhmatulin
|
f163ca69de
|
Autodiscover tweaks
|
2026-03-26 16:30:05 -07:00 |
|
Eugene Rakhmatulin
|
a78e221de3
|
Autodiscovery refactoring with mesh support
|
2026-03-26 15:47:41 -07:00 |
|
Eugene Rakhmatulin
|
83a74bccec
|
Removed extra solo mode check
|
2026-03-26 07:45:23 -07:00 |
|
Eugene Rakhmatulin
|
ff18a9ad5b
|
Merge branch '3-node' of gitlab.home.eugr.net:ai/spark-vllm into 3-node
|
2026-03-25 23:38:44 -07:00 |
|
Eugene Rakhmatulin
|
c08b34a218
|
add --config passthrough to run-recipe
|
2026-03-25 23:35:52 -07:00 |
|
Eugene Rakhmatulin
|
23cca2a11a
|
Merge branch '3-node' of gitlab.home.eugr.net:ai/spark-vllm into 3-node
|
2026-03-25 23:17:25 -07:00 |
|
Eugene Rakhmatulin
|
c2fe579ccc
|
Enhance .env file handling and validation in scripts
|
2026-03-25 23:16:56 -07:00 |
|
Eugene Rakhmatulin
|
8b7c02aa25
|
add .env support to build-and-copy.sh
|
2026-03-25 22:47:02 -07:00 |
|
Eugene Rakhmatulin
|
73fec1bdf8
|
bugfix
|
2026-03-25 15:40:09 -07:00 |
|
Eugene Rakhmatulin
|
2f5ff0211e
|
Cleanup in build script
|
2026-03-25 15:39:23 -07:00 |
|
Eugene Rakhmatulin
|
63ee72e729
|
Merge branch '3-node' of gitlab.home.eugr.net:ai/spark-vllm into 3-node
|
2026-03-25 15:36:31 -07:00 |
|
Eugene Rakhmatulin
|
4a0feea6c3
|
Added --cleanup option to build script
|
2026-03-25 15:35:32 -07:00 |
|
Eugene Rakhmatulin
|
429042b7dc
|
Revert "Added --cleanup option"
This reverts commit b8930b05a1.
|
2026-03-25 15:35:15 -07:00 |
|
Eugene Rakhmatulin
|
ef95336937
|
Merge branch '3-node' of gitlab.home.eugr.net:ai/spark-vllm into 3-node
|
2026-03-25 15:25:19 -07:00 |
|
Eugene Rakhmatulin
|
b8930b05a1
|
Added --cleanup option
|
2026-03-25 15:24:59 -07:00 |
|
Eugene Rakhmatulin
|
49d505ad14
|
Merge branch '3-node' of gitlab.home.eugr.net:ai/spark-vllm into 3-node
|
2026-03-25 15:16:47 -07:00 |
|
Eugene Rakhmatulin
|
1755dfd114
|
Added LOCAL_IP support
|
2026-03-25 15:16:06 -07:00 |
|
Eugene Rakhmatulin
|
3d4dc4c82e
|
Merge branch '3-node' of gitlab.home.eugr.net:ai/spark-vllm into 3-node
|
2026-03-25 14:42:37 -07:00 |
|
Eugene Rakhmatulin
|
07fac71dac
|
Fixed bug with CONTAINER_NAME variable
|
2026-03-25 14:42:01 -07:00 |
|
Eugene Rakhmatulin
|
1702f47df6
|
Merge branch '3-node' of gitlab.home.eugr.net:ai/spark-vllm into 3-node
|
2026-03-25 14:18:32 -07:00 |
|
Eugene Rakhmatulin
|
ad2cd3373f
|
.env configuration support for launch-cluster.sh
|
2026-03-25 14:18:00 -07:00 |
|
Eugene Rakhmatulin
|
1fd8c7afc3
|
Merge branch 'main' into 3-node
|
2026-03-25 12:45:40 -07:00 |
|
Eugene Rakhmatulin
|
3dcd2a90c1
|
Updated Nemotron-3-Super recipe
|
2026-03-25 12:44:44 -07:00 |
|
Eugene Rakhmatulin
|
efacbd69f2
|
Updated Nemotron3-Super recipe
|
2026-03-25 12:43:12 -07:00 |
|
Eugene Rakhmatulin
|
c4b078b868
|
Merge branch 'main' into 3-node
|
2026-03-24 22:21:25 -07:00 |
|
Eugene Rakhmatulin
|
3be2fb24a8
|
Merge pull request #122
|
2026-03-24 22:18:52 -07:00 |
|
Eugene Rakhmatulin
|
7fa69187df
|
metadata changes
|
2026-03-24 22:18:07 -07:00 |
|
Drew Botwinick
|
8298c3d7f8
|
Merge remote-tracking branch 'upstream/main'
# Conflicts:
# Dockerfile
|
2026-03-24 15:41:09 -05:00 |
|
Eugene Rakhmatulin
|
f8c2653fd3
|
Quick fix for NCCL dependency
|
2026-03-23 23:20:59 -07:00 |
|
Eugene Rakhmatulin
|
990a7b3837
|
Use mesh-optimized NCCL
|
2026-03-23 15:43:18 -07:00 |
|
Eugene Rakhmatulin
|
9e089acf2b
|
Updated Nemotron recipes to use VLLM CUTLASS
|
2026-03-22 23:03:24 -07:00 |
|
Eugene Rakhmatulin
|
2d749742e4
|
Changed base image back to base CUDA development one
|
2026-03-21 18:11:20 -07:00 |
|
Eugene Rakhmatulin
|
7a54657abf
|
Revert "cuda 13.2 torch"
This reverts commit 926dd57a87.
|
2026-03-21 15:36:17 -07:00 |
|
Eugene Rakhmatulin
|
926dd57a87
|
cuda 13.2 torch
|
2026-03-21 15:15:01 -07:00 |
|
Eugene Rakhmatulin
|
6e8d85c914
|
cleanup
|
2026-03-21 15:12:12 -07:00 |
|
Drew Botwinick
|
d6e76f8e2f
|
add build metadata generation and include in Dockerfiles
|
2026-03-21 16:10:04 -05:00 |
|
Eugene Rakhmatulin
|
8385506c5e
|
Fixes
|
2026-03-20 23:51:21 -07:00 |
|
Eugene Rakhmatulin
|
8caebe3155
|
Reverting back to CUDA image + pytorch from wheels
|
2026-03-20 17:03:18 -07:00 |
|
Eugene Rakhmatulin
|
919a881cb1
|
Merge branch 'main' of gitlab.home.eugr.net:ai/spark-vllm
|
2026-03-18 22:03:25 -07:00 |
|
Eugene Rakhmatulin
|
8ddc259619
|
Fixed #111
|
2026-03-18 22:03:04 -07:00 |
|
eugr
|
22f3fa6c21
|
Merge pull request #103 from apairmont/network_arg
Add docker --network arg to common build flags
|
2026-03-18 21:48:48 -07:00 |
|
Eugene Rakhmatulin
|
15d295887c
|
Updated README to reflect --master-port parameter
|
2026-03-18 21:23:28 -07:00 |
|
Eugene Rakhmatulin
|
7e4150feed
|
Added master-port argument
|
2026-03-18 16:57:55 -07:00 |
|
eugr
|
7b752c31c5
|
Merge pull request #110 from voloszad/patch-1
Remove run-cluster-node.sh script copy and permission commands from Dockerfile.mxfp4
|
2026-03-18 14:54:11 -07:00 |
|
Andrej V.
|
bdd2b10f54
|
Remove script copy and permission commands from Dockerfile
Removed script copying and permission setting for run-cluster-node.sh.
|
2026-03-18 21:57:56 +01:00 |
|
Eugene Rakhmatulin
|
2755b62d12
|
Fixes #108
|
2026-03-18 13:26:39 -07:00 |
|