Eugene Rakhmatulin
|
294d155532
|
Add NCCL debug level option to launch-cluster.sh
|
2025-12-18 23:28:12 -08:00 |
|
Eugene Rakhmatulin
|
0377e9badf
|
Bugfix: don't shut down on exit if cluster is already running
|
2025-12-18 23:12:39 -08:00 |
|
Eugene Rakhmatulin
|
2a2f8f24e2
|
Allow launch-cluster.sh to be executed in non-TTY environment
|
2025-12-18 23:02:58 -08:00 |
|
Eugene Rakhmatulin
|
8c53179cc2
|
changed extra docker args variable to VLLM_SPARK_EXTRA_DOCKER_ARGS for consistency
|
2025-12-18 22:27:27 -08:00 |
|
Eugene Rakhmatulin
|
cf937af897
|
Merge pull request #6
|
2025-12-18 22:17:12 -08:00 |
|
Eugene Rakhmatulin
|
cf9da89545
|
Updated README
|
2025-12-18 22:03:46 -08:00 |
|
Eugene Rakhmatulin
|
8a0cb3c853
|
Merge branch 'main' into pr-6
|
2025-12-18 22:02:13 -08:00 |
|
Eugene Rakhmatulin
|
442f7369ad
|
Updated build script to handle BUILD_JOBS argument
|
2025-12-18 22:02:04 -08:00 |
|
Eugene Rakhmatulin
|
e6efd668cd
|
Added Table of Contents to README
|
2025-12-18 15:43:09 -08:00 |
|
Eugene Rakhmatulin
|
8be691e806
|
Fixed issue with argument passing
|
2025-12-18 15:31:53 -08:00 |
|
Eugene Rakhmatulin
|
369283f655
|
Updated README.md with launch-cluster details.
|
2025-12-18 15:25:22 -08:00 |
|
Eugene Rakhmatulin
|
db5c443905
|
Enhance launch-cluster script with improved node detection and SSH scanning using netcat and Python
|
2025-12-18 14:52:23 -08:00 |
|
Eugene Rakhmatulin
|
6c04ebfca1
|
Refactor launch-cluster script to include cluster running checks and streamline start process for head and worker nodes
|
2025-12-18 14:50:26 -08:00 |
|
Eugene Rakhmatulin
|
f7a15bfaf5
|
Enhance launch-cluster script with improved SSH connectivity checks for worker nodes
|
2025-12-18 14:22:48 -08:00 |
|
Eugene Rakhmatulin
|
25b1d8eb4f
|
Enhance launch-cluster script with auto-detection for interfaces and nodes
|
2025-12-18 13:53:28 -08:00 |
|
Eugene Rakhmatulin
|
a1ed352635
|
renamed launch-cluster for consitency
|
2025-12-18 13:11:48 -08:00 |
|
Eugene Rakhmatulin
|
20a6699bf7
|
Add launch_cluster script for managing cluster nodes and actions
|
2025-12-18 13:11:13 -08:00 |
|
Eugene Rakhmatulin
|
1025243316
|
Added launch_cluster script to simplify launching cluster on nodes.
|
2025-12-18 13:10:57 -08:00 |
|
Christopher Owen
|
a13a9f6806
|
Limit build parallelism to reduce OOM situations
|
2025-12-18 13:36:35 +01:00 |
|
Eugene Rakhmatulin
|
e0f6cff132
|
Merge pull request #1
|
2025-12-16 21:32:42 -08:00 |
|
TeskaLabs Admin
|
f1abfb85b6
|
Bump of the version
|
2025-12-16 17:58:48 +00:00 |
|
Eugene Rakhmatulin
|
79f6a204d1
|
Update README.md
|
2025-12-15 09:51:49 -08:00 |
|
Eugene Rakhmatulin
|
0606b1b984
|
Refactor Triton and vLLM reference handling in Dockerfile and build script
|
2025-12-14 23:28:08 -08:00 |
|
eugr
|
4551795908
|
Fixed missing Infiniband dependency, added CuDNN
|
2025-12-14 21:49:50 -08:00 |
|
eugr
|
33720fc9d6
|
Use no-build-isolation for Triton Kernels build
|
2025-12-14 18:35:26 -08:00 |
|
eugr
|
dc614dc6ae
|
Separated Triton build into a dedicated phase for better caching
|
2025-12-14 10:32:28 -08:00 |
|
eugr
|
25f759fec8
|
Optimized triton caching
|
2025-12-14 09:26:10 -08:00 |
|
eugr
|
02f842e1fd
|
Updated README
|
2025-12-14 00:39:15 -08:00 |
|
eugr
|
e8a12da072
|
Build triton from source; add TRITON_SHA argument to specify triton release, and add timing statistics
|
2025-12-14 00:30:50 -08:00 |
|
eugr
|
a8217a1fd8
|
Improved dependency handling
|
2025-12-13 22:41:30 -08:00 |
|
eugr
|
cc3e73feb1
|
Improved caching
|
2025-12-13 21:34:57 -08:00 |
|
eugr
|
76a8e92c86
|
Multistage build with caching
|
2025-12-13 21:18:26 -08:00 |
|
eugr
|
295e1f2266
|
Removed MiniMax M2 temporary patch from Dockerfile; updated README.md
|
2025-12-11 13:24:57 -08:00 |
|
eugr
|
37c12cf9e4
|
Removed MiniMax M2 patch since the fix is merged into main
|
2025-12-11 13:23:30 -08:00 |
|
eugr
|
5fba205db4
|
Implemented a temporary patch for recently broken MiniMax-M2 (in builds after 12/10) for some quants.
|
2025-12-11 11:13:05 -08:00 |
|
eugr
|
9d351cd6d5
|
Updated README
|
2025-12-05 11:32:02 -08:00 |
|
eugr
|
270446be27
|
Add build-and-copy script for automated image building and deployment
|
2025-12-05 11:28:43 -08:00 |
|
eugr
|
b10ed739fe
|
formatting changes
|
2025-11-29 10:04:12 -08:00 |
|
eugr
|
6a66a4b66f
|
Added patch to allow fastsafetensors in cluster config
|
2025-11-26 21:25:04 -08:00 |
|
eugr
|
712637a348
|
Added second RoCE interface to examples
|
2025-11-26 19:53:37 -08:00 |
|
eugr
|
bdf16a0a34
|
Formatting
|
2025-11-26 14:02:15 -08:00 |
|
eugr
|
cf8e411ad2
|
Added benchmarking
|
2025-11-26 14:01:04 -08:00 |
|
eugr
|
676fa2ace9
|
Formatting fix
|
2025-11-26 13:52:30 -08:00 |
|
eugr
|
4f27899939
|
Added some details on networking
|
2025-11-26 13:50:39 -08:00 |
|
eugr
|
1a4bc1d7aa
|
Typo
|
2025-11-26 13:44:34 -08:00 |
|
eugr
|
2a7d31ad81
|
Updated README
|
2025-11-26 13:30:17 -08:00 |
|
eugr
|
549214e6ed
|
Added missing Infiniband and RDMA libraries
|
2025-11-25 16:14:08 -08:00 |
|
eugr
|
a96a3a2dac
|
Removed temporary patch for NVFP4 quants support as it's been merged into main
|
2025-11-25 12:48:58 -08:00 |
|
eugr
|
a93bd56389
|
Updated README
|
2025-11-24 21:44:01 -08:00 |
|
eugr
|
4c976375c5
|
Added missing dependencies; added dashboard support for Ray clusters
|
2025-11-24 21:13:06 -08:00 |
|