Commit Graph

212 Commits

Author SHA1 Message Date
Eugene Rakhmatulin
294d155532 Add NCCL debug level option to launch-cluster.sh 2025-12-18 23:28:12 -08:00
Eugene Rakhmatulin
0377e9badf Bugfix: don't shut down on exit if cluster is already running 2025-12-18 23:12:39 -08:00
Eugene Rakhmatulin
2a2f8f24e2 Allow launch-cluster.sh to be executed in non-TTY environment 2025-12-18 23:02:58 -08:00
Eugene Rakhmatulin
8c53179cc2 changed extra docker args variable to VLLM_SPARK_EXTRA_DOCKER_ARGS for consistency 2025-12-18 22:27:27 -08:00
Eugene Rakhmatulin
cf937af897 Merge pull request #6 2025-12-18 22:17:12 -08:00
Eugene Rakhmatulin
cf9da89545 Updated README 2025-12-18 22:03:46 -08:00
Eugene Rakhmatulin
8a0cb3c853 Merge branch 'main' into pr-6 2025-12-18 22:02:13 -08:00
Eugene Rakhmatulin
442f7369ad Updated build script to handle BUILD_JOBS argument 2025-12-18 22:02:04 -08:00
Eugene Rakhmatulin
e6efd668cd Added Table of Contents to README 2025-12-18 15:43:09 -08:00
Eugene Rakhmatulin
8be691e806 Fixed issue with argument passing 2025-12-18 15:31:53 -08:00
Eugene Rakhmatulin
369283f655 Updated README.md with launch-cluster details. 2025-12-18 15:25:22 -08:00
Eugene Rakhmatulin
db5c443905 Enhance launch-cluster script with improved node detection and SSH scanning using netcat and Python 2025-12-18 14:52:23 -08:00
Eugene Rakhmatulin
6c04ebfca1 Refactor launch-cluster script to include cluster running checks and streamline start process for head and worker nodes 2025-12-18 14:50:26 -08:00
Eugene Rakhmatulin
f7a15bfaf5 Enhance launch-cluster script with improved SSH connectivity checks for worker nodes 2025-12-18 14:22:48 -08:00
Eugene Rakhmatulin
25b1d8eb4f Enhance launch-cluster script with auto-detection for interfaces and nodes 2025-12-18 13:53:28 -08:00
Eugene Rakhmatulin
a1ed352635 renamed launch-cluster for consitency 2025-12-18 13:11:48 -08:00
Eugene Rakhmatulin
20a6699bf7 Add launch_cluster script for managing cluster nodes and actions 2025-12-18 13:11:13 -08:00
Eugene Rakhmatulin
1025243316 Added launch_cluster script to simplify launching cluster on nodes. 2025-12-18 13:10:57 -08:00
Christopher Owen
a13a9f6806 Limit build parallelism to reduce OOM situations 2025-12-18 13:36:35 +01:00
Eric Lewis
11355677f6 Add parallel copy option to build-and-copy.sh
Introduced the --copy-parallel flag to enable concurrent copying of Docker images to multiple hosts. Updated the README with usage instructions and details about the new option. Refactored the script to support both serial and parallel copy modes for improved efficiency.
2025-12-18 01:24:48 -05:00
Eric Lewis
e67abd5e6e Add multi-host copy support to build-and-copy.sh
Updated build-and-copy.sh to support copying Docker images to multiple hosts using the new -c/--copy-to flag, which accepts space- or comma-separated host lists. The old --copy-to-host flag is retained as an alias for backward compatibility, and -h is now used for help. The README was updated to document these changes and provide new usage examples.
2025-12-18 00:32:45 -05:00
Eugene Rakhmatulin
e0f6cff132 Merge pull request #1 2025-12-16 21:32:42 -08:00
TeskaLabs Admin
f1abfb85b6 Bump of the version 2025-12-16 17:58:48 +00:00
Eugene Rakhmatulin
79f6a204d1 Update README.md 2025-12-15 09:51:49 -08:00
Eugene Rakhmatulin
0606b1b984 Refactor Triton and vLLM reference handling in Dockerfile and build script 2025-12-14 23:28:08 -08:00
eugr
4551795908 Fixed missing Infiniband dependency, added CuDNN 2025-12-14 21:49:50 -08:00
eugr
33720fc9d6 Use no-build-isolation for Triton Kernels build 2025-12-14 18:35:26 -08:00
eugr
dc614dc6ae Separated Triton build into a dedicated phase for better caching 2025-12-14 10:32:28 -08:00
eugr
25f759fec8 Optimized triton caching 2025-12-14 09:26:10 -08:00
eugr
02f842e1fd Updated README 2025-12-14 00:39:15 -08:00
eugr
e8a12da072 Build triton from source; add TRITON_SHA argument to specify triton release, and add timing statistics 2025-12-14 00:30:50 -08:00
eugr
a8217a1fd8 Improved dependency handling 2025-12-13 22:41:30 -08:00
eugr
cc3e73feb1 Improved caching 2025-12-13 21:34:57 -08:00
eugr
76a8e92c86 Multistage build with caching 2025-12-13 21:18:26 -08:00
eugr
295e1f2266 Removed MiniMax M2 temporary patch from Dockerfile; updated README.md 2025-12-11 13:24:57 -08:00
eugr
37c12cf9e4 Removed MiniMax M2 patch since the fix is merged into main 2025-12-11 13:23:30 -08:00
eugr
5fba205db4 Implemented a temporary patch for recently broken MiniMax-M2 (in builds after 12/10) for some quants. 2025-12-11 11:13:05 -08:00
eugr
9d351cd6d5 Updated README 2025-12-05 11:32:02 -08:00
eugr
270446be27 Add build-and-copy script for automated image building and deployment 2025-12-05 11:28:43 -08:00
eugr
b10ed739fe formatting changes 2025-11-29 10:04:12 -08:00
eugr
6a66a4b66f Added patch to allow fastsafetensors in cluster config 2025-11-26 21:25:04 -08:00
eugr
712637a348 Added second RoCE interface to examples 2025-11-26 19:53:37 -08:00
eugr
bdf16a0a34 Formatting 2025-11-26 14:02:15 -08:00
eugr
cf8e411ad2 Added benchmarking 2025-11-26 14:01:04 -08:00
eugr
676fa2ace9 Formatting fix 2025-11-26 13:52:30 -08:00
eugr
4f27899939 Added some details on networking 2025-11-26 13:50:39 -08:00
eugr
1a4bc1d7aa Typo 2025-11-26 13:44:34 -08:00
eugr
2a7d31ad81 Updated README 2025-11-26 13:30:17 -08:00
eugr
549214e6ed Added missing Infiniband and RDMA libraries 2025-11-25 16:14:08 -08:00
eugr
a96a3a2dac Removed temporary patch for NVFP4 quants support as it's been merged into main 2025-11-25 12:48:58 -08:00