Commit Graph

58 Commits

Author SHA1 Message Date
Eugene Rakhmatulin
2a2f8f24e2 Allow launch-cluster.sh to be executed in non-TTY environment 2025-12-18 23:02:58 -08:00
Eugene Rakhmatulin
8c53179cc2 changed extra docker args variable to VLLM_SPARK_EXTRA_DOCKER_ARGS for consistency 2025-12-18 22:27:27 -08:00
Eugene Rakhmatulin
cf937af897 Merge pull request #6 2025-12-18 22:17:12 -08:00
Eugene Rakhmatulin
cf9da89545 Updated README 2025-12-18 22:03:46 -08:00
Eugene Rakhmatulin
8a0cb3c853 Merge branch 'main' into pr-6 2025-12-18 22:02:13 -08:00
Eugene Rakhmatulin
442f7369ad Updated build script to handle BUILD_JOBS argument 2025-12-18 22:02:04 -08:00
Eugene Rakhmatulin
e6efd668cd Added Table of Contents to README 2025-12-18 15:43:09 -08:00
Eugene Rakhmatulin
8be691e806 Fixed issue with argument passing 2025-12-18 15:31:53 -08:00
Eugene Rakhmatulin
369283f655 Updated README.md with launch-cluster details. 2025-12-18 15:25:22 -08:00
Eugene Rakhmatulin
db5c443905 Enhance launch-cluster script with improved node detection and SSH scanning using netcat and Python 2025-12-18 14:52:23 -08:00
Eugene Rakhmatulin
6c04ebfca1 Refactor launch-cluster script to include cluster running checks and streamline start process for head and worker nodes 2025-12-18 14:50:26 -08:00
Eugene Rakhmatulin
f7a15bfaf5 Enhance launch-cluster script with improved SSH connectivity checks for worker nodes 2025-12-18 14:22:48 -08:00
Eugene Rakhmatulin
25b1d8eb4f Enhance launch-cluster script with auto-detection for interfaces and nodes 2025-12-18 13:53:28 -08:00
Eugene Rakhmatulin
a1ed352635 renamed launch-cluster for consitency 2025-12-18 13:11:48 -08:00
Eugene Rakhmatulin
20a6699bf7 Add launch_cluster script for managing cluster nodes and actions 2025-12-18 13:11:13 -08:00
Eugene Rakhmatulin
1025243316 Added launch_cluster script to simplify launching cluster on nodes. 2025-12-18 13:10:57 -08:00
Christopher Owen
a13a9f6806 Limit build parallelism to reduce OOM situations 2025-12-18 13:36:35 +01:00
Eugene Rakhmatulin
e0f6cff132 Merge pull request #1 2025-12-16 21:32:42 -08:00
TeskaLabs Admin
f1abfb85b6 Bump of the version 2025-12-16 17:58:48 +00:00
Eugene Rakhmatulin
79f6a204d1 Update README.md 2025-12-15 09:51:49 -08:00
Eugene Rakhmatulin
0606b1b984 Refactor Triton and vLLM reference handling in Dockerfile and build script 2025-12-14 23:28:08 -08:00
eugr
4551795908 Fixed missing Infiniband dependency, added CuDNN 2025-12-14 21:49:50 -08:00
eugr
33720fc9d6 Use no-build-isolation for Triton Kernels build 2025-12-14 18:35:26 -08:00
eugr
dc614dc6ae Separated Triton build into a dedicated phase for better caching 2025-12-14 10:32:28 -08:00
eugr
25f759fec8 Optimized triton caching 2025-12-14 09:26:10 -08:00
eugr
02f842e1fd Updated README 2025-12-14 00:39:15 -08:00
eugr
e8a12da072 Build triton from source; add TRITON_SHA argument to specify triton release, and add timing statistics 2025-12-14 00:30:50 -08:00
eugr
a8217a1fd8 Improved dependency handling 2025-12-13 22:41:30 -08:00
eugr
cc3e73feb1 Improved caching 2025-12-13 21:34:57 -08:00
eugr
76a8e92c86 Multistage build with caching 2025-12-13 21:18:26 -08:00
eugr
295e1f2266 Removed MiniMax M2 temporary patch from Dockerfile; updated README.md 2025-12-11 13:24:57 -08:00
eugr
37c12cf9e4 Removed MiniMax M2 patch since the fix is merged into main 2025-12-11 13:23:30 -08:00
eugr
5fba205db4 Implemented a temporary patch for recently broken MiniMax-M2 (in builds after 12/10) for some quants. 2025-12-11 11:13:05 -08:00
eugr
9d351cd6d5 Updated README 2025-12-05 11:32:02 -08:00
eugr
270446be27 Add build-and-copy script for automated image building and deployment 2025-12-05 11:28:43 -08:00
eugr
b10ed739fe formatting changes 2025-11-29 10:04:12 -08:00
eugr
6a66a4b66f Added patch to allow fastsafetensors in cluster config 2025-11-26 21:25:04 -08:00
eugr
712637a348 Added second RoCE interface to examples 2025-11-26 19:53:37 -08:00
eugr
bdf16a0a34 Formatting 2025-11-26 14:02:15 -08:00
eugr
cf8e411ad2 Added benchmarking 2025-11-26 14:01:04 -08:00
eugr
676fa2ace9 Formatting fix 2025-11-26 13:52:30 -08:00
eugr
4f27899939 Added some details on networking 2025-11-26 13:50:39 -08:00
eugr
1a4bc1d7aa Typo 2025-11-26 13:44:34 -08:00
eugr
2a7d31ad81 Updated README 2025-11-26 13:30:17 -08:00
eugr
549214e6ed Added missing Infiniband and RDMA libraries 2025-11-25 16:14:08 -08:00
eugr
a96a3a2dac Removed temporary patch for NVFP4 quants support as it's been merged into main 2025-11-25 12:48:58 -08:00
eugr
a93bd56389 Updated README 2025-11-24 21:44:01 -08:00
eugr
4c976375c5 Added missing dependencies; added dashboard support for Ray clusters 2025-11-24 21:13:06 -08:00
eugr
399948a725 Added missing modules for flashinfer 2025-11-24 17:02:04 -08:00
eugr
bd48032c45 Fixed typo in docker command in README 2025-11-24 16:34:19 -08:00