Commit Graph

125 Commits

Author SHA1 Message Date
Eugene Rakhmatulin
9f35dbdd2d Reverted back to release flashinfer 2025-12-20 23:01:49 -08:00
Eugene Rakhmatulin
d5d85aaac7 Added optional flashinfer packages, using pre-release flashinfer 2025-12-20 22:56:40 -08:00
Eugene Rakhmatulin
76988e0c75 Added --use-wheels to use precompiled vLLM wheels instead of compiling from the source 2025-12-20 20:25:07 -08:00
Eugene Rakhmatulin
a83200573a Enhance Dockerfile: limit ccache size, enable compression, and optimize git repo size 2025-12-20 15:29:37 -08:00
Eugene Rakhmatulin
fbb1bf73d5 Switching to flashinfer 0.6.x pre-release wheels 2025-12-20 13:28:06 -08:00
Eugene Rakhmatulin
f075801c59 Fixed launch_cluster bug introduced by refactoring 2025-12-19 10:51:50 -08:00
Eugene Rakhmatulin
0cac77c286 Fixed contributor username 2025-12-19 10:41:03 -08:00
Eugene Rakhmatulin
3eb57a6d49 Updated README - autodiscovery in copy ops 2025-12-19 10:39:28 -08:00
Eugene Rakhmatulin
a351f182cc Implement autodiscovery for copy hosts and enhance interface detection in build-and-copy and launch-cluster scripts 2025-12-19 10:36:39 -08:00
Eugene Rakhmatulin
244ad758d2 Updated README 2025-12-19 09:56:24 -08:00
Eugene Rakhmatulin
074316de68 Merge pull request #2 2025-12-19 08:59:29 -08:00
Eugene Rakhmatulin
23858a3c7f Merge branch 'main' into pr-2 2025-12-19 08:51:52 -08:00
Eugene Rakhmatulin
de055928b8 Update CHANGELOG: Document --nccl-debug option for NCCL debug level control 2025-12-18 23:29:03 -08:00
Eugene Rakhmatulin
294d155532 Add NCCL debug level option to launch-cluster.sh 2025-12-18 23:28:12 -08:00
Eugene Rakhmatulin
0377e9badf Bugfix: don't shut down on exit if cluster is already running 2025-12-18 23:12:39 -08:00
Eugene Rakhmatulin
2a2f8f24e2 Allow launch-cluster.sh to be executed in non-TTY environment 2025-12-18 23:02:58 -08:00
Eugene Rakhmatulin
8c53179cc2 changed extra docker args variable to VLLM_SPARK_EXTRA_DOCKER_ARGS for consistency 2025-12-18 22:27:27 -08:00
Eugene Rakhmatulin
cf937af897 Merge pull request #6 2025-12-18 22:17:12 -08:00
Eugene Rakhmatulin
cf9da89545 Updated README 2025-12-18 22:03:46 -08:00
Eugene Rakhmatulin
8a0cb3c853 Merge branch 'main' into pr-6 2025-12-18 22:02:13 -08:00
Eugene Rakhmatulin
442f7369ad Updated build script to handle BUILD_JOBS argument 2025-12-18 22:02:04 -08:00
Eugene Rakhmatulin
e6efd668cd Added Table of Contents to README 2025-12-18 15:43:09 -08:00
Eugene Rakhmatulin
8be691e806 Fixed issue with argument passing 2025-12-18 15:31:53 -08:00
Eugene Rakhmatulin
369283f655 Updated README.md with launch-cluster details. 2025-12-18 15:25:22 -08:00
Eugene Rakhmatulin
db5c443905 Enhance launch-cluster script with improved node detection and SSH scanning using netcat and Python 2025-12-18 14:52:23 -08:00
Eugene Rakhmatulin
6c04ebfca1 Refactor launch-cluster script to include cluster running checks and streamline start process for head and worker nodes 2025-12-18 14:50:26 -08:00
Eugene Rakhmatulin
f7a15bfaf5 Enhance launch-cluster script with improved SSH connectivity checks for worker nodes 2025-12-18 14:22:48 -08:00
Eugene Rakhmatulin
25b1d8eb4f Enhance launch-cluster script with auto-detection for interfaces and nodes 2025-12-18 13:53:28 -08:00
Eugene Rakhmatulin
a1ed352635 renamed launch-cluster for consitency 2025-12-18 13:11:48 -08:00
Eugene Rakhmatulin
20a6699bf7 Add launch_cluster script for managing cluster nodes and actions 2025-12-18 13:11:13 -08:00
Eugene Rakhmatulin
1025243316 Added launch_cluster script to simplify launching cluster on nodes. 2025-12-18 13:10:57 -08:00
Christopher Owen
a13a9f6806 Limit build parallelism to reduce OOM situations 2025-12-18 13:36:35 +01:00
Eric Lewis
11355677f6 Add parallel copy option to build-and-copy.sh
Introduced the --copy-parallel flag to enable concurrent copying of Docker images to multiple hosts. Updated the README with usage instructions and details about the new option. Refactored the script to support both serial and parallel copy modes for improved efficiency.
2025-12-18 01:24:48 -05:00
Eric Lewis
e67abd5e6e Add multi-host copy support to build-and-copy.sh
Updated build-and-copy.sh to support copying Docker images to multiple hosts using the new -c/--copy-to flag, which accepts space- or comma-separated host lists. The old --copy-to-host flag is retained as an alias for backward compatibility, and -h is now used for help. The README was updated to document these changes and provide new usage examples.
2025-12-18 00:32:45 -05:00
Eugene Rakhmatulin
e0f6cff132 Merge pull request #1 2025-12-16 21:32:42 -08:00
TeskaLabs Admin
f1abfb85b6 Bump of the version 2025-12-16 17:58:48 +00:00
Eugene Rakhmatulin
79f6a204d1 Update README.md 2025-12-15 09:51:49 -08:00
Eugene Rakhmatulin
0606b1b984 Refactor Triton and vLLM reference handling in Dockerfile and build script 2025-12-14 23:28:08 -08:00
eugr
4551795908 Fixed missing Infiniband dependency, added CuDNN 2025-12-14 21:49:50 -08:00
eugr
33720fc9d6 Use no-build-isolation for Triton Kernels build 2025-12-14 18:35:26 -08:00
eugr
dc614dc6ae Separated Triton build into a dedicated phase for better caching 2025-12-14 10:32:28 -08:00
eugr
25f759fec8 Optimized triton caching 2025-12-14 09:26:10 -08:00
eugr
02f842e1fd Updated README 2025-12-14 00:39:15 -08:00
eugr
e8a12da072 Build triton from source; add TRITON_SHA argument to specify triton release, and add timing statistics 2025-12-14 00:30:50 -08:00
eugr
a8217a1fd8 Improved dependency handling 2025-12-13 22:41:30 -08:00
eugr
cc3e73feb1 Improved caching 2025-12-13 21:34:57 -08:00
eugr
76a8e92c86 Multistage build with caching 2025-12-13 21:18:26 -08:00
eugr
295e1f2266 Removed MiniMax M2 temporary patch from Dockerfile; updated README.md 2025-12-11 13:24:57 -08:00
eugr
37c12cf9e4 Removed MiniMax M2 patch since the fix is merged into main 2025-12-11 13:23:30 -08:00
eugr
5fba205db4 Implemented a temporary patch for recently broken MiniMax-M2 (in builds after 12/10) for some quants. 2025-12-11 11:13:05 -08:00