Commit Graph

28 Commits

Author SHA1 Message Date
Eugene Rakhmatulin
1c853b725e allows to use $HF_HOME as huggingface cache directory, closes #68 2026-02-25 16:38:04 -08:00
Eugene Rakhmatulin
f886505436 Added --non-privileged flag to launch-cluster.sh 2026-02-15 00:12:06 -08:00
Eugene Rakhmatulin
6d3f5dfd5c map flashinfer/torch/triton cache directories by default 2026-02-10 16:36:02 -08:00
Eugene Rakhmatulin
ef6a5eca29 Merge branch 'main' into pr-19 2026-02-04 11:36:59 -08:00
Eugene Rakhmatulin
f7830636af Cleaning up launch-cluster changes 2026-02-04 11:36:55 -08:00
Raphael Amorim
28ba6090fc Adding suggestions from Eugr and unit tests 2026-02-03 17:32:59 -05:00
Eugene Rakhmatulin
4b9ab0de7c Added ability to launch NGC container in the cluster 2026-02-02 16:57:04 -08:00
Raphael Amorim
751bc5a47a Adding sample profile and profile loader 2026-02-02 10:25:53 -05:00
Eugene Rakhmatulin
4a4b4e7610 Fixed a bug when solo mode failed on a standalone Spark without configured RoCE. 2026-01-30 16:39:11 -08:00
Eugene Rakhmatulin
ace61c2d55 added new mod for glm4.7-flash-awq, solo model support. 2026-01-29 18:18:00 -08:00
Eugene Rakhmatulin
7a81e90cd2 added -e parameter 2026-01-29 13:06:22 -08:00
Eugene Rakhmatulin
9ad61078ce Added multiple mods support 2025-12-23 17:45:55 -08:00
Eugene Rakhmatulin
c90a6d0bde Fixed remote docker execution 2025-12-23 13:49:38 -08:00
Eugene Rakhmatulin
19dec79c5c initial mod implementation 2025-12-23 13:38:10 -08:00
Eugene Rakhmatulin
1464b0dc8f Display image name in launch-cluster.sh output 2025-12-21 22:44:01 -08:00
Eugene Rakhmatulin
f075801c59 Fixed launch_cluster bug introduced by refactoring 2025-12-19 10:51:50 -08:00
Eugene Rakhmatulin
a351f182cc Implement autodiscovery for copy hosts and enhance interface detection in build-and-copy and launch-cluster scripts 2025-12-19 10:36:39 -08:00
Eugene Rakhmatulin
294d155532 Add NCCL debug level option to launch-cluster.sh 2025-12-18 23:28:12 -08:00
Eugene Rakhmatulin
0377e9badf Bugfix: don't shut down on exit if cluster is already running 2025-12-18 23:12:39 -08:00
Eugene Rakhmatulin
2a2f8f24e2 Allow launch-cluster.sh to be executed in non-TTY environment 2025-12-18 23:02:58 -08:00
Eugene Rakhmatulin
8c53179cc2 changed extra docker args variable to VLLM_SPARK_EXTRA_DOCKER_ARGS for consistency 2025-12-18 22:27:27 -08:00
Eugene Rakhmatulin
8be691e806 Fixed issue with argument passing 2025-12-18 15:31:53 -08:00
Eugene Rakhmatulin
369283f655 Updated README.md with launch-cluster details. 2025-12-18 15:25:22 -08:00
Eugene Rakhmatulin
db5c443905 Enhance launch-cluster script with improved node detection and SSH scanning using netcat and Python 2025-12-18 14:52:23 -08:00
Eugene Rakhmatulin
6c04ebfca1 Refactor launch-cluster script to include cluster running checks and streamline start process for head and worker nodes 2025-12-18 14:50:26 -08:00
Eugene Rakhmatulin
f7a15bfaf5 Enhance launch-cluster script with improved SSH connectivity checks for worker nodes 2025-12-18 14:22:48 -08:00
Eugene Rakhmatulin
25b1d8eb4f Enhance launch-cluster script with auto-detection for interfaces and nodes 2025-12-18 13:53:28 -08:00
Eugene Rakhmatulin
a1ed352635 renamed launch-cluster for consitency 2025-12-18 13:11:48 -08:00