Eugene Rakhmatulin
|
1c853b725e
|
allows to use $HF_HOME as huggingface cache directory, closes #68
|
2026-02-25 16:38:04 -08:00 |
|
Eugene Rakhmatulin
|
f886505436
|
Added --non-privileged flag to launch-cluster.sh
|
2026-02-15 00:12:06 -08:00 |
|
Eugene Rakhmatulin
|
6d3f5dfd5c
|
map flashinfer/torch/triton cache directories by default
|
2026-02-10 16:36:02 -08:00 |
|
Eugene Rakhmatulin
|
ef6a5eca29
|
Merge branch 'main' into pr-19
|
2026-02-04 11:36:59 -08:00 |
|
Eugene Rakhmatulin
|
f7830636af
|
Cleaning up launch-cluster changes
|
2026-02-04 11:36:55 -08:00 |
|
Raphael Amorim
|
28ba6090fc
|
Adding suggestions from Eugr and unit tests
|
2026-02-03 17:32:59 -05:00 |
|
Eugene Rakhmatulin
|
4b9ab0de7c
|
Added ability to launch NGC container in the cluster
|
2026-02-02 16:57:04 -08:00 |
|
Raphael Amorim
|
751bc5a47a
|
Adding sample profile and profile loader
|
2026-02-02 10:25:53 -05:00 |
|
Eugene Rakhmatulin
|
4a4b4e7610
|
Fixed a bug when solo mode failed on a standalone Spark without configured RoCE.
|
2026-01-30 16:39:11 -08:00 |
|
Eugene Rakhmatulin
|
ace61c2d55
|
added new mod for glm4.7-flash-awq, solo model support.
|
2026-01-29 18:18:00 -08:00 |
|
Eugene Rakhmatulin
|
7a81e90cd2
|
added -e parameter
|
2026-01-29 13:06:22 -08:00 |
|
Eugene Rakhmatulin
|
9ad61078ce
|
Added multiple mods support
|
2025-12-23 17:45:55 -08:00 |
|
Eugene Rakhmatulin
|
c90a6d0bde
|
Fixed remote docker execution
|
2025-12-23 13:49:38 -08:00 |
|
Eugene Rakhmatulin
|
19dec79c5c
|
initial mod implementation
|
2025-12-23 13:38:10 -08:00 |
|
Eugene Rakhmatulin
|
1464b0dc8f
|
Display image name in launch-cluster.sh output
|
2025-12-21 22:44:01 -08:00 |
|
Eugene Rakhmatulin
|
f075801c59
|
Fixed launch_cluster bug introduced by refactoring
|
2025-12-19 10:51:50 -08:00 |
|
Eugene Rakhmatulin
|
a351f182cc
|
Implement autodiscovery for copy hosts and enhance interface detection in build-and-copy and launch-cluster scripts
|
2025-12-19 10:36:39 -08:00 |
|
Eugene Rakhmatulin
|
294d155532
|
Add NCCL debug level option to launch-cluster.sh
|
2025-12-18 23:28:12 -08:00 |
|
Eugene Rakhmatulin
|
0377e9badf
|
Bugfix: don't shut down on exit if cluster is already running
|
2025-12-18 23:12:39 -08:00 |
|
Eugene Rakhmatulin
|
2a2f8f24e2
|
Allow launch-cluster.sh to be executed in non-TTY environment
|
2025-12-18 23:02:58 -08:00 |
|
Eugene Rakhmatulin
|
8c53179cc2
|
changed extra docker args variable to VLLM_SPARK_EXTRA_DOCKER_ARGS for consistency
|
2025-12-18 22:27:27 -08:00 |
|
Eugene Rakhmatulin
|
8be691e806
|
Fixed issue with argument passing
|
2025-12-18 15:31:53 -08:00 |
|
Eugene Rakhmatulin
|
369283f655
|
Updated README.md with launch-cluster details.
|
2025-12-18 15:25:22 -08:00 |
|
Eugene Rakhmatulin
|
db5c443905
|
Enhance launch-cluster script with improved node detection and SSH scanning using netcat and Python
|
2025-12-18 14:52:23 -08:00 |
|
Eugene Rakhmatulin
|
6c04ebfca1
|
Refactor launch-cluster script to include cluster running checks and streamline start process for head and worker nodes
|
2025-12-18 14:50:26 -08:00 |
|
Eugene Rakhmatulin
|
f7a15bfaf5
|
Enhance launch-cluster script with improved SSH connectivity checks for worker nodes
|
2025-12-18 14:22:48 -08:00 |
|
Eugene Rakhmatulin
|
25b1d8eb4f
|
Enhance launch-cluster script with auto-detection for interfaces and nodes
|
2025-12-18 13:53:28 -08:00 |
|
Eugene Rakhmatulin
|
a1ed352635
|
renamed launch-cluster for consitency
|
2025-12-18 13:11:48 -08:00 |
|