57 Commits

Author SHA1 Message Date
Eugene Rakhmatulin
6b7f8dace6 Fixes #187 2026-04-15 22:32:14 -07:00
Eugene Rakhmatulin
15a04ada32 Bug fixes 2026-03-31 16:20:23 -07:00
Eugene Rakhmatulin
48318380f9 Bugfix 2026-03-31 13:41:35 -07:00
Eugene Rakhmatulin
287d3c72e5 Fix for forced autodiscovery 2026-03-31 13:34:59 -07:00
Eugene Rakhmatulin
9370b2bb34 Don't start the cluster if only --setup/--discover is specified 2026-03-31 13:29:56 -07:00
Eugene Rakhmatulin
c8ee2a2511 Perform node count check in any mode 2026-03-26 18:15:09 -07:00
Eugene Rakhmatulin
ce293b5f05 Additional checks for parallelism and cluster size 2026-03-26 17:52:47 -07:00
Eugene Rakhmatulin
a78e221de3 Autodiscovery refactoring with mesh support 2026-03-26 15:47:41 -07:00
Eugene Rakhmatulin
83a74bccec Removed extra solo mode check 2026-03-26 07:45:23 -07:00
Eugene Rakhmatulin
c2fe579ccc Enhance .env file handling and validation in scripts 2026-03-25 23:16:56 -07:00
Eugene Rakhmatulin
429042b7dc Revert "Added --cleanup option"
This reverts commit b8930b05a1.
2026-03-25 15:35:15 -07:00
Eugene Rakhmatulin
b8930b05a1 Added --cleanup option 2026-03-25 15:24:59 -07:00
Eugene Rakhmatulin
1755dfd114 Added LOCAL_IP support 2026-03-25 15:16:06 -07:00
Eugene Rakhmatulin
07fac71dac Fixed bug with CONTAINER_NAME variable 2026-03-25 14:42:01 -07:00
Eugene Rakhmatulin
ad2cd3373f .env configuration support for launch-cluster.sh 2026-03-25 14:18:00 -07:00
Eugene Rakhmatulin
8ddc259619 Fixed #111 2026-03-18 22:03:04 -07:00
Eugene Rakhmatulin
7e4150feed Added master-port argument 2026-03-18 16:57:55 -07:00
Eugene Rakhmatulin
f327b92abe Fixes #106 and #108 2026-03-18 13:06:44 -07:00
Eugene Rakhmatulin
fa645f3e4b bugfixes 2026-03-13 13:39:30 -07:00
Eugene Rakhmatulin
dedbd0a01d bugfixes 2026-03-13 12:41:48 -07:00
Eugene Rakhmatulin
caa83d9e5b Bugfixes 2026-03-13 12:32:43 -07:00
Eugene Rakhmatulin
4bcbbaa25a Bugfixes 2026-03-13 12:23:41 -07:00
Eugene Rakhmatulin
d08266a123 Bugfixes 2026-03-13 12:18:22 -07:00
Eugene Rakhmatulin
03b055d7f0 Major cluster orchestration refactoring to support running without Ray 2026-03-13 11:55:18 -07:00
L.B.R.
50b3ca60f3 Fix shell quoting for exec command arguments
Arguments with special characters (e.g. JSON strings) were passed
unquoted, causing breakage for commands like:
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

Use printf %q in launch-cluster.sh and shlex.quote() in run-recipe.py
to properly escape arguments.
2026-03-04 15:22:42 +00:00
Eugene Rakhmatulin
df88997449 piping exec command to docker logs when running in the daemon mode. 2026-02-26 18:19:38 -08:00
Eugene Rakhmatulin
c1c3b9d66a support for daemon mode with exec command 2026-02-26 15:23:08 -08:00
Eugene Rakhmatulin
e9aa411e6c Merge branch 'main' into pr-62 2026-02-26 14:57:32 -08:00
Eugene Rakhmatulin
1c853b725e allows to use $HF_HOME as huggingface cache directory, closes #68 2026-02-25 16:38:04 -08:00
Drew Botwinick
a276a76be2 support daemon mode for ACTION == exec 2026-02-23 23:12:52 -06:00
Eugene Rakhmatulin
f886505436 Added --non-privileged flag to launch-cluster.sh 2026-02-15 00:12:06 -08:00
Eugene Rakhmatulin
6d3f5dfd5c map flashinfer/torch/triton cache directories by default 2026-02-10 16:36:02 -08:00
Eugene Rakhmatulin
ef6a5eca29 Merge branch 'main' into pr-19 2026-02-04 11:36:59 -08:00
Eugene Rakhmatulin
f7830636af Cleaning up launch-cluster changes 2026-02-04 11:36:55 -08:00
Raphael Amorim
28ba6090fc Adding suggestions from Eugr and unit tests 2026-02-03 17:32:59 -05:00
Eugene Rakhmatulin
4b9ab0de7c Added ability to launch NGC container in the cluster 2026-02-02 16:57:04 -08:00
Raphael Amorim
751bc5a47a Adding sample profile and profile loader 2026-02-02 10:25:53 -05:00
Eugene Rakhmatulin
4a4b4e7610 Fixed a bug when solo mode failed on a standalone Spark without configured RoCE. 2026-01-30 16:39:11 -08:00
Eugene Rakhmatulin
ace61c2d55 added new mod for glm4.7-flash-awq, solo model support. 2026-01-29 18:18:00 -08:00
Eugene Rakhmatulin
7a81e90cd2 added -e parameter 2026-01-29 13:06:22 -08:00
Eugene Rakhmatulin
9ad61078ce Added multiple mods support 2025-12-23 17:45:55 -08:00
Eugene Rakhmatulin
c90a6d0bde Fixed remote docker execution 2025-12-23 13:49:38 -08:00
Eugene Rakhmatulin
19dec79c5c initial mod implementation 2025-12-23 13:38:10 -08:00
Eugene Rakhmatulin
1464b0dc8f Display image name in launch-cluster.sh output 2025-12-21 22:44:01 -08:00
Eugene Rakhmatulin
f075801c59 Fixed launch_cluster bug introduced by refactoring 2025-12-19 10:51:50 -08:00
Eugene Rakhmatulin
a351f182cc Implement autodiscovery for copy hosts and enhance interface detection in build-and-copy and launch-cluster scripts 2025-12-19 10:36:39 -08:00
Eugene Rakhmatulin
294d155532 Add NCCL debug level option to launch-cluster.sh 2025-12-18 23:28:12 -08:00
Eugene Rakhmatulin
0377e9badf Bugfix: don't shut down on exit if cluster is already running 2025-12-18 23:12:39 -08:00
Eugene Rakhmatulin
2a2f8f24e2 Allow launch-cluster.sh to be executed in non-TTY environment 2025-12-18 23:02:58 -08:00
Eugene Rakhmatulin
8c53179cc2 changed extra docker args variable to VLLM_SPARK_EXTRA_DOCKER_ARGS for consistency 2025-12-18 22:27:27 -08:00