Commit Graph

33 Commits

Author SHA1 Message Date
L.B.R.
50b3ca60f3 Fix shell quoting for exec command arguments
Arguments with special characters (e.g. JSON strings) were passed
unquoted, causing breakage for commands like:
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

Use printf %q in launch-cluster.sh and shlex.quote() in run-recipe.py
to properly escape arguments.
2026-03-04 15:22:42 +00:00
Eugene Rakhmatulin
df88997449 piping exec command to docker logs when running in the daemon mode. 2026-02-26 18:19:38 -08:00
Eugene Rakhmatulin
c1c3b9d66a support for daemon mode with exec command 2026-02-26 15:23:08 -08:00
Eugene Rakhmatulin
e9aa411e6c Merge branch 'main' into pr-62 2026-02-26 14:57:32 -08:00
Eugene Rakhmatulin
1c853b725e allows to use $HF_HOME as huggingface cache directory, closes #68 2026-02-25 16:38:04 -08:00
Drew Botwinick
a276a76be2 support daemon mode for ACTION == exec 2026-02-23 23:12:52 -06:00
Eugene Rakhmatulin
f886505436 Added --non-privileged flag to launch-cluster.sh 2026-02-15 00:12:06 -08:00
Eugene Rakhmatulin
6d3f5dfd5c map flashinfer/torch/triton cache directories by default 2026-02-10 16:36:02 -08:00
Eugene Rakhmatulin
ef6a5eca29 Merge branch 'main' into pr-19 2026-02-04 11:36:59 -08:00
Eugene Rakhmatulin
f7830636af Cleaning up launch-cluster changes 2026-02-04 11:36:55 -08:00
Raphael Amorim
28ba6090fc Adding suggestions from Eugr and unit tests 2026-02-03 17:32:59 -05:00
Eugene Rakhmatulin
4b9ab0de7c Added ability to launch NGC container in the cluster 2026-02-02 16:57:04 -08:00
Raphael Amorim
751bc5a47a Adding sample profile and profile loader 2026-02-02 10:25:53 -05:00
Eugene Rakhmatulin
4a4b4e7610 Fixed a bug when solo mode failed on a standalone Spark without configured RoCE. 2026-01-30 16:39:11 -08:00
Eugene Rakhmatulin
ace61c2d55 added new mod for glm4.7-flash-awq, solo model support. 2026-01-29 18:18:00 -08:00
Eugene Rakhmatulin
7a81e90cd2 added -e parameter 2026-01-29 13:06:22 -08:00
Eugene Rakhmatulin
9ad61078ce Added multiple mods support 2025-12-23 17:45:55 -08:00
Eugene Rakhmatulin
c90a6d0bde Fixed remote docker execution 2025-12-23 13:49:38 -08:00
Eugene Rakhmatulin
19dec79c5c initial mod implementation 2025-12-23 13:38:10 -08:00
Eugene Rakhmatulin
1464b0dc8f Display image name in launch-cluster.sh output 2025-12-21 22:44:01 -08:00
Eugene Rakhmatulin
f075801c59 Fixed launch_cluster bug introduced by refactoring 2025-12-19 10:51:50 -08:00
Eugene Rakhmatulin
a351f182cc Implement autodiscovery for copy hosts and enhance interface detection in build-and-copy and launch-cluster scripts 2025-12-19 10:36:39 -08:00
Eugene Rakhmatulin
294d155532 Add NCCL debug level option to launch-cluster.sh 2025-12-18 23:28:12 -08:00
Eugene Rakhmatulin
0377e9badf Bugfix: don't shut down on exit if cluster is already running 2025-12-18 23:12:39 -08:00
Eugene Rakhmatulin
2a2f8f24e2 Allow launch-cluster.sh to be executed in non-TTY environment 2025-12-18 23:02:58 -08:00
Eugene Rakhmatulin
8c53179cc2 changed extra docker args variable to VLLM_SPARK_EXTRA_DOCKER_ARGS for consistency 2025-12-18 22:27:27 -08:00
Eugene Rakhmatulin
8be691e806 Fixed issue with argument passing 2025-12-18 15:31:53 -08:00
Eugene Rakhmatulin
369283f655 Updated README.md with launch-cluster details. 2025-12-18 15:25:22 -08:00
Eugene Rakhmatulin
db5c443905 Enhance launch-cluster script with improved node detection and SSH scanning using netcat and Python 2025-12-18 14:52:23 -08:00
Eugene Rakhmatulin
6c04ebfca1 Refactor launch-cluster script to include cluster running checks and streamline start process for head and worker nodes 2025-12-18 14:50:26 -08:00
Eugene Rakhmatulin
f7a15bfaf5 Enhance launch-cluster script with improved SSH connectivity checks for worker nodes 2025-12-18 14:22:48 -08:00
Eugene Rakhmatulin
25b1d8eb4f Enhance launch-cluster script with auto-detection for interfaces and nodes 2025-12-18 13:53:28 -08:00
Eugene Rakhmatulin
a1ed352635 renamed launch-cluster for consitency 2025-12-18 13:11:48 -08:00