Eugene Rakhmatulin
8ddc259619
Fixed #111
2026-03-18 22:03:04 -07:00
Eugene Rakhmatulin
7e4150feed
Added master-port argument
2026-03-18 16:57:55 -07:00
Eugene Rakhmatulin
f327b92abe
Fixes #106 and #108
2026-03-18 13:06:44 -07:00
Eugene Rakhmatulin
fa645f3e4b
bugfixes
2026-03-13 13:39:30 -07:00
Eugene Rakhmatulin
dedbd0a01d
bugfixes
2026-03-13 12:41:48 -07:00
Eugene Rakhmatulin
caa83d9e5b
Bugfixes
2026-03-13 12:32:43 -07:00
Eugene Rakhmatulin
4bcbbaa25a
Bugfixes
2026-03-13 12:23:41 -07:00
Eugene Rakhmatulin
d08266a123
Bugfixes
2026-03-13 12:18:22 -07:00
Eugene Rakhmatulin
03b055d7f0
Major cluster orchestration refactoring to support running without Ray
2026-03-13 11:55:18 -07:00
L.B.R.
50b3ca60f3
Fix shell quoting for exec command arguments
...
Arguments with special characters (e.g. JSON strings) were passed
unquoted, causing breakage for commands like:
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
Use printf %q in launch-cluster.sh and shlex.quote() in run-recipe.py
to properly escape arguments.
2026-03-04 15:22:42 +00:00
Eugene Rakhmatulin
df88997449
piping exec command to docker logs when running in the daemon mode.
2026-02-26 18:19:38 -08:00
Eugene Rakhmatulin
c1c3b9d66a
support for daemon mode with exec command
2026-02-26 15:23:08 -08:00
Eugene Rakhmatulin
e9aa411e6c
Merge branch 'main' into pr-62
2026-02-26 14:57:32 -08:00
Eugene Rakhmatulin
1c853b725e
allows to use $HF_HOME as huggingface cache directory, closes #68
2026-02-25 16:38:04 -08:00
Drew Botwinick
a276a76be2
support daemon mode for ACTION == exec
2026-02-23 23:12:52 -06:00
Eugene Rakhmatulin
f886505436
Added --non-privileged flag to launch-cluster.sh
2026-02-15 00:12:06 -08:00
Eugene Rakhmatulin
6d3f5dfd5c
map flashinfer/torch/triton cache directories by default
2026-02-10 16:36:02 -08:00
Eugene Rakhmatulin
ef6a5eca29
Merge branch 'main' into pr-19
2026-02-04 11:36:59 -08:00
Eugene Rakhmatulin
f7830636af
Cleaning up launch-cluster changes
2026-02-04 11:36:55 -08:00
Raphael Amorim
28ba6090fc
Adding suggestions from Eugr and unit tests
2026-02-03 17:32:59 -05:00
Eugene Rakhmatulin
4b9ab0de7c
Added ability to launch NGC container in the cluster
2026-02-02 16:57:04 -08:00
Raphael Amorim
751bc5a47a
Adding sample profile and profile loader
2026-02-02 10:25:53 -05:00
Eugene Rakhmatulin
4a4b4e7610
Fixed a bug when solo mode failed on a standalone Spark without configured RoCE.
2026-01-30 16:39:11 -08:00
Eugene Rakhmatulin
ace61c2d55
added new mod for glm4.7-flash-awq, solo model support.
2026-01-29 18:18:00 -08:00
Eugene Rakhmatulin
7a81e90cd2
added -e parameter
2026-01-29 13:06:22 -08:00
Eugene Rakhmatulin
9ad61078ce
Added multiple mods support
2025-12-23 17:45:55 -08:00
Eugene Rakhmatulin
c90a6d0bde
Fixed remote docker execution
2025-12-23 13:49:38 -08:00
Eugene Rakhmatulin
19dec79c5c
initial mod implementation
2025-12-23 13:38:10 -08:00
Eugene Rakhmatulin
1464b0dc8f
Display image name in launch-cluster.sh output
2025-12-21 22:44:01 -08:00
Eugene Rakhmatulin
f075801c59
Fixed launch_cluster bug introduced by refactoring
2025-12-19 10:51:50 -08:00
Eugene Rakhmatulin
a351f182cc
Implement autodiscovery for copy hosts and enhance interface detection in build-and-copy and launch-cluster scripts
2025-12-19 10:36:39 -08:00
Eugene Rakhmatulin
294d155532
Add NCCL debug level option to launch-cluster.sh
2025-12-18 23:28:12 -08:00
Eugene Rakhmatulin
0377e9badf
Bugfix: don't shut down on exit if cluster is already running
2025-12-18 23:12:39 -08:00
Eugene Rakhmatulin
2a2f8f24e2
Allow launch-cluster.sh to be executed in non-TTY environment
2025-12-18 23:02:58 -08:00
Eugene Rakhmatulin
8c53179cc2
changed extra docker args variable to VLLM_SPARK_EXTRA_DOCKER_ARGS for consistency
2025-12-18 22:27:27 -08:00
Eugene Rakhmatulin
8be691e806
Fixed issue with argument passing
2025-12-18 15:31:53 -08:00
Eugene Rakhmatulin
369283f655
Updated README.md with launch-cluster details.
2025-12-18 15:25:22 -08:00
Eugene Rakhmatulin
db5c443905
Enhance launch-cluster script with improved node detection and SSH scanning using netcat and Python
2025-12-18 14:52:23 -08:00
Eugene Rakhmatulin
6c04ebfca1
Refactor launch-cluster script to include cluster running checks and streamline start process for head and worker nodes
2025-12-18 14:50:26 -08:00
Eugene Rakhmatulin
f7a15bfaf5
Enhance launch-cluster script with improved SSH connectivity checks for worker nodes
2025-12-18 14:22:48 -08:00
Eugene Rakhmatulin
25b1d8eb4f
Enhance launch-cluster script with auto-detection for interfaces and nodes
2025-12-18 13:53:28 -08:00
Eugene Rakhmatulin
a1ed352635
renamed launch-cluster for consitency
2025-12-18 13:11:48 -08:00