Commit Graph

345 Commits

Author SHA1 Message Date
Eugene Rakhmatulin
287d3c72e5 Fix for forced autodiscovery 2026-03-31 13:34:59 -07:00
Eugene Rakhmatulin
9370b2bb34 Don't start the cluster if only --setup/--discover is specified 2026-03-31 13:29:56 -07:00
Eugene Rakhmatulin
bb177383ff Bugfix in autodiscovery dedup 2026-03-31 12:46:15 -07:00
Eugene Rakhmatulin
7f0be29fcc Handle edge case when two sparks have both cables plugged and assigned IPs 2026-03-31 11:59:03 -07:00
Eugene Rakhmatulin
41c0ce2c9a Fixed FI PR 2026-03-30 14:25:42 -07:00
Eugene Rakhmatulin
45494688d1 Updated README, added NVFP4 fix 2026-03-30 11:45:40 -07:00
Eugene Rakhmatulin
a3201f8873 --flashinfer-ref / --apply-flashinfer-pr 2026-03-29 22:40:35 -07:00
Eugene Rakhmatulin
e471ca2436 Don't copy if -c is not specified 2026-03-28 18:12:32 -07:00
Eugene Rakhmatulin
32674c2619 removed temporary patch as it causes more issues. 2026-03-28 17:49:17 -07:00
Eugene Rakhmatulin
47f5f931b5 Allow to specify config file when doing setup 2026-03-28 14:55:31 -07:00
Eugene Rakhmatulin
d37217bad0 moved PR patch before the requirements patching 2026-03-28 09:22:19 -07:00
Eugene Rakhmatulin
e70c87b4f6 Added PR38423 (temp) 2026-03-28 08:50:54 -07:00
Eugene Rakhmatulin
c1a6cec074 Updated documentation; default image tags in build script 2026-03-27 16:41:09 -07:00
Eugene Rakhmatulin
51d69c5c17 commenting out non-applicable PRs 2026-03-27 16:15:54 -07:00
Eugene Rakhmatulin
101ae6fd56 Merge branch 'main' into 3-node-autodiscover 2026-03-27 09:02:10 -07:00
Eugene Rakhmatulin
f4ca15ce18 Made autoround mod optional to support latest version of vLLM. Fixes #144. 2026-03-27 09:00:50 -07:00
Eugene Rakhmatulin
3d918e0b82 Merge branch '3-node' into 3-node-autodiscover 2026-03-27 07:51:08 -07:00
eugr
47a896d722 Removed expert-parallel from 3x-node Qwen 2026-03-26 22:44:48 -07:00
Eugene Rakhmatulin
0fa585f909 Fix typo in pipeline_parallel setting in Qwen3.5-397B-INT4-Autoround recipe 2026-03-26 18:43:17 -07:00
Eugene Rakhmatulin
cecec74828 Add recipe for Qwen3.5-397B-INT4-Autoround in pipeline-parallel mode 2026-03-26 18:41:57 -07:00
Eugene Rakhmatulin
c8ee2a2511 Perform node count check in any mode 2026-03-26 18:15:09 -07:00
Eugene Rakhmatulin
ce293b5f05 Additional checks for parallelism and cluster size 2026-03-26 17:52:47 -07:00
Eugene Rakhmatulin
f872cc17a8 Fix for --setup behavior 2026-03-26 16:49:09 -07:00
Eugene Rakhmatulin
00c16746e5 Handle new copy hosts setup in run-recipe.py 2026-03-26 16:45:35 -07:00
Eugene Rakhmatulin
f163ca69de Autodiscover tweaks 2026-03-26 16:30:05 -07:00
Eugene Rakhmatulin
a78e221de3 Autodiscovery refactoring with mesh support 2026-03-26 15:47:41 -07:00
Eugene Rakhmatulin
e6ee108cdf Temporary patch for NVFP4 2026-03-26 11:43:44 -07:00
Eugene Rakhmatulin
174de6f0a8 temporary patch for PR38126 2026-03-26 08:58:04 -07:00
Eugene Rakhmatulin
83a74bccec Removed extra solo mode check 2026-03-26 07:45:23 -07:00
Eugene Rakhmatulin
ff18a9ad5b Merge branch '3-node' of gitlab.home.eugr.net:ai/spark-vllm into 3-node 2026-03-25 23:38:44 -07:00
Eugene Rakhmatulin
c08b34a218 add --config passthrough to run-recipe 2026-03-25 23:35:52 -07:00
Eugene Rakhmatulin
23cca2a11a Merge branch '3-node' of gitlab.home.eugr.net:ai/spark-vllm into 3-node 2026-03-25 23:17:25 -07:00
Eugene Rakhmatulin
c2fe579ccc Enhance .env file handling and validation in scripts 2026-03-25 23:16:56 -07:00
Eugene Rakhmatulin
8b7c02aa25 add .env support to build-and-copy.sh 2026-03-25 22:47:02 -07:00
Eugene Rakhmatulin
73fec1bdf8 bugfix 2026-03-25 15:40:09 -07:00
Eugene Rakhmatulin
2f5ff0211e Cleanup in build script 2026-03-25 15:39:23 -07:00
Eugene Rakhmatulin
63ee72e729 Merge branch '3-node' of gitlab.home.eugr.net:ai/spark-vllm into 3-node 2026-03-25 15:36:31 -07:00
Eugene Rakhmatulin
4a0feea6c3 Added --cleanup option to build script 2026-03-25 15:35:32 -07:00
Eugene Rakhmatulin
429042b7dc Revert "Added --cleanup option"
This reverts commit b8930b05a1.
2026-03-25 15:35:15 -07:00
Eugene Rakhmatulin
ef95336937 Merge branch '3-node' of gitlab.home.eugr.net:ai/spark-vllm into 3-node 2026-03-25 15:25:19 -07:00
Eugene Rakhmatulin
b8930b05a1 Added --cleanup option 2026-03-25 15:24:59 -07:00
Eugene Rakhmatulin
49d505ad14 Merge branch '3-node' of gitlab.home.eugr.net:ai/spark-vllm into 3-node 2026-03-25 15:16:47 -07:00
Eugene Rakhmatulin
1755dfd114 Added LOCAL_IP support 2026-03-25 15:16:06 -07:00
Eugene Rakhmatulin
3d4dc4c82e Merge branch '3-node' of gitlab.home.eugr.net:ai/spark-vllm into 3-node 2026-03-25 14:42:37 -07:00
Eugene Rakhmatulin
07fac71dac Fixed bug with CONTAINER_NAME variable 2026-03-25 14:42:01 -07:00
Eugene Rakhmatulin
1702f47df6 Merge branch '3-node' of gitlab.home.eugr.net:ai/spark-vllm into 3-node 2026-03-25 14:18:32 -07:00
Eugene Rakhmatulin
ad2cd3373f .env configuration support for launch-cluster.sh 2026-03-25 14:18:00 -07:00
Eugene Rakhmatulin
1fd8c7afc3 Merge branch 'main' into 3-node 2026-03-25 12:45:40 -07:00
Eugene Rakhmatulin
3dcd2a90c1 Updated Nemotron-3-Super recipe 2026-03-25 12:44:44 -07:00
Eugene Rakhmatulin
efacbd69f2 Updated Nemotron3-Super recipe 2026-03-25 12:43:12 -07:00