Eugene Rakhmatulin
44808f7018
Apply vLLM PR 35568
2026-04-02 17:13:54 -07:00
Eugene Rakhmatulin
12caec228e
switching gpt-oss-120b to solo only for now
2026-04-01 10:27:50 -07:00
Eugene Rakhmatulin
27eb35f08d
Fixed 4x qwen recipe
2026-04-01 10:09:01 -07:00
eugr
3335540972
Merge branch 'pr-152'
2026-04-01 08:59:01 -07:00
eugr
ae25d64ac0
Changed CUTLASS ref for mxfp4 build
2026-04-01 08:58:31 -07:00
Eugene Rakhmatulin
a770865834
Updated PRs to apply
2026-04-01 08:31:34 -07:00
Artyom
7b47235463
Pin nvidia-nvshmem-cu13 to <3.6 in Dockerfile.mxfp4
...
nvidia-nvshmem-cu13 3.6.5 (released Mar 24) introduced a breaking
change — nvshmemi_device_state_d was removed from NVSHMEM headers,
which breaks FlashInfer AOT compilation of nvshmem_binding.cu.
2026-04-01 07:38:53 +02:00
Eugene Rakhmatulin
3a3ab98b3e
Temporarily added PR2897 to Dockerfile
2026-03-31 22:06:08 -07:00
Eugene Rakhmatulin
23fb7dcc20
Merge branch '3-node-autodiscover'
2026-03-31 18:22:23 -07:00
Eugene Rakhmatulin
c4860b86a2
Updated README with 3-node support
2026-03-31 18:19:22 -07:00
Eugene Rakhmatulin
044557943c
Bugfixes
2026-03-31 17:49:17 -07:00
Eugene Rakhmatulin
ead749239d
Bugfix
2026-03-31 16:57:56 -07:00
Eugene Rakhmatulin
a889fed254
Updated README
2026-03-31 16:54:19 -07:00
Eugene Rakhmatulin
e89104d91b
Always rerun discovery when --discover is specified
2026-03-31 16:25:05 -07:00
Eugene Rakhmatulin
15a04ada32
Bug fixes
2026-03-31 16:20:23 -07:00
Eugene Rakhmatulin
a467a7a0bd
Updated README for 3-node
2026-03-31 13:47:04 -07:00
Eugene Rakhmatulin
48318380f9
Bugfix
2026-03-31 13:41:35 -07:00
Eugene Rakhmatulin
287d3c72e5
Fix for forced autodiscovery
2026-03-31 13:34:59 -07:00
Eugene Rakhmatulin
9370b2bb34
Don't start the cluster if only --setup/--discover is specified
2026-03-31 13:29:56 -07:00
Eugene Rakhmatulin
bb177383ff
Bugfix in autodiscovery dedup
2026-03-31 12:46:15 -07:00
Eugene Rakhmatulin
7f0be29fcc
Handle edge case when two sparks have both cables plugged and assigned IPs
2026-03-31 11:59:03 -07:00
Eugene Rakhmatulin
41c0ce2c9a
Fixed FI PR
2026-03-30 14:25:42 -07:00
Eugene Rakhmatulin
45494688d1
Updated README, added NVFP4 fix
2026-03-30 11:45:40 -07:00
Eugene Rakhmatulin
a3201f8873
--flashinfer-ref / --apply-flashinfer-pr
2026-03-29 22:40:35 -07:00
Eugene Rakhmatulin
e471ca2436
Don't copy if -c is not specified
2026-03-28 18:12:32 -07:00
Eugene Rakhmatulin
32674c2619
removed temporary patch as it causes more issues.
2026-03-28 17:49:17 -07:00
Eugene Rakhmatulin
47f5f931b5
Allow to specify config file when doing setup
2026-03-28 14:55:31 -07:00
Eugene Rakhmatulin
d37217bad0
moved PR patch before the requirements patching
2026-03-28 09:22:19 -07:00
Eugene Rakhmatulin
e70c87b4f6
Added PR38423 (temp)
2026-03-28 08:50:54 -07:00
Eugene Rakhmatulin
c1a6cec074
Updated documentation; default image tags in build script
2026-03-27 16:41:09 -07:00
Eugene Rakhmatulin
51d69c5c17
commenting out non-applicable PRs
2026-03-27 16:15:54 -07:00
Eugene Rakhmatulin
e7f2ee692f
Added temporary patch to apply PR38126 that fixes broken NVFP4 quants
2026-03-27 09:30:26 -07:00
Eugene Rakhmatulin
101ae6fd56
Merge branch 'main' into 3-node-autodiscover
2026-03-27 09:02:10 -07:00
Eugene Rakhmatulin
f4ca15ce18
Made autoround mod optional to support latest version of vLLM. Fixes #144 .
2026-03-27 09:00:50 -07:00
Eugene Rakhmatulin
3d918e0b82
Merge branch '3-node' into 3-node-autodiscover
2026-03-27 07:51:08 -07:00
eugr
47a896d722
Removed expert-parallel from 3x-node Qwen
2026-03-26 22:44:48 -07:00
Eugene Rakhmatulin
0fa585f909
Fix typo in pipeline_parallel setting in Qwen3.5-397B-INT4-Autoround recipe
2026-03-26 18:43:17 -07:00
Eugene Rakhmatulin
cecec74828
Add recipe for Qwen3.5-397B-INT4-Autoround in pipeline-parallel mode
2026-03-26 18:41:57 -07:00
Eugene Rakhmatulin
c8ee2a2511
Perform node count check in any mode
2026-03-26 18:15:09 -07:00
Eugene Rakhmatulin
ce293b5f05
Additional checks for parallelism and cluster size
2026-03-26 17:52:47 -07:00
Eugene Rakhmatulin
f872cc17a8
Fix for --setup behavior
2026-03-26 16:49:09 -07:00
Eugene Rakhmatulin
00c16746e5
Handle new copy hosts setup in run-recipe.py
2026-03-26 16:45:35 -07:00
Eugene Rakhmatulin
f163ca69de
Autodiscover tweaks
2026-03-26 16:30:05 -07:00
Eugene Rakhmatulin
a78e221de3
Autodiscovery refactoring with mesh support
2026-03-26 15:47:41 -07:00
Eugene Rakhmatulin
e6ee108cdf
Temporary patch for NVFP4
2026-03-26 11:43:44 -07:00
Eugene Rakhmatulin
174de6f0a8
temporary patch for PR38126
2026-03-26 08:58:04 -07:00
Eugene Rakhmatulin
83a74bccec
Removed extra solo mode check
2026-03-26 07:45:23 -07:00
Eugene Rakhmatulin
ff18a9ad5b
Merge branch '3-node' of gitlab.home.eugr.net:ai/spark-vllm into 3-node
2026-03-25 23:38:44 -07:00
Eugene Rakhmatulin
c08b34a218
add --config passthrough to run-recipe
2026-03-25 23:35:52 -07:00
Eugene Rakhmatulin
23cca2a11a
Merge branch '3-node' of gitlab.home.eugr.net:ai/spark-vllm into 3-node
2026-03-25 23:17:25 -07:00