J.J. Hoesing
358b4795b6
Add --mkpath to rsync args to handle the case where .cache/huggingface/hub doesn't already exist on the destination.
2026-02-26 03:12:34 -08:00
Eugene Rakhmatulin
dbd3d21fb8
allows $HF_HOME in hf-download.sh
2026-02-25 16:39:12 -08:00
Eugene Rakhmatulin
1c853b725e
allows to use $HF_HOME as huggingface cache directory, closes #68
2026-02-25 16:38:04 -08:00
Eugene Rakhmatulin
5a3536b38e
Fixed a bug where updated tags would cause git fetch to fail
2026-02-24 20:59:54 -08:00
Eugene Rakhmatulin
5ed2c23d0d
Mod for Intel/Qwen3-Coder-Next-INT4-Autoround model
2026-02-24 18:24:42 -08:00
Eugene Rakhmatulin
3c27d521bb
Reverting another breaking vLLM PR, fixes #60
2026-02-23 09:51:45 -08:00
Eugene Rakhmatulin
4c8f90395b
Changed reasoning parser in MInimax for better compatibility with modern clients (like coding tools).
2026-02-21 11:53:13 -08:00
Eugene Rakhmatulin
349a270c1e
More robust handling of wheels downloads
2026-02-19 13:47:59 -08:00
Eugene Rakhmatulin
ad662f9bab
Changed MXFP4 CUTLASS SHA
2026-02-18 18:20:15 -08:00
Eugene Rakhmatulin
b959818536
MXFP4 fix cache bug
2026-02-18 16:53:57 -08:00
Eugene Rakhmatulin
c60c16e867
Temporary patch to reverse PR that fails builds
2026-02-18 16:20:20 -08:00
Eugene Rakhmatulin
f09c2c3ac8
Refactoring, updated README
2026-02-18 15:58:53 -08:00
Eugene Rakhmatulin
8873a0d959
Handle failed downloads properly
2026-02-18 14:55:43 -08:00
Eugene Rakhmatulin
12fd8a4503
Merge branch 'flashinfer-gen' of gitlab.home.eugr.net:ai/spark-vllm into flashinfer-gen
2026-02-18 14:47:20 -08:00
Eugene Rakhmatulin
34fff7b3fb
Download flashinfer wheels from releases
2026-02-18 14:46:01 -08:00
Eugene Rakhmatulin
a6fdf58a82
Merge branch 'main' into flashinfer-gen
2026-02-18 13:35:41 -08:00
Eugene Rakhmatulin
bd3f45f920
Updated MXFP4 build to use fresh repo references
2026-02-18 13:35:09 -08:00
Eugene Rakhmatulin
b06531f70b
Backup old wheels before rebuilding and restore on failure
2026-02-17 23:13:25 -08:00
Eugene Rakhmatulin
a49b89a0e5
Remove old wheels before rebuilding
2026-02-17 23:04:58 -08:00
Eugene Rakhmatulin
ec0f189256
Initial refactoring to enable separate wheel builds
2026-02-17 19:15:32 -08:00
Eugene Rakhmatulin
5b2313dddb
Changed KV type to fp8 in qwen3-coder-next recipe and reduced default context size to 131072 to ensure it all fits in a single Spark.
2026-02-17 13:07:54 -08:00
Eugene Rakhmatulin
0249f1fdde
Merge branch 'main' into privileged
2026-02-17 13:01:31 -08:00
Eugene Rakhmatulin
ef07046d51
Now using an opened PR for glm-4.7-flash crash fix in the mod
2026-02-17 12:45:17 -08:00
Eugene Rakhmatulin
6aafc9c7d3
Merge branch 'main' into privileged
2026-02-16 11:38:41 -08:00
Eugene Rakhmatulin
1e7f2d5640
Small fix for M2.5 recipe
2026-02-16 11:38:34 -08:00
Eugene Rakhmatulin
bd2085d783
Merge branch 'main' into privileged
2026-02-16 11:36:06 -08:00
Eugene Rakhmatulin
24f42be5cc
Added a recipe for MiniMax M2.5 AWQ
2026-02-16 11:35:53 -08:00
Eugene Rakhmatulin
88a5d09748
Merge branch 'main' into privileged
2026-02-16 09:29:09 -08:00
Eugene Rakhmatulin
c23aff91d3
Temporary fix for #38
2026-02-16 09:23:10 -08:00
Eugene Rakhmatulin
f886505436
Added --non-privileged flag to launch-cluster.sh
2026-02-15 00:12:06 -08:00
Eugene Rakhmatulin
4214d4fefe
Caching cubins during build for reuse
2026-02-13 19:30:28 -08:00
Eugene Rakhmatulin
3470345624
Another fix for the Qwen mod as the slow PR was reversed in main
2026-02-13 13:46:00 -08:00
Eugene Rakhmatulin
c0524608c2
Qwen3-coder-next mod - use a new PR instead of reverting previous one
2026-02-13 12:03:44 -08:00
Eugene Rakhmatulin
701147b1eb
Qwen3-Coder-Next fixes and updated recipe
2026-02-12 15:56:32 -08:00
Eugene Rakhmatulin
da4185cb12
Fixed an issue with fetching latest vLLM code
2026-02-11 22:35:49 -08:00
Eugene Rakhmatulin
3b1e49dcb0
Supporting other CUDA archs via --gpu-arch flag
2026-02-11 13:10:41 -08:00
Eugene Rakhmatulin
c6b245cfe8
Added prefix caching to nemotron recipe
2026-02-10 18:25:01 -08:00
Eugene Rakhmatulin
6d3f5dfd5c
map flashinfer/torch/triton cache directories by default
2026-02-10 16:36:02 -08:00
Eugene Rakhmatulin
b990a1b8ac
Fixed #37
2026-02-10 14:31:43 -08:00
Eugene Rakhmatulin
ace16f3a8f
Applied new fastsafetensors fix to mxfp4 build; disabled wheel builds by default
2026-02-09 23:47:06 -08:00
Eugene Rakhmatulin
74876dd442
Added recipes for nemotron-nano-3 and qwen3-coder-next
2026-02-09 14:33:35 -08:00
Eugene Rakhmatulin
3aa5e5dce4
Merge pull request #34
2026-02-09 14:28:30 -08:00
Raphael Amorim
6943a51ced
Adding tests and refactoring repeated methods
2026-02-09 17:21:32 -05:00
Raphael Amorim
d07ad5450f
Adding solo_only option to the recipe
2026-02-09 17:03:57 -05:00
Eugene Rakhmatulin
2923fe6ea5
Removed temp fastsafetensors patch
2026-02-09 10:21:14 -08:00
Eugene Rakhmatulin
06e8817f18
Triton 3.6.0 is now default
2026-02-08 22:38:31 -08:00
Eugene Rakhmatulin
d845cd0401
changed arch to 12.1a again
2026-02-08 14:18:12 -08:00
Eugene Rakhmatulin
5bf422a2ca
Merge branch 'main' into pytorch-base
2026-02-08 13:01:17 -08:00
Eugene Rakhmatulin
15c1506d0c
Merge pull request #32
2026-02-08 07:17:20 -08:00
Raphael Amorim
b7c3cdcfcb
Enhancement: add -- pass-through for arbitrary vLLM arguments
...
Implements Unix-style pass-through allowing any vLLM argument to be
passed after `--` separator. Arguments are appended verbatim to the
generated vLLM command.
Examples:
./run-recipe.py model --solo -- --load-format safetensors
./run-recipe.py model --solo -- --served-model-name my-api
./run-recipe.py model --solo -- -cc.cudagraph_mode=PIECEWISE
Features:
- Uses parse_known_args() to capture arguments after --
- Warns when extra args duplicate CLI overrides (--port, --tp, etc.)
- Works in both solo and cluster modes
Adds 10 integration tests covering:
- --load-format, --served-model-name, equals syntax
- Multiple arguments, empty --, cluster mode
- Duplicate detection warnings for port/tp/gpu-mem
Closes #30
2026-02-08 02:36:49 -05:00