Commit Graph

306 Commits

Author SHA1 Message Date
Eugene Rakhmatulin
3c27d521bb Reverting another breaking vLLM PR, fixes #60 2026-02-23 09:51:45 -08:00
Eugene Rakhmatulin
4c8f90395b Changed reasoning parser in MInimax for better compatibility with modern clients (like coding tools). 2026-02-21 11:53:13 -08:00
Eugene Rakhmatulin
349a270c1e More robust handling of wheels downloads 2026-02-19 13:47:59 -08:00
Eugene Rakhmatulin
ad662f9bab Changed MXFP4 CUTLASS SHA 2026-02-18 18:20:15 -08:00
Eugene Rakhmatulin
b959818536 MXFP4 fix cache bug 2026-02-18 16:53:57 -08:00
Eugene Rakhmatulin
c60c16e867 Temporary patch to reverse PR that fails builds 2026-02-18 16:20:20 -08:00
Eugene Rakhmatulin
f09c2c3ac8 Refactoring, updated README 2026-02-18 15:58:53 -08:00
Eugene Rakhmatulin
8873a0d959 Handle failed downloads properly 2026-02-18 14:55:43 -08:00
Eugene Rakhmatulin
12fd8a4503 Merge branch 'flashinfer-gen' of gitlab.home.eugr.net:ai/spark-vllm into flashinfer-gen 2026-02-18 14:47:20 -08:00
Eugene Rakhmatulin
34fff7b3fb Download flashinfer wheels from releases 2026-02-18 14:46:01 -08:00
Eugene Rakhmatulin
a6fdf58a82 Merge branch 'main' into flashinfer-gen 2026-02-18 13:35:41 -08:00
Eugene Rakhmatulin
bd3f45f920 Updated MXFP4 build to use fresh repo references 2026-02-18 13:35:09 -08:00
Eugene Rakhmatulin
b06531f70b Backup old wheels before rebuilding and restore on failure 2026-02-17 23:13:25 -08:00
Eugene Rakhmatulin
a49b89a0e5 Remove old wheels before rebuilding 2026-02-17 23:04:58 -08:00
Eugene Rakhmatulin
ec0f189256 Initial refactoring to enable separate wheel builds 2026-02-17 19:15:32 -08:00
Eugene Rakhmatulin
5b2313dddb Changed KV type to fp8 in qwen3-coder-next recipe and reduced default context size to 131072 to ensure it all fits in a single Spark. 2026-02-17 13:07:54 -08:00
Eugene Rakhmatulin
0249f1fdde Merge branch 'main' into privileged 2026-02-17 13:01:31 -08:00
Eugene Rakhmatulin
ef07046d51 Now using an opened PR for glm-4.7-flash crash fix in the mod 2026-02-17 12:45:17 -08:00
Eugene Rakhmatulin
6aafc9c7d3 Merge branch 'main' into privileged 2026-02-16 11:38:41 -08:00
Eugene Rakhmatulin
1e7f2d5640 Small fix for M2.5 recipe 2026-02-16 11:38:34 -08:00
Eugene Rakhmatulin
bd2085d783 Merge branch 'main' into privileged 2026-02-16 11:36:06 -08:00
Eugene Rakhmatulin
24f42be5cc Added a recipe for MiniMax M2.5 AWQ 2026-02-16 11:35:53 -08:00
Eugene Rakhmatulin
88a5d09748 Merge branch 'main' into privileged 2026-02-16 09:29:09 -08:00
Eugene Rakhmatulin
c23aff91d3 Temporary fix for #38 2026-02-16 09:23:10 -08:00
Eugene Rakhmatulin
f886505436 Added --non-privileged flag to launch-cluster.sh 2026-02-15 00:12:06 -08:00
Eugene Rakhmatulin
4214d4fefe Caching cubins during build for reuse 2026-02-13 19:30:28 -08:00
Eugene Rakhmatulin
3470345624 Another fix for the Qwen mod as the slow PR was reversed in main 2026-02-13 13:46:00 -08:00
Eugene Rakhmatulin
c0524608c2 Qwen3-coder-next mod - use a new PR instead of reverting previous one 2026-02-13 12:03:44 -08:00
Eugene Rakhmatulin
701147b1eb Qwen3-Coder-Next fixes and updated recipe 2026-02-12 15:56:32 -08:00
Eugene Rakhmatulin
da4185cb12 Fixed an issue with fetching latest vLLM code 2026-02-11 22:35:49 -08:00
Eugene Rakhmatulin
3b1e49dcb0 Supporting other CUDA archs via --gpu-arch flag 2026-02-11 13:10:41 -08:00
Eugene Rakhmatulin
c6b245cfe8 Added prefix caching to nemotron recipe 2026-02-10 18:25:01 -08:00
Eugene Rakhmatulin
6d3f5dfd5c map flashinfer/torch/triton cache directories by default 2026-02-10 16:36:02 -08:00
Eugene Rakhmatulin
b990a1b8ac Fixed #37 2026-02-10 14:31:43 -08:00
Eugene Rakhmatulin
ace16f3a8f Applied new fastsafetensors fix to mxfp4 build; disabled wheel builds by default 2026-02-09 23:47:06 -08:00
Eugene Rakhmatulin
74876dd442 Added recipes for nemotron-nano-3 and qwen3-coder-next 2026-02-09 14:33:35 -08:00
Eugene Rakhmatulin
3aa5e5dce4 Merge pull request #34 2026-02-09 14:28:30 -08:00
Raphael Amorim
6943a51ced Adding tests and refactoring repeated methods 2026-02-09 17:21:32 -05:00
Raphael Amorim
d07ad5450f Adding solo_only option to the recipe 2026-02-09 17:03:57 -05:00
Eugene Rakhmatulin
2923fe6ea5 Removed temp fastsafetensors patch 2026-02-09 10:21:14 -08:00
Eugene Rakhmatulin
06e8817f18 Triton 3.6.0 is now default 2026-02-08 22:38:31 -08:00
Eugene Rakhmatulin
d845cd0401 changed arch to 12.1a again 2026-02-08 14:18:12 -08:00
Eugene Rakhmatulin
5bf422a2ca Merge branch 'main' into pytorch-base 2026-02-08 13:01:17 -08:00
Eugene Rakhmatulin
15c1506d0c Merge pull request #32 2026-02-08 07:17:20 -08:00
Raphael Amorim
b7c3cdcfcb Enhancement: add -- pass-through for arbitrary vLLM arguments
Implements Unix-style pass-through allowing any vLLM argument to be
passed after `--` separator. Arguments are appended verbatim to the
generated vLLM command.

Examples:
  ./run-recipe.py model --solo -- --load-format safetensors
  ./run-recipe.py model --solo -- --served-model-name my-api
  ./run-recipe.py model --solo -- -cc.cudagraph_mode=PIECEWISE

Features:
- Uses parse_known_args() to capture arguments after --
- Warns when extra args duplicate CLI overrides (--port, --tp, etc.)
- Works in both solo and cluster modes

Adds 10 integration tests covering:
- --load-format, --served-model-name, equals syntax
- Multiple arguments, empty --, cluster mode
- Duplicate detection warnings for port/tp/gpu-mem

Closes #30
2026-02-08 02:36:49 -05:00
Eugene Rakhmatulin
dfb300e51a Merge branch 'main' into pytorch-base 2026-02-05 13:54:12 -08:00
Eugene Rakhmatulin
8cb956b972 Updated networking guide 2026-02-05 13:53:57 -08:00
Eugene Rakhmatulin
66210e641d Merge branch 'main' into pytorch-base 2026-02-04 12:07:06 -08:00
Eugene Rakhmatulin
f139c4b55d Updated tests 2026-02-04 12:06:30 -08:00
Eugene Rakhmatulin
c7d45157e0 Merge pull request #19 2026-02-04 12:03:20 -08:00