Commit Graph

21 Commits

Author SHA1 Message Date
Eugene Rakhmatulin
1ad85442ac Added a helper mod for Qwen3.5-397B recipe 2026-04-12 19:14:23 -07:00
Eugene Rakhmatulin
288da8e911 Mod to fix Gemma4 tool parser 2026-04-04 16:48:07 -07:00
Eugene Rakhmatulin
f4ca15ce18 Made autoround mod optional to support latest version of vLLM. Fixes #144. 2026-03-27 09:00:50 -07:00
Eugene Rakhmatulin
03b055d7f0 Major cluster orchestration refactoring to support running without Ray 2026-03-13 11:55:18 -07:00
Eugene Rakhmatulin
d609fecef3 Merge branch 'main' of github.com:eugr/spark-vllm-docker 2026-03-12 15:04:41 -07:00
eugr
7c198b1ceb Merge pull request #90 from sonusflow/pr/qwen35-397b-tp4
Add Qwen3.5-397B INT4-AutoRound TP=4 recipe (37 tok/s)
2026-03-12 15:04:23 -07:00
Eugene Rakhmatulin
8ae51192e5 Experimental mod to support gpu-memory-utilization-gb 2026-03-12 13:37:44 -07:00
remi
122edc8229 super nemotron mod & recipe for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 2026-03-11 20:53:44 +01:00
sonusflow
006734910c Add Qwen3.5-397B INT4-AutoRound TP=4 recipe and Marlin fix
Production-tested recipe for running Qwen3.5-397B-A17B with INT4 AutoRound
quantization across 4 DGX Spark nodes using tensor parallelism.

Performance (4× DGX Spark, driver 580.126.09):
- Single user: 37 tok/s
- 4 concurrent: ~26 tok/s per user, ~103 tok/s aggregate

The Marlin TP fix resolves the MIN_THREAD_N=64 constraint that breaks
in_proj_ba layers at TP=4 (output_size=128/4=32 < 64). Solution:
ReplicatedLinear for B/A projections, applied via diff patches.

Key config:
- VLLM_MARLIN_USE_ATOMIC_ADD=1 (required for Marlin correctness)
- KV cache FP8, prefix caching enabled
- gpu_memory_utilization 0.78 (UMA safe margin)
- CUDAGraphs enabled (default, requires driver 580.x)

Note: Driver 590.x has CUDAGraph capture deadlock on GB10 unified memory.
Stay on driver 580.126.09.
2026-03-09 21:30:28 +00:00
Eugene Rakhmatulin
d42c4199fa Unsloth chat template for qwen3.5 2026-03-06 23:35:18 -08:00
Eugene Rakhmatulin
8f11e7e5ed Intel/Qwen3.5-122B-A10B-int4-AutoRound support via mods/fix-qwen3.5-autoround 2026-02-27 10:55:42 -08:00
Eugene Rakhmatulin
5ed2c23d0d Mod for Intel/Qwen3-Coder-Next-INT4-Autoround model 2026-02-24 18:24:42 -08:00
Eugene Rakhmatulin
ef07046d51 Now using an opened PR for glm-4.7-flash crash fix in the mod 2026-02-17 12:45:17 -08:00
Eugene Rakhmatulin
c23aff91d3 Temporary fix for #38 2026-02-16 09:23:10 -08:00
Eugene Rakhmatulin
3470345624 Another fix for the Qwen mod as the slow PR was reversed in main 2026-02-13 13:46:00 -08:00
Eugene Rakhmatulin
c0524608c2 Qwen3-coder-next mod - use a new PR instead of reverting previous one 2026-02-13 12:03:44 -08:00
Eugene Rakhmatulin
701147b1eb Qwen3-Coder-Next fixes and updated recipe 2026-02-12 15:56:32 -08:00
Eugene Rakhmatulin
4b9ab0de7c Added ability to launch NGC container in the cluster 2026-02-02 16:57:04 -08:00
Eugene Rakhmatulin
4634ee92a2 Added a mod for Nemotron Nano 2026-02-02 11:58:07 -08:00
Eugene Rakhmatulin
ace61c2d55 added new mod for glm4.7-flash-awq, solo model support. 2026-01-29 18:18:00 -08:00
Eugene Rakhmatulin
19dec79c5c initial mod implementation 2025-12-23 13:38:10 -08:00