Eugene Rakhmatulin
|
d609fecef3
|
Merge branch 'main' of github.com:eugr/spark-vllm-docker
|
2026-03-12 15:04:41 -07:00 |
|
eugr
|
7c198b1ceb
|
Merge pull request #90 from sonusflow/pr/qwen35-397b-tp4
Add Qwen3.5-397B INT4-AutoRound TP=4 recipe (37 tok/s)
|
2026-03-12 15:04:23 -07:00 |
|
Eugene Rakhmatulin
|
8ae51192e5
|
Experimental mod to support gpu-memory-utilization-gb
|
2026-03-12 13:37:44 -07:00 |
|
remi
|
122edc8229
|
super nemotron mod & recipe for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
|
2026-03-11 20:53:44 +01:00 |
|
sonusflow
|
006734910c
|
Add Qwen3.5-397B INT4-AutoRound TP=4 recipe and Marlin fix
Production-tested recipe for running Qwen3.5-397B-A17B with INT4 AutoRound
quantization across 4 DGX Spark nodes using tensor parallelism.
Performance (4× DGX Spark, driver 580.126.09):
- Single user: 37 tok/s
- 4 concurrent: ~26 tok/s per user, ~103 tok/s aggregate
The Marlin TP fix resolves the MIN_THREAD_N=64 constraint that breaks
in_proj_ba layers at TP=4 (output_size=128/4=32 < 64). Solution:
ReplicatedLinear for B/A projections, applied via diff patches.
Key config:
- VLLM_MARLIN_USE_ATOMIC_ADD=1 (required for Marlin correctness)
- KV cache FP8, prefix caching enabled
- gpu_memory_utilization 0.78 (UMA safe margin)
- CUDAGraphs enabled (default, requires driver 580.x)
Note: Driver 590.x has CUDAGraph capture deadlock on GB10 unified memory.
Stay on driver 580.126.09.
|
2026-03-09 21:30:28 +00:00 |
|
Eugene Rakhmatulin
|
d42c4199fa
|
Unsloth chat template for qwen3.5
|
2026-03-06 23:35:18 -08:00 |
|
Eugene Rakhmatulin
|
8f11e7e5ed
|
Intel/Qwen3.5-122B-A10B-int4-AutoRound support via mods/fix-qwen3.5-autoround
|
2026-02-27 10:55:42 -08:00 |
|
Eugene Rakhmatulin
|
5ed2c23d0d
|
Mod for Intel/Qwen3-Coder-Next-INT4-Autoround model
|
2026-02-24 18:24:42 -08:00 |
|
Eugene Rakhmatulin
|
ef07046d51
|
Now using an opened PR for glm-4.7-flash crash fix in the mod
|
2026-02-17 12:45:17 -08:00 |
|
Eugene Rakhmatulin
|
c23aff91d3
|
Temporary fix for #38
|
2026-02-16 09:23:10 -08:00 |
|
Eugene Rakhmatulin
|
3470345624
|
Another fix for the Qwen mod as the slow PR was reversed in main
|
2026-02-13 13:46:00 -08:00 |
|
Eugene Rakhmatulin
|
c0524608c2
|
Qwen3-coder-next mod - use a new PR instead of reverting previous one
|
2026-02-13 12:03:44 -08:00 |
|
Eugene Rakhmatulin
|
701147b1eb
|
Qwen3-Coder-Next fixes and updated recipe
|
2026-02-12 15:56:32 -08:00 |
|
Eugene Rakhmatulin
|
4b9ab0de7c
|
Added ability to launch NGC container in the cluster
|
2026-02-02 16:57:04 -08:00 |
|
Eugene Rakhmatulin
|
4634ee92a2
|
Added a mod for Nemotron Nano
|
2026-02-02 11:58:07 -08:00 |
|
Eugene Rakhmatulin
|
ace61c2d55
|
added new mod for glm4.7-flash-awq, solo model support.
|
2026-01-29 18:18:00 -08:00 |
|
Eugene Rakhmatulin
|
19dec79c5c
|
initial mod implementation
|
2025-12-23 13:38:10 -08:00 |
|