spark-vllm-docker

Author	SHA1	Message	Date
Eugene Rakhmatulin	d609fecef3	Merge branch 'main' of github.com:eugr/spark-vllm-docker	2026-03-12 15:04:41 -07:00
eugr	7c198b1ceb	Merge pull request #90 from sonusflow/pr/qwen35-397b-tp4 Add Qwen3.5-397B INT4-AutoRound TP=4 recipe (37 tok/s)	2026-03-12 15:04:23 -07:00
Eugene Rakhmatulin	8ae51192e5	Experimental mod to support gpu-memory-utilization-gb	2026-03-12 13:37:44 -07:00
remi	122edc8229	super nemotron mod & recipe for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	2026-03-11 20:53:44 +01:00
sonusflow	006734910c	Add Qwen3.5-397B INT4-AutoRound TP=4 recipe and Marlin fix Production-tested recipe for running Qwen3.5-397B-A17B with INT4 AutoRound quantization across 4 DGX Spark nodes using tensor parallelism. Performance (4× DGX Spark, driver 580.126.09): - Single user: 37 tok/s - 4 concurrent: ~26 tok/s per user, ~103 tok/s aggregate The Marlin TP fix resolves the MIN_THREAD_N=64 constraint that breaks in_proj_ba layers at TP=4 (output_size=128/4=32 < 64). Solution: ReplicatedLinear for B/A projections, applied via diff patches. Key config: - VLLM_MARLIN_USE_ATOMIC_ADD=1 (required for Marlin correctness) - KV cache FP8, prefix caching enabled - gpu_memory_utilization 0.78 (UMA safe margin) - CUDAGraphs enabled (default, requires driver 580.x) Note: Driver 590.x has CUDAGraph capture deadlock on GB10 unified memory. Stay on driver 580.126.09.	2026-03-09 21:30:28 +00:00
Eugene Rakhmatulin	d42c4199fa	Unsloth chat template for qwen3.5	2026-03-06 23:35:18 -08:00
Eugene Rakhmatulin	8f11e7e5ed	Intel/Qwen3.5-122B-A10B-int4-AutoRound support via mods/fix-qwen3.5-autoround	2026-02-27 10:55:42 -08:00
Eugene Rakhmatulin	5ed2c23d0d	Mod for Intel/Qwen3-Coder-Next-INT4-Autoround model	2026-02-24 18:24:42 -08:00
Eugene Rakhmatulin	ef07046d51	Now using an opened PR for glm-4.7-flash crash fix in the mod	2026-02-17 12:45:17 -08:00
Eugene Rakhmatulin	c23aff91d3	Temporary fix for #38	2026-02-16 09:23:10 -08:00
Eugene Rakhmatulin	3470345624	Another fix for the Qwen mod as the slow PR was reversed in main	2026-02-13 13:46:00 -08:00
Eugene Rakhmatulin	c0524608c2	Qwen3-coder-next mod - use a new PR instead of reverting previous one	2026-02-13 12:03:44 -08:00
Eugene Rakhmatulin	701147b1eb	Qwen3-Coder-Next fixes and updated recipe	2026-02-12 15:56:32 -08:00
Eugene Rakhmatulin	4b9ab0de7c	Added ability to launch NGC container in the cluster	2026-02-02 16:57:04 -08:00
Eugene Rakhmatulin	4634ee92a2	Added a mod for Nemotron Nano	2026-02-02 11:58:07 -08:00
Eugene Rakhmatulin	ace61c2d55	added new mod for glm4.7-flash-awq, solo model support.	2026-01-29 18:18:00 -08:00
Eugene Rakhmatulin	19dec79c5c	initial mod implementation	2025-12-23 13:38:10 -08:00

17 Commits