spark-vllm-docker

Files

sonusflow 006734910c Add Qwen3.5-397B INT4-AutoRound TP=4 recipe and Marlin fix

Production-tested recipe for running Qwen3.5-397B-A17B with INT4 AutoRound
quantization across 4 DGX Spark nodes using tensor parallelism.

Performance (4× DGX Spark, driver 580.126.09):
- Single user: 37 tok/s
- 4 concurrent: ~26 tok/s per user, ~103 tok/s aggregate

The Marlin TP fix resolves the MIN_THREAD_N=64 constraint that breaks
in_proj_ba layers at TP=4 (output_size=128/4=32 < 64). Solution:
ReplicatedLinear for B/A projections, applied via diff patches.

Key config:
- VLLM_MARLIN_USE_ATOMIC_ADD=1 (required for Marlin correctness)
- KV cache FP8, prefix caching enabled
- gpu_memory_utilization 0.78 (UMA safe margin)
- CUDAGraphs enabled (default, requires driver 580.x)

Note: Driver 590.x has CUDAGraph capture deadlock on GB10 unified memory.
Stay on driver 580.126.09.

2026-03-09 21:30:28 +00:00

fix-glm-4.7-flash-AWQ

Now using an opened PR for glm-4.7-flash crash fix in the mod

2026-02-17 12:45:17 -08:00

fix-qwen3-coder-next

Another fix for the Qwen mod as the slow PR was reversed in main

2026-02-13 13:46:00 -08:00

fix-qwen3-next-autoround

Mod for Intel/Qwen3-Coder-Next-INT4-Autoround model

2026-02-24 18:24:42 -08:00

fix-qwen3.5-autoround

Intel/Qwen3.5-122B-A10B-int4-AutoRound support via mods/fix-qwen3.5-autoround