spark-vllm-docker/mods/fix-qwen35-tp4-marlin/fix_rope.py at 43a00ed90f736e0137757df5b2a4c9fed218cb5e

Files

sonusflow 006734910c Add Qwen3.5-397B INT4-AutoRound TP=4 recipe and Marlin fix

Production-tested recipe for running Qwen3.5-397B-A17B with INT4 AutoRound
quantization across 4 DGX Spark nodes using tensor parallelism.

Performance (4× DGX Spark, driver 580.126.09):
- Single user: 37 tok/s
- 4 concurrent: ~26 tok/s per user, ~103 tok/s aggregate

The Marlin TP fix resolves the MIN_THREAD_N=64 constraint that breaks
in_proj_ba layers at TP=4 (output_size=128/4=32 < 64). Solution:
ReplicatedLinear for B/A projections, applied via diff patches.

Key config:
- VLLM_MARLIN_USE_ATOMIC_ADD=1 (required for Marlin correctness)
- KV cache FP8, prefix caching enabled
- gpu_memory_utilization 0.78 (UMA safe margin)
- CUDAGraphs enabled (default, requires driver 580.x)

Note: Driver 590.x has CUDAGraph capture deadlock on GB10 unified memory.
Stay on driver 580.126.09.

2026-03-09 21:30:28 +00:00

642 B

Raw Blame History

View Raw

642 B Raw Blame History

642 B

Raw Blame History