sonusflow
006734910c
Add Qwen3.5-397B INT4-AutoRound TP=4 recipe and Marlin fix
Production-tested recipe for running Qwen3.5-397B-A17B with INT4 AutoRound
quantization across 4 DGX Spark nodes using tensor parallelism.
Performance (4× DGX Spark, driver 580.126.09):
- Single user: 37 tok/s
- 4 concurrent: ~26 tok/s per user, ~103 tok/s aggregate
The Marlin TP fix resolves the MIN_THREAD_N=64 constraint that breaks
in_proj_ba layers at TP=4 (output_size=128/4=32 < 64). Solution:
ReplicatedLinear for B/A projections, applied via diff patches.
Key config:
- VLLM_MARLIN_USE_ATOMIC_ADD=1 (required for Marlin correctness)
- KV cache FP8, prefix caching enabled
- gpu_memory_utilization 0.78 (UMA safe margin)
- CUDAGraphs enabled (default, requires driver 580.x)
Note: Driver 590.x has CUDAGraph capture deadlock on GB10 unified memory.
Stay on driver 580.126.09.
2026-03-09 21:30:28 +00:00
..
2026-02-17 12:45:17 -08:00
2026-02-13 13:46:00 -08:00
2026-02-24 18:24:42 -08:00
2026-02-27 10:55:42 -08:00
2026-03-06 23:35:18 -08:00
2026-03-09 21:30:28 +00:00
2025-12-23 13:38:10 -08:00
2026-02-02 16:57:04 -08:00
2026-02-02 16:57:04 -08:00