Add Qwen3.5-397B INT4-AutoRound TP=4 recipe and Marlin fix
Production-tested recipe for running Qwen3.5-397B-A17B with INT4 AutoRound quantization across 4 DGX Spark nodes using tensor parallelism. Performance (4× DGX Spark, driver 580.126.09): - Single user: 37 tok/s - 4 concurrent: ~26 tok/s per user, ~103 tok/s aggregate The Marlin TP fix resolves the MIN_THREAD_N=64 constraint that breaks in_proj_ba layers at TP=4 (output_size=128/4=32 < 64). Solution: ReplicatedLinear for B/A projections, applied via diff patches. Key config: - VLLM_MARLIN_USE_ATOMIC_ADD=1 (required for Marlin correctness) - KV cache FP8, prefix caching enabled - gpu_memory_utilization 0.78 (UMA safe margin) - CUDAGraphs enabled (default, requires driver 580.x) Note: Driver 590.x has CUDAGraph capture deadlock on GB10 unified memory. Stay on driver 580.126.09.
This commit is contained in:
23
mods/fix-qwen35-tp4-marlin/fix_rope.py
Normal file
23
mods/fix-qwen35-tp4-marlin/fix_rope.py
Normal file
@@ -0,0 +1,23 @@
|
||||
# Fix: ignore_keys_at_rope_validation is a list but transformers uses | (set union)
|
||||
import re
|
||||
|
||||
path = "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5_moe.py"
|
||||
with open(path) as f:
|
||||
content = f.read()
|
||||
|
||||
old = """kwargs["ignore_keys_at_rope_validation"] = [
|
||||
"mrope_section",
|
||||
"mrope_interleaved",
|
||||
]"""
|
||||
|
||||
new = """kwargs["ignore_keys_at_rope_validation"] = {
|
||||
"mrope_section",
|
||||
"mrope_interleaved",
|
||||
}"""
|
||||
|
||||
content = content.replace(old, new)
|
||||
|
||||
with open(path, "w") as f:
|
||||
f.write(content)
|
||||
|
||||
print("Fixed ignore_keys_at_rope_validation: list -> set")
|
||||
Reference in New Issue
Block a user