Production-tested recipe for running Qwen3.5-397B-A17B with INT4 AutoRound quantization across 4 DGX Spark nodes using tensor parallelism. Performance (4× DGX Spark, driver 580.126.09): - Single user: 37 tok/s - 4 concurrent: ~26 tok/s per user, ~103 tok/s aggregate The Marlin TP fix resolves the MIN_THREAD_N=64 constraint that breaks in_proj_ba layers at TP=4 (output_size=128/4=32 < 64). Solution: ReplicatedLinear for B/A projections, applied via diff patches. Key config: - VLLM_MARLIN_USE_ATOMIC_ADD=1 (required for Marlin correctness) - KV cache FP8, prefix caching enabled - gpu_memory_utilization 0.78 (UMA safe margin) - CUDAGraphs enabled (default, requires driver 580.x) Note: Driver 590.x has CUDAGraph capture deadlock on GB10 unified memory. Stay on driver 580.126.09.
24 lines
917 B
Bash
Executable File
24 lines
917 B
Bash
Executable File
#!/bin/bash
|
|
# Fix Marlin TP=4 constraint for Qwen3.5-397B: in_proj_ba output_size=128 / TP=4 = 32 < min_thread_n=64
|
|
# Solution: Replace MergedColumnParallelLinear with two ReplicatedLinear for B/A projections
|
|
# Delivery: unified diff patches (portable across vLLM versions)
|
|
|
|
set -e
|
|
MOD_DIR="$(dirname "$0")"
|
|
MODELS_DIR="/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models"
|
|
|
|
echo "[fix-qwen35-tp4-marlin] Applying patches..."
|
|
|
|
# Apply patches with --forward (skip if already applied)
|
|
patch --forward --batch -p0 -d "$MODELS_DIR" < "$MOD_DIR/qwen3_next.patch" || {
|
|
echo "[fix-qwen35-tp4-marlin] qwen3_next.patch already applied or failed"
|
|
}
|
|
patch --forward --batch -p0 -d "$MODELS_DIR" < "$MOD_DIR/qwen3_5.patch" || {
|
|
echo "[fix-qwen35-tp4-marlin] qwen3_5.patch already applied or failed"
|
|
}
|
|
|
|
# Fix rope validation (idempotent)
|
|
python3 "$MOD_DIR/fix_rope.py"
|
|
|
|
echo "[fix-qwen35-tp4-marlin] Done."
|