Files
spark-vllm-docker/recipes/qwen3.5-397b-int4-autoround.yaml
sonusflow 006734910c Add Qwen3.5-397B INT4-AutoRound TP=4 recipe and Marlin fix
Production-tested recipe for running Qwen3.5-397B-A17B with INT4 AutoRound
quantization across 4 DGX Spark nodes using tensor parallelism.

Performance (4× DGX Spark, driver 580.126.09):
- Single user: 37 tok/s
- 4 concurrent: ~26 tok/s per user, ~103 tok/s aggregate

The Marlin TP fix resolves the MIN_THREAD_N=64 constraint that breaks
in_proj_ba layers at TP=4 (output_size=128/4=32 < 64). Solution:
ReplicatedLinear for B/A projections, applied via diff patches.

Key config:
- VLLM_MARLIN_USE_ATOMIC_ADD=1 (required for Marlin correctness)
- KV cache FP8, prefix caching enabled
- gpu_memory_utilization 0.78 (UMA safe margin)
- CUDAGraphs enabled (default, requires driver 580.x)

Note: Driver 590.x has CUDAGraph capture deadlock on GB10 unified memory.
Stay on driver 580.126.09.
2026-03-09 21:30:28 +00:00

53 lines
1.6 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Recipe: Qwen3.5-397B-A17B-INT4-Autoround
# Qwen3.5-397B model in Intel INT4-Autoround quantization, TP=4 across 4 DGX Spark nodes
# Benchmarked at 37 tok/s single-user, 103 tok/s aggregate (4 concurrent) on 4× DGX Spark
# Requires NVIDIA driver 580.x (590.x has CUDAGraph deadlock bug on GB10)
recipe_version: "1"
name: Qwen3.5-397B-INT4-Autoround
description: Qwen3.5-397B with TP=4 across 4 DGX Spark nodes (Marlin fix applied)
# HuggingFace model to download (optional, for --download-model)
model: Intel/Qwen3.5-397B-A17B-int4-AutoRound
# Container image to use
container: vllm-node-tf5
build_args:
- --tf5
# Mods required: coder-next tool/reasoning parser + Marlin TP fix
mods:
- mods/fix-qwen3-coder-next
- mods/fix-qwen35-tp4-marlin
# Environment variables
env:
VLLM_MARLIN_USE_ATOMIC_ADD: 1
# Default settings (can be overridden via CLI, e.g. --tensor_parallel 2)
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 4
gpu_memory_utilization: 0.78
max_model_len: 32768
max_num_batched_tokens: 8192
# The vLLM serve command template
command: |
vllm serve Intel/Qwen3.5-397B-A17B-int4-AutoRound \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tensor-parallel-size {tensor_parallel} \
--distributed-executor-backend ray \
--kv-cache-dtype fp8 \
--gpu-memory-utilization {gpu_memory_utilization} \
--max-model-len {max_model_len} \
--max-num-batched-tokens {max_num_batched_tokens} \
--enable-prefix-caching \
--trust-remote-code \
--host {host} \
--port {port}