Commit Graph

30 Commits

Author SHA1 Message Date
Eugene Rakhmatulin
b1eeefc0eb Changed Nemotron-3-Nano-NVFP4 to Marlin backend 2026-03-17 13:10:48 -07:00
eugr
7c198b1ceb Merge pull request #90 from sonusflow/pr/qwen35-397b-tp4
Add Qwen3.5-397B INT4-AutoRound TP=4 recipe (37 tok/s)
2026-03-12 15:04:23 -07:00
Eugene Rakhmatulin
8fec9bed06 Updated Nemotron to support dual sparks 2026-03-12 13:30:15 -07:00
Eugene Rakhmatulin
6f9a2f981c Adjusted model parameters 2026-03-12 12:59:05 -07:00
remi
122edc8229 super nemotron mod & recipe for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 2026-03-11 20:53:44 +01:00
Eugene Rakhmatulin
7ceea85647 Fixed qwen3-coder-next-int4-autoround to exclude Ray 2026-03-11 11:20:56 -07:00
Eugene Rakhmatulin
45066e2b16 Updated README 2026-03-11 09:57:34 -07:00
Eugene Rakhmatulin
f2cf11b047 Added a recipe for qwen3-coder-next-int4-autoround 2026-03-11 09:23:23 -07:00
sonusflow
3baca14eb1 Move recipe to 4x-spark-cluster/ and add UMA memory optimizations
- Move qwen3.5-397b-int4-autoround.yaml to recipes/4x-spark-cluster/
  per maintainer request (multi-node recipes in separate directory)
- Add PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to recipe env
- Optimize Ray for GB10 UMA (128GB shared CPU/GPU memory):
  - Disable Ray dashboard (saves ~1.2 GiB per node)
  - Limit Ray object store to 1 GiB (default 30% of RAM = 33 GiB)
  - Disable pre-started idle workers (saves ~8 GiB on head node)
  - Set --num-cpus 2 and --disable-usage-stats on all nodes
- Net effect: ~40+ GiB freed across 4-node cluster for model/KV cache
2026-03-11 07:29:45 +00:00
sonusflow
006734910c Add Qwen3.5-397B INT4-AutoRound TP=4 recipe and Marlin fix
Production-tested recipe for running Qwen3.5-397B-A17B with INT4 AutoRound
quantization across 4 DGX Spark nodes using tensor parallelism.

Performance (4× DGX Spark, driver 580.126.09):
- Single user: 37 tok/s
- 4 concurrent: ~26 tok/s per user, ~103 tok/s aggregate

The Marlin TP fix resolves the MIN_THREAD_N=64 constraint that breaks
in_proj_ba layers at TP=4 (output_size=128/4=32 < 64). Solution:
ReplicatedLinear for B/A projections, applied via diff patches.

Key config:
- VLLM_MARLIN_USE_ATOMIC_ADD=1 (required for Marlin correctness)
- KV cache FP8, prefix caching enabled
- gpu_memory_utilization 0.78 (UMA safe margin)
- CUDAGraphs enabled (default, requires driver 580.x)

Note: Driver 590.x has CUDAGraph capture deadlock on GB10 unified memory.
Stay on driver 580.126.09.
2026-03-09 21:30:28 +00:00
Eugene Rakhmatulin
d42c4199fa Unsloth chat template for qwen3.5 2026-03-06 23:35:18 -08:00
Eugene Rakhmatulin
9dc09bd04b Renamed recipe for qwen3.5-35b-a3b-fp8 to match others 2026-03-06 13:56:06 -08:00
eugr
d148d95a19 Merge pull request #80 from oliverjohnwilson/recipe-add_minimax-m2.5_qwen3.5-397b-a17B-fp8
added minimax-m2.5 and qwen3.5-397b-a17B-fp8 recipes to a recipes/4x-spark-cluster/ subdirectory
2026-03-06 11:46:37 -08:00
eugr
3fabd3fb1c Merge pull request #72 from erikvullings/main
Add Qwen35-35B-A3B recipe in FP8 format
2026-03-05 16:27:50 -08:00
Eugene Rakhmatulin
a749fcce87 Added a recipe for qwen3.5-122B-FP8 2026-03-04 16:49:39 -08:00
oliverjohnwilson
4303f8b6d0 added minimax-m2.5 and qwen3.5-397b-a17B-fp8 recipes to a recipes/4x-spark-cluster/ subdirectory 2026-03-04 16:01:37 -06:00
Erik Vullings
163f23d85b Update qwen35-35b-a3b-fp8.yaml
--max_num_batched_tokens is a default variable now, which can be overriden via the CLI
2026-03-03 12:46:12 +01:00
Eugene Rakhmatulin
7d8465fd9c Added recipe for qwen3.5-122b-int4-autoround, updated README 2026-03-02 12:18:16 -08:00
Erik Vullings
e8f94d6b8b Add Qwen35-35B-A3B recipe in FP8 format 2026-02-27 17:46:06 +01:00
Eugene Rakhmatulin
4c8f90395b Changed reasoning parser in MInimax for better compatibility with modern clients (like coding tools). 2026-02-21 11:53:13 -08:00
Eugene Rakhmatulin
5b2313dddb Changed KV type to fp8 in qwen3-coder-next recipe and reduced default context size to 131072 to ensure it all fits in a single Spark. 2026-02-17 13:07:54 -08:00
Eugene Rakhmatulin
1e7f2d5640 Small fix for M2.5 recipe 2026-02-16 11:38:34 -08:00
Eugene Rakhmatulin
24f42be5cc Added a recipe for MiniMax M2.5 AWQ 2026-02-16 11:35:53 -08:00
Eugene Rakhmatulin
701147b1eb Qwen3-Coder-Next fixes and updated recipe 2026-02-12 15:56:32 -08:00
Eugene Rakhmatulin
c6b245cfe8 Added prefix caching to nemotron recipe 2026-02-10 18:25:01 -08:00
Eugene Rakhmatulin
74876dd442 Added recipes for nemotron-nano-3 and qwen3-coder-next 2026-02-09 14:33:35 -08:00
Raphael Amorim
6943a51ced Adding tests and refactoring repeated methods 2026-02-09 17:21:32 -05:00
Raphael Amorim
b7c3cdcfcb Enhancement: add -- pass-through for arbitrary vLLM arguments
Implements Unix-style pass-through allowing any vLLM argument to be
passed after `--` separator. Arguments are appended verbatim to the
generated vLLM command.

Examples:
  ./run-recipe.py model --solo -- --load-format safetensors
  ./run-recipe.py model --solo -- --served-model-name my-api
  ./run-recipe.py model --solo -- -cc.cudagraph_mode=PIECEWISE

Features:
- Uses parse_known_args() to capture arguments after --
- Warns when extra args duplicate CLI overrides (--port, --tp, etc.)
- Works in both solo and cluster modes

Adds 10 integration tests covering:
- --load-format, --served-model-name, equals syntax
- Multiple arguments, empty --, cluster mode
- Duplicate detection warnings for port/tp/gpu-mem

Closes #30
2026-02-08 02:36:49 -05:00
Eugene Rakhmatulin
ec987259a0 Recipes and Launch Script support 2026-02-04 12:01:53 -08:00
Raphael Amorim
30f16f1d4e feat: Add recipe-based one-click model deployment system
Introduces a YAML recipe system for simplified model deployment:

- run-recipe.py: Main script handling build, download, and launch
- run-recipe.sh: Bash wrapper for dependency management
- recipes/: Pre-configured recipes for common models
  - glm-4.7-flash-awq.yaml: GLM-4.7-Flash with AWQ quantization
  - glm-4.7-nvfp4.yaml: GLM-4.7 with NVFP4 (cluster-only)
  - minimax-m2-awq.yaml: MiniMax M2 with AWQ
  - openai-gpt-oss-120b.yaml: OpenAI GPT-OSS 120B with MXFP4

Key features:
- Auto-discover cluster nodes with --discover, saves to .env
- Load nodes from .env automatically on subsequent runs
- cluster_only flag for models requiring multi-node setup
- build_args field for Dockerfile selection (--pre-tf, --exp-mxfp4)
- Solo mode auto-strips --distributed-executor-backend ray
- --setup flag for full build + download + run workflow
- --dry-run to preview execution without running

Usage:
  ./run-recipe.sh --discover           # Find and save cluster nodes
  ./run-recipe.sh glm-4.7-flash-awq --solo --setup
  ./run-recipe.sh glm-4.7-nvfp4 --setup  # Uses nodes from .env
2026-02-03 16:09:12 -05:00