Merge pull request #19

This commit is contained in:
Eugene Rakhmatulin
2026-02-04 12:03:20 -08:00
15 changed files with 3020 additions and 10 deletions

59
.github/workflows/test-recipes.yml vendored Normal file
View File

@@ -0,0 +1,59 @@
name: Recipe Tests
on:
push:
branches: [ main, profiles ]
paths:
- 'run-recipe.py'
- 'run-recipe.sh'
- 'launch-cluster.sh'
- 'recipes/**'
- 'tests/**'
- '.github/workflows/test-recipes.yml'
pull_request:
paths:
- 'run-recipe.py'
- 'run-recipe.sh'
- 'launch-cluster.sh'
- 'recipes/**'
- 'tests/**'
- '.github/workflows/test-recipes.yml'
workflow_dispatch:
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.10', '3.11', '3.12']
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pyyaml
- name: Make scripts executable
run: |
chmod +x run-recipe.py run-recipe.sh launch-cluster.sh
chmod +x tests/test_recipes.sh
- name: Run recipe integration tests
run: |
./tests/test_recipes.sh -v
- name: Verify all recipes with dry-run
run: |
for recipe in recipes/*.yaml; do
name=$(basename "$recipe" .yaml)
echo "Testing recipe: $name"
./run-recipe.py "$name" --dry-run --solo || exit 1
done

118
README.md
View File

@@ -16,10 +16,11 @@ While it was primarily developed to support multi-node inference, it works just
- [4. Using `run-cluster-node.sh` (Internal)](#4-using-run-cluster-nodesh-internal)
- [5. Configuration Details](#5-configuration-details)
- [6. Mods and Patches](#6-mods-and-patches)
- [7. Using cluster mode for inference](#7-using-cluster-mode-for-inference)
- [8. Fastsafetensors](#8-fastsafetensors)
- [9. Benchmarking](#9-benchmarking)
- [10. Downloading Models](#10-downloading-models)
- [7. Launch Scripts](#7-launch-scripts)
- [8. Using cluster mode for inference](#8-using-cluster-mode-for-inference)
- [9. Fastsafetensors](#9-fastsafetensors)
- [10. Benchmarking](#10-benchmarking)
- [11. Downloading Models](#11-downloading-models)
## DISCLAIMER
@@ -158,6 +159,58 @@ Don't do it every time you rebuild, because it will slow down compilation times.
For periodic maintenance, I recommend using a filter: `docker builder prune --filter until=72h`
### 2026-02-04
#### Recipes support
A major contribution from @raphaelamorim - model recipes.
Recipes allow to launch models with preconfigured settings with one command.
Example:
```bash
# List available recipes
./run-recipe.sh --list
# Run a recipe in solo mode (single node)
./run-recipe.sh glm-4.7-flash-awq --solo
# Full setup: build container + download model + run
./run-recipe.sh glm-4.7-flash-awq --solo --setup
# Run with overrides
./run-recipe.sh glm-4.7-flash-awq --solo --port 9000 --gpu-mem 0.8
# Cluster deployment
./run-recipe.sh glm-4.7-nvfp4 --setup
```
Please refer to the [documentation](recipes/README.md) for the details.
#### Launch script option
You can now specify a launch script to execute on head node instead of specifying a command directly via `exec` action.
Example:
```bash
./launch-cluster.sh --launch-script examples/vllm-openai-gpt-oss-120b.sh
```
Thanks @raphaelamorim for the contribution!
#### Ability to apply vLLM PRs during build
`./build-and-copy.sh` now supports ability to apply vLLM PRs to builds. PR is applied to the most recent vLLM commit (or specific vllm-ref if set). This does NOT apply to wheels build and MXFP4 special build!
To use, just specify `--apply-vllm-pr <pr_num>` in the arguments. Please note that it may fail depending on whether the PR needs a rebase for the specified vLLM reference/main branch. Use with caution!
Example:
```bash
./build-and-copy.sh -t vllm-node-20260204-pr31740 --apply-vllm-pr 31740 -c
```
### 2026-02-02
#### Nemotron Nano mod
@@ -670,6 +723,7 @@ You can override the auto-detected values if needed:
| `--nccl-debug` | NCCL debug level (e.g., INFO, WARN). Defaults to INFO if flag is present but value is omitted. |
| `--check-config` | Check configuration and auto-detection without launching. |
| `--solo` | Solo mode: skip autodetection, launch only on current node, do not launch Ray cluster |
| `--launch-script` | Path to bash script to execute in the container (from examples/ directory or absolute path). If launch script is specified, action should be omitted. |
| `-d` | Run in daemon mode (detached). |
## 3\. Running the Container (Manual)
@@ -846,7 +900,55 @@ Mods can be used for:
- Customizing vLLM behavior for specific workloads
- Rapid iteration on development without rebuilding the entire image
## 7\. Using cluster mode for inference
## 7\. Launch Scripts
Launch scripts provide a simple way to define reusable model configurations. Instead of passing long command lines, you can create a bash script that is copied into the container and executed directly.
### Basic Usage
```bash
# Use a launch script by name (looks in profiles/ directory)
./launch-cluster.sh --launch-script example-vllm-minimax
# Use with explicit nodes
./launch-cluster.sh -n 192.168.1.1,192.168.1.2 --launch-script vllm-openai-gpt-oss-120b.sh
# Combine with mods for models requiring patches
./launch-cluster.sh --launch-script vllm-glm-4.7-nvfp4.sh --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4
```
### Script Format
Launch scripts are simple bash files that run directly inside the container:
```bash
#!/bin/bash
# PROFILE: OpenAI GPT-OSS 120B
# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
# Set environment variables if needed
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
# Run your command
vllm serve openai/gpt-oss-120b \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--enable-auto-tool-choice
```
### Available Launch Scripts
The `examples/` directory contains ready-to-use launch scripts:
- **example-vllm-minimax.sh** - MiniMax-M2-AWQ with Ray distributed backend
- **vllm-openai-gpt-oss-120b.sh** - OpenAI GPT-OSS 120B with FlashInfer MOE
- **vllm-glm-4.7-nvfp4.sh** - GLM-4.7-NVFP4 (requires the glm4_moe patch mod)
See [examples/README.md](examples/README.md) for detailed documentation and more examples.
## 8\. Using cluster mode for inference
First, start follow the instructions above to start the head container on your first Spark, and node container on the second Spark.
Then, on the first Spark, run vllm like this:
@@ -863,7 +965,7 @@ docker exec -it vllm_node
And execute vllm command inside.
## 8\. Fastsafetensors
## 9\. Fastsafetensors
This build includes support for fastsafetensors loading which significantly improves loading speeds, especially on DGX Spark where MMAP performance is very poor currently.
[Fasttensors](https://github.com/foundation-model-stack/fastsafetensors/) solve this issue by using more efficient multi-threaded loading while avoiding mmap.
@@ -877,11 +979,11 @@ To use this method, simply include `--load-format fastsafetensors` when running
HF_HUB_OFFLINE=1 vllm serve openai/gpt-oss-120b --port 8888 --host 0.0.0.0 --trust_remote_code --swap-space 16 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --load-format fastsafetensors
```
## 9\. Benchmarking
## 10\. Benchmarking
I recommend using [llama-benchy](https://github.com/eugr/llama-benchy) - a new benchmarking tool that delivers results in the same format as llama-bench from llama.cpp suite.
## 10\. Downloading Models
## 11\. Downloading Models
The `hf-download.sh` script provides a convenient way to download models from HuggingFace and distribute them across your cluster nodes. It uses Huggingface CLI via `uvx` for fast downloads and `rsync` for distribution across the cluster.

186
examples/README.md Normal file
View File

@@ -0,0 +1,186 @@
# Example Launch Scripts
This directory contains example bash scripts that demonstrate how to use the `--launch-script` option directly with `launch-cluster.sh`.
**Note:** For most use cases, the recipe system (`./run-recipe.sh`) is the recommended approach. These examples are provided for reference and for advanced users who need direct control over the launch process.
## Why Launch Scripts?
- **Simple** - Just write a bash script that runs your command
- **Flexible** - Use any bash features: environment variables, conditionals, loops
- **Standalone** - Each script can be tested directly on a head node
- **No magic** - What you see is what gets executed
## Usage
```bash
# Use a launch script by name (looks in examples/ directory)
./launch-cluster.sh --launch-script example-vllm-minimax
# Use a launch script by filename
./launch-cluster.sh --launch-script example-vllm-minimax.sh
# Use a launch script with absolute path
./launch-cluster.sh --launch-script /path/to/my-script.sh
# Combine with mods if needed
./launch-cluster.sh --launch-script my-script.sh --apply-mod mods/my-patch
# Combine with other options
./launch-cluster.sh -n 192.168.1.1,192.168.1.2 --launch-script my-model.sh -d
```
When using `--launch-script`, the `exec` action is automatically implied if no action is specified.
## Script Structure
Launch scripts are simple bash scripts. The script is copied into the container at `/workspace/exec-script.sh` and executed.
```bash
#!/bin/bash
# PROFILE: Human-readable name
# DESCRIPTION: What this script does
# Optional: Set environment variables
export MY_VAR="value"
# Run your command
vllm serve org/model-name \
--port 8000 \
--host 0.0.0.0 \
--gpu-memory-utilization 0.7
```
### Metadata Comments
The `# PROFILE:` and `# DESCRIPTION:` comments are optional but recommended for documentation:
```bash
#!/bin/bash
# PROFILE: MiniMax-M2-AWQ Example
# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
```
## Examples
### Basic vLLM Serving
```bash
#!/bin/bash
# PROFILE: MiniMax-M2-AWQ
# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
vllm serve QuantTrio/MiniMax-M2-AWQ \
--port 8000 \
--host 0.0.0.0 \
--gpu-memory-utilization 0.7 \
-tp 2 \
--distributed-executor-backend ray \
--max-model-len 128000 \
--load-format fastsafetensors \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2
```
### With Environment Variables
```bash
#!/bin/bash
# PROFILE: OpenAI GPT-OSS 120B
# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
# Enable FlashInfer MOE with MXFP4/MXFP8 quantization
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
vllm serve openai/gpt-oss-120b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--host 0.0.0.0 \
--port 8000
```
### With Conditional Logic
```bash
#!/bin/bash
# PROFILE: Adaptive Model Server
# DESCRIPTION: Adjusts settings based on available GPUs
GPU_COUNT=$(nvidia-smi -L | wc -l)
echo "Detected $GPU_COUNT GPUs"
if [[ $GPU_COUNT -ge 4 ]]; then
TP_SIZE=4
MEM_UTIL=0.9
else
TP_SIZE=2
MEM_UTIL=0.7
fi
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--port 8000 \
--host 0.0.0.0 \
-tp $TP_SIZE \
--gpu-memory-utilization $MEM_UTIL \
--distributed-executor-backend ray
```
### SGLang
```bash
#!/bin/bash
# PROFILE: SGLang Llama 3.1
# DESCRIPTION: SGLang runtime with Llama 3.1
sglang launch meta-llama/Llama-3.1-8B-Instruct \
--port 8000 \
--host 0.0.0.0 \
--tp 2
```
### With Model Requiring Patches
If your model requires patches, use `--apply-mod` alongside `--launch-script`:
```bash
# Script: vllm-glm-4.7-nvfp4.sh
#!/bin/bash
# PROFILE: Salyut1/GLM-4.7-NVFP4
# DESCRIPTION: vLLM serving GLM-4.7-NVFP4
# NOTE: Requires --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4
vllm serve Salyut1/GLM-4.7-NVFP4 \
--attention-config.backend flashinfer \
--tool-call-parser glm47 \
-tp 2 \
--host 0.0.0.0 \
--port 8000
```
Usage:
```bash
./launch-cluster.sh --launch-script vllm-glm-4.7-nvfp4.sh --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4 exec
```
## Creating a New Launch Script
1. Create a new `.sh` file in this directory
2. Add the shebang `#!/bin/bash`
3. Add `# PROFILE:` and `# DESCRIPTION:` comments
4. Write your command (e.g., `vllm serve ...`)
5. Run with `./launch-cluster.sh --launch-script my-script.sh exec`
## Testing Scripts
Since launch scripts are standard bash files, you can test them directly:
```bash
# Inside a running container or on a head node with the runtime installed
cd profiles
./my-script.sh
```
This makes development and debugging much easier than complex configuration systems.

View File

@@ -0,0 +1,15 @@
#!/bin/bash
# PROFILE: MiniMax-M2-AWQ Example
# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
vllm serve QuantTrio/MiniMax-M2-AWQ \
--port 8000 \
--host 0.0.0.0 \
--gpu-memory-utilization 0.7 \
-tp 2 \
--distributed-executor-backend ray \
--max-model-len 128000 \
--load-format fastsafetensors \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think

View File

@@ -0,0 +1,17 @@
#!/bin/bash
# PROFILE: Salyut1/GLM-4.7-NVFP4
# DESCRIPTION: vLLM serving GLM-4.7-NVFP4
# NOTE: This profile requires --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4 to fix k/v scales incompatibility
# See: https://huggingface.co/Salyut1/GLM-4.7-NVFP4/discussions/3#694ab9b6e2efa04b7ecb0c4b
vllm serve Salyut1/GLM-4.7-NVFP4 \
--attention-config.backend flashinfer \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
-tp 2 \
--gpu-memory-utilization 0.88 \
--max-model-len 32000 \
--distributed-executor-backend ray \
--host 0.0.0.0 \
--port 8000

View File

@@ -0,0 +1,20 @@
#!/bin/bash
# PROFILE: OpenAI GPT-OSS 120B
# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
# Enable FlashInfer MOE with MXFP4/MXFP8 quantization
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
vllm serve openai/gpt-oss-120b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.70 \
--max-model-len 128000 \
--max-num-batched-tokens 4096 \
--max-num-seqs 8 \
--enable-prefix-caching \
--host 0.0.0.0 \
--port 8000

View File

@@ -26,6 +26,8 @@ ACTION="start"
CLUSTER_WAS_RUNNING="false"
MOD_PATHS=()
MOD_TYPES=()
LAUNCH_SCRIPT_PATH=""
SCRIPT_DIR="$(dirname "$(realpath "$0")")"
ACTIONS_ARG=""
SOLO_MODE="false"
@@ -41,11 +43,16 @@ usage() {
echo " -e, --env Environment variable to pass to container (e.g. -e VAR=val)"
echo " --nccl-debug NCCL debug level (Optional, one of: VERSION, WARN, INFO, TRACE). If no level is provided, defaults to INFO."
echo " --apply-mod Path to directory or zip file containing run.sh to apply before launch (Can be specified multiple times)"
echo " --launch-script Path to bash script to execute in the container (from examples/ directory or absolute path). If launch script is specified, action should be omitted."
echo " --check-config Check configuration and auto-detection without launching"
echo " --solo Solo mode: skip autodetection, launch only on current node, do not launch Ray cluster"
echo " -d Daemon mode (only for 'start' action)"
echo " action start | stop | status | exec (Default: start)"
echo " command Command to run (only for 'exec' action)"
echo " action start | stop | status | exec (Default: start). Not compatible with --launch-script."
echo " command Command to run (only for 'exec' action). Not compatible with --launch-script."
echo ""
echo "Launch Script Usage:"
echo " $0 --launch-script examples/my-script.sh # Script copied to container and executed"
echo " $0 --launch-script /path/to/script.sh # Uses absolute path to script"
exit 1
}
@@ -59,6 +66,7 @@ while [[ "$#" -gt 0 ]]; do
--ib-if) IB_IF="$2"; shift ;;
-e|--env) DOCKER_ARGS="$DOCKER_ARGS -e $2"; shift ;;
--apply-mod) MOD_PATHS+=("$2"); shift ;;
--launch-script) LAUNCH_SCRIPT_PATH="$2"; shift ;;
--nccl-debug)
if [[ -n "$2" && "$2" =~ ^(VERSION|WARN|INFO|TRACE)$ ]]; then
NCCL_DEBUG_VAL="$2"
@@ -72,9 +80,17 @@ while [[ "$#" -gt 0 ]]; do
-d) DAEMON_MODE="true" ;;
-h|--help) usage ;;
start|stop|status)
if [[ -n "$LAUNCH_SCRIPT_PATH" ]]; then
echo "Error: Action '$1' is not compatible with --launch-script. Please omit the action or not use --launch-script."
exit 1
fi
ACTION="$1"
;;
exec)
if [[ -n "$LAUNCH_SCRIPT_PATH" ]]; then
echo "Error: Action 'exec' is not compatible with --launch-script. Please omit the action or not use --launch-script."
exit 1
fi
ACTION="exec"
shift
COMMAND_TO_RUN="$@"
@@ -85,6 +101,10 @@ while [[ "$#" -gt 0 ]]; do
# unless it's the default 'start' implied.
# However, to support "omitted" = start, we need to be careful.
# If the arg looks like a command, it's exec.
if [[ -n "$LAUNCH_SCRIPT_PATH" ]]; then
echo "Error: Command is not compatible with --launch-script. Please omit the command or not use --launch-script."
exit 1
fi
ACTION="exec"
COMMAND_TO_RUN="$@"
break
@@ -107,6 +127,37 @@ if [[ -n "$NCCL_DEBUG_VAL" ]]; then
esac
fi
# Resolve launch script path if specified
if [[ -n "$LAUNCH_SCRIPT_PATH" ]]; then
# Check if it's an absolute path or relative path that exists
if [[ -f "$LAUNCH_SCRIPT_PATH" ]]; then
LAUNCH_SCRIPT_PATH=$(realpath "$LAUNCH_SCRIPT_PATH")
# Check if it's just a filename, look in examples/ directory
elif [[ -f "$SCRIPT_DIR/examples/$LAUNCH_SCRIPT_PATH" ]]; then
LAUNCH_SCRIPT_PATH="$SCRIPT_DIR/examples/$LAUNCH_SCRIPT_PATH"
# Check if it's a name without .sh extension
elif [[ -f "$SCRIPT_DIR/examples/${LAUNCH_SCRIPT_PATH}.sh" ]]; then
LAUNCH_SCRIPT_PATH="$SCRIPT_DIR/examples/${LAUNCH_SCRIPT_PATH}.sh"
else
echo "Error: Launch script '$LAUNCH_SCRIPT_PATH' not found."
echo "Searched in:"
echo " - $LAUNCH_SCRIPT_PATH"
echo " - $SCRIPT_DIR/examples/$LAUNCH_SCRIPT_PATH"
echo " - $SCRIPT_DIR/examples/${LAUNCH_SCRIPT_PATH}.sh"
exit 1
fi
echo "Using launch script: $LAUNCH_SCRIPT_PATH"
# Set command to run the copied script (use absolute path since docker exec may not be in /workspace)
COMMAND_TO_RUN="/workspace/exec-script.sh"
# If launch script is specified, default action to exec unless explicitly set to stop/status
if [[ "$ACTION" == "start" ]]; then
ACTION="exec"
fi
fi
# Validate MOD_PATHS if set
for i in "${!MOD_PATHS[@]}"; do
mod_path="${MOD_PATHS[$i]}"
@@ -427,6 +478,25 @@ apply_mod_to_container() {
fi
}
# Copy Launch Script to Container Function
copy_launch_script_to_container() {
local container="$1"
local script_path="$2"
echo "Copying launch script to head node..."
local target_script_path="$script_path"
# Copy script into container as /workspace/exec-script.sh
echo " Copying script into container..."
docker cp "$target_script_path" "$container:/workspace/exec-script.sh"
# Make executable
docker exec "$container" chmod +x /workspace/exec-script.sh
echo " Launch script copied to head node"
}
# Start Cluster Function
start_cluster() {
check_cluster_running
@@ -495,6 +565,11 @@ start_cluster() {
done
fi
# Copy launch script to head node only (workers don't need it - they just run Ray)
if [[ -n "$LAUNCH_SCRIPT_PATH" ]]; then
copy_launch_script_to_container "$CONTAINER_NAME" "$LAUNCH_SCRIPT_PATH"
fi
if [[ "$SOLO_MODE" == "false" ]]; then
wait_for_cluster
else

266
recipes/README.md Normal file
View File

@@ -0,0 +1,266 @@
# Recipes
Recipes provide a **one-click solution** for deploying models with pre-configured settings. Each recipe is a YAML file that specifies:
- HuggingFace model to download
- Container image and build arguments
- Required mods/patches
- Default parameters (port, host, tensor parallelism, etc.)
- Environment variables
- The vLLM serve command
## Quick Start
```bash
# List available recipes
./run-recipe.sh --list
# Run a recipe in solo mode (single node)
./run-recipe.sh glm-4.7-flash-awq --solo
# Full setup: build container + download model + run
./run-recipe.sh glm-4.7-flash-awq --solo --setup
# Run with overrides
./run-recipe.sh glm-4.7-flash-awq --solo --port 9000 --gpu-mem 0.8
# Cluster deployment
./run-recipe.sh glm-4.7-nvfp4 -n 192.168.1.10,192.168.1.11 --setup
```
## Cluster Node Discovery
The recipe runner can automatically discover cluster nodes:
```bash
# Auto-discover nodes and save to .env
./run-recipe.sh --discover
# Show current .env configuration
./run-recipe.sh --show-env
# Run recipe (uses nodes from .env automatically)
./run-recipe.sh glm-4.7-nvfp4 --setup
```
When you run `--discover`, it:
1. Scans the network for nodes with SSH access
2. Prompts you to select which nodes to include
3. Saves the configuration to `.env`
Future recipe runs will automatically use nodes from `.env` unless you specify `-n` or `--solo`.
## Workflow Modes
### Solo Mode (Single Node)
```bash
# Explicitly run in solo mode
./run-recipe.sh glm-4.7-flash-awq --solo
# If no nodes configured, defaults to solo
./run-recipe.sh minimax-m2-awq
```
### Cluster Mode (Multiple Nodes)
```bash
# Specify nodes directly (first IP is head node)
./run-recipe.sh glm-4.7-nvfp4 -n 192.168.1.10,192.168.1.11 --setup
# Or use auto-discovered nodes from .env
./run-recipe.sh --discover # First time only
./run-recipe.sh glm-4.7-nvfp4 --setup
```
When using cluster mode with `--setup`:
- Container is built locally and copied to all worker nodes
- Model is downloaded locally and copied to all worker nodes
### Cluster-Only Recipes
Some models are too large to run on a single node. These recipes have `cluster_only: true` and will fail with a helpful error if you try to run them in solo mode:
```bash
$ ./run-recipe.sh glm-4.7-nvfp4 --solo
Error: Recipe 'GLM-4.7-NVFP4' requires cluster mode.
This model is too large to run on a single node.
Options:
1. Specify nodes directly: ./run-recipe.sh glm-4.7-nvfp4 -n node1,node2
2. Auto-discover and save: ./run-recipe.sh --discover
Then run: ./run-recipe.sh glm-4.7-nvfp4
```
## Setup Options
| Flag | Description |
|------|-------------|
| `--setup` | Full setup: build (if missing) + download (if missing) + run |
| `--build-only` | Only build/copy the container, don't run |
| `--download-only` | Only download/copy the model, don't run |
| `--force-build` | Rebuild even if container exists |
| `--force-download` | Re-download even if model exists |
| `--dry-run` | Show what would happen without executing |
## Recipe Format
```yaml
# Required fields
name: Human-readable name
container: docker-image-name
command: |
vllm serve model/name \
--port {port} \
--host {host}
# Optional fields
description: What this recipe does
model: org/model-name # HuggingFace model ID for --setup downloads
cluster_only: false # Set to true if model requires cluster mode
build_args: # Extra args for build-and-copy.sh
- --pre-tf # e.g., for transformers 5.0
- --exp-mxfp4 # e.g., for MXFP4 Dockerfile
mods:
- mods/some-patch
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
gpu_memory_utilization: 0.85
max_model_len: 32000
env:
SOME_VAR: "value"
```
### Build Arguments
The `build_args` field passes flags to `build-and-copy.sh`:
| Flag | Description |
|------|-------------|
| `--pre-tf` | Use transformers 5.0 (required for GLM-4.7 models) |
| `--exp-mxfp4` | Use MXFP4 Dockerfile (for MXFP4 quantized models) |
| `--use-wheels` | Use pre-built wheels instead of building from source |
### Parameter Substitution
Use `{param_name}` in the command to substitute values from defaults or CLI overrides:
```yaml
defaults:
port: 8000
tensor_parallel: 2
command: |
vllm serve my/model \
--port {port} \
-tp {tensor_parallel}
```
Override at runtime:
```bash
./run-recipe.sh my-recipe --port 9000 --tp 4
```
## CLI Reference
```
Usage: ./run-recipe.sh [OPTIONS] [RECIPE]
Cluster discovery:
--discover Auto-detect cluster nodes and save to .env
--show-env Show current .env configuration
Recipe overrides:
--port PORT Override port
--host HOST Override host
--tensor-parallel, --tp N Override tensor parallelism
--gpu-memory-utilization N Override GPU memory utilization (--gpu-mem)
--max-model-len N Override max model length
Setup options:
--setup Full setup: build + download + run
--build-only Only build/copy container, don't run
--download-only Only download/copy model, don't run
--force-build Rebuild even if container exists
--force-download Re-download even if model exists
Launch options:
--solo Run in solo mode (single node, no Ray)
-n, --nodes IPS Comma-separated node IPs (first = head)
-d, --daemon Run in daemon mode
-t, --container IMAGE Override container from recipe
--nccl-debug LEVEL NCCL debug level (VERSION, WARN, INFO, TRACE)
Other:
--dry-run Show what would be executed
--list, -l List available recipes
```
## Creating a Recipe
1. Create a new `.yaml` file in `recipes/`
2. Specify required fields: `name`, `container`, `command`
3. Add `build_args` if your model needs special build options
4. Add `mods` if your model needs patches
5. Set `cluster_only: true` if model is too large for single node
6. Set sensible `defaults`
7. Add `env` variables if needed
Example:
```yaml
name: My Model
description: My custom model setup
container: vllm-node-tf5
build_args:
- --pre-tf
mods:
- mods/my-fix
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 1
gpu_memory_utilization: 0.85
command: |
vllm serve org/my-model \
--port {port} \
--host {host} \
-tp {tensor_parallel} \
--gpu-memory-utilization {gpu_memory_utilization}
```
## Architecture
```
┌─────────────────────────────────────────────────────────┐
│ run-recipe.sh / run-recipe.py │
│ - Parses YAML recipe │
│ - Auto-discovers cluster nodes (--discover) │
│ - Loads nodes from .env │
│ - Handles --setup (build + download + run) │
│ - Generates launch script from template │
│ - Applies CLI overrides │
└──────────┬────────────────────────┬─────────────────────┘
│ calls (for build) │ calls (for download)
▼ ▼
┌──────────────────────┐ ┌───────────────────────────────┐
│ build-and-copy.sh │ │ hf-download.sh │
│ - Docker build │ │ - HuggingFace model download │
│ - Copy to workers │ │ - Rsync to workers │
└──────────────────────┘ └───────────────────────────────┘
│ then calls (for run)
┌─────────────────────────────────────────────────────────┐
│ launch-cluster.sh │
│ - Cluster orchestration │
│ - Container lifecycle │
│ - Mod application │
│ - Launch script execution │
└─────────────────────────────────────────────────────────┘
```
This separation follows the Unix philosophy: `run-recipe.sh` provides convenience, while the underlying scripts remain focused on their specific tasks.

View File

@@ -0,0 +1,64 @@
# Recipe: GLM-4.7-Flash-AWQ-4bit
# cyankiwi's AWQ quantized GLM-4.7-Flash model
# Requires a patch for inference speed optimization
#
# NOTE: vLLM implementation is suboptimal even with the patch.
# The model performance is still significantly slower than it should be
# for a model with this number of active parameters. Running in cluster
# increases prompt processing performance, but not token generation.
# Expect ~40 t/s generation speed in both single node and cluster.
recipe_version: "1"
name: GLM-4.7-Flash-AWQ
description: vLLM serving cyankiwi/GLM-4.7-Flash-AWQ-4bit with speed optimization patch
# HuggingFace model to download
model: cyankiwi/GLM-4.7-Flash-AWQ-4bit
# This model can run on single node (solo) or cluster
cluster_only: false
# Container image to use
container: vllm-node-tf5
# Build arguments for build-and-copy.sh
# tf5 = transformers 5.0 (required for GLM-4.7)
build_args:
- --pre-tf
# Mods to apply before running (paths relative to repo root)
# This mod prevents severe inference speed degradation
mods:
- mods/fix-glm-4.7-flash-AWQ
# Default settings (can be overridden via CLI)
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 1
gpu_memory_utilization: 0.7
max_model_len: 202752
max_num_batched_tokens: 4096
max_num_seqs: 64
served_model_name: glm-4.7-flash
# Environment variables to set in the container
env:
# Add any required env vars here
# The vLLM serve command template
# Use {var_name} for substitution from defaults/overrides
# In cluster mode, --distributed-executor-backend ray and -tp 2 are added
command: |
vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name {served_model_name} \
--max-model-len {max_model_len} \
--max-num-batched-tokens {max_num_batched_tokens} \
--max-num-seqs {max_num_seqs} \
--gpu-memory-utilization {gpu_memory_utilization} \
-tp {tensor_parallel} \
--host {host} \
--port {port}

View File

@@ -0,0 +1,40 @@
# Recipe: MiniMax-M2-AWQ
# MiniMax M2 model with AWQ quantization
recipe_version: "1"
name: MiniMax-M2-AWQ
description: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
# HuggingFace model to download (optional, for --download-model)
model: QuantTrio/MiniMax-M2-AWQ
# Container image to use
container: vllm-node
# No mods required
mods: []
# Default settings (can be overridden via CLI)
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
gpu_memory_utilization: 0.7
max_model_len: 128000
# Environment variables
env: {}
# The vLLM serve command template
command: |
vllm serve QuantTrio/MiniMax-M2-AWQ \
--port {port} \
--host {host} \
--gpu-memory-utilization {gpu_memory_utilization} \
-tp {tensor_parallel} \
--distributed-executor-backend ray \
--max-model-len {max_model_len} \
--load-format fastsafetensors \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think

View File

@@ -0,0 +1,52 @@
# Recipe: OpenAI GPT-OSS 120B
# OpenAI's open source 120B MoE model with MXFP4 quantization support
recipe_version: "1"
name: OpenAI GPT-OSS 120B
description: vLLM serving openai/gpt-oss-120b with MXFP4 quantization and FlashInfer
# HuggingFace model to download (optional, for --download-model)
model: openai/gpt-oss-120b
# Container image to use
container: vllm-node-mxfp4
# Build arguments for build-and-copy.sh
build_args:
- --exp-mxfp4
# No mods required for this model
mods: []
# Default settings (can be overridden via CLI)
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
gpu_memory_utilization: 0.70
max_num_batched_tokens: 8192
# Environment variables to set in the container
env:
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: "1"
# The vLLM serve command template
# Uses MXFP4 quantization for memory efficiency
command: |
vllm serve openai/gpt-oss-120b \
--tool-call-parser openai \
--reasoning-parser openai_gptoss \
--enable-auto-tool-choice \
--tensor-parallel-size {tensor_parallel} \
--distributed-executor-backend ray \
--gpu-memory-utilization {gpu_memory_utilization} \
--enable-prefix-caching \
--load-format fastsafetensors \
--quantization mxfp4 \
--mxfp4-backend CUTLASS \
--mxfp4-layers moe,qkv,o,lm_head \
--attention-backend FLASHINFER \
--kv-cache-dtype fp8 \
--max-num-batched-tokens {max_num_batched_tokens} \
--host {host} \
--port {port}

1124
run-recipe.py Executable file

File diff suppressed because it is too large Load Diff

42
run-recipe.sh Executable file
View File

@@ -0,0 +1,42 @@
#!/bin/bash
#
# run-recipe.sh - Wrapper for run-recipe.py
#
# Ensures Python dependencies are available and runs the recipe runner.
#
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
RECIPE_SCRIPT="$SCRIPT_DIR/run-recipe.py"
# Check for Python 3.10+
if command -v python3 &>/dev/null; then
PYTHON=python3
elif command -v python &>/dev/null; then
PYTHON=python
else
echo "Error: Python 3 not found. Please install Python 3.10 or later."
exit 1
fi
# Verify version
PY_VERSION=$($PYTHON -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')
PY_MAJOR=$($PYTHON -c 'import sys; print(sys.version_info.major)')
PY_MINOR=$($PYTHON -c 'import sys; print(sys.version_info.minor)')
if [[ "$PY_MAJOR" -lt 3 ]] || [[ "$PY_MAJOR" -eq 3 && "$PY_MINOR" -lt 10 ]]; then
echo "Error: Python 3.10+ required, found $PY_VERSION"
exit 1
fi
# Check for PyYAML and install if missing
if ! $PYTHON -c "import yaml" 2>/dev/null; then
echo "Installing PyYAML..."
$PYTHON -m pip install --quiet pyyaml
if [[ $? -ne 0 ]]; then
echo "Error: Failed to install PyYAML. Try: pip install pyyaml"
exit 1
fi
fi
# Run the recipe script
exec $PYTHON "$RECIPE_SCRIPT" "$@"

View File

@@ -0,0 +1,89 @@
# Expected vLLM serve arguments for each recipe
# This file is used by test_recipes.sh to verify recipes match README documentation
#
# Format: Each recipe has a section with expected arguments
# Tests will verify these arguments appear in the dry-run output
#
# IMPORTANT: Keep this in sync with README.md documentation
# When updating recipes, update both README.md and this file
# ==============================================================================
# glm-4.7-flash-awq
# README Reference: Lines 186-198 (solo) and 203-218 (cluster)
# ==============================================================================
GLM_FLASH_AWQ_MODEL="cyankiwi/GLM-4.7-Flash-AWQ-4bit"
GLM_FLASH_AWQ_CONTAINER="vllm-node-tf5"
GLM_FLASH_AWQ_MOD="mods/fix-glm-4.7-flash-AWQ"
GLM_FLASH_AWQ_ARGS=(
"--tool-call-parser glm47"
"--reasoning-parser glm45"
"--enable-auto-tool-choice"
"--served-model-name glm-4.7-flash"
"--max-model-len 202752"
"--max-num-batched-tokens 4096"
"--max-num-seqs 64"
"--gpu-memory-utilization 0.7"
"--port 8888"
"--host 0.0.0.0"
)
# ==============================================================================
# openai-gpt-oss-120b
# README Reference: Lines 244-257 (solo) and 264-280 (cluster)
# ==============================================================================
GPT_OSS_MODEL="openai/gpt-oss-120b"
GPT_OSS_CONTAINER="vllm-node-mxfp4"
GPT_OSS_ARGS=(
"--port 8888"
"--host 0.0.0.0"
"--enable-auto-tool-choice"
"--tool-call-parser openai"
"--reasoning-parser openai_gptoss"
"--gpu-memory-utilization 0.7"
"--enable-prefix-caching"
"--load-format fastsafetensors"
"--quantization mxfp4"
"--mxfp4-backend CUTLASS"
"--mxfp4-layers moe,qkv,o,lm_head"
"--attention-backend FLASHINFER"
"--kv-cache-dtype fp8"
"--max-num-batched-tokens 8192"
)
# ==============================================================================
# minimax-m2-awq
# README Reference: Not explicitly documented, but based on model requirements
# ==============================================================================
MINIMAX_MODEL="QuantTrio/MiniMax-M2-AWQ"
MINIMAX_CONTAINER="vllm-node"
MINIMAX_ARGS=(
"--port 8000"
"--host 0.0.0.0"
"--gpu-memory-utilization 0.7"
"--max-model-len 128000"
"--load-format fastsafetensors"
"--enable-auto-tool-choice"
"--tool-call-parser minimax_m2"
"--reasoning-parser minimax_m2_append_think"
)
# ==============================================================================
# Cluster Mode Expected Arguments
# These are arguments that should appear ONLY in cluster mode
# Note: Tests use 2 nodes, so tensor_parallel = 2 (1 GPU per node)
# ==============================================================================
# glm-4.7-flash-awq cluster mode (no distributed backend - single GPU model)
GLM_FLASH_AWQ_CLUSTER_TP="1"
# openai-gpt-oss-120b cluster mode (2 nodes = tp 2)
GPT_OSS_CLUSTER_TP="2"
GPT_OSS_CLUSTER_ARGS=(
"--distributed-executor-backend ray"
)
# minimax-m2-awq cluster mode (2 nodes = tp 2)
MINIMAX_CLUSTER_TP="2"
MINIMAX_CLUSTER_ARGS=(
"--distributed-executor-backend ray"
)

859
tests/test_recipes.sh Executable file
View File

@@ -0,0 +1,859 @@
#!/bin/bash
#
# test_recipes.sh - Integration tests for run-recipe.py and launch-cluster.sh
#
# These tests use --dry-run mode to verify compatibility without actually
# running containers. Suitable for CI/CD pipelines.
#
# Usage:
# ./tests/test_recipes.sh # Run all tests
# ./tests/test_recipes.sh -v # Verbose output
#
set -e
SCRIPT_DIR="$(dirname "$(realpath "$0")")"
PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
VERBOSE="${1:-}"
# Load expected commands for README verification
source "$SCRIPT_DIR/expected_commands.sh"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Test counters
TESTS_PASSED=0
TESTS_FAILED=0
TESTS_SKIPPED=0
# Helper functions
log_test() {
echo -e "${YELLOW}[TEST]${NC} $1"
}
log_pass() {
echo -e "${GREEN}[PASS]${NC} $1"
TESTS_PASSED=$((TESTS_PASSED + 1))
}
log_fail() {
echo -e "${RED}[FAIL]${NC} $1"
TESTS_FAILED=$((TESTS_FAILED + 1))
}
log_skip() {
echo -e "${YELLOW}[SKIP]${NC} $1"
TESTS_SKIPPED=$((TESTS_SKIPPED + 1))
}
log_verbose() {
if [[ "$VERBOSE" == "-v" ]]; then
echo " $1"
fi
}
# Check prerequisites
check_prerequisites() {
log_test "Checking prerequisites..."
if ! command -v python3 &> /dev/null; then
log_fail "python3 not found"
exit 1
fi
# Check Python version
python_version=$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')
if [[ $(echo "$python_version < 3.10" | bc -l) -eq 1 ]]; then
log_fail "Python 3.10+ required, found $python_version"
exit 1
fi
# Check PyYAML
if ! python3 -c "import yaml" 2>/dev/null; then
log_fail "PyYAML not installed"
exit 1
fi
log_pass "Prerequisites OK (Python $python_version with PyYAML)"
}
# Test: run-recipe.py exists and is executable
test_run_recipe_exists() {
log_test "run-recipe.py exists and is executable"
if [[ -x "$PROJECT_DIR/run-recipe.py" ]]; then
log_pass "run-recipe.py is executable"
else
log_fail "run-recipe.py not found or not executable"
fi
}
# Test: launch-cluster.sh exists and is executable
test_launch_cluster_exists() {
log_test "launch-cluster.sh exists and is executable"
if [[ -x "$PROJECT_DIR/launch-cluster.sh" ]]; then
log_pass "launch-cluster.sh is executable"
else
log_fail "launch-cluster.sh not found or not executable"
fi
}
# Test: run-recipe.py --list works
test_list_recipes() {
log_test "run-recipe.py --list"
output=$("$PROJECT_DIR/run-recipe.py" --list 2>&1)
if [[ $? -eq 0 ]] && echo "$output" | grep -q "Available recipes"; then
log_pass "--list shows available recipes"
log_verbose "Found recipes in output"
else
log_fail "--list failed or no recipes found"
log_verbose "$output"
fi
}
# Test: All recipes have required recipe_version field
test_recipe_version_required() {
log_test "All recipes have required recipe_version field"
local all_valid=true
for recipe in "$PROJECT_DIR/recipes/"*.yaml; do
if [[ -f "$recipe" ]]; then
recipe_name=$(basename "$recipe")
if ! grep -q "^recipe_version:" "$recipe"; then
log_verbose "$recipe_name missing recipe_version"
all_valid=false
fi
fi
done
if [[ "$all_valid" == "true" ]]; then
log_pass "All recipes have recipe_version field"
else
log_fail "Some recipes missing recipe_version field"
fi
}
# Test: All recipes load without errors
test_all_recipes_load() {
log_test "All recipes load without errors"
local all_valid=true
for recipe in "$PROJECT_DIR/recipes/"*.yaml; do
if [[ -f "$recipe" ]]; then
recipe_name=$(basename "$recipe" .yaml)
# Try to load recipe with --dry-run (will fail early if recipe is invalid)
if ! "$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo 2>&1 | grep -q "Error:"; then
log_verbose "$recipe_name loads OK"
else
log_verbose "$recipe_name failed to load"
all_valid=false
fi
fi
done
if [[ "$all_valid" == "true" ]]; then
log_pass "All recipes load successfully"
else
log_fail "Some recipes failed to load"
fi
}
# Test: Dry-run generates valid launch script
test_dry_run_generates_script() {
log_test "Dry-run generates valid launch script"
# Find first available recipe
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
if [[ -z "$first_recipe" ]]; then
log_skip "No recipes found"
return
fi
recipe_name=$(basename "$first_recipe" .yaml)
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo 2>&1)
if echo "$output" | grep -q "#!/bin/bash" && echo "$output" | grep -q "vllm serve"; then
log_pass "Dry-run generates bash script with vllm serve command"
else
log_fail "Dry-run output doesn't contain expected content"
log_verbose "$output"
fi
}
# Test: Solo mode sets tensor_parallel=1
test_solo_mode_tp1() {
log_test "Solo mode sets tensor_parallel=1"
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
if [[ -z "$first_recipe" ]]; then
log_skip "No recipes found"
return
fi
recipe_name=$(basename "$first_recipe" .yaml)
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo 2>&1)
# Check that -tp 1 is in the output (solo mode should set tp=1)
if echo "$output" | grep -q "\-tp 1"; then
log_pass "Solo mode correctly sets -tp 1"
else
log_fail "Solo mode did not set -tp 1"
log_verbose "$output"
fi
}
# Test: Solo mode removes --distributed-executor-backend ray
test_solo_mode_removes_ray() {
log_test "Solo mode removes --distributed-executor-backend ray"
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
if [[ -z "$first_recipe" ]]; then
log_skip "No recipes found"
return
fi
recipe_name=$(basename "$first_recipe" .yaml)
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo 2>&1)
# Check that --distributed-executor-backend is NOT in the output
if ! echo "$output" | grep -q "\-\-distributed-executor-backend"; then
log_pass "Solo mode correctly removes --distributed-executor-backend"
else
log_fail "Solo mode did not remove --distributed-executor-backend"
log_verbose "$output"
fi
}
# Test: Cluster mode preserves --distributed-executor-backend ray
test_cluster_mode_keeps_ray() {
log_test "Cluster mode preserves --distributed-executor-backend ray"
# Use minimax-m2-awq which explicitly has --distributed-executor-backend ray
if [[ ! -f "$PROJECT_DIR/recipes/minimax-m2-awq.yaml" ]]; then
log_skip "minimax-m2-awq.yaml not found"
return
fi
output=$("$PROJECT_DIR/run-recipe.py" minimax-m2-awq --dry-run -n "192.168.1.1,192.168.1.2" 2>&1)
# Check that --distributed-executor-backend IS in the output for cluster mode
if echo "$output" | grep -q "\-\-distributed-executor-backend ray"; then
log_pass "Cluster mode correctly preserves --distributed-executor-backend ray"
else
log_fail "Cluster mode did not preserve --distributed-executor-backend"
log_verbose "$output"
fi
}
# Test: CLI overrides work (--port)
test_cli_override_port() {
log_test "CLI override --port works"
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
if [[ -z "$first_recipe" ]]; then
log_skip "No recipes found"
return
fi
recipe_name=$(basename "$first_recipe" .yaml)
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo --port 9999 2>&1)
if echo "$output" | grep -q "\-\-port 9999"; then
log_pass "--port override correctly applied"
else
log_fail "--port override not found in output"
log_verbose "$output"
fi
}
# Test: launch-cluster.sh --help works
test_launch_cluster_help() {
log_test "launch-cluster.sh --help"
output=$("$PROJECT_DIR/launch-cluster.sh" --help 2>&1 || true)
if echo "$output" | grep -q "Usage:"; then
log_pass "--help shows usage information"
else
log_fail "--help did not show usage"
log_verbose "$output"
fi
}
# Test: launch-cluster.sh references examples/ not profiles/
test_launch_cluster_examples_path() {
log_test "launch-cluster.sh references examples/ directory"
if grep -q "examples/" "$PROJECT_DIR/launch-cluster.sh"; then
log_pass "launch-cluster.sh references examples/"
else
log_fail "launch-cluster.sh does not reference examples/"
fi
if grep -q "profiles/" "$PROJECT_DIR/launch-cluster.sh"; then
log_fail "launch-cluster.sh still references profiles/"
fi
}
# Test: Unsupported recipe version shows warning
test_unsupported_recipe_version() {
log_test "Unsupported recipe_version shows warning"
# Create a temporary recipe with unsupported version
temp_recipe=$(mktemp)
cat > "$temp_recipe" << 'EOF'
recipe_version: "999"
name: Test Recipe
container: test-container
command: echo "test"
EOF
output=$("$PROJECT_DIR/run-recipe.py" "$temp_recipe" --dry-run --solo 2>&1)
rm -f "$temp_recipe"
if echo "$output" | grep -q "Warning.*schema version"; then
log_pass "Unsupported recipe_version shows warning"
else
log_fail "No warning for unsupported recipe_version"
log_verbose "$output"
fi
}
# Test: Missing recipe_version fails
test_missing_recipe_version_fails() {
log_test "Missing recipe_version field fails"
# Create a temporary recipe without recipe_version
temp_recipe=$(mktemp)
cat > "$temp_recipe" << 'EOF'
name: Test Recipe
container: test-container
command: echo "test"
EOF
output=$("$PROJECT_DIR/run-recipe.py" "$temp_recipe" --dry-run --solo 2>&1 || true)
rm -f "$temp_recipe"
if echo "$output" | grep -q "Error.*recipe_version"; then
log_pass "Missing recipe_version correctly fails"
else
log_fail "Missing recipe_version did not fail as expected"
log_verbose "$output"
fi
}
# Test: cluster_only recipe fails in solo mode
test_cluster_only_fails_solo() {
log_test "cluster_only recipe fails in solo mode"
# Create a temporary cluster_only recipe
temp_recipe=$(mktemp)
cat > "$temp_recipe" << 'EOF'
recipe_version: "1"
name: Cluster Only Test
container: test-container
cluster_only: true
command: echo "test"
EOF
output=$("$PROJECT_DIR/run-recipe.py" "$temp_recipe" --dry-run --solo 2>&1 || true)
exit_code=$?
rm -f "$temp_recipe"
if echo "$output" | grep -q "requires cluster mode"; then
log_pass "cluster_only recipe correctly fails in solo mode"
else
log_fail "cluster_only recipe did not fail in solo mode"
log_verbose "$output"
fi
}
# ==============================================================================
# Launch-cluster.sh Command Line Verification Tests
# ==============================================================================
# These tests verify that the dry-run output contains the expected
# launch-cluster.sh command line arguments matching the recipe configuration.
# Helper: Extract launch-cluster command from dry-run output
extract_launch_cmd() {
echo "$1" | grep -A5 "launch-cluster.sh is called with:" | grep -v "launch-cluster.sh is called with:" | tr '\n' ' '
}
# Test: Solo mode generates --solo flag in launch-cluster command
test_launch_cmd_solo_flag() {
log_test "Launch command includes --solo flag in solo mode"
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
if [[ -z "$first_recipe" ]]; then
log_skip "No recipes found"
return
fi
recipe_name=$(basename "$first_recipe" .yaml)
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo 2>&1)
launch_cmd=$(extract_launch_cmd "$output")
if echo "$launch_cmd" | grep -q "\-\-solo"; then
log_pass "Launch command includes --solo flag"
else
log_fail "Launch command missing --solo flag"
log_verbose "Launch cmd: $launch_cmd"
fi
}
# Test: Cluster mode generates -n flag with nodes
test_launch_cmd_nodes_flag() {
log_test "Launch command includes -n flag with nodes in cluster mode"
output=$("$PROJECT_DIR/run-recipe.py" minimax-m2-awq --dry-run -n "10.0.0.1,10.0.0.2" 2>&1)
launch_cmd=$(extract_launch_cmd "$output")
if echo "$launch_cmd" | grep -q "\-n 10.0.0.1,10.0.0.2"; then
log_pass "Launch command includes -n with correct nodes"
else
log_fail "Launch command missing or incorrect -n flag"
log_verbose "Launch cmd: $launch_cmd"
fi
}
# Test: Container image from recipe is passed to launch-cluster
test_launch_cmd_container_image() {
log_test "Launch command includes correct container image (-t)"
# Use openai-gpt-oss-120b which has a specific container name
if [[ ! -f "$PROJECT_DIR/recipes/openai-gpt-oss-120b.yaml" ]]; then
log_skip "openai-gpt-oss-120b.yaml not found"
return
fi
output=$("$PROJECT_DIR/run-recipe.py" openai-gpt-oss-120b --dry-run --solo 2>&1)
launch_cmd=$(extract_launch_cmd "$output")
# Check the container is vllm-node-mxfp4 (from the recipe)
if echo "$launch_cmd" | grep -q "\-t vllm-node-mxfp4"; then
log_pass "Launch command includes correct container image"
else
log_fail "Launch command has wrong container image"
log_verbose "Launch cmd: $launch_cmd"
fi
}
# Test: Mods from recipe are passed as --apply-mod
test_launch_cmd_mods() {
log_test "Launch command includes --apply-mod for recipe mods"
# Use glm-4.7-flash-awq which has a mod
if [[ ! -f "$PROJECT_DIR/recipes/glm-4.7-flash-awq.yaml" ]]; then
log_skip "glm-4.7-flash-awq.yaml not found"
return
fi
output=$("$PROJECT_DIR/run-recipe.py" glm-4.7-flash-awq --dry-run --solo 2>&1)
launch_cmd=$(extract_launch_cmd "$output")
if echo "$launch_cmd" | grep -q "\-\-apply-mod"; then
log_pass "Launch command includes --apply-mod for mods"
else
log_fail "Launch command missing --apply-mod"
log_verbose "Launch cmd: $launch_cmd"
fi
}
# Test: Daemon mode flag is passed through
test_launch_cmd_daemon_flag() {
log_test "Launch command includes -d flag in daemon mode"
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
if [[ -z "$first_recipe" ]]; then
log_skip "No recipes found"
return
fi
recipe_name=$(basename "$first_recipe" .yaml)
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo -d 2>&1)
launch_cmd=$(extract_launch_cmd "$output")
if echo "$launch_cmd" | grep -q "\-d"; then
log_pass "Launch command includes -d flag"
else
log_fail "Launch command missing -d flag"
log_verbose "Launch cmd: $launch_cmd"
fi
}
# Test: NCCL debug level is passed through
test_launch_cmd_nccl_debug() {
log_test "Launch command includes --nccl-debug when specified"
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
if [[ -z "$first_recipe" ]]; then
log_skip "No recipes found"
return
fi
recipe_name=$(basename "$first_recipe" .yaml)
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo --nccl-debug INFO 2>&1)
launch_cmd=$(extract_launch_cmd "$output")
if echo "$launch_cmd" | grep -q "\-\-nccl-debug INFO"; then
log_pass "Launch command includes --nccl-debug INFO"
else
log_fail "Launch command missing --nccl-debug"
log_verbose "Launch cmd: $launch_cmd"
fi
}
# Test: --launch-script is always included
test_launch_cmd_launch_script() {
log_test "Launch command includes --launch-script"
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
if [[ -z "$first_recipe" ]]; then
log_skip "No recipes found"
return
fi
recipe_name=$(basename "$first_recipe" .yaml)
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo 2>&1)
launch_cmd=$(extract_launch_cmd "$output")
if echo "$launch_cmd" | grep -q "\-\-launch-script"; then
log_pass "Launch command includes --launch-script"
else
log_fail "Launch command missing --launch-script"
log_verbose "Launch cmd: $launch_cmd"
fi
}
# Test: Container override (-t CLI) takes precedence
test_launch_cmd_container_override() {
log_test "CLI container override (-t) takes precedence"
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
if [[ -z "$first_recipe" ]]; then
log_skip "No recipes found"
return
fi
recipe_name=$(basename "$first_recipe" .yaml)
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo -t my-custom-image 2>&1)
launch_cmd=$(extract_launch_cmd "$output")
if echo "$launch_cmd" | grep -q "\-t my-custom-image"; then
log_pass "Container override correctly applied"
else
log_fail "Container override not applied"
log_verbose "Launch cmd: $launch_cmd"
fi
}
# Test: Cluster mode does NOT include --solo flag
test_launch_cmd_no_solo_in_cluster() {
log_test "Launch command does NOT include --solo in cluster mode"
output=$("$PROJECT_DIR/run-recipe.py" minimax-m2-awq --dry-run -n "10.0.0.1,10.0.0.2" 2>&1)
launch_cmd=$(extract_launch_cmd "$output")
if echo "$launch_cmd" | grep -qv "\-\-solo" || ! echo "$launch_cmd" | grep -q "\-\-solo"; then
log_pass "Cluster mode correctly omits --solo flag"
else
log_fail "Cluster mode incorrectly includes --solo flag"
log_verbose "Launch cmd: $launch_cmd"
fi
}
# ==============================================================================
# README Documentation Verification Tests
# ==============================================================================
# These tests verify that recipe dry-run output matches the expected commands
# documented in README.md. Expected values are defined in expected_commands.sh
# Helper: Extract the generated launch script from dry-run output
extract_vllm_command() {
# Extract lines between "Generated Launch Script" and "What would be executed"
echo "$1" | sed -n '/=== Generated Launch Script ===/,/=== What would be executed ===/p' | grep -v "===" | grep -v "^#" | grep -v "^$"
}
# Helper: Verify a recipe contains all expected arguments
verify_recipe_args() {
local recipe_name="$1"
local expected_model="$2"
local expected_container="$3"
shift 3
local expected_args=("$@")
log_test "README match: $recipe_name"
if [[ ! -f "$PROJECT_DIR/recipes/${recipe_name}.yaml" ]]; then
log_skip "${recipe_name}.yaml not found"
return
fi
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo 2>&1)
vllm_cmd=$(extract_vllm_command "$output")
launch_cmd=$(extract_launch_cmd "$output")
local all_passed=true
local missing_args=()
# Check model name
if ! echo "$vllm_cmd" | grep -q "$expected_model"; then
missing_args+=("model: $expected_model")
all_passed=false
fi
# Check container
if ! echo "$launch_cmd" | grep -q "\-t $expected_container"; then
missing_args+=("container: $expected_container")
all_passed=false
fi
# Check each expected argument
for arg in "${expected_args[@]}"; do
# Handle arguments that may have slight formatting differences
# Extract the flag and value separately for flexible matching
local flag=$(echo "$arg" | awk '{print $1}')
local value=$(echo "$arg" | cut -d' ' -f2-)
# Use grep -F for fixed string matching (avoids -- being treated as grep options)
if ! echo "$vllm_cmd" | grep -qF -- "$flag"; then
missing_args+=("$arg")
all_passed=false
elif [[ -n "$value" ]] && [[ "$value" != "$flag" ]]; then
# Check if value is present (might be on next line due to formatting)
if ! echo "$vllm_cmd" | grep -qF -- "$value"; then
missing_args+=("$arg (flag present, value mismatch)")
all_passed=false
fi
fi
done
if [[ "$all_passed" == "true" ]]; then
log_pass "README match: $recipe_name - all expected arguments present"
else
log_fail "README match: $recipe_name - missing arguments"
for missing in "${missing_args[@]}"; do
log_verbose " Missing: $missing"
done
log_verbose " vLLM command: $vllm_cmd"
fi
}
# Test: glm-4.7-flash-awq matches README documentation
test_readme_glm_flash_awq() {
verify_recipe_args "glm-4.7-flash-awq" \
"$GLM_FLASH_AWQ_MODEL" \
"$GLM_FLASH_AWQ_CONTAINER" \
"${GLM_FLASH_AWQ_ARGS[@]}"
}
# Test: openai-gpt-oss-120b matches README documentation
test_readme_gpt_oss() {
verify_recipe_args "openai-gpt-oss-120b" \
"$GPT_OSS_MODEL" \
"$GPT_OSS_CONTAINER" \
"${GPT_OSS_ARGS[@]}"
}
# Test: minimax-m2-awq matches expected configuration
test_readme_minimax() {
verify_recipe_args "minimax-m2-awq" \
"$MINIMAX_MODEL" \
"$MINIMAX_CONTAINER" \
"${MINIMAX_ARGS[@]}"
}
# Test: glm-4.7-flash-awq includes correct mod
test_readme_glm_flash_mod() {
log_test "README match: glm-4.7-flash-awq mod path"
if [[ ! -f "$PROJECT_DIR/recipes/glm-4.7-flash-awq.yaml" ]]; then
log_skip "glm-4.7-flash-awq.yaml not found"
return
fi
output=$("$PROJECT_DIR/run-recipe.py" glm-4.7-flash-awq --dry-run --solo 2>&1)
launch_cmd=$(extract_launch_cmd "$output")
if echo "$launch_cmd" | grep -q "$GLM_FLASH_AWQ_MOD"; then
log_pass "README match: glm-4.7-flash-awq has correct mod path"
else
log_fail "README match: glm-4.7-flash-awq missing expected mod: $GLM_FLASH_AWQ_MOD"
log_verbose "Launch cmd: $launch_cmd"
fi
}
# Helper: Verify cluster mode specific arguments
verify_cluster_args() {
local recipe_name="$1"
local expected_tp="$2"
shift 2
local expected_args=("$@")
log_test "README match (cluster): $recipe_name"
if [[ ! -f "$PROJECT_DIR/recipes/${recipe_name}.yaml" ]]; then
log_skip "${recipe_name}.yaml not found"
return
fi
# Use fake nodes for cluster mode
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run -n "10.0.0.1,10.0.0.2" 2>&1)
vllm_cmd=$(extract_vllm_command "$output")
local all_passed=true
local missing_args=()
# Check tensor parallel
if ! echo "$vllm_cmd" | grep -qE "(--tensor-parallel-size|-tp) $expected_tp"; then
missing_args+=("tensor_parallel: $expected_tp")
all_passed=false
fi
# Check cluster-specific arguments
for arg in "${expected_args[@]}"; do
if ! echo "$vllm_cmd" | grep -qF -- "$arg"; then
missing_args+=("$arg")
all_passed=false
fi
done
if [[ "$all_passed" == "true" ]]; then
log_pass "README match (cluster): $recipe_name - cluster args correct"
else
log_fail "README match (cluster): $recipe_name - missing cluster arguments"
for missing in "${missing_args[@]}"; do
log_verbose " Missing: $missing"
done
log_verbose " vLLM command: $vllm_cmd"
fi
}
# Test: openai-gpt-oss-120b cluster mode has correct tensor_parallel and ray backend
test_readme_gpt_oss_cluster() {
verify_cluster_args "openai-gpt-oss-120b" \
"$GPT_OSS_CLUSTER_TP" \
"${GPT_OSS_CLUSTER_ARGS[@]}"
}
# Test: minimax-m2-awq cluster mode has correct tensor_parallel and ray backend
test_readme_minimax_cluster() {
verify_cluster_args "minimax-m2-awq" \
"$MINIMAX_CLUSTER_TP" \
"${MINIMAX_CLUSTER_ARGS[@]}"
}
# Test: glm-4.7-flash-awq cluster mode stays at tp=1 (single GPU model)
test_readme_glm_flash_cluster() {
log_test "README match (cluster): glm-4.7-flash-awq stays tp=1"
if [[ ! -f "$PROJECT_DIR/recipes/glm-4.7-flash-awq.yaml" ]]; then
log_skip "glm-4.7-flash-awq.yaml not found"
return
fi
# Even in cluster mode, this model uses tp=1
output=$("$PROJECT_DIR/run-recipe.py" glm-4.7-flash-awq --dry-run -n "10.0.0.1,10.0.0.2" 2>&1)
vllm_cmd=$(extract_vllm_command "$output")
if echo "$vllm_cmd" | grep -qE "(--tensor-parallel-size|-tp) 1"; then
log_pass "README match (cluster): glm-4.7-flash-awq correctly keeps tp=1"
else
log_fail "README match (cluster): glm-4.7-flash-awq should have tp=1"
log_verbose " vLLM command: $vllm_cmd"
fi
}
# Run all tests
main() {
echo "=============================================="
echo " run-recipe.py Integration Tests"
echo "=============================================="
echo ""
cd "$PROJECT_DIR"
check_prerequisites
echo ""
# File existence tests
test_run_recipe_exists
test_launch_cluster_exists
echo ""
# Basic functionality tests
test_list_recipes
test_recipe_version_required
test_all_recipes_load
echo ""
# Dry-run tests
test_dry_run_generates_script
test_solo_mode_tp1
test_solo_mode_removes_ray
test_cluster_mode_keeps_ray
test_cli_override_port
echo ""
# launch-cluster.sh command line verification tests
echo "--- Launch Command Verification ---"
test_launch_cmd_solo_flag
test_launch_cmd_nodes_flag
test_launch_cmd_container_image
test_launch_cmd_mods
test_launch_cmd_daemon_flag
test_launch_cmd_nccl_debug
test_launch_cmd_launch_script
test_launch_cmd_container_override
test_launch_cmd_no_solo_in_cluster
echo ""
# README documentation verification tests
echo "--- README Documentation Verification (Solo Mode) ---"
test_readme_glm_flash_awq
test_readme_gpt_oss
test_readme_minimax
test_readme_glm_flash_mod
echo ""
# Cluster mode documentation verification tests
echo "--- README Documentation Verification (Cluster Mode) ---"
test_readme_gpt_oss_cluster
test_readme_minimax_cluster
test_readme_glm_flash_cluster
echo ""
# launch-cluster.sh tests
test_launch_cluster_help
test_launch_cluster_examples_path
echo ""
# Validation tests
test_unsupported_recipe_version
test_missing_recipe_version_fails
test_cluster_only_fails_solo
echo ""
# Summary
echo "=============================================="
echo " Test Summary"
echo "=============================================="
echo -e " ${GREEN}Passed:${NC} $TESTS_PASSED"
echo -e " ${RED}Failed:${NC} $TESTS_FAILED"
echo -e " ${YELLOW}Skipped:${NC} $TESTS_SKIPPED"
echo "=============================================="
if [[ $TESTS_FAILED -gt 0 ]]; then
exit 1
fi
exit 0
}
main "$@"