Merge branch 'main' into pytorch-base
This commit is contained in:
59
.github/workflows/test-recipes.yml
vendored
Normal file
59
.github/workflows/test-recipes.yml
vendored
Normal file
@@ -0,0 +1,59 @@
|
|||||||
|
name: Recipe Tests
|
||||||
|
|
||||||
|
on:
|
||||||
|
push:
|
||||||
|
branches: [ main, profiles ]
|
||||||
|
paths:
|
||||||
|
- 'run-recipe.py'
|
||||||
|
- 'run-recipe.sh'
|
||||||
|
- 'launch-cluster.sh'
|
||||||
|
- 'recipes/**'
|
||||||
|
- 'tests/**'
|
||||||
|
- '.github/workflows/test-recipes.yml'
|
||||||
|
pull_request:
|
||||||
|
paths:
|
||||||
|
- 'run-recipe.py'
|
||||||
|
- 'run-recipe.sh'
|
||||||
|
- 'launch-cluster.sh'
|
||||||
|
- 'recipes/**'
|
||||||
|
- 'tests/**'
|
||||||
|
- '.github/workflows/test-recipes.yml'
|
||||||
|
workflow_dispatch:
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
test:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
|
||||||
|
strategy:
|
||||||
|
matrix:
|
||||||
|
python-version: ['3.10', '3.11', '3.12']
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v4
|
||||||
|
|
||||||
|
- name: Set up Python ${{ matrix.python-version }}
|
||||||
|
uses: actions/setup-python@v5
|
||||||
|
with:
|
||||||
|
python-version: ${{ matrix.python-version }}
|
||||||
|
|
||||||
|
- name: Install dependencies
|
||||||
|
run: |
|
||||||
|
python -m pip install --upgrade pip
|
||||||
|
pip install pyyaml
|
||||||
|
|
||||||
|
- name: Make scripts executable
|
||||||
|
run: |
|
||||||
|
chmod +x run-recipe.py run-recipe.sh launch-cluster.sh
|
||||||
|
chmod +x tests/test_recipes.sh
|
||||||
|
|
||||||
|
- name: Run recipe integration tests
|
||||||
|
run: |
|
||||||
|
./tests/test_recipes.sh -v
|
||||||
|
|
||||||
|
- name: Verify all recipes with dry-run
|
||||||
|
run: |
|
||||||
|
for recipe in recipes/*.yaml; do
|
||||||
|
name=$(basename "$recipe" .yaml)
|
||||||
|
echo "Testing recipe: $name"
|
||||||
|
./run-recipe.py "$name" --dry-run --solo || exit 1
|
||||||
|
done
|
||||||
118
README.md
118
README.md
@@ -16,10 +16,11 @@ While it was primarily developed to support multi-node inference, it works just
|
|||||||
- [4. Using `run-cluster-node.sh` (Internal)](#4-using-run-cluster-nodesh-internal)
|
- [4. Using `run-cluster-node.sh` (Internal)](#4-using-run-cluster-nodesh-internal)
|
||||||
- [5. Configuration Details](#5-configuration-details)
|
- [5. Configuration Details](#5-configuration-details)
|
||||||
- [6. Mods and Patches](#6-mods-and-patches)
|
- [6. Mods and Patches](#6-mods-and-patches)
|
||||||
- [7. Using cluster mode for inference](#7-using-cluster-mode-for-inference)
|
- [7. Launch Scripts](#7-launch-scripts)
|
||||||
- [8. Fastsafetensors](#8-fastsafetensors)
|
- [8. Using cluster mode for inference](#8-using-cluster-mode-for-inference)
|
||||||
- [9. Benchmarking](#9-benchmarking)
|
- [9. Fastsafetensors](#9-fastsafetensors)
|
||||||
- [10. Downloading Models](#10-downloading-models)
|
- [10. Benchmarking](#10-benchmarking)
|
||||||
|
- [11. Downloading Models](#11-downloading-models)
|
||||||
|
|
||||||
## DISCLAIMER
|
## DISCLAIMER
|
||||||
|
|
||||||
@@ -158,6 +159,58 @@ Don't do it every time you rebuild, because it will slow down compilation times.
|
|||||||
|
|
||||||
For periodic maintenance, I recommend using a filter: `docker builder prune --filter until=72h`
|
For periodic maintenance, I recommend using a filter: `docker builder prune --filter until=72h`
|
||||||
|
|
||||||
|
### 2026-02-04
|
||||||
|
|
||||||
|
#### Recipes support
|
||||||
|
|
||||||
|
A major contribution from @raphaelamorim - model recipes.
|
||||||
|
Recipes allow to launch models with preconfigured settings with one command.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# List available recipes
|
||||||
|
./run-recipe.sh --list
|
||||||
|
|
||||||
|
# Run a recipe in solo mode (single node)
|
||||||
|
./run-recipe.sh glm-4.7-flash-awq --solo
|
||||||
|
|
||||||
|
# Full setup: build container + download model + run
|
||||||
|
./run-recipe.sh glm-4.7-flash-awq --solo --setup
|
||||||
|
|
||||||
|
# Run with overrides
|
||||||
|
./run-recipe.sh glm-4.7-flash-awq --solo --port 9000 --gpu-mem 0.8
|
||||||
|
|
||||||
|
# Cluster deployment
|
||||||
|
./run-recipe.sh glm-4.7-nvfp4 --setup
|
||||||
|
```
|
||||||
|
|
||||||
|
Please refer to the [documentation](recipes/README.md) for the details.
|
||||||
|
|
||||||
|
#### Launch script option
|
||||||
|
|
||||||
|
You can now specify a launch script to execute on head node instead of specifying a command directly via `exec` action.
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./launch-cluster.sh --launch-script examples/vllm-openai-gpt-oss-120b.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Thanks @raphaelamorim for the contribution!
|
||||||
|
|
||||||
|
|
||||||
|
#### Ability to apply vLLM PRs during build
|
||||||
|
|
||||||
|
`./build-and-copy.sh` now supports ability to apply vLLM PRs to builds. PR is applied to the most recent vLLM commit (or specific vllm-ref if set). This does NOT apply to wheels build and MXFP4 special build!
|
||||||
|
|
||||||
|
To use, just specify `--apply-vllm-pr <pr_num>` in the arguments. Please note that it may fail depending on whether the PR needs a rebase for the specified vLLM reference/main branch. Use with caution!
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./build-and-copy.sh -t vllm-node-20260204-pr31740 --apply-vllm-pr 31740 -c
|
||||||
|
```
|
||||||
|
|
||||||
### 2026-02-02
|
### 2026-02-02
|
||||||
|
|
||||||
#### Nemotron Nano mod
|
#### Nemotron Nano mod
|
||||||
@@ -670,6 +723,7 @@ You can override the auto-detected values if needed:
|
|||||||
| `--nccl-debug` | NCCL debug level (e.g., INFO, WARN). Defaults to INFO if flag is present but value is omitted. |
|
| `--nccl-debug` | NCCL debug level (e.g., INFO, WARN). Defaults to INFO if flag is present but value is omitted. |
|
||||||
| `--check-config` | Check configuration and auto-detection without launching. |
|
| `--check-config` | Check configuration and auto-detection without launching. |
|
||||||
| `--solo` | Solo mode: skip autodetection, launch only on current node, do not launch Ray cluster |
|
| `--solo` | Solo mode: skip autodetection, launch only on current node, do not launch Ray cluster |
|
||||||
|
| `--launch-script` | Path to bash script to execute in the container (from examples/ directory or absolute path). If launch script is specified, action should be omitted. |
|
||||||
| `-d` | Run in daemon mode (detached). |
|
| `-d` | Run in daemon mode (detached). |
|
||||||
|
|
||||||
## 3\. Running the Container (Manual)
|
## 3\. Running the Container (Manual)
|
||||||
@@ -846,7 +900,55 @@ Mods can be used for:
|
|||||||
- Customizing vLLM behavior for specific workloads
|
- Customizing vLLM behavior for specific workloads
|
||||||
- Rapid iteration on development without rebuilding the entire image
|
- Rapid iteration on development without rebuilding the entire image
|
||||||
|
|
||||||
## 7\. Using cluster mode for inference
|
## 7\. Launch Scripts
|
||||||
|
|
||||||
|
Launch scripts provide a simple way to define reusable model configurations. Instead of passing long command lines, you can create a bash script that is copied into the container and executed directly.
|
||||||
|
|
||||||
|
### Basic Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Use a launch script by name (looks in profiles/ directory)
|
||||||
|
./launch-cluster.sh --launch-script example-vllm-minimax
|
||||||
|
|
||||||
|
# Use with explicit nodes
|
||||||
|
./launch-cluster.sh -n 192.168.1.1,192.168.1.2 --launch-script vllm-openai-gpt-oss-120b.sh
|
||||||
|
|
||||||
|
# Combine with mods for models requiring patches
|
||||||
|
./launch-cluster.sh --launch-script vllm-glm-4.7-nvfp4.sh --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4
|
||||||
|
```
|
||||||
|
|
||||||
|
### Script Format
|
||||||
|
|
||||||
|
Launch scripts are simple bash files that run directly inside the container:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# PROFILE: OpenAI GPT-OSS 120B
|
||||||
|
# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
|
||||||
|
|
||||||
|
# Set environment variables if needed
|
||||||
|
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
|
||||||
|
|
||||||
|
# Run your command
|
||||||
|
vllm serve openai/gpt-oss-120b \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port 8000 \
|
||||||
|
--tensor-parallel-size 2 \
|
||||||
|
--distributed-executor-backend ray \
|
||||||
|
--enable-auto-tool-choice
|
||||||
|
```
|
||||||
|
|
||||||
|
### Available Launch Scripts
|
||||||
|
|
||||||
|
The `examples/` directory contains ready-to-use launch scripts:
|
||||||
|
|
||||||
|
- **example-vllm-minimax.sh** - MiniMax-M2-AWQ with Ray distributed backend
|
||||||
|
- **vllm-openai-gpt-oss-120b.sh** - OpenAI GPT-OSS 120B with FlashInfer MOE
|
||||||
|
- **vllm-glm-4.7-nvfp4.sh** - GLM-4.7-NVFP4 (requires the glm4_moe patch mod)
|
||||||
|
|
||||||
|
See [examples/README.md](examples/README.md) for detailed documentation and more examples.
|
||||||
|
|
||||||
|
## 8\. Using cluster mode for inference
|
||||||
|
|
||||||
First, start follow the instructions above to start the head container on your first Spark, and node container on the second Spark.
|
First, start follow the instructions above to start the head container on your first Spark, and node container on the second Spark.
|
||||||
Then, on the first Spark, run vllm like this:
|
Then, on the first Spark, run vllm like this:
|
||||||
@@ -863,7 +965,7 @@ docker exec -it vllm_node
|
|||||||
|
|
||||||
And execute vllm command inside.
|
And execute vllm command inside.
|
||||||
|
|
||||||
## 8\. Fastsafetensors
|
## 9\. Fastsafetensors
|
||||||
|
|
||||||
This build includes support for fastsafetensors loading which significantly improves loading speeds, especially on DGX Spark where MMAP performance is very poor currently.
|
This build includes support for fastsafetensors loading which significantly improves loading speeds, especially on DGX Spark where MMAP performance is very poor currently.
|
||||||
[Fasttensors](https://github.com/foundation-model-stack/fastsafetensors/) solve this issue by using more efficient multi-threaded loading while avoiding mmap.
|
[Fasttensors](https://github.com/foundation-model-stack/fastsafetensors/) solve this issue by using more efficient multi-threaded loading while avoiding mmap.
|
||||||
@@ -877,11 +979,11 @@ To use this method, simply include `--load-format fastsafetensors` when running
|
|||||||
HF_HUB_OFFLINE=1 vllm serve openai/gpt-oss-120b --port 8888 --host 0.0.0.0 --trust_remote_code --swap-space 16 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --load-format fastsafetensors
|
HF_HUB_OFFLINE=1 vllm serve openai/gpt-oss-120b --port 8888 --host 0.0.0.0 --trust_remote_code --swap-space 16 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --load-format fastsafetensors
|
||||||
```
|
```
|
||||||
|
|
||||||
## 9\. Benchmarking
|
## 10\. Benchmarking
|
||||||
|
|
||||||
I recommend using [llama-benchy](https://github.com/eugr/llama-benchy) - a new benchmarking tool that delivers results in the same format as llama-bench from llama.cpp suite.
|
I recommend using [llama-benchy](https://github.com/eugr/llama-benchy) - a new benchmarking tool that delivers results in the same format as llama-bench from llama.cpp suite.
|
||||||
|
|
||||||
## 10\. Downloading Models
|
## 11\. Downloading Models
|
||||||
|
|
||||||
The `hf-download.sh` script provides a convenient way to download models from HuggingFace and distribute them across your cluster nodes. It uses Huggingface CLI via `uvx` for fast downloads and `rsync` for distribution across the cluster.
|
The `hf-download.sh` script provides a convenient way to download models from HuggingFace and distribute them across your cluster nodes. It uses Huggingface CLI via `uvx` for fast downloads and `rsync` for distribution across the cluster.
|
||||||
|
|
||||||
|
|||||||
186
examples/README.md
Normal file
186
examples/README.md
Normal file
@@ -0,0 +1,186 @@
|
|||||||
|
# Example Launch Scripts
|
||||||
|
|
||||||
|
This directory contains example bash scripts that demonstrate how to use the `--launch-script` option directly with `launch-cluster.sh`.
|
||||||
|
|
||||||
|
**Note:** For most use cases, the recipe system (`./run-recipe.sh`) is the recommended approach. These examples are provided for reference and for advanced users who need direct control over the launch process.
|
||||||
|
|
||||||
|
## Why Launch Scripts?
|
||||||
|
|
||||||
|
- **Simple** - Just write a bash script that runs your command
|
||||||
|
- **Flexible** - Use any bash features: environment variables, conditionals, loops
|
||||||
|
- **Standalone** - Each script can be tested directly on a head node
|
||||||
|
- **No magic** - What you see is what gets executed
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Use a launch script by name (looks in examples/ directory)
|
||||||
|
./launch-cluster.sh --launch-script example-vllm-minimax
|
||||||
|
|
||||||
|
# Use a launch script by filename
|
||||||
|
./launch-cluster.sh --launch-script example-vllm-minimax.sh
|
||||||
|
|
||||||
|
# Use a launch script with absolute path
|
||||||
|
./launch-cluster.sh --launch-script /path/to/my-script.sh
|
||||||
|
|
||||||
|
# Combine with mods if needed
|
||||||
|
./launch-cluster.sh --launch-script my-script.sh --apply-mod mods/my-patch
|
||||||
|
|
||||||
|
# Combine with other options
|
||||||
|
./launch-cluster.sh -n 192.168.1.1,192.168.1.2 --launch-script my-model.sh -d
|
||||||
|
```
|
||||||
|
|
||||||
|
When using `--launch-script`, the `exec` action is automatically implied if no action is specified.
|
||||||
|
|
||||||
|
## Script Structure
|
||||||
|
|
||||||
|
Launch scripts are simple bash scripts. The script is copied into the container at `/workspace/exec-script.sh` and executed.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# PROFILE: Human-readable name
|
||||||
|
# DESCRIPTION: What this script does
|
||||||
|
|
||||||
|
# Optional: Set environment variables
|
||||||
|
export MY_VAR="value"
|
||||||
|
|
||||||
|
# Run your command
|
||||||
|
vllm serve org/model-name \
|
||||||
|
--port 8000 \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--gpu-memory-utilization 0.7
|
||||||
|
```
|
||||||
|
|
||||||
|
### Metadata Comments
|
||||||
|
|
||||||
|
The `# PROFILE:` and `# DESCRIPTION:` comments are optional but recommended for documentation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# PROFILE: MiniMax-M2-AWQ Example
|
||||||
|
# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
|
||||||
|
```
|
||||||
|
|
||||||
|
## Examples
|
||||||
|
|
||||||
|
### Basic vLLM Serving
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# PROFILE: MiniMax-M2-AWQ
|
||||||
|
# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
|
||||||
|
|
||||||
|
vllm serve QuantTrio/MiniMax-M2-AWQ \
|
||||||
|
--port 8000 \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--gpu-memory-utilization 0.7 \
|
||||||
|
-tp 2 \
|
||||||
|
--distributed-executor-backend ray \
|
||||||
|
--max-model-len 128000 \
|
||||||
|
--load-format fastsafetensors \
|
||||||
|
--enable-auto-tool-choice \
|
||||||
|
--tool-call-parser minimax_m2
|
||||||
|
```
|
||||||
|
|
||||||
|
### With Environment Variables
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# PROFILE: OpenAI GPT-OSS 120B
|
||||||
|
# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
|
||||||
|
|
||||||
|
# Enable FlashInfer MOE with MXFP4/MXFP8 quantization
|
||||||
|
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
|
||||||
|
|
||||||
|
vllm serve openai/gpt-oss-120b \
|
||||||
|
--tool-call-parser openai \
|
||||||
|
--enable-auto-tool-choice \
|
||||||
|
--tensor-parallel-size 2 \
|
||||||
|
--distributed-executor-backend ray \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port 8000
|
||||||
|
```
|
||||||
|
|
||||||
|
### With Conditional Logic
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# PROFILE: Adaptive Model Server
|
||||||
|
# DESCRIPTION: Adjusts settings based on available GPUs
|
||||||
|
|
||||||
|
GPU_COUNT=$(nvidia-smi -L | wc -l)
|
||||||
|
echo "Detected $GPU_COUNT GPUs"
|
||||||
|
|
||||||
|
if [[ $GPU_COUNT -ge 4 ]]; then
|
||||||
|
TP_SIZE=4
|
||||||
|
MEM_UTIL=0.9
|
||||||
|
else
|
||||||
|
TP_SIZE=2
|
||||||
|
MEM_UTIL=0.7
|
||||||
|
fi
|
||||||
|
|
||||||
|
vllm serve meta-llama/Llama-3.1-70B-Instruct \
|
||||||
|
--port 8000 \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
-tp $TP_SIZE \
|
||||||
|
--gpu-memory-utilization $MEM_UTIL \
|
||||||
|
--distributed-executor-backend ray
|
||||||
|
```
|
||||||
|
|
||||||
|
### SGLang
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# PROFILE: SGLang Llama 3.1
|
||||||
|
# DESCRIPTION: SGLang runtime with Llama 3.1
|
||||||
|
|
||||||
|
sglang launch meta-llama/Llama-3.1-8B-Instruct \
|
||||||
|
--port 8000 \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--tp 2
|
||||||
|
```
|
||||||
|
|
||||||
|
### With Model Requiring Patches
|
||||||
|
|
||||||
|
If your model requires patches, use `--apply-mod` alongside `--launch-script`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Script: vllm-glm-4.7-nvfp4.sh
|
||||||
|
#!/bin/bash
|
||||||
|
# PROFILE: Salyut1/GLM-4.7-NVFP4
|
||||||
|
# DESCRIPTION: vLLM serving GLM-4.7-NVFP4
|
||||||
|
# NOTE: Requires --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4
|
||||||
|
|
||||||
|
vllm serve Salyut1/GLM-4.7-NVFP4 \
|
||||||
|
--attention-config.backend flashinfer \
|
||||||
|
--tool-call-parser glm47 \
|
||||||
|
-tp 2 \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port 8000
|
||||||
|
```
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
```bash
|
||||||
|
./launch-cluster.sh --launch-script vllm-glm-4.7-nvfp4.sh --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4 exec
|
||||||
|
```
|
||||||
|
|
||||||
|
## Creating a New Launch Script
|
||||||
|
|
||||||
|
1. Create a new `.sh` file in this directory
|
||||||
|
2. Add the shebang `#!/bin/bash`
|
||||||
|
3. Add `# PROFILE:` and `# DESCRIPTION:` comments
|
||||||
|
4. Write your command (e.g., `vllm serve ...`)
|
||||||
|
5. Run with `./launch-cluster.sh --launch-script my-script.sh exec`
|
||||||
|
|
||||||
|
## Testing Scripts
|
||||||
|
|
||||||
|
Since launch scripts are standard bash files, you can test them directly:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Inside a running container or on a head node with the runtime installed
|
||||||
|
cd profiles
|
||||||
|
./my-script.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
This makes development and debugging much easier than complex configuration systems.
|
||||||
|
|
||||||
15
examples/example-vllm-minimax.sh
Normal file
15
examples/example-vllm-minimax.sh
Normal file
@@ -0,0 +1,15 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
# PROFILE: MiniMax-M2-AWQ Example
|
||||||
|
# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
|
||||||
|
|
||||||
|
vllm serve QuantTrio/MiniMax-M2-AWQ \
|
||||||
|
--port 8000 \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--gpu-memory-utilization 0.7 \
|
||||||
|
-tp 2 \
|
||||||
|
--distributed-executor-backend ray \
|
||||||
|
--max-model-len 128000 \
|
||||||
|
--load-format fastsafetensors \
|
||||||
|
--enable-auto-tool-choice \
|
||||||
|
--tool-call-parser minimax_m2 \
|
||||||
|
--reasoning-parser minimax_m2_append_think
|
||||||
17
examples/vllm-glm-4.7-nvfp4.sh
Normal file
17
examples/vllm-glm-4.7-nvfp4.sh
Normal file
@@ -0,0 +1,17 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
# PROFILE: Salyut1/GLM-4.7-NVFP4
|
||||||
|
# DESCRIPTION: vLLM serving GLM-4.7-NVFP4
|
||||||
|
# NOTE: This profile requires --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4 to fix k/v scales incompatibility
|
||||||
|
# See: https://huggingface.co/Salyut1/GLM-4.7-NVFP4/discussions/3#694ab9b6e2efa04b7ecb0c4b
|
||||||
|
|
||||||
|
vllm serve Salyut1/GLM-4.7-NVFP4 \
|
||||||
|
--attention-config.backend flashinfer \
|
||||||
|
--tool-call-parser glm47 \
|
||||||
|
--reasoning-parser glm45 \
|
||||||
|
--enable-auto-tool-choice \
|
||||||
|
-tp 2 \
|
||||||
|
--gpu-memory-utilization 0.88 \
|
||||||
|
--max-model-len 32000 \
|
||||||
|
--distributed-executor-backend ray \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port 8000
|
||||||
20
examples/vllm-openai-gpt-oss-120b.sh
Normal file
20
examples/vllm-openai-gpt-oss-120b.sh
Normal file
@@ -0,0 +1,20 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
# PROFILE: OpenAI GPT-OSS 120B
|
||||||
|
# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
|
||||||
|
|
||||||
|
# Enable FlashInfer MOE with MXFP4/MXFP8 quantization
|
||||||
|
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
|
||||||
|
|
||||||
|
vllm serve openai/gpt-oss-120b \
|
||||||
|
--tool-call-parser openai \
|
||||||
|
--enable-auto-tool-choice \
|
||||||
|
--tensor-parallel-size 2 \
|
||||||
|
--distributed-executor-backend ray \
|
||||||
|
--kv-cache-dtype fp8 \
|
||||||
|
--gpu-memory-utilization 0.70 \
|
||||||
|
--max-model-len 128000 \
|
||||||
|
--max-num-batched-tokens 4096 \
|
||||||
|
--max-num-seqs 8 \
|
||||||
|
--enable-prefix-caching \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port 8000
|
||||||
@@ -26,6 +26,8 @@ ACTION="start"
|
|||||||
CLUSTER_WAS_RUNNING="false"
|
CLUSTER_WAS_RUNNING="false"
|
||||||
MOD_PATHS=()
|
MOD_PATHS=()
|
||||||
MOD_TYPES=()
|
MOD_TYPES=()
|
||||||
|
LAUNCH_SCRIPT_PATH=""
|
||||||
|
SCRIPT_DIR="$(dirname "$(realpath "$0")")"
|
||||||
|
|
||||||
ACTIONS_ARG=""
|
ACTIONS_ARG=""
|
||||||
SOLO_MODE="false"
|
SOLO_MODE="false"
|
||||||
@@ -41,11 +43,16 @@ usage() {
|
|||||||
echo " -e, --env Environment variable to pass to container (e.g. -e VAR=val)"
|
echo " -e, --env Environment variable to pass to container (e.g. -e VAR=val)"
|
||||||
echo " --nccl-debug NCCL debug level (Optional, one of: VERSION, WARN, INFO, TRACE). If no level is provided, defaults to INFO."
|
echo " --nccl-debug NCCL debug level (Optional, one of: VERSION, WARN, INFO, TRACE). If no level is provided, defaults to INFO."
|
||||||
echo " --apply-mod Path to directory or zip file containing run.sh to apply before launch (Can be specified multiple times)"
|
echo " --apply-mod Path to directory or zip file containing run.sh to apply before launch (Can be specified multiple times)"
|
||||||
|
echo " --launch-script Path to bash script to execute in the container (from examples/ directory or absolute path). If launch script is specified, action should be omitted."
|
||||||
echo " --check-config Check configuration and auto-detection without launching"
|
echo " --check-config Check configuration and auto-detection without launching"
|
||||||
echo " --solo Solo mode: skip autodetection, launch only on current node, do not launch Ray cluster"
|
echo " --solo Solo mode: skip autodetection, launch only on current node, do not launch Ray cluster"
|
||||||
echo " -d Daemon mode (only for 'start' action)"
|
echo " -d Daemon mode (only for 'start' action)"
|
||||||
echo " action start | stop | status | exec (Default: start)"
|
echo " action start | stop | status | exec (Default: start). Not compatible with --launch-script."
|
||||||
echo " command Command to run (only for 'exec' action)"
|
echo " command Command to run (only for 'exec' action). Not compatible with --launch-script."
|
||||||
|
echo ""
|
||||||
|
echo "Launch Script Usage:"
|
||||||
|
echo " $0 --launch-script examples/my-script.sh # Script copied to container and executed"
|
||||||
|
echo " $0 --launch-script /path/to/script.sh # Uses absolute path to script"
|
||||||
exit 1
|
exit 1
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -59,6 +66,7 @@ while [[ "$#" -gt 0 ]]; do
|
|||||||
--ib-if) IB_IF="$2"; shift ;;
|
--ib-if) IB_IF="$2"; shift ;;
|
||||||
-e|--env) DOCKER_ARGS="$DOCKER_ARGS -e $2"; shift ;;
|
-e|--env) DOCKER_ARGS="$DOCKER_ARGS -e $2"; shift ;;
|
||||||
--apply-mod) MOD_PATHS+=("$2"); shift ;;
|
--apply-mod) MOD_PATHS+=("$2"); shift ;;
|
||||||
|
--launch-script) LAUNCH_SCRIPT_PATH="$2"; shift ;;
|
||||||
--nccl-debug)
|
--nccl-debug)
|
||||||
if [[ -n "$2" && "$2" =~ ^(VERSION|WARN|INFO|TRACE)$ ]]; then
|
if [[ -n "$2" && "$2" =~ ^(VERSION|WARN|INFO|TRACE)$ ]]; then
|
||||||
NCCL_DEBUG_VAL="$2"
|
NCCL_DEBUG_VAL="$2"
|
||||||
@@ -72,9 +80,17 @@ while [[ "$#" -gt 0 ]]; do
|
|||||||
-d) DAEMON_MODE="true" ;;
|
-d) DAEMON_MODE="true" ;;
|
||||||
-h|--help) usage ;;
|
-h|--help) usage ;;
|
||||||
start|stop|status)
|
start|stop|status)
|
||||||
|
if [[ -n "$LAUNCH_SCRIPT_PATH" ]]; then
|
||||||
|
echo "Error: Action '$1' is not compatible with --launch-script. Please omit the action or not use --launch-script."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
ACTION="$1"
|
ACTION="$1"
|
||||||
;;
|
;;
|
||||||
exec)
|
exec)
|
||||||
|
if [[ -n "$LAUNCH_SCRIPT_PATH" ]]; then
|
||||||
|
echo "Error: Action 'exec' is not compatible with --launch-script. Please omit the action or not use --launch-script."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
ACTION="exec"
|
ACTION="exec"
|
||||||
shift
|
shift
|
||||||
COMMAND_TO_RUN="$@"
|
COMMAND_TO_RUN="$@"
|
||||||
@@ -85,6 +101,10 @@ while [[ "$#" -gt 0 ]]; do
|
|||||||
# unless it's the default 'start' implied.
|
# unless it's the default 'start' implied.
|
||||||
# However, to support "omitted" = start, we need to be careful.
|
# However, to support "omitted" = start, we need to be careful.
|
||||||
# If the arg looks like a command, it's exec.
|
# If the arg looks like a command, it's exec.
|
||||||
|
if [[ -n "$LAUNCH_SCRIPT_PATH" ]]; then
|
||||||
|
echo "Error: Command is not compatible with --launch-script. Please omit the command or not use --launch-script."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
ACTION="exec"
|
ACTION="exec"
|
||||||
COMMAND_TO_RUN="$@"
|
COMMAND_TO_RUN="$@"
|
||||||
break
|
break
|
||||||
@@ -107,6 +127,37 @@ if [[ -n "$NCCL_DEBUG_VAL" ]]; then
|
|||||||
esac
|
esac
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
# Resolve launch script path if specified
|
||||||
|
if [[ -n "$LAUNCH_SCRIPT_PATH" ]]; then
|
||||||
|
# Check if it's an absolute path or relative path that exists
|
||||||
|
if [[ -f "$LAUNCH_SCRIPT_PATH" ]]; then
|
||||||
|
LAUNCH_SCRIPT_PATH=$(realpath "$LAUNCH_SCRIPT_PATH")
|
||||||
|
# Check if it's just a filename, look in examples/ directory
|
||||||
|
elif [[ -f "$SCRIPT_DIR/examples/$LAUNCH_SCRIPT_PATH" ]]; then
|
||||||
|
LAUNCH_SCRIPT_PATH="$SCRIPT_DIR/examples/$LAUNCH_SCRIPT_PATH"
|
||||||
|
# Check if it's a name without .sh extension
|
||||||
|
elif [[ -f "$SCRIPT_DIR/examples/${LAUNCH_SCRIPT_PATH}.sh" ]]; then
|
||||||
|
LAUNCH_SCRIPT_PATH="$SCRIPT_DIR/examples/${LAUNCH_SCRIPT_PATH}.sh"
|
||||||
|
else
|
||||||
|
echo "Error: Launch script '$LAUNCH_SCRIPT_PATH' not found."
|
||||||
|
echo "Searched in:"
|
||||||
|
echo " - $LAUNCH_SCRIPT_PATH"
|
||||||
|
echo " - $SCRIPT_DIR/examples/$LAUNCH_SCRIPT_PATH"
|
||||||
|
echo " - $SCRIPT_DIR/examples/${LAUNCH_SCRIPT_PATH}.sh"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Using launch script: $LAUNCH_SCRIPT_PATH"
|
||||||
|
|
||||||
|
# Set command to run the copied script (use absolute path since docker exec may not be in /workspace)
|
||||||
|
COMMAND_TO_RUN="/workspace/exec-script.sh"
|
||||||
|
|
||||||
|
# If launch script is specified, default action to exec unless explicitly set to stop/status
|
||||||
|
if [[ "$ACTION" == "start" ]]; then
|
||||||
|
ACTION="exec"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
# Validate MOD_PATHS if set
|
# Validate MOD_PATHS if set
|
||||||
for i in "${!MOD_PATHS[@]}"; do
|
for i in "${!MOD_PATHS[@]}"; do
|
||||||
mod_path="${MOD_PATHS[$i]}"
|
mod_path="${MOD_PATHS[$i]}"
|
||||||
@@ -427,6 +478,25 @@ apply_mod_to_container() {
|
|||||||
fi
|
fi
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# Copy Launch Script to Container Function
|
||||||
|
copy_launch_script_to_container() {
|
||||||
|
local container="$1"
|
||||||
|
local script_path="$2"
|
||||||
|
|
||||||
|
echo "Copying launch script to head node..."
|
||||||
|
|
||||||
|
local target_script_path="$script_path"
|
||||||
|
|
||||||
|
# Copy script into container as /workspace/exec-script.sh
|
||||||
|
echo " Copying script into container..."
|
||||||
|
docker cp "$target_script_path" "$container:/workspace/exec-script.sh"
|
||||||
|
|
||||||
|
# Make executable
|
||||||
|
docker exec "$container" chmod +x /workspace/exec-script.sh
|
||||||
|
|
||||||
|
echo " Launch script copied to head node"
|
||||||
|
}
|
||||||
|
|
||||||
# Start Cluster Function
|
# Start Cluster Function
|
||||||
start_cluster() {
|
start_cluster() {
|
||||||
check_cluster_running
|
check_cluster_running
|
||||||
@@ -495,6 +565,11 @@ start_cluster() {
|
|||||||
done
|
done
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
# Copy launch script to head node only (workers don't need it - they just run Ray)
|
||||||
|
if [[ -n "$LAUNCH_SCRIPT_PATH" ]]; then
|
||||||
|
copy_launch_script_to_container "$CONTAINER_NAME" "$LAUNCH_SCRIPT_PATH"
|
||||||
|
fi
|
||||||
|
|
||||||
if [[ "$SOLO_MODE" == "false" ]]; then
|
if [[ "$SOLO_MODE" == "false" ]]; then
|
||||||
wait_for_cluster
|
wait_for_cluster
|
||||||
else
|
else
|
||||||
|
|||||||
266
recipes/README.md
Normal file
266
recipes/README.md
Normal file
@@ -0,0 +1,266 @@
|
|||||||
|
# Recipes
|
||||||
|
|
||||||
|
Recipes provide a **one-click solution** for deploying models with pre-configured settings. Each recipe is a YAML file that specifies:
|
||||||
|
|
||||||
|
- HuggingFace model to download
|
||||||
|
- Container image and build arguments
|
||||||
|
- Required mods/patches
|
||||||
|
- Default parameters (port, host, tensor parallelism, etc.)
|
||||||
|
- Environment variables
|
||||||
|
- The vLLM serve command
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# List available recipes
|
||||||
|
./run-recipe.sh --list
|
||||||
|
|
||||||
|
# Run a recipe in solo mode (single node)
|
||||||
|
./run-recipe.sh glm-4.7-flash-awq --solo
|
||||||
|
|
||||||
|
# Full setup: build container + download model + run
|
||||||
|
./run-recipe.sh glm-4.7-flash-awq --solo --setup
|
||||||
|
|
||||||
|
# Run with overrides
|
||||||
|
./run-recipe.sh glm-4.7-flash-awq --solo --port 9000 --gpu-mem 0.8
|
||||||
|
|
||||||
|
# Cluster deployment
|
||||||
|
./run-recipe.sh glm-4.7-nvfp4 -n 192.168.1.10,192.168.1.11 --setup
|
||||||
|
```
|
||||||
|
|
||||||
|
## Cluster Node Discovery
|
||||||
|
|
||||||
|
The recipe runner can automatically discover cluster nodes:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Auto-discover nodes and save to .env
|
||||||
|
./run-recipe.sh --discover
|
||||||
|
|
||||||
|
# Show current .env configuration
|
||||||
|
./run-recipe.sh --show-env
|
||||||
|
|
||||||
|
# Run recipe (uses nodes from .env automatically)
|
||||||
|
./run-recipe.sh glm-4.7-nvfp4 --setup
|
||||||
|
```
|
||||||
|
|
||||||
|
When you run `--discover`, it:
|
||||||
|
1. Scans the network for nodes with SSH access
|
||||||
|
2. Prompts you to select which nodes to include
|
||||||
|
3. Saves the configuration to `.env`
|
||||||
|
|
||||||
|
Future recipe runs will automatically use nodes from `.env` unless you specify `-n` or `--solo`.
|
||||||
|
|
||||||
|
## Workflow Modes
|
||||||
|
|
||||||
|
### Solo Mode (Single Node)
|
||||||
|
```bash
|
||||||
|
# Explicitly run in solo mode
|
||||||
|
./run-recipe.sh glm-4.7-flash-awq --solo
|
||||||
|
|
||||||
|
# If no nodes configured, defaults to solo
|
||||||
|
./run-recipe.sh minimax-m2-awq
|
||||||
|
```
|
||||||
|
|
||||||
|
### Cluster Mode (Multiple Nodes)
|
||||||
|
```bash
|
||||||
|
# Specify nodes directly (first IP is head node)
|
||||||
|
./run-recipe.sh glm-4.7-nvfp4 -n 192.168.1.10,192.168.1.11 --setup
|
||||||
|
|
||||||
|
# Or use auto-discovered nodes from .env
|
||||||
|
./run-recipe.sh --discover # First time only
|
||||||
|
./run-recipe.sh glm-4.7-nvfp4 --setup
|
||||||
|
```
|
||||||
|
|
||||||
|
When using cluster mode with `--setup`:
|
||||||
|
- Container is built locally and copied to all worker nodes
|
||||||
|
- Model is downloaded locally and copied to all worker nodes
|
||||||
|
|
||||||
|
### Cluster-Only Recipes
|
||||||
|
|
||||||
|
Some models are too large to run on a single node. These recipes have `cluster_only: true` and will fail with a helpful error if you try to run them in solo mode:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ ./run-recipe.sh glm-4.7-nvfp4 --solo
|
||||||
|
Error: Recipe 'GLM-4.7-NVFP4' requires cluster mode.
|
||||||
|
This model is too large to run on a single node.
|
||||||
|
|
||||||
|
Options:
|
||||||
|
1. Specify nodes directly: ./run-recipe.sh glm-4.7-nvfp4 -n node1,node2
|
||||||
|
2. Auto-discover and save: ./run-recipe.sh --discover
|
||||||
|
Then run: ./run-recipe.sh glm-4.7-nvfp4
|
||||||
|
```
|
||||||
|
|
||||||
|
## Setup Options
|
||||||
|
|
||||||
|
| Flag | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| `--setup` | Full setup: build (if missing) + download (if missing) + run |
|
||||||
|
| `--build-only` | Only build/copy the container, don't run |
|
||||||
|
| `--download-only` | Only download/copy the model, don't run |
|
||||||
|
| `--force-build` | Rebuild even if container exists |
|
||||||
|
| `--force-download` | Re-download even if model exists |
|
||||||
|
| `--dry-run` | Show what would happen without executing |
|
||||||
|
|
||||||
|
## Recipe Format
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Required fields
|
||||||
|
name: Human-readable name
|
||||||
|
container: docker-image-name
|
||||||
|
command: |
|
||||||
|
vllm serve model/name \
|
||||||
|
--port {port} \
|
||||||
|
--host {host}
|
||||||
|
|
||||||
|
# Optional fields
|
||||||
|
description: What this recipe does
|
||||||
|
model: org/model-name # HuggingFace model ID for --setup downloads
|
||||||
|
cluster_only: false # Set to true if model requires cluster mode
|
||||||
|
build_args: # Extra args for build-and-copy.sh
|
||||||
|
- --pre-tf # e.g., for transformers 5.0
|
||||||
|
- --exp-mxfp4 # e.g., for MXFP4 Dockerfile
|
||||||
|
mods:
|
||||||
|
- mods/some-patch
|
||||||
|
defaults:
|
||||||
|
port: 8000
|
||||||
|
host: 0.0.0.0
|
||||||
|
tensor_parallel: 2
|
||||||
|
gpu_memory_utilization: 0.85
|
||||||
|
max_model_len: 32000
|
||||||
|
env:
|
||||||
|
SOME_VAR: "value"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Build Arguments
|
||||||
|
|
||||||
|
The `build_args` field passes flags to `build-and-copy.sh`:
|
||||||
|
|
||||||
|
| Flag | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| `--pre-tf` | Use transformers 5.0 (required for GLM-4.7 models) |
|
||||||
|
| `--exp-mxfp4` | Use MXFP4 Dockerfile (for MXFP4 quantized models) |
|
||||||
|
| `--use-wheels` | Use pre-built wheels instead of building from source |
|
||||||
|
|
||||||
|
### Parameter Substitution
|
||||||
|
|
||||||
|
Use `{param_name}` in the command to substitute values from defaults or CLI overrides:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
defaults:
|
||||||
|
port: 8000
|
||||||
|
tensor_parallel: 2
|
||||||
|
|
||||||
|
command: |
|
||||||
|
vllm serve my/model \
|
||||||
|
--port {port} \
|
||||||
|
-tp {tensor_parallel}
|
||||||
|
```
|
||||||
|
|
||||||
|
Override at runtime:
|
||||||
|
```bash
|
||||||
|
./run-recipe.sh my-recipe --port 9000 --tp 4
|
||||||
|
```
|
||||||
|
|
||||||
|
## CLI Reference
|
||||||
|
|
||||||
|
```
|
||||||
|
Usage: ./run-recipe.sh [OPTIONS] [RECIPE]
|
||||||
|
|
||||||
|
Cluster discovery:
|
||||||
|
--discover Auto-detect cluster nodes and save to .env
|
||||||
|
--show-env Show current .env configuration
|
||||||
|
|
||||||
|
Recipe overrides:
|
||||||
|
--port PORT Override port
|
||||||
|
--host HOST Override host
|
||||||
|
--tensor-parallel, --tp N Override tensor parallelism
|
||||||
|
--gpu-memory-utilization N Override GPU memory utilization (--gpu-mem)
|
||||||
|
--max-model-len N Override max model length
|
||||||
|
|
||||||
|
Setup options:
|
||||||
|
--setup Full setup: build + download + run
|
||||||
|
--build-only Only build/copy container, don't run
|
||||||
|
--download-only Only download/copy model, don't run
|
||||||
|
--force-build Rebuild even if container exists
|
||||||
|
--force-download Re-download even if model exists
|
||||||
|
|
||||||
|
Launch options:
|
||||||
|
--solo Run in solo mode (single node, no Ray)
|
||||||
|
-n, --nodes IPS Comma-separated node IPs (first = head)
|
||||||
|
-d, --daemon Run in daemon mode
|
||||||
|
-t, --container IMAGE Override container from recipe
|
||||||
|
--nccl-debug LEVEL NCCL debug level (VERSION, WARN, INFO, TRACE)
|
||||||
|
|
||||||
|
Other:
|
||||||
|
--dry-run Show what would be executed
|
||||||
|
--list, -l List available recipes
|
||||||
|
```
|
||||||
|
|
||||||
|
## Creating a Recipe
|
||||||
|
|
||||||
|
1. Create a new `.yaml` file in `recipes/`
|
||||||
|
2. Specify required fields: `name`, `container`, `command`
|
||||||
|
3. Add `build_args` if your model needs special build options
|
||||||
|
4. Add `mods` if your model needs patches
|
||||||
|
5. Set `cluster_only: true` if model is too large for single node
|
||||||
|
6. Set sensible `defaults`
|
||||||
|
7. Add `env` variables if needed
|
||||||
|
|
||||||
|
Example:
|
||||||
|
```yaml
|
||||||
|
name: My Model
|
||||||
|
description: My custom model setup
|
||||||
|
container: vllm-node-tf5
|
||||||
|
|
||||||
|
build_args:
|
||||||
|
- --pre-tf
|
||||||
|
|
||||||
|
mods:
|
||||||
|
- mods/my-fix
|
||||||
|
|
||||||
|
defaults:
|
||||||
|
port: 8000
|
||||||
|
host: 0.0.0.0
|
||||||
|
tensor_parallel: 1
|
||||||
|
gpu_memory_utilization: 0.85
|
||||||
|
|
||||||
|
command: |
|
||||||
|
vllm serve org/my-model \
|
||||||
|
--port {port} \
|
||||||
|
--host {host} \
|
||||||
|
-tp {tensor_parallel} \
|
||||||
|
--gpu-memory-utilization {gpu_memory_utilization}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────┐
|
||||||
|
│ run-recipe.sh / run-recipe.py │
|
||||||
|
│ - Parses YAML recipe │
|
||||||
|
│ - Auto-discovers cluster nodes (--discover) │
|
||||||
|
│ - Loads nodes from .env │
|
||||||
|
│ - Handles --setup (build + download + run) │
|
||||||
|
│ - Generates launch script from template │
|
||||||
|
│ - Applies CLI overrides │
|
||||||
|
└──────────┬────────────────────────┬─────────────────────┘
|
||||||
|
│ calls (for build) │ calls (for download)
|
||||||
|
▼ ▼
|
||||||
|
┌──────────────────────┐ ┌───────────────────────────────┐
|
||||||
|
│ build-and-copy.sh │ │ hf-download.sh │
|
||||||
|
│ - Docker build │ │ - HuggingFace model download │
|
||||||
|
│ - Copy to workers │ │ - Rsync to workers │
|
||||||
|
└──────────────────────┘ └───────────────────────────────┘
|
||||||
|
│
|
||||||
|
│ then calls (for run)
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────┐
|
||||||
|
│ launch-cluster.sh │
|
||||||
|
│ - Cluster orchestration │
|
||||||
|
│ - Container lifecycle │
|
||||||
|
│ - Mod application │
|
||||||
|
│ - Launch script execution │
|
||||||
|
└─────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
This separation follows the Unix philosophy: `run-recipe.sh` provides convenience, while the underlying scripts remain focused on their specific tasks.
|
||||||
64
recipes/glm-4.7-flash-awq.yaml
Normal file
64
recipes/glm-4.7-flash-awq.yaml
Normal file
@@ -0,0 +1,64 @@
|
|||||||
|
# Recipe: GLM-4.7-Flash-AWQ-4bit
|
||||||
|
# cyankiwi's AWQ quantized GLM-4.7-Flash model
|
||||||
|
# Requires a patch for inference speed optimization
|
||||||
|
#
|
||||||
|
# NOTE: vLLM implementation is suboptimal even with the patch.
|
||||||
|
# The model performance is still significantly slower than it should be
|
||||||
|
# for a model with this number of active parameters. Running in cluster
|
||||||
|
# increases prompt processing performance, but not token generation.
|
||||||
|
# Expect ~40 t/s generation speed in both single node and cluster.
|
||||||
|
|
||||||
|
recipe_version: "1"
|
||||||
|
name: GLM-4.7-Flash-AWQ
|
||||||
|
description: vLLM serving cyankiwi/GLM-4.7-Flash-AWQ-4bit with speed optimization patch
|
||||||
|
|
||||||
|
# HuggingFace model to download
|
||||||
|
model: cyankiwi/GLM-4.7-Flash-AWQ-4bit
|
||||||
|
|
||||||
|
# This model can run on single node (solo) or cluster
|
||||||
|
cluster_only: false
|
||||||
|
|
||||||
|
# Container image to use
|
||||||
|
container: vllm-node-tf5
|
||||||
|
|
||||||
|
# Build arguments for build-and-copy.sh
|
||||||
|
# tf5 = transformers 5.0 (required for GLM-4.7)
|
||||||
|
build_args:
|
||||||
|
- --pre-tf
|
||||||
|
|
||||||
|
# Mods to apply before running (paths relative to repo root)
|
||||||
|
# This mod prevents severe inference speed degradation
|
||||||
|
mods:
|
||||||
|
- mods/fix-glm-4.7-flash-AWQ
|
||||||
|
|
||||||
|
# Default settings (can be overridden via CLI)
|
||||||
|
defaults:
|
||||||
|
port: 8000
|
||||||
|
host: 0.0.0.0
|
||||||
|
tensor_parallel: 1
|
||||||
|
gpu_memory_utilization: 0.7
|
||||||
|
max_model_len: 202752
|
||||||
|
max_num_batched_tokens: 4096
|
||||||
|
max_num_seqs: 64
|
||||||
|
served_model_name: glm-4.7-flash
|
||||||
|
|
||||||
|
# Environment variables to set in the container
|
||||||
|
env:
|
||||||
|
# Add any required env vars here
|
||||||
|
|
||||||
|
# The vLLM serve command template
|
||||||
|
# Use {var_name} for substitution from defaults/overrides
|
||||||
|
# In cluster mode, --distributed-executor-backend ray and -tp 2 are added
|
||||||
|
command: |
|
||||||
|
vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
|
||||||
|
--tool-call-parser glm47 \
|
||||||
|
--reasoning-parser glm45 \
|
||||||
|
--enable-auto-tool-choice \
|
||||||
|
--served-model-name {served_model_name} \
|
||||||
|
--max-model-len {max_model_len} \
|
||||||
|
--max-num-batched-tokens {max_num_batched_tokens} \
|
||||||
|
--max-num-seqs {max_num_seqs} \
|
||||||
|
--gpu-memory-utilization {gpu_memory_utilization} \
|
||||||
|
-tp {tensor_parallel} \
|
||||||
|
--host {host} \
|
||||||
|
--port {port}
|
||||||
40
recipes/minimax-m2-awq.yaml
Normal file
40
recipes/minimax-m2-awq.yaml
Normal file
@@ -0,0 +1,40 @@
|
|||||||
|
# Recipe: MiniMax-M2-AWQ
|
||||||
|
# MiniMax M2 model with AWQ quantization
|
||||||
|
|
||||||
|
recipe_version: "1"
|
||||||
|
name: MiniMax-M2-AWQ
|
||||||
|
description: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
|
||||||
|
|
||||||
|
# HuggingFace model to download (optional, for --download-model)
|
||||||
|
model: QuantTrio/MiniMax-M2-AWQ
|
||||||
|
|
||||||
|
# Container image to use
|
||||||
|
container: vllm-node
|
||||||
|
|
||||||
|
# No mods required
|
||||||
|
mods: []
|
||||||
|
|
||||||
|
# Default settings (can be overridden via CLI)
|
||||||
|
defaults:
|
||||||
|
port: 8000
|
||||||
|
host: 0.0.0.0
|
||||||
|
tensor_parallel: 2
|
||||||
|
gpu_memory_utilization: 0.7
|
||||||
|
max_model_len: 128000
|
||||||
|
|
||||||
|
# Environment variables
|
||||||
|
env: {}
|
||||||
|
|
||||||
|
# The vLLM serve command template
|
||||||
|
command: |
|
||||||
|
vllm serve QuantTrio/MiniMax-M2-AWQ \
|
||||||
|
--port {port} \
|
||||||
|
--host {host} \
|
||||||
|
--gpu-memory-utilization {gpu_memory_utilization} \
|
||||||
|
-tp {tensor_parallel} \
|
||||||
|
--distributed-executor-backend ray \
|
||||||
|
--max-model-len {max_model_len} \
|
||||||
|
--load-format fastsafetensors \
|
||||||
|
--enable-auto-tool-choice \
|
||||||
|
--tool-call-parser minimax_m2 \
|
||||||
|
--reasoning-parser minimax_m2_append_think
|
||||||
52
recipes/openai-gpt-oss-120b.yaml
Normal file
52
recipes/openai-gpt-oss-120b.yaml
Normal file
@@ -0,0 +1,52 @@
|
|||||||
|
# Recipe: OpenAI GPT-OSS 120B
|
||||||
|
# OpenAI's open source 120B MoE model with MXFP4 quantization support
|
||||||
|
|
||||||
|
recipe_version: "1"
|
||||||
|
name: OpenAI GPT-OSS 120B
|
||||||
|
description: vLLM serving openai/gpt-oss-120b with MXFP4 quantization and FlashInfer
|
||||||
|
|
||||||
|
# HuggingFace model to download (optional, for --download-model)
|
||||||
|
model: openai/gpt-oss-120b
|
||||||
|
|
||||||
|
# Container image to use
|
||||||
|
container: vllm-node-mxfp4
|
||||||
|
|
||||||
|
# Build arguments for build-and-copy.sh
|
||||||
|
build_args:
|
||||||
|
- --exp-mxfp4
|
||||||
|
|
||||||
|
# No mods required for this model
|
||||||
|
mods: []
|
||||||
|
|
||||||
|
# Default settings (can be overridden via CLI)
|
||||||
|
defaults:
|
||||||
|
port: 8000
|
||||||
|
host: 0.0.0.0
|
||||||
|
tensor_parallel: 2
|
||||||
|
gpu_memory_utilization: 0.70
|
||||||
|
max_num_batched_tokens: 8192
|
||||||
|
|
||||||
|
# Environment variables to set in the container
|
||||||
|
env:
|
||||||
|
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: "1"
|
||||||
|
|
||||||
|
# The vLLM serve command template
|
||||||
|
# Uses MXFP4 quantization for memory efficiency
|
||||||
|
command: |
|
||||||
|
vllm serve openai/gpt-oss-120b \
|
||||||
|
--tool-call-parser openai \
|
||||||
|
--reasoning-parser openai_gptoss \
|
||||||
|
--enable-auto-tool-choice \
|
||||||
|
--tensor-parallel-size {tensor_parallel} \
|
||||||
|
--distributed-executor-backend ray \
|
||||||
|
--gpu-memory-utilization {gpu_memory_utilization} \
|
||||||
|
--enable-prefix-caching \
|
||||||
|
--load-format fastsafetensors \
|
||||||
|
--quantization mxfp4 \
|
||||||
|
--mxfp4-backend CUTLASS \
|
||||||
|
--mxfp4-layers moe,qkv,o,lm_head \
|
||||||
|
--attention-backend FLASHINFER \
|
||||||
|
--kv-cache-dtype fp8 \
|
||||||
|
--max-num-batched-tokens {max_num_batched_tokens} \
|
||||||
|
--host {host} \
|
||||||
|
--port {port}
|
||||||
1124
run-recipe.py
Executable file
1124
run-recipe.py
Executable file
File diff suppressed because it is too large
Load Diff
42
run-recipe.sh
Executable file
42
run-recipe.sh
Executable file
@@ -0,0 +1,42 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
#
|
||||||
|
# run-recipe.sh - Wrapper for run-recipe.py
|
||||||
|
#
|
||||||
|
# Ensures Python dependencies are available and runs the recipe runner.
|
||||||
|
#
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
RECIPE_SCRIPT="$SCRIPT_DIR/run-recipe.py"
|
||||||
|
|
||||||
|
# Check for Python 3.10+
|
||||||
|
if command -v python3 &>/dev/null; then
|
||||||
|
PYTHON=python3
|
||||||
|
elif command -v python &>/dev/null; then
|
||||||
|
PYTHON=python
|
||||||
|
else
|
||||||
|
echo "Error: Python 3 not found. Please install Python 3.10 or later."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Verify version
|
||||||
|
PY_VERSION=$($PYTHON -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')
|
||||||
|
PY_MAJOR=$($PYTHON -c 'import sys; print(sys.version_info.major)')
|
||||||
|
PY_MINOR=$($PYTHON -c 'import sys; print(sys.version_info.minor)')
|
||||||
|
|
||||||
|
if [[ "$PY_MAJOR" -lt 3 ]] || [[ "$PY_MAJOR" -eq 3 && "$PY_MINOR" -lt 10 ]]; then
|
||||||
|
echo "Error: Python 3.10+ required, found $PY_VERSION"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check for PyYAML and install if missing
|
||||||
|
if ! $PYTHON -c "import yaml" 2>/dev/null; then
|
||||||
|
echo "Installing PyYAML..."
|
||||||
|
$PYTHON -m pip install --quiet pyyaml
|
||||||
|
if [[ $? -ne 0 ]]; then
|
||||||
|
echo "Error: Failed to install PyYAML. Try: pip install pyyaml"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Run the recipe script
|
||||||
|
exec $PYTHON "$RECIPE_SCRIPT" "$@"
|
||||||
89
tests/expected_commands.sh
Normal file
89
tests/expected_commands.sh
Normal file
@@ -0,0 +1,89 @@
|
|||||||
|
# Expected vLLM serve arguments for each recipe
|
||||||
|
# This file is used by test_recipes.sh to verify recipes match README documentation
|
||||||
|
#
|
||||||
|
# Format: Each recipe has a section with expected arguments
|
||||||
|
# Tests will verify these arguments appear in the dry-run output
|
||||||
|
#
|
||||||
|
# IMPORTANT: Keep this in sync with README.md documentation
|
||||||
|
# When updating recipes, update both README.md and this file
|
||||||
|
|
||||||
|
# ==============================================================================
|
||||||
|
# glm-4.7-flash-awq
|
||||||
|
# README Reference: Lines 186-198 (solo) and 203-218 (cluster)
|
||||||
|
# ==============================================================================
|
||||||
|
GLM_FLASH_AWQ_MODEL="cyankiwi/GLM-4.7-Flash-AWQ-4bit"
|
||||||
|
GLM_FLASH_AWQ_CONTAINER="vllm-node-tf5"
|
||||||
|
GLM_FLASH_AWQ_MOD="mods/fix-glm-4.7-flash-AWQ"
|
||||||
|
GLM_FLASH_AWQ_ARGS=(
|
||||||
|
"--tool-call-parser glm47"
|
||||||
|
"--reasoning-parser glm45"
|
||||||
|
"--enable-auto-tool-choice"
|
||||||
|
"--served-model-name glm-4.7-flash"
|
||||||
|
"--max-model-len 202752"
|
||||||
|
"--max-num-batched-tokens 4096"
|
||||||
|
"--max-num-seqs 64"
|
||||||
|
"--gpu-memory-utilization 0.7"
|
||||||
|
"--port 8000"
|
||||||
|
"--host 0.0.0.0"
|
||||||
|
)
|
||||||
|
|
||||||
|
# ==============================================================================
|
||||||
|
# openai-gpt-oss-120b
|
||||||
|
# README Reference: Lines 244-257 (solo) and 264-280 (cluster)
|
||||||
|
# ==============================================================================
|
||||||
|
GPT_OSS_MODEL="openai/gpt-oss-120b"
|
||||||
|
GPT_OSS_CONTAINER="vllm-node-mxfp4"
|
||||||
|
GPT_OSS_ARGS=(
|
||||||
|
"--port 8000"
|
||||||
|
"--host 0.0.0.0"
|
||||||
|
"--enable-auto-tool-choice"
|
||||||
|
"--tool-call-parser openai"
|
||||||
|
"--reasoning-parser openai_gptoss"
|
||||||
|
"--gpu-memory-utilization 0.7"
|
||||||
|
"--enable-prefix-caching"
|
||||||
|
"--load-format fastsafetensors"
|
||||||
|
"--quantization mxfp4"
|
||||||
|
"--mxfp4-backend CUTLASS"
|
||||||
|
"--mxfp4-layers moe,qkv,o,lm_head"
|
||||||
|
"--attention-backend FLASHINFER"
|
||||||
|
"--kv-cache-dtype fp8"
|
||||||
|
"--max-num-batched-tokens 8192"
|
||||||
|
)
|
||||||
|
|
||||||
|
# ==============================================================================
|
||||||
|
# minimax-m2-awq
|
||||||
|
# README Reference: Not explicitly documented, but based on model requirements
|
||||||
|
# ==============================================================================
|
||||||
|
MINIMAX_MODEL="QuantTrio/MiniMax-M2-AWQ"
|
||||||
|
MINIMAX_CONTAINER="vllm-node"
|
||||||
|
MINIMAX_ARGS=(
|
||||||
|
"--port 8000"
|
||||||
|
"--host 0.0.0.0"
|
||||||
|
"--gpu-memory-utilization 0.7"
|
||||||
|
"--max-model-len 128000"
|
||||||
|
"--load-format fastsafetensors"
|
||||||
|
"--enable-auto-tool-choice"
|
||||||
|
"--tool-call-parser minimax_m2"
|
||||||
|
"--reasoning-parser minimax_m2_append_think"
|
||||||
|
)
|
||||||
|
|
||||||
|
# ==============================================================================
|
||||||
|
# Cluster Mode Expected Arguments
|
||||||
|
# These are arguments that should appear ONLY in cluster mode
|
||||||
|
# Note: Tests use 2 nodes, so tensor_parallel = 2 (1 GPU per node)
|
||||||
|
# ==============================================================================
|
||||||
|
|
||||||
|
# glm-4.7-flash-awq cluster mode (no distributed backend - single GPU model)
|
||||||
|
GLM_FLASH_AWQ_CLUSTER_TP="1"
|
||||||
|
|
||||||
|
# openai-gpt-oss-120b cluster mode (2 nodes = tp 2)
|
||||||
|
GPT_OSS_CLUSTER_TP="2"
|
||||||
|
GPT_OSS_CLUSTER_ARGS=(
|
||||||
|
"--distributed-executor-backend ray"
|
||||||
|
)
|
||||||
|
|
||||||
|
# minimax-m2-awq cluster mode (2 nodes = tp 2)
|
||||||
|
MINIMAX_CLUSTER_TP="2"
|
||||||
|
MINIMAX_CLUSTER_ARGS=(
|
||||||
|
"--distributed-executor-backend ray"
|
||||||
|
)
|
||||||
859
tests/test_recipes.sh
Executable file
859
tests/test_recipes.sh
Executable file
@@ -0,0 +1,859 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
#
|
||||||
|
# test_recipes.sh - Integration tests for run-recipe.py and launch-cluster.sh
|
||||||
|
#
|
||||||
|
# These tests use --dry-run mode to verify compatibility without actually
|
||||||
|
# running containers. Suitable for CI/CD pipelines.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ./tests/test_recipes.sh # Run all tests
|
||||||
|
# ./tests/test_recipes.sh -v # Verbose output
|
||||||
|
#
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(dirname "$(realpath "$0")")"
|
||||||
|
PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
|
||||||
|
VERBOSE="${1:-}"
|
||||||
|
|
||||||
|
# Load expected commands for README verification
|
||||||
|
source "$SCRIPT_DIR/expected_commands.sh"
|
||||||
|
|
||||||
|
# Colors for output
|
||||||
|
RED='\033[0;31m'
|
||||||
|
GREEN='\033[0;32m'
|
||||||
|
YELLOW='\033[1;33m'
|
||||||
|
NC='\033[0m' # No Color
|
||||||
|
|
||||||
|
# Test counters
|
||||||
|
TESTS_PASSED=0
|
||||||
|
TESTS_FAILED=0
|
||||||
|
TESTS_SKIPPED=0
|
||||||
|
|
||||||
|
# Helper functions
|
||||||
|
log_test() {
|
||||||
|
echo -e "${YELLOW}[TEST]${NC} $1"
|
||||||
|
}
|
||||||
|
|
||||||
|
log_pass() {
|
||||||
|
echo -e "${GREEN}[PASS]${NC} $1"
|
||||||
|
TESTS_PASSED=$((TESTS_PASSED + 1))
|
||||||
|
}
|
||||||
|
|
||||||
|
log_fail() {
|
||||||
|
echo -e "${RED}[FAIL]${NC} $1"
|
||||||
|
TESTS_FAILED=$((TESTS_FAILED + 1))
|
||||||
|
}
|
||||||
|
|
||||||
|
log_skip() {
|
||||||
|
echo -e "${YELLOW}[SKIP]${NC} $1"
|
||||||
|
TESTS_SKIPPED=$((TESTS_SKIPPED + 1))
|
||||||
|
}
|
||||||
|
|
||||||
|
log_verbose() {
|
||||||
|
if [[ "$VERBOSE" == "-v" ]]; then
|
||||||
|
echo " $1"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Check prerequisites
|
||||||
|
check_prerequisites() {
|
||||||
|
log_test "Checking prerequisites..."
|
||||||
|
|
||||||
|
if ! command -v python3 &> /dev/null; then
|
||||||
|
log_fail "python3 not found"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check Python version
|
||||||
|
python_version=$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')
|
||||||
|
if [[ $(echo "$python_version < 3.10" | bc -l) -eq 1 ]]; then
|
||||||
|
log_fail "Python 3.10+ required, found $python_version"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check PyYAML
|
||||||
|
if ! python3 -c "import yaml" 2>/dev/null; then
|
||||||
|
log_fail "PyYAML not installed"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
log_pass "Prerequisites OK (Python $python_version with PyYAML)"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: run-recipe.py exists and is executable
|
||||||
|
test_run_recipe_exists() {
|
||||||
|
log_test "run-recipe.py exists and is executable"
|
||||||
|
|
||||||
|
if [[ -x "$PROJECT_DIR/run-recipe.py" ]]; then
|
||||||
|
log_pass "run-recipe.py is executable"
|
||||||
|
else
|
||||||
|
log_fail "run-recipe.py not found or not executable"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: launch-cluster.sh exists and is executable
|
||||||
|
test_launch_cluster_exists() {
|
||||||
|
log_test "launch-cluster.sh exists and is executable"
|
||||||
|
|
||||||
|
if [[ -x "$PROJECT_DIR/launch-cluster.sh" ]]; then
|
||||||
|
log_pass "launch-cluster.sh is executable"
|
||||||
|
else
|
||||||
|
log_fail "launch-cluster.sh not found or not executable"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: run-recipe.py --list works
|
||||||
|
test_list_recipes() {
|
||||||
|
log_test "run-recipe.py --list"
|
||||||
|
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" --list 2>&1)
|
||||||
|
|
||||||
|
if [[ $? -eq 0 ]] && echo "$output" | grep -q "Available recipes"; then
|
||||||
|
log_pass "--list shows available recipes"
|
||||||
|
log_verbose "Found recipes in output"
|
||||||
|
else
|
||||||
|
log_fail "--list failed or no recipes found"
|
||||||
|
log_verbose "$output"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: All recipes have required recipe_version field
|
||||||
|
test_recipe_version_required() {
|
||||||
|
log_test "All recipes have required recipe_version field"
|
||||||
|
|
||||||
|
local all_valid=true
|
||||||
|
for recipe in "$PROJECT_DIR/recipes/"*.yaml; do
|
||||||
|
if [[ -f "$recipe" ]]; then
|
||||||
|
recipe_name=$(basename "$recipe")
|
||||||
|
if ! grep -q "^recipe_version:" "$recipe"; then
|
||||||
|
log_verbose "$recipe_name missing recipe_version"
|
||||||
|
all_valid=false
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ "$all_valid" == "true" ]]; then
|
||||||
|
log_pass "All recipes have recipe_version field"
|
||||||
|
else
|
||||||
|
log_fail "Some recipes missing recipe_version field"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: All recipes load without errors
|
||||||
|
test_all_recipes_load() {
|
||||||
|
log_test "All recipes load without errors"
|
||||||
|
|
||||||
|
local all_valid=true
|
||||||
|
for recipe in "$PROJECT_DIR/recipes/"*.yaml; do
|
||||||
|
if [[ -f "$recipe" ]]; then
|
||||||
|
recipe_name=$(basename "$recipe" .yaml)
|
||||||
|
# Try to load recipe with --dry-run (will fail early if recipe is invalid)
|
||||||
|
if ! "$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo 2>&1 | grep -q "Error:"; then
|
||||||
|
log_verbose "$recipe_name loads OK"
|
||||||
|
else
|
||||||
|
log_verbose "$recipe_name failed to load"
|
||||||
|
all_valid=false
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ "$all_valid" == "true" ]]; then
|
||||||
|
log_pass "All recipes load successfully"
|
||||||
|
else
|
||||||
|
log_fail "Some recipes failed to load"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: Dry-run generates valid launch script
|
||||||
|
test_dry_run_generates_script() {
|
||||||
|
log_test "Dry-run generates valid launch script"
|
||||||
|
|
||||||
|
# Find first available recipe
|
||||||
|
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
|
||||||
|
if [[ -z "$first_recipe" ]]; then
|
||||||
|
log_skip "No recipes found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
recipe_name=$(basename "$first_recipe" .yaml)
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo 2>&1)
|
||||||
|
|
||||||
|
if echo "$output" | grep -q "#!/bin/bash" && echo "$output" | grep -q "vllm serve"; then
|
||||||
|
log_pass "Dry-run generates bash script with vllm serve command"
|
||||||
|
else
|
||||||
|
log_fail "Dry-run output doesn't contain expected content"
|
||||||
|
log_verbose "$output"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: Solo mode sets tensor_parallel=1
|
||||||
|
test_solo_mode_tp1() {
|
||||||
|
log_test "Solo mode sets tensor_parallel=1"
|
||||||
|
|
||||||
|
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
|
||||||
|
if [[ -z "$first_recipe" ]]; then
|
||||||
|
log_skip "No recipes found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
recipe_name=$(basename "$first_recipe" .yaml)
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo 2>&1)
|
||||||
|
|
||||||
|
# Check that -tp 1 is in the output (solo mode should set tp=1)
|
||||||
|
if echo "$output" | grep -q "\-tp 1"; then
|
||||||
|
log_pass "Solo mode correctly sets -tp 1"
|
||||||
|
else
|
||||||
|
log_fail "Solo mode did not set -tp 1"
|
||||||
|
log_verbose "$output"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: Solo mode removes --distributed-executor-backend ray
|
||||||
|
test_solo_mode_removes_ray() {
|
||||||
|
log_test "Solo mode removes --distributed-executor-backend ray"
|
||||||
|
|
||||||
|
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
|
||||||
|
if [[ -z "$first_recipe" ]]; then
|
||||||
|
log_skip "No recipes found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
recipe_name=$(basename "$first_recipe" .yaml)
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo 2>&1)
|
||||||
|
|
||||||
|
# Check that --distributed-executor-backend is NOT in the output
|
||||||
|
if ! echo "$output" | grep -q "\-\-distributed-executor-backend"; then
|
||||||
|
log_pass "Solo mode correctly removes --distributed-executor-backend"
|
||||||
|
else
|
||||||
|
log_fail "Solo mode did not remove --distributed-executor-backend"
|
||||||
|
log_verbose "$output"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: Cluster mode preserves --distributed-executor-backend ray
|
||||||
|
test_cluster_mode_keeps_ray() {
|
||||||
|
log_test "Cluster mode preserves --distributed-executor-backend ray"
|
||||||
|
|
||||||
|
# Use minimax-m2-awq which explicitly has --distributed-executor-backend ray
|
||||||
|
if [[ ! -f "$PROJECT_DIR/recipes/minimax-m2-awq.yaml" ]]; then
|
||||||
|
log_skip "minimax-m2-awq.yaml not found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" minimax-m2-awq --dry-run -n "192.168.1.1,192.168.1.2" 2>&1)
|
||||||
|
|
||||||
|
# Check that --distributed-executor-backend IS in the output for cluster mode
|
||||||
|
if echo "$output" | grep -q "\-\-distributed-executor-backend ray"; then
|
||||||
|
log_pass "Cluster mode correctly preserves --distributed-executor-backend ray"
|
||||||
|
else
|
||||||
|
log_fail "Cluster mode did not preserve --distributed-executor-backend"
|
||||||
|
log_verbose "$output"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: CLI overrides work (--port)
|
||||||
|
test_cli_override_port() {
|
||||||
|
log_test "CLI override --port works"
|
||||||
|
|
||||||
|
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
|
||||||
|
if [[ -z "$first_recipe" ]]; then
|
||||||
|
log_skip "No recipes found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
recipe_name=$(basename "$first_recipe" .yaml)
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo --port 9999 2>&1)
|
||||||
|
|
||||||
|
if echo "$output" | grep -q "\-\-port 9999"; then
|
||||||
|
log_pass "--port override correctly applied"
|
||||||
|
else
|
||||||
|
log_fail "--port override not found in output"
|
||||||
|
log_verbose "$output"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: launch-cluster.sh --help works
|
||||||
|
test_launch_cluster_help() {
|
||||||
|
log_test "launch-cluster.sh --help"
|
||||||
|
|
||||||
|
output=$("$PROJECT_DIR/launch-cluster.sh" --help 2>&1 || true)
|
||||||
|
|
||||||
|
if echo "$output" | grep -q "Usage:"; then
|
||||||
|
log_pass "--help shows usage information"
|
||||||
|
else
|
||||||
|
log_fail "--help did not show usage"
|
||||||
|
log_verbose "$output"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: launch-cluster.sh references examples/ not profiles/
|
||||||
|
test_launch_cluster_examples_path() {
|
||||||
|
log_test "launch-cluster.sh references examples/ directory"
|
||||||
|
|
||||||
|
if grep -q "examples/" "$PROJECT_DIR/launch-cluster.sh"; then
|
||||||
|
log_pass "launch-cluster.sh references examples/"
|
||||||
|
else
|
||||||
|
log_fail "launch-cluster.sh does not reference examples/"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if grep -q "profiles/" "$PROJECT_DIR/launch-cluster.sh"; then
|
||||||
|
log_fail "launch-cluster.sh still references profiles/"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: Unsupported recipe version shows warning
|
||||||
|
test_unsupported_recipe_version() {
|
||||||
|
log_test "Unsupported recipe_version shows warning"
|
||||||
|
|
||||||
|
# Create a temporary recipe with unsupported version
|
||||||
|
temp_recipe=$(mktemp)
|
||||||
|
cat > "$temp_recipe" << 'EOF'
|
||||||
|
recipe_version: "999"
|
||||||
|
name: Test Recipe
|
||||||
|
container: test-container
|
||||||
|
command: echo "test"
|
||||||
|
EOF
|
||||||
|
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" "$temp_recipe" --dry-run --solo 2>&1)
|
||||||
|
rm -f "$temp_recipe"
|
||||||
|
|
||||||
|
if echo "$output" | grep -q "Warning.*schema version"; then
|
||||||
|
log_pass "Unsupported recipe_version shows warning"
|
||||||
|
else
|
||||||
|
log_fail "No warning for unsupported recipe_version"
|
||||||
|
log_verbose "$output"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: Missing recipe_version fails
|
||||||
|
test_missing_recipe_version_fails() {
|
||||||
|
log_test "Missing recipe_version field fails"
|
||||||
|
|
||||||
|
# Create a temporary recipe without recipe_version
|
||||||
|
temp_recipe=$(mktemp)
|
||||||
|
cat > "$temp_recipe" << 'EOF'
|
||||||
|
name: Test Recipe
|
||||||
|
container: test-container
|
||||||
|
command: echo "test"
|
||||||
|
EOF
|
||||||
|
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" "$temp_recipe" --dry-run --solo 2>&1 || true)
|
||||||
|
rm -f "$temp_recipe"
|
||||||
|
|
||||||
|
if echo "$output" | grep -q "Error.*recipe_version"; then
|
||||||
|
log_pass "Missing recipe_version correctly fails"
|
||||||
|
else
|
||||||
|
log_fail "Missing recipe_version did not fail as expected"
|
||||||
|
log_verbose "$output"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: cluster_only recipe fails in solo mode
|
||||||
|
test_cluster_only_fails_solo() {
|
||||||
|
log_test "cluster_only recipe fails in solo mode"
|
||||||
|
|
||||||
|
# Create a temporary cluster_only recipe
|
||||||
|
temp_recipe=$(mktemp)
|
||||||
|
cat > "$temp_recipe" << 'EOF'
|
||||||
|
recipe_version: "1"
|
||||||
|
name: Cluster Only Test
|
||||||
|
container: test-container
|
||||||
|
cluster_only: true
|
||||||
|
command: echo "test"
|
||||||
|
EOF
|
||||||
|
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" "$temp_recipe" --dry-run --solo 2>&1 || true)
|
||||||
|
exit_code=$?
|
||||||
|
rm -f "$temp_recipe"
|
||||||
|
|
||||||
|
if echo "$output" | grep -q "requires cluster mode"; then
|
||||||
|
log_pass "cluster_only recipe correctly fails in solo mode"
|
||||||
|
else
|
||||||
|
log_fail "cluster_only recipe did not fail in solo mode"
|
||||||
|
log_verbose "$output"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# ==============================================================================
|
||||||
|
# Launch-cluster.sh Command Line Verification Tests
|
||||||
|
# ==============================================================================
|
||||||
|
# These tests verify that the dry-run output contains the expected
|
||||||
|
# launch-cluster.sh command line arguments matching the recipe configuration.
|
||||||
|
|
||||||
|
# Helper: Extract launch-cluster command from dry-run output
|
||||||
|
extract_launch_cmd() {
|
||||||
|
echo "$1" | grep -A5 "launch-cluster.sh is called with:" | grep -v "launch-cluster.sh is called with:" | tr '\n' ' '
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: Solo mode generates --solo flag in launch-cluster command
|
||||||
|
test_launch_cmd_solo_flag() {
|
||||||
|
log_test "Launch command includes --solo flag in solo mode"
|
||||||
|
|
||||||
|
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
|
||||||
|
if [[ -z "$first_recipe" ]]; then
|
||||||
|
log_skip "No recipes found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
recipe_name=$(basename "$first_recipe" .yaml)
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo 2>&1)
|
||||||
|
launch_cmd=$(extract_launch_cmd "$output")
|
||||||
|
|
||||||
|
if echo "$launch_cmd" | grep -q "\-\-solo"; then
|
||||||
|
log_pass "Launch command includes --solo flag"
|
||||||
|
else
|
||||||
|
log_fail "Launch command missing --solo flag"
|
||||||
|
log_verbose "Launch cmd: $launch_cmd"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: Cluster mode generates -n flag with nodes
|
||||||
|
test_launch_cmd_nodes_flag() {
|
||||||
|
log_test "Launch command includes -n flag with nodes in cluster mode"
|
||||||
|
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" minimax-m2-awq --dry-run -n "10.0.0.1,10.0.0.2" 2>&1)
|
||||||
|
launch_cmd=$(extract_launch_cmd "$output")
|
||||||
|
|
||||||
|
if echo "$launch_cmd" | grep -q "\-n 10.0.0.1,10.0.0.2"; then
|
||||||
|
log_pass "Launch command includes -n with correct nodes"
|
||||||
|
else
|
||||||
|
log_fail "Launch command missing or incorrect -n flag"
|
||||||
|
log_verbose "Launch cmd: $launch_cmd"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: Container image from recipe is passed to launch-cluster
|
||||||
|
test_launch_cmd_container_image() {
|
||||||
|
log_test "Launch command includes correct container image (-t)"
|
||||||
|
|
||||||
|
# Use openai-gpt-oss-120b which has a specific container name
|
||||||
|
if [[ ! -f "$PROJECT_DIR/recipes/openai-gpt-oss-120b.yaml" ]]; then
|
||||||
|
log_skip "openai-gpt-oss-120b.yaml not found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" openai-gpt-oss-120b --dry-run --solo 2>&1)
|
||||||
|
launch_cmd=$(extract_launch_cmd "$output")
|
||||||
|
|
||||||
|
# Check the container is vllm-node-mxfp4 (from the recipe)
|
||||||
|
if echo "$launch_cmd" | grep -q "\-t vllm-node-mxfp4"; then
|
||||||
|
log_pass "Launch command includes correct container image"
|
||||||
|
else
|
||||||
|
log_fail "Launch command has wrong container image"
|
||||||
|
log_verbose "Launch cmd: $launch_cmd"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: Mods from recipe are passed as --apply-mod
|
||||||
|
test_launch_cmd_mods() {
|
||||||
|
log_test "Launch command includes --apply-mod for recipe mods"
|
||||||
|
|
||||||
|
# Use glm-4.7-flash-awq which has a mod
|
||||||
|
if [[ ! -f "$PROJECT_DIR/recipes/glm-4.7-flash-awq.yaml" ]]; then
|
||||||
|
log_skip "glm-4.7-flash-awq.yaml not found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" glm-4.7-flash-awq --dry-run --solo 2>&1)
|
||||||
|
launch_cmd=$(extract_launch_cmd "$output")
|
||||||
|
|
||||||
|
if echo "$launch_cmd" | grep -q "\-\-apply-mod"; then
|
||||||
|
log_pass "Launch command includes --apply-mod for mods"
|
||||||
|
else
|
||||||
|
log_fail "Launch command missing --apply-mod"
|
||||||
|
log_verbose "Launch cmd: $launch_cmd"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: Daemon mode flag is passed through
|
||||||
|
test_launch_cmd_daemon_flag() {
|
||||||
|
log_test "Launch command includes -d flag in daemon mode"
|
||||||
|
|
||||||
|
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
|
||||||
|
if [[ -z "$first_recipe" ]]; then
|
||||||
|
log_skip "No recipes found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
recipe_name=$(basename "$first_recipe" .yaml)
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo -d 2>&1)
|
||||||
|
launch_cmd=$(extract_launch_cmd "$output")
|
||||||
|
|
||||||
|
if echo "$launch_cmd" | grep -q "\-d"; then
|
||||||
|
log_pass "Launch command includes -d flag"
|
||||||
|
else
|
||||||
|
log_fail "Launch command missing -d flag"
|
||||||
|
log_verbose "Launch cmd: $launch_cmd"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: NCCL debug level is passed through
|
||||||
|
test_launch_cmd_nccl_debug() {
|
||||||
|
log_test "Launch command includes --nccl-debug when specified"
|
||||||
|
|
||||||
|
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
|
||||||
|
if [[ -z "$first_recipe" ]]; then
|
||||||
|
log_skip "No recipes found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
recipe_name=$(basename "$first_recipe" .yaml)
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo --nccl-debug INFO 2>&1)
|
||||||
|
launch_cmd=$(extract_launch_cmd "$output")
|
||||||
|
|
||||||
|
if echo "$launch_cmd" | grep -q "\-\-nccl-debug INFO"; then
|
||||||
|
log_pass "Launch command includes --nccl-debug INFO"
|
||||||
|
else
|
||||||
|
log_fail "Launch command missing --nccl-debug"
|
||||||
|
log_verbose "Launch cmd: $launch_cmd"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: --launch-script is always included
|
||||||
|
test_launch_cmd_launch_script() {
|
||||||
|
log_test "Launch command includes --launch-script"
|
||||||
|
|
||||||
|
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
|
||||||
|
if [[ -z "$first_recipe" ]]; then
|
||||||
|
log_skip "No recipes found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
recipe_name=$(basename "$first_recipe" .yaml)
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo 2>&1)
|
||||||
|
launch_cmd=$(extract_launch_cmd "$output")
|
||||||
|
|
||||||
|
if echo "$launch_cmd" | grep -q "\-\-launch-script"; then
|
||||||
|
log_pass "Launch command includes --launch-script"
|
||||||
|
else
|
||||||
|
log_fail "Launch command missing --launch-script"
|
||||||
|
log_verbose "Launch cmd: $launch_cmd"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: Container override (-t CLI) takes precedence
|
||||||
|
test_launch_cmd_container_override() {
|
||||||
|
log_test "CLI container override (-t) takes precedence"
|
||||||
|
|
||||||
|
first_recipe=$(ls "$PROJECT_DIR/recipes/"*.yaml 2>/dev/null | head -1)
|
||||||
|
if [[ -z "$first_recipe" ]]; then
|
||||||
|
log_skip "No recipes found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
recipe_name=$(basename "$first_recipe" .yaml)
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo -t my-custom-image 2>&1)
|
||||||
|
launch_cmd=$(extract_launch_cmd "$output")
|
||||||
|
|
||||||
|
if echo "$launch_cmd" | grep -q "\-t my-custom-image"; then
|
||||||
|
log_pass "Container override correctly applied"
|
||||||
|
else
|
||||||
|
log_fail "Container override not applied"
|
||||||
|
log_verbose "Launch cmd: $launch_cmd"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: Cluster mode does NOT include --solo flag
|
||||||
|
test_launch_cmd_no_solo_in_cluster() {
|
||||||
|
log_test "Launch command does NOT include --solo in cluster mode"
|
||||||
|
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" minimax-m2-awq --dry-run -n "10.0.0.1,10.0.0.2" 2>&1)
|
||||||
|
launch_cmd=$(extract_launch_cmd "$output")
|
||||||
|
|
||||||
|
if echo "$launch_cmd" | grep -qv "\-\-solo" || ! echo "$launch_cmd" | grep -q "\-\-solo"; then
|
||||||
|
log_pass "Cluster mode correctly omits --solo flag"
|
||||||
|
else
|
||||||
|
log_fail "Cluster mode incorrectly includes --solo flag"
|
||||||
|
log_verbose "Launch cmd: $launch_cmd"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# ==============================================================================
|
||||||
|
# README Documentation Verification Tests
|
||||||
|
# ==============================================================================
|
||||||
|
# These tests verify that recipe dry-run output matches the expected commands
|
||||||
|
# documented in README.md. Expected values are defined in expected_commands.sh
|
||||||
|
|
||||||
|
# Helper: Extract the generated launch script from dry-run output
|
||||||
|
extract_vllm_command() {
|
||||||
|
# Extract lines between "Generated Launch Script" and "What would be executed"
|
||||||
|
echo "$1" | sed -n '/=== Generated Launch Script ===/,/=== What would be executed ===/p' | grep -v "===" | grep -v "^#" | grep -v "^$"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Helper: Verify a recipe contains all expected arguments
|
||||||
|
verify_recipe_args() {
|
||||||
|
local recipe_name="$1"
|
||||||
|
local expected_model="$2"
|
||||||
|
local expected_container="$3"
|
||||||
|
shift 3
|
||||||
|
local expected_args=("$@")
|
||||||
|
|
||||||
|
log_test "README match: $recipe_name"
|
||||||
|
|
||||||
|
if [[ ! -f "$PROJECT_DIR/recipes/${recipe_name}.yaml" ]]; then
|
||||||
|
log_skip "${recipe_name}.yaml not found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run --solo 2>&1)
|
||||||
|
vllm_cmd=$(extract_vllm_command "$output")
|
||||||
|
launch_cmd=$(extract_launch_cmd "$output")
|
||||||
|
|
||||||
|
local all_passed=true
|
||||||
|
local missing_args=()
|
||||||
|
|
||||||
|
# Check model name
|
||||||
|
if ! echo "$vllm_cmd" | grep -q "$expected_model"; then
|
||||||
|
missing_args+=("model: $expected_model")
|
||||||
|
all_passed=false
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check container
|
||||||
|
if ! echo "$launch_cmd" | grep -q "\-t $expected_container"; then
|
||||||
|
missing_args+=("container: $expected_container")
|
||||||
|
all_passed=false
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check each expected argument
|
||||||
|
for arg in "${expected_args[@]}"; do
|
||||||
|
# Handle arguments that may have slight formatting differences
|
||||||
|
# Extract the flag and value separately for flexible matching
|
||||||
|
local flag=$(echo "$arg" | awk '{print $1}')
|
||||||
|
local value=$(echo "$arg" | cut -d' ' -f2-)
|
||||||
|
|
||||||
|
# Use grep -F for fixed string matching (avoids -- being treated as grep options)
|
||||||
|
if ! echo "$vllm_cmd" | grep -qF -- "$flag"; then
|
||||||
|
missing_args+=("$arg")
|
||||||
|
all_passed=false
|
||||||
|
elif [[ -n "$value" ]] && [[ "$value" != "$flag" ]]; then
|
||||||
|
# Check if value is present (might be on next line due to formatting)
|
||||||
|
if ! echo "$vllm_cmd" | grep -qF -- "$value"; then
|
||||||
|
missing_args+=("$arg (flag present, value mismatch)")
|
||||||
|
all_passed=false
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ "$all_passed" == "true" ]]; then
|
||||||
|
log_pass "README match: $recipe_name - all expected arguments present"
|
||||||
|
else
|
||||||
|
log_fail "README match: $recipe_name - missing arguments"
|
||||||
|
for missing in "${missing_args[@]}"; do
|
||||||
|
log_verbose " Missing: $missing"
|
||||||
|
done
|
||||||
|
log_verbose " vLLM command: $vllm_cmd"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: glm-4.7-flash-awq matches README documentation
|
||||||
|
test_readme_glm_flash_awq() {
|
||||||
|
verify_recipe_args "glm-4.7-flash-awq" \
|
||||||
|
"$GLM_FLASH_AWQ_MODEL" \
|
||||||
|
"$GLM_FLASH_AWQ_CONTAINER" \
|
||||||
|
"${GLM_FLASH_AWQ_ARGS[@]}"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: openai-gpt-oss-120b matches README documentation
|
||||||
|
test_readme_gpt_oss() {
|
||||||
|
verify_recipe_args "openai-gpt-oss-120b" \
|
||||||
|
"$GPT_OSS_MODEL" \
|
||||||
|
"$GPT_OSS_CONTAINER" \
|
||||||
|
"${GPT_OSS_ARGS[@]}"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: minimax-m2-awq matches expected configuration
|
||||||
|
test_readme_minimax() {
|
||||||
|
verify_recipe_args "minimax-m2-awq" \
|
||||||
|
"$MINIMAX_MODEL" \
|
||||||
|
"$MINIMAX_CONTAINER" \
|
||||||
|
"${MINIMAX_ARGS[@]}"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: glm-4.7-flash-awq includes correct mod
|
||||||
|
test_readme_glm_flash_mod() {
|
||||||
|
log_test "README match: glm-4.7-flash-awq mod path"
|
||||||
|
|
||||||
|
if [[ ! -f "$PROJECT_DIR/recipes/glm-4.7-flash-awq.yaml" ]]; then
|
||||||
|
log_skip "glm-4.7-flash-awq.yaml not found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" glm-4.7-flash-awq --dry-run --solo 2>&1)
|
||||||
|
launch_cmd=$(extract_launch_cmd "$output")
|
||||||
|
|
||||||
|
if echo "$launch_cmd" | grep -q "$GLM_FLASH_AWQ_MOD"; then
|
||||||
|
log_pass "README match: glm-4.7-flash-awq has correct mod path"
|
||||||
|
else
|
||||||
|
log_fail "README match: glm-4.7-flash-awq missing expected mod: $GLM_FLASH_AWQ_MOD"
|
||||||
|
log_verbose "Launch cmd: $launch_cmd"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Helper: Verify cluster mode specific arguments
|
||||||
|
verify_cluster_args() {
|
||||||
|
local recipe_name="$1"
|
||||||
|
local expected_tp="$2"
|
||||||
|
shift 2
|
||||||
|
local expected_args=("$@")
|
||||||
|
|
||||||
|
log_test "README match (cluster): $recipe_name"
|
||||||
|
|
||||||
|
if [[ ! -f "$PROJECT_DIR/recipes/${recipe_name}.yaml" ]]; then
|
||||||
|
log_skip "${recipe_name}.yaml not found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Use fake nodes for cluster mode
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" "$recipe_name" --dry-run -n "10.0.0.1,10.0.0.2" 2>&1)
|
||||||
|
vllm_cmd=$(extract_vllm_command "$output")
|
||||||
|
|
||||||
|
local all_passed=true
|
||||||
|
local missing_args=()
|
||||||
|
|
||||||
|
# Check tensor parallel
|
||||||
|
if ! echo "$vllm_cmd" | grep -qE "(--tensor-parallel-size|-tp) $expected_tp"; then
|
||||||
|
missing_args+=("tensor_parallel: $expected_tp")
|
||||||
|
all_passed=false
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check cluster-specific arguments
|
||||||
|
for arg in "${expected_args[@]}"; do
|
||||||
|
if ! echo "$vllm_cmd" | grep -qF -- "$arg"; then
|
||||||
|
missing_args+=("$arg")
|
||||||
|
all_passed=false
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ "$all_passed" == "true" ]]; then
|
||||||
|
log_pass "README match (cluster): $recipe_name - cluster args correct"
|
||||||
|
else
|
||||||
|
log_fail "README match (cluster): $recipe_name - missing cluster arguments"
|
||||||
|
for missing in "${missing_args[@]}"; do
|
||||||
|
log_verbose " Missing: $missing"
|
||||||
|
done
|
||||||
|
log_verbose " vLLM command: $vllm_cmd"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: openai-gpt-oss-120b cluster mode has correct tensor_parallel and ray backend
|
||||||
|
test_readme_gpt_oss_cluster() {
|
||||||
|
verify_cluster_args "openai-gpt-oss-120b" \
|
||||||
|
"$GPT_OSS_CLUSTER_TP" \
|
||||||
|
"${GPT_OSS_CLUSTER_ARGS[@]}"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: minimax-m2-awq cluster mode has correct tensor_parallel and ray backend
|
||||||
|
test_readme_minimax_cluster() {
|
||||||
|
verify_cluster_args "minimax-m2-awq" \
|
||||||
|
"$MINIMAX_CLUSTER_TP" \
|
||||||
|
"${MINIMAX_CLUSTER_ARGS[@]}"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Test: glm-4.7-flash-awq cluster mode stays at tp=1 (single GPU model)
|
||||||
|
test_readme_glm_flash_cluster() {
|
||||||
|
log_test "README match (cluster): glm-4.7-flash-awq stays tp=1"
|
||||||
|
|
||||||
|
if [[ ! -f "$PROJECT_DIR/recipes/glm-4.7-flash-awq.yaml" ]]; then
|
||||||
|
log_skip "glm-4.7-flash-awq.yaml not found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Even in cluster mode, this model uses tp=1
|
||||||
|
output=$("$PROJECT_DIR/run-recipe.py" glm-4.7-flash-awq --dry-run -n "10.0.0.1,10.0.0.2" 2>&1)
|
||||||
|
vllm_cmd=$(extract_vllm_command "$output")
|
||||||
|
|
||||||
|
if echo "$vllm_cmd" | grep -qE "(--tensor-parallel-size|-tp) 1"; then
|
||||||
|
log_pass "README match (cluster): glm-4.7-flash-awq correctly keeps tp=1"
|
||||||
|
else
|
||||||
|
log_fail "README match (cluster): glm-4.7-flash-awq should have tp=1"
|
||||||
|
log_verbose " vLLM command: $vllm_cmd"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run all tests
|
||||||
|
main() {
|
||||||
|
echo "=============================================="
|
||||||
|
echo " run-recipe.py Integration Tests"
|
||||||
|
echo "=============================================="
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
cd "$PROJECT_DIR"
|
||||||
|
|
||||||
|
check_prerequisites
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# File existence tests
|
||||||
|
test_run_recipe_exists
|
||||||
|
test_launch_cluster_exists
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Basic functionality tests
|
||||||
|
test_list_recipes
|
||||||
|
test_recipe_version_required
|
||||||
|
test_all_recipes_load
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Dry-run tests
|
||||||
|
test_dry_run_generates_script
|
||||||
|
test_solo_mode_tp1
|
||||||
|
test_solo_mode_removes_ray
|
||||||
|
test_cluster_mode_keeps_ray
|
||||||
|
test_cli_override_port
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# launch-cluster.sh command line verification tests
|
||||||
|
echo "--- Launch Command Verification ---"
|
||||||
|
test_launch_cmd_solo_flag
|
||||||
|
test_launch_cmd_nodes_flag
|
||||||
|
test_launch_cmd_container_image
|
||||||
|
test_launch_cmd_mods
|
||||||
|
test_launch_cmd_daemon_flag
|
||||||
|
test_launch_cmd_nccl_debug
|
||||||
|
test_launch_cmd_launch_script
|
||||||
|
test_launch_cmd_container_override
|
||||||
|
test_launch_cmd_no_solo_in_cluster
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# README documentation verification tests
|
||||||
|
echo "--- README Documentation Verification (Solo Mode) ---"
|
||||||
|
test_readme_glm_flash_awq
|
||||||
|
test_readme_gpt_oss
|
||||||
|
test_readme_minimax
|
||||||
|
test_readme_glm_flash_mod
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Cluster mode documentation verification tests
|
||||||
|
echo "--- README Documentation Verification (Cluster Mode) ---"
|
||||||
|
test_readme_gpt_oss_cluster
|
||||||
|
test_readme_minimax_cluster
|
||||||
|
test_readme_glm_flash_cluster
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# launch-cluster.sh tests
|
||||||
|
test_launch_cluster_help
|
||||||
|
test_launch_cluster_examples_path
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Validation tests
|
||||||
|
test_unsupported_recipe_version
|
||||||
|
test_missing_recipe_version_fails
|
||||||
|
test_cluster_only_fails_solo
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Summary
|
||||||
|
echo "=============================================="
|
||||||
|
echo " Test Summary"
|
||||||
|
echo "=============================================="
|
||||||
|
echo -e " ${GREEN}Passed:${NC} $TESTS_PASSED"
|
||||||
|
echo -e " ${RED}Failed:${NC} $TESTS_FAILED"
|
||||||
|
echo -e " ${YELLOW}Skipped:${NC} $TESTS_SKIPPED"
|
||||||
|
echo "=============================================="
|
||||||
|
|
||||||
|
if [[ $TESTS_FAILED -gt 0 ]]; then
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
exit 0
|
||||||
|
}
|
||||||
|
|
||||||
|
main "$@"
|
||||||
Reference in New Issue
Block a user