Files

Raphael Amorim 30f16f1d4e feat: Add recipe-based one-click model deployment system

Introduces a YAML recipe system for simplified model deployment:

- run-recipe.py: Main script handling build, download, and launch
- run-recipe.sh: Bash wrapper for dependency management
- recipes/: Pre-configured recipes for common models
  - glm-4.7-flash-awq.yaml: GLM-4.7-Flash with AWQ quantization
  - glm-4.7-nvfp4.yaml: GLM-4.7 with NVFP4 (cluster-only)
  - minimax-m2-awq.yaml: MiniMax M2 with AWQ
  - openai-gpt-oss-120b.yaml: OpenAI GPT-OSS 120B with MXFP4

Key features:
- Auto-discover cluster nodes with --discover, saves to .env
- Load nodes from .env automatically on subsequent runs
- cluster_only flag for models requiring multi-node setup
- build_args field for Dockerfile selection (--pre-tf, --exp-mxfp4)
- Solo mode auto-strips --distributed-executor-backend ray
- --setup flag for full build + download + run workflow
- --dry-run to preview execution without running

Usage:
  ./run-recipe.sh --discover           # Find and save cluster nodes
  ./run-recipe.sh glm-4.7-flash-awq --solo --setup
  ./run-recipe.sh glm-4.7-nvfp4 --setup  # Uses nodes from .env

2026-02-03 16:09:12 -05:00

glm-4.7-flash-awq.yaml

feat: Add recipe-based one-click model deployment system

2026-02-03 16:09:12 -05:00

minimax-m2-awq.yaml

feat: Add recipe-based one-click model deployment system

2026-02-03 16:09:12 -05:00

openai-gpt-oss-120b.yaml

feat: Add recipe-based one-click model deployment system

2026-02-03 16:09:12 -05:00

README.md

feat: Add recipe-based one-click model deployment system

2026-02-03 16:09:12 -05:00

README.md

Recipes

Recipes provide a one-click solution for deploying models with pre-configured settings. Each recipe is a YAML file that specifies:

HuggingFace model to download
Container image and build arguments
Required mods/patches
Default parameters (port, host, tensor parallelism, etc.)
Environment variables
The vLLM serve command

Quick Start

# List available recipes
./run-recipe.sh --list

# Run a recipe in solo mode (single node)
./run-recipe.sh glm-4.7-flash-awq --solo

# Full setup: build container + download model + run
./run-recipe.sh glm-4.7-flash-awq --solo --setup

# Run with overrides
./run-recipe.sh glm-4.7-flash-awq --solo --port 9000 --gpu-mem 0.8

# Cluster deployment
./run-recipe.sh glm-4.7-nvfp4 -n 192.168.1.10,192.168.1.11 --setup

Cluster Node Discovery

The recipe runner can automatically discover cluster nodes:

# Auto-discover nodes and save to .env
./run-recipe.sh --discover

# Show current .env configuration
./run-recipe.sh --show-env

# Run recipe (uses nodes from .env automatically)
./run-recipe.sh glm-4.7-nvfp4 --setup

When you run --discover, it:

Scans the network for nodes with SSH access
Prompts you to select which nodes to include
Saves the configuration to .env

Future recipe runs will automatically use nodes from .env unless you specify -n or --solo.

Workflow Modes

Solo Mode (Single Node)

# Explicitly run in solo mode
./run-recipe.sh glm-4.7-flash-awq --solo

# If no nodes configured, defaults to solo
./run-recipe.sh minimax-m2-awq

Cluster Mode (Multiple Nodes)

# Specify nodes directly (first IP is head node)
./run-recipe.sh glm-4.7-nvfp4 -n 192.168.1.10,192.168.1.11 --setup

# Or use auto-discovered nodes from .env
./run-recipe.sh --discover  # First time only
./run-recipe.sh glm-4.7-nvfp4 --setup

When using cluster mode with --setup:

Container is built locally and copied to all worker nodes
Model is downloaded locally and copied to all worker nodes

Cluster-Only Recipes

Some models are too large to run on a single node. These recipes have cluster_only: true and will fail with a helpful error if you try to run them in solo mode:

$ ./run-recipe.sh glm-4.7-nvfp4 --solo
Error: Recipe 'GLM-4.7-NVFP4' requires cluster mode.
This model is too large to run on a single node.

Options:
  1. Specify nodes directly:  ./run-recipe.sh glm-4.7-nvfp4 -n node1,node2
  2. Auto-discover and save:  ./run-recipe.sh --discover
     Then run:                ./run-recipe.sh glm-4.7-nvfp4

Setup Options

Flag	Description
`--setup`	Full setup: build (if missing) + download (if missing) + run
`--build-only`	Only build/copy the container, don't run
`--download-only`	Only download/copy the model, don't run
`--force-build`	Rebuild even if container exists
`--force-download`	Re-download even if model exists
`--dry-run`	Show what would happen without executing

Recipe Format

# Required fields
name: Human-readable name
container: docker-image-name
command: |
  vllm serve model/name \
      --port {port} \
      --host {host}

# Optional fields
description: What this recipe does
model: org/model-name              # HuggingFace model ID for --setup downloads
cluster_only: false                # Set to true if model requires cluster mode
build_args:                        # Extra args for build-and-copy.sh
  - --pre-tf                       # e.g., for transformers 5.0
  - --exp-mxfp4                    # e.g., for MXFP4 Dockerfile
mods:
  - mods/some-patch
defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.85
  max_model_len: 32000
env:
  SOME_VAR: "value"

Build Arguments

The build_args field passes flags to build-and-copy.sh:

Flag	Description
`--pre-tf`	Use transformers 5.0 (required for GLM-4.7 models)
`--exp-mxfp4`	Use MXFP4 Dockerfile (for MXFP4 quantized models)
`--use-wheels`	Use pre-built wheels instead of building from source

Parameter Substitution

Use {param_name} in the command to substitute values from defaults or CLI overrides:

defaults:
  port: 8000
  tensor_parallel: 2

command: |
  vllm serve my/model \
      --port {port} \
      -tp {tensor_parallel}

Override at runtime:

./run-recipe.sh my-recipe --port 9000 --tp 4

CLI Reference

Usage: ./run-recipe.sh [OPTIONS] [RECIPE]

Cluster discovery:
  --discover                  Auto-detect cluster nodes and save to .env
  --show-env                  Show current .env configuration

Recipe overrides:
  --port PORT                 Override port
  --host HOST                 Override host
  --tensor-parallel, --tp N   Override tensor parallelism
  --gpu-memory-utilization N  Override GPU memory utilization (--gpu-mem)
  --max-model-len N           Override max model length

Setup options:
  --setup                     Full setup: build + download + run
  --build-only                Only build/copy container, don't run
  --download-only             Only download/copy model, don't run
  --force-build               Rebuild even if container exists
  --force-download            Re-download even if model exists

Launch options:
  --solo                      Run in solo mode (single node, no Ray)
  -n, --nodes IPS             Comma-separated node IPs (first = head)
  -d, --daemon                Run in daemon mode
  -t, --container IMAGE       Override container from recipe
  --nccl-debug LEVEL          NCCL debug level (VERSION, WARN, INFO, TRACE)

Other:
  --dry-run                   Show what would be executed
  --list, -l                  List available recipes

Creating a Recipe

Create a new .yaml file in recipes/
Specify required fields: name, container, command
Add build_args if your model needs special build options
Add mods if your model needs patches
Set cluster_only: true if model is too large for single node
Set sensible defaults
Add env variables if needed

Example:

name: My Model
description: My custom model setup
container: vllm-node-tf5

build_args:
  - --pre-tf

mods:
  - mods/my-fix

defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 1
  gpu_memory_utilization: 0.85

command: |
  vllm serve org/my-model \
      --port {port} \
      --host {host} \
      -tp {tensor_parallel} \
      --gpu-memory-utilization {gpu_memory_utilization}

Architecture

┌─────────────────────────────────────────────────────────┐
│  run-recipe.sh / run-recipe.py                          │
│  - Parses YAML recipe                                   │
│  - Auto-discovers cluster nodes (--discover)            │
│  - Loads nodes from .env                                │
│  - Handles --setup (build + download + run)             │
│  - Generates launch script from template                │
│  - Applies CLI overrides                                │
└──────────┬────────────────────────┬─────────────────────┘
           │ calls (for build)      │ calls (for download)
           ▼                        ▼
┌──────────────────────┐  ┌───────────────────────────────┐
│  build-and-copy.sh   │  │  hf-download.sh               │
│  - Docker build      │  │  - HuggingFace model download │
│  - Copy to workers   │  │  - Rsync to workers           │
└──────────────────────┘  └───────────────────────────────┘
           │ 
           │ then calls (for run)
           ▼
┌─────────────────────────────────────────────────────────┐
│  launch-cluster.sh                                      │
│  - Cluster orchestration                                │
│  - Container lifecycle                                  │
│  - Mod application                                      │
│  - Launch script execution                              │
└─────────────────────────────────────────────────────────┘

This separation follows the Unix philosophy: run-recipe.sh provides convenience, while the underlying scripts remain focused on their specific tasks.