Adding sample profile and profile loader

This commit is contained in:
Raphael Amorim
2026-01-25 21:22:45 -05:00
parent 133ed9cfb9
commit 751bc5a47a
6 changed files with 390 additions and 8 deletions

View File

@@ -16,10 +16,11 @@ While it was primarily developed to support multi-node inference, it works just
- [4. Using `run-cluster-node.sh` (Internal)](#4-using-run-cluster-nodesh-internal)
- [5. Configuration Details](#5-configuration-details)
- [6. Mods and Patches](#6-mods-and-patches)
- [7. Using cluster mode for inference](#7-using-cluster-mode-for-inference)
- [8. Fastsafetensors](#8-fastsafetensors)
- [9. Benchmarking](#9-benchmarking)
- [10. Downloading Models](#10-downloading-models)
- [7. Launch Scripts](#7-launch-scripts)
- [8. Using cluster mode for inference](#8-using-cluster-mode-for-inference)
- [9. Fastsafetensors](#9-fastsafetensors)
- [10. Benchmarking](#10-benchmarking)
- [11. Downloading Models](#11-downloading-models)
## DISCLAIMER
@@ -770,7 +771,55 @@ Mods can be used for:
- Customizing vLLM behavior for specific workloads
- Rapid iteration on development without rebuilding the entire image
## 7\. Using cluster mode for inference
## 7\. Launch Scripts
Launch scripts provide a simple way to define reusable model configurations. Instead of passing long command lines, you can create a bash script that is copied into the container and executed directly.
### Basic Usage
```bash
# Use a launch script by name (looks in profiles/ directory)
./launch-cluster.sh --launch-script example-vllm-minimax
# Use with explicit nodes
./launch-cluster.sh -n 192.168.1.1,192.168.1.2 --launch-script vllm-openai-gpt-oss-120b.sh
# Combine with mods for models requiring patches
./launch-cluster.sh --launch-script vllm-glm-4.7-nvfp4.sh --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4
```
### Script Format
Launch scripts are simple bash files that run directly inside the container:
```bash
#!/bin/bash
# PROFILE: OpenAI GPT-OSS 120B
# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
# Set environment variables if needed
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
# Run your command
vllm serve openai/gpt-oss-120b \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--enable-auto-tool-choice
```
### Available Launch Scripts
The `profiles/` directory contains ready-to-use launch scripts:
- **example-vllm-minimax.sh** - MiniMax-M2-AWQ with Ray distributed backend
- **vllm-openai-gpt-oss-120b.sh** - OpenAI GPT-OSS 120B with FlashInfer MOE
- **vllm-glm-4.7-nvfp4.sh** - GLM-4.7-NVFP4 (requires the glm4_moe patch mod)
See [profiles/README.md](profiles/README.md) for detailed documentation and more examples.
## 8\. Using cluster mode for inference
First, start follow the instructions above to start the head container on your first Spark, and node container on the second Spark.
Then, on the first Spark, run vllm like this:
@@ -787,7 +836,7 @@ docker exec -it vllm_node
And execute vllm command inside.
## 8\. Fastsafetensors
## 9\. Fastsafetensors
This build includes support for fastsafetensors loading which significantly improves loading speeds, especially on DGX Spark where MMAP performance is very poor currently.
[Fasttensors](https://github.com/foundation-model-stack/fastsafetensors/) solve this issue by using more efficient multi-threaded loading while avoiding mmap.
@@ -801,11 +850,11 @@ To use this method, simply include `--load-format fastsafetensors` when running
HF_HUB_OFFLINE=1 vllm serve openai/gpt-oss-120b --port 8888 --host 0.0.0.0 --trust_remote_code --swap-space 16 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --load-format fastsafetensors
```
## 9\. Benchmarking
## 10\. Benchmarking
I recommend using [llama-benchy](https://github.com/eugr/llama-benchy) - a new benchmarking tool that delivers results in the same format as llama-bench from llama.cpp suite.
## 10\. Downloading Models
## 11\. Downloading Models
The `hf-download.sh` script provides a convenient way to download models from HuggingFace and distribute them across your cluster nodes. It uses Huggingface CLI via `uvx` for fast downloads and `rsync` for distribution across the cluster.

View File

@@ -26,6 +26,8 @@ ACTION="start"
CLUSTER_WAS_RUNNING="false"
MOD_PATHS=()
MOD_TYPES=()
LAUNCH_SCRIPT_PATH=""
SCRIPT_DIR="$(dirname "$(realpath "$0")")"
ACTIONS_ARG=""
SOLO_MODE="false"
@@ -41,11 +43,16 @@ usage() {
echo " -e, --env Environment variable to pass to container (e.g. -e VAR=val)"
echo " --nccl-debug NCCL debug level (Optional, one of: VERSION, WARN, INFO, TRACE). If no level is provided, defaults to INFO."
echo " --apply-mod Path to directory or zip file containing run.sh to apply before launch (Can be specified multiple times)"
echo " --launch-script Path to bash script to execute in the container (from profiles/ directory or absolute path)"
echo " --check-config Check configuration and auto-detection without launching"
echo " --solo Solo mode: skip autodetection, launch only on current node, do not launch Ray cluster"
echo " -d Daemon mode (only for 'start' action)"
echo " action start | stop | status | exec (Default: start)"
echo " command Command to run (only for 'exec' action)"
echo ""
echo "Launch Script Usage:"
echo " $0 --launch-script profiles/my-script.sh # Script copied to container and executed"
echo " $0 --launch-script /path/to/script.sh # Uses absolute path to script"
exit 1
}
@@ -59,6 +66,7 @@ while [[ "$#" -gt 0 ]]; do
--ib-if) IB_IF="$2"; shift ;;
-e|--env) DOCKER_ARGS="$DOCKER_ARGS -e $2"; shift ;;
--apply-mod) MOD_PATHS+=("$2"); shift ;;
--launch-script) LAUNCH_SCRIPT_PATH="$2"; shift ;;
--nccl-debug)
if [[ -n "$2" && "$2" =~ ^(VERSION|WARN|INFO|TRACE)$ ]]; then
NCCL_DEBUG_VAL="$2"
@@ -107,6 +115,37 @@ if [[ -n "$NCCL_DEBUG_VAL" ]]; then
esac
fi
# Resolve launch script path if specified
if [[ -n "$LAUNCH_SCRIPT_PATH" ]]; then
# Check if it's an absolute path or relative path that exists
if [[ -f "$LAUNCH_SCRIPT_PATH" ]]; then
LAUNCH_SCRIPT_PATH=$(realpath "$LAUNCH_SCRIPT_PATH")
# Check if it's just a filename, look in profiles/ directory
elif [[ -f "$SCRIPT_DIR/profiles/$LAUNCH_SCRIPT_PATH" ]]; then
LAUNCH_SCRIPT_PATH="$SCRIPT_DIR/profiles/$LAUNCH_SCRIPT_PATH"
# Check if it's a name without .sh extension
elif [[ -f "$SCRIPT_DIR/profiles/${LAUNCH_SCRIPT_PATH}.sh" ]]; then
LAUNCH_SCRIPT_PATH="$SCRIPT_DIR/profiles/${LAUNCH_SCRIPT_PATH}.sh"
else
echo "Error: Launch script '$LAUNCH_SCRIPT_PATH' not found."
echo "Searched in:"
echo " - $LAUNCH_SCRIPT_PATH"
echo " - $SCRIPT_DIR/profiles/$LAUNCH_SCRIPT_PATH"
echo " - $SCRIPT_DIR/profiles/${LAUNCH_SCRIPT_PATH}.sh"
exit 1
fi
echo "Using launch script: $LAUNCH_SCRIPT_PATH"
# Set command to run the copied script (use absolute path since docker exec may not be in /workspace)
COMMAND_TO_RUN="/workspace/exec-script.sh"
# If launch script is specified, default action to exec unless explicitly set to stop/status
if [[ "$ACTION" == "start" ]]; then
ACTION="exec"
fi
fi
# Validate MOD_PATHS if set
for i in "${!MOD_PATHS[@]}"; do
mod_path="${MOD_PATHS[$i]}"
@@ -426,6 +465,51 @@ apply_mod_to_container() {
fi
}
# Copy Launch Script to Container Function
copy_launch_script_to_container() {
local node_ip="$1"
local container="$2"
local is_local="$3" # true/false
local script_path="$4"
echo "Copying launch script to $node_ip..."
# Command prefix for remote vs local
local cmd_prefix=""
if [[ "$is_local" == "false" ]]; then
cmd_prefix="ssh -o BatchMode=yes -o StrictHostKeyChecking=no $node_ip"
fi
local target_script_path="$script_path"
local remote_cleanup_path=""
# Copy script to remote node first if needed
if [[ "$is_local" == "false" ]]; then
local remote_tmp="/tmp/exec_script_$(date +%s)_$RANDOM.sh"
echo " Copying script to $node_ip:$remote_tmp..."
if ! scp -o BatchMode=yes -o StrictHostKeyChecking=no "$script_path" "$node_ip:$remote_tmp"; then
echo "Error: Failed to copy launch script to $node_ip"
exit 1
fi
target_script_path="$remote_tmp"
remote_cleanup_path="$remote_tmp"
fi
# Copy script into container as /workspace/exec-script.sh
echo " Copying script into container..."
$cmd_prefix docker cp "$target_script_path" "$container:/workspace/exec-script.sh"
# Make executable
$cmd_prefix docker exec "$container" chmod +x /workspace/exec-script.sh
# Cleanup remote temp
if [[ -n "$remote_cleanup_path" ]]; then
ssh -o BatchMode=yes -o StrictHostKeyChecking=no "$node_ip" "rm -f $remote_cleanup_path"
fi
echo " Launch script copied to $node_ip"
}
# Start Cluster Function
start_cluster() {
check_cluster_running
@@ -494,6 +578,19 @@ start_cluster() {
done
fi
# Copy launch script if specified
if [[ -n "$LAUNCH_SCRIPT_PATH" ]]; then
echo "Copying launch script to cluster nodes..."
# Copy to Head
copy_launch_script_to_container "$HEAD_IP" "$CONTAINER_NAME" "true" "$LAUNCH_SCRIPT_PATH"
# Copy to Workers
for worker in "${PEER_NODES[@]}"; do
copy_launch_script_to_container "$worker" "$CONTAINER_NAME" "false" "$LAUNCH_SCRIPT_PATH"
done
fi
if [[ "$SOLO_MODE" == "false" ]]; then
wait_for_cluster
else

184
profiles/README.md Normal file
View File

@@ -0,0 +1,184 @@
# Launch Scripts
This directory contains bash scripts that can be executed in the container using the `--launch-script` option. Launch scripts are simple, executable bash files that run directly inside the container.
## Why Launch Scripts?
- **Simple** - Just write a bash script that runs your command
- **Flexible** - Use any bash features: environment variables, conditionals, loops
- **Standalone** - Each script can be tested directly on a head node
- **No magic** - What you see is what gets executed
## Usage
```bash
# Use a launch script by name (looks in profiles/ directory)
./launch-cluster.sh --launch-script example-vllm-minimax
# Use a launch script by filename
./launch-cluster.sh --launch-script example-vllm-minimax.sh
# Use a launch script with absolute path
./launch-cluster.sh --launch-script /path/to/my-script.sh
# Combine with mods if needed
./launch-cluster.sh --launch-script my-script.sh --apply-mod mods/my-patch
# Combine with other options
./launch-cluster.sh -n 192.168.1.1,192.168.1.2 --launch-script my-model.sh -d
```
When using `--launch-script`, the `exec` action is automatically implied if no action is specified.
## Script Structure
Launch scripts are simple bash scripts. The script is copied into the container at `/workspace/exec-script.sh` and executed.
```bash
#!/bin/bash
# PROFILE: Human-readable name
# DESCRIPTION: What this script does
# Optional: Set environment variables
export MY_VAR="value"
# Run your command
vllm serve org/model-name \
--port 8000 \
--host 0.0.0.0 \
--gpu-memory-utilization 0.7
```
### Metadata Comments
The `# PROFILE:` and `# DESCRIPTION:` comments are optional but recommended for documentation:
```bash
#!/bin/bash
# PROFILE: MiniMax-M2-AWQ Example
# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
```
## Examples
### Basic vLLM Serving
```bash
#!/bin/bash
# PROFILE: MiniMax-M2-AWQ
# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
vllm serve QuantTrio/MiniMax-M2-AWQ \
--port 8000 \
--host 0.0.0.0 \
--gpu-memory-utilization 0.7 \
-tp 2 \
--distributed-executor-backend ray \
--max-model-len 128000 \
--load-format fastsafetensors \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2
```
### With Environment Variables
```bash
#!/bin/bash
# PROFILE: OpenAI GPT-OSS 120B
# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
# Enable FlashInfer MOE with MXFP4/MXFP8 quantization
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
vllm serve openai/gpt-oss-120b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--host 0.0.0.0 \
--port 8000
```
### With Conditional Logic
```bash
#!/bin/bash
# PROFILE: Adaptive Model Server
# DESCRIPTION: Adjusts settings based on available GPUs
GPU_COUNT=$(nvidia-smi -L | wc -l)
echo "Detected $GPU_COUNT GPUs"
if [[ $GPU_COUNT -ge 4 ]]; then
TP_SIZE=4
MEM_UTIL=0.9
else
TP_SIZE=2
MEM_UTIL=0.7
fi
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--port 8000 \
--host 0.0.0.0 \
-tp $TP_SIZE \
--gpu-memory-utilization $MEM_UTIL \
--distributed-executor-backend ray
```
### SGLang
```bash
#!/bin/bash
# PROFILE: SGLang Llama 3.1
# DESCRIPTION: SGLang runtime with Llama 3.1
sglang launch meta-llama/Llama-3.1-8B-Instruct \
--port 8000 \
--host 0.0.0.0 \
--tp 2
```
### With Model Requiring Patches
If your model requires patches, use `--apply-mod` alongside `--launch-script`:
```bash
# Script: vllm-glm-4.7-nvfp4.sh
#!/bin/bash
# PROFILE: Salyut1/GLM-4.7-NVFP4
# DESCRIPTION: vLLM serving GLM-4.7-NVFP4
# NOTE: Requires --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4
vllm serve Salyut1/GLM-4.7-NVFP4 \
--attention-config.backend flashinfer \
--tool-call-parser glm47 \
-tp 2 \
--host 0.0.0.0 \
--port 8000
```
Usage:
```bash
./launch-cluster.sh --launch-script vllm-glm-4.7-nvfp4.sh --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4 exec
```
## Creating a New Launch Script
1. Create a new `.sh` file in this directory
2. Add the shebang `#!/bin/bash`
3. Add `# PROFILE:` and `# DESCRIPTION:` comments
4. Write your command (e.g., `vllm serve ...`)
5. Run with `./launch-cluster.sh --launch-script my-script.sh exec`
## Testing Scripts
Since launch scripts are standard bash files, you can test them directly:
```bash
# Inside a running container or on a head node with the runtime installed
cd profiles
./my-script.sh
```
This makes development and debugging much easier than complex configuration systems.

View File

@@ -0,0 +1,15 @@
#!/bin/bash
# PROFILE: MiniMax-M2-AWQ Example
# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
vllm serve QuantTrio/MiniMax-M2-AWQ \
--port 8000 \
--host 0.0.0.0 \
--gpu-memory-utilization 0.7 \
-tp 2 \
--distributed-executor-backend ray \
--max-model-len 128000 \
--load-format fastsafetensors \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think

View File

@@ -0,0 +1,17 @@
#!/bin/bash
# PROFILE: Salyut1/GLM-4.7-NVFP4
# DESCRIPTION: vLLM serving GLM-4.7-NVFP4
# NOTE: This profile requires --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4 to fix k/v scales incompatibility
# See: https://huggingface.co/Salyut1/GLM-4.7-NVFP4/discussions/3#694ab9b6e2efa04b7ecb0c4b
vllm serve Salyut1/GLM-4.7-NVFP4 \
--attention-config.backend flashinfer \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
-tp 2 \
--gpu-memory-utilization 0.88 \
--max-model-len 32000 \
--distributed-executor-backend ray \
--host 0.0.0.0 \
--port 8000

View File

@@ -0,0 +1,20 @@
#!/bin/bash
# PROFILE: OpenAI GPT-OSS 120B
# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
# Enable FlashInfer MOE with MXFP4/MXFP8 quantization
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
vllm serve openai/gpt-oss-120b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.70 \
--max-model-len 128000 \
--max-num-batched-tokens 4096 \
--max-num-seqs 8 \
--enable-prefix-caching \
--host 0.0.0.0 \
--port 8000