Adding sample profile and profile loader
This commit is contained in:
65
README.md
65
README.md
@@ -16,10 +16,11 @@ While it was primarily developed to support multi-node inference, it works just
|
||||
- [4. Using `run-cluster-node.sh` (Internal)](#4-using-run-cluster-nodesh-internal)
|
||||
- [5. Configuration Details](#5-configuration-details)
|
||||
- [6. Mods and Patches](#6-mods-and-patches)
|
||||
- [7. Using cluster mode for inference](#7-using-cluster-mode-for-inference)
|
||||
- [8. Fastsafetensors](#8-fastsafetensors)
|
||||
- [9. Benchmarking](#9-benchmarking)
|
||||
- [10. Downloading Models](#10-downloading-models)
|
||||
- [7. Launch Scripts](#7-launch-scripts)
|
||||
- [8. Using cluster mode for inference](#8-using-cluster-mode-for-inference)
|
||||
- [9. Fastsafetensors](#9-fastsafetensors)
|
||||
- [10. Benchmarking](#10-benchmarking)
|
||||
- [11. Downloading Models](#11-downloading-models)
|
||||
|
||||
## DISCLAIMER
|
||||
|
||||
@@ -770,7 +771,55 @@ Mods can be used for:
|
||||
- Customizing vLLM behavior for specific workloads
|
||||
- Rapid iteration on development without rebuilding the entire image
|
||||
|
||||
## 7\. Using cluster mode for inference
|
||||
## 7\. Launch Scripts
|
||||
|
||||
Launch scripts provide a simple way to define reusable model configurations. Instead of passing long command lines, you can create a bash script that is copied into the container and executed directly.
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```bash
|
||||
# Use a launch script by name (looks in profiles/ directory)
|
||||
./launch-cluster.sh --launch-script example-vllm-minimax
|
||||
|
||||
# Use with explicit nodes
|
||||
./launch-cluster.sh -n 192.168.1.1,192.168.1.2 --launch-script vllm-openai-gpt-oss-120b.sh
|
||||
|
||||
# Combine with mods for models requiring patches
|
||||
./launch-cluster.sh --launch-script vllm-glm-4.7-nvfp4.sh --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4
|
||||
```
|
||||
|
||||
### Script Format
|
||||
|
||||
Launch scripts are simple bash files that run directly inside the container:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# PROFILE: OpenAI GPT-OSS 120B
|
||||
# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
|
||||
|
||||
# Set environment variables if needed
|
||||
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
|
||||
|
||||
# Run your command
|
||||
vllm serve openai/gpt-oss-120b \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--tensor-parallel-size 2 \
|
||||
--distributed-executor-backend ray \
|
||||
--enable-auto-tool-choice
|
||||
```
|
||||
|
||||
### Available Launch Scripts
|
||||
|
||||
The `profiles/` directory contains ready-to-use launch scripts:
|
||||
|
||||
- **example-vllm-minimax.sh** - MiniMax-M2-AWQ with Ray distributed backend
|
||||
- **vllm-openai-gpt-oss-120b.sh** - OpenAI GPT-OSS 120B with FlashInfer MOE
|
||||
- **vllm-glm-4.7-nvfp4.sh** - GLM-4.7-NVFP4 (requires the glm4_moe patch mod)
|
||||
|
||||
See [profiles/README.md](profiles/README.md) for detailed documentation and more examples.
|
||||
|
||||
## 8\. Using cluster mode for inference
|
||||
|
||||
First, start follow the instructions above to start the head container on your first Spark, and node container on the second Spark.
|
||||
Then, on the first Spark, run vllm like this:
|
||||
@@ -787,7 +836,7 @@ docker exec -it vllm_node
|
||||
|
||||
And execute vllm command inside.
|
||||
|
||||
## 8\. Fastsafetensors
|
||||
## 9\. Fastsafetensors
|
||||
|
||||
This build includes support for fastsafetensors loading which significantly improves loading speeds, especially on DGX Spark where MMAP performance is very poor currently.
|
||||
[Fasttensors](https://github.com/foundation-model-stack/fastsafetensors/) solve this issue by using more efficient multi-threaded loading while avoiding mmap.
|
||||
@@ -801,11 +850,11 @@ To use this method, simply include `--load-format fastsafetensors` when running
|
||||
HF_HUB_OFFLINE=1 vllm serve openai/gpt-oss-120b --port 8888 --host 0.0.0.0 --trust_remote_code --swap-space 16 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --load-format fastsafetensors
|
||||
```
|
||||
|
||||
## 9\. Benchmarking
|
||||
## 10\. Benchmarking
|
||||
|
||||
I recommend using [llama-benchy](https://github.com/eugr/llama-benchy) - a new benchmarking tool that delivers results in the same format as llama-bench from llama.cpp suite.
|
||||
|
||||
## 10\. Downloading Models
|
||||
## 11\. Downloading Models
|
||||
|
||||
The `hf-download.sh` script provides a convenient way to download models from HuggingFace and distribute them across your cluster nodes. It uses Huggingface CLI via `uvx` for fast downloads and `rsync` for distribution across the cluster.
|
||||
|
||||
|
||||
@@ -26,6 +26,8 @@ ACTION="start"
|
||||
CLUSTER_WAS_RUNNING="false"
|
||||
MOD_PATHS=()
|
||||
MOD_TYPES=()
|
||||
LAUNCH_SCRIPT_PATH=""
|
||||
SCRIPT_DIR="$(dirname "$(realpath "$0")")"
|
||||
|
||||
ACTIONS_ARG=""
|
||||
SOLO_MODE="false"
|
||||
@@ -41,11 +43,16 @@ usage() {
|
||||
echo " -e, --env Environment variable to pass to container (e.g. -e VAR=val)"
|
||||
echo " --nccl-debug NCCL debug level (Optional, one of: VERSION, WARN, INFO, TRACE). If no level is provided, defaults to INFO."
|
||||
echo " --apply-mod Path to directory or zip file containing run.sh to apply before launch (Can be specified multiple times)"
|
||||
echo " --launch-script Path to bash script to execute in the container (from profiles/ directory or absolute path)"
|
||||
echo " --check-config Check configuration and auto-detection without launching"
|
||||
echo " --solo Solo mode: skip autodetection, launch only on current node, do not launch Ray cluster"
|
||||
echo " -d Daemon mode (only for 'start' action)"
|
||||
echo " action start | stop | status | exec (Default: start)"
|
||||
echo " command Command to run (only for 'exec' action)"
|
||||
echo ""
|
||||
echo "Launch Script Usage:"
|
||||
echo " $0 --launch-script profiles/my-script.sh # Script copied to container and executed"
|
||||
echo " $0 --launch-script /path/to/script.sh # Uses absolute path to script"
|
||||
exit 1
|
||||
}
|
||||
|
||||
@@ -59,6 +66,7 @@ while [[ "$#" -gt 0 ]]; do
|
||||
--ib-if) IB_IF="$2"; shift ;;
|
||||
-e|--env) DOCKER_ARGS="$DOCKER_ARGS -e $2"; shift ;;
|
||||
--apply-mod) MOD_PATHS+=("$2"); shift ;;
|
||||
--launch-script) LAUNCH_SCRIPT_PATH="$2"; shift ;;
|
||||
--nccl-debug)
|
||||
if [[ -n "$2" && "$2" =~ ^(VERSION|WARN|INFO|TRACE)$ ]]; then
|
||||
NCCL_DEBUG_VAL="$2"
|
||||
@@ -107,6 +115,37 @@ if [[ -n "$NCCL_DEBUG_VAL" ]]; then
|
||||
esac
|
||||
fi
|
||||
|
||||
# Resolve launch script path if specified
|
||||
if [[ -n "$LAUNCH_SCRIPT_PATH" ]]; then
|
||||
# Check if it's an absolute path or relative path that exists
|
||||
if [[ -f "$LAUNCH_SCRIPT_PATH" ]]; then
|
||||
LAUNCH_SCRIPT_PATH=$(realpath "$LAUNCH_SCRIPT_PATH")
|
||||
# Check if it's just a filename, look in profiles/ directory
|
||||
elif [[ -f "$SCRIPT_DIR/profiles/$LAUNCH_SCRIPT_PATH" ]]; then
|
||||
LAUNCH_SCRIPT_PATH="$SCRIPT_DIR/profiles/$LAUNCH_SCRIPT_PATH"
|
||||
# Check if it's a name without .sh extension
|
||||
elif [[ -f "$SCRIPT_DIR/profiles/${LAUNCH_SCRIPT_PATH}.sh" ]]; then
|
||||
LAUNCH_SCRIPT_PATH="$SCRIPT_DIR/profiles/${LAUNCH_SCRIPT_PATH}.sh"
|
||||
else
|
||||
echo "Error: Launch script '$LAUNCH_SCRIPT_PATH' not found."
|
||||
echo "Searched in:"
|
||||
echo " - $LAUNCH_SCRIPT_PATH"
|
||||
echo " - $SCRIPT_DIR/profiles/$LAUNCH_SCRIPT_PATH"
|
||||
echo " - $SCRIPT_DIR/profiles/${LAUNCH_SCRIPT_PATH}.sh"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "Using launch script: $LAUNCH_SCRIPT_PATH"
|
||||
|
||||
# Set command to run the copied script (use absolute path since docker exec may not be in /workspace)
|
||||
COMMAND_TO_RUN="/workspace/exec-script.sh"
|
||||
|
||||
# If launch script is specified, default action to exec unless explicitly set to stop/status
|
||||
if [[ "$ACTION" == "start" ]]; then
|
||||
ACTION="exec"
|
||||
fi
|
||||
fi
|
||||
|
||||
# Validate MOD_PATHS if set
|
||||
for i in "${!MOD_PATHS[@]}"; do
|
||||
mod_path="${MOD_PATHS[$i]}"
|
||||
@@ -426,6 +465,51 @@ apply_mod_to_container() {
|
||||
fi
|
||||
}
|
||||
|
||||
# Copy Launch Script to Container Function
|
||||
copy_launch_script_to_container() {
|
||||
local node_ip="$1"
|
||||
local container="$2"
|
||||
local is_local="$3" # true/false
|
||||
local script_path="$4"
|
||||
|
||||
echo "Copying launch script to $node_ip..."
|
||||
|
||||
# Command prefix for remote vs local
|
||||
local cmd_prefix=""
|
||||
if [[ "$is_local" == "false" ]]; then
|
||||
cmd_prefix="ssh -o BatchMode=yes -o StrictHostKeyChecking=no $node_ip"
|
||||
fi
|
||||
|
||||
local target_script_path="$script_path"
|
||||
local remote_cleanup_path=""
|
||||
|
||||
# Copy script to remote node first if needed
|
||||
if [[ "$is_local" == "false" ]]; then
|
||||
local remote_tmp="/tmp/exec_script_$(date +%s)_$RANDOM.sh"
|
||||
echo " Copying script to $node_ip:$remote_tmp..."
|
||||
if ! scp -o BatchMode=yes -o StrictHostKeyChecking=no "$script_path" "$node_ip:$remote_tmp"; then
|
||||
echo "Error: Failed to copy launch script to $node_ip"
|
||||
exit 1
|
||||
fi
|
||||
target_script_path="$remote_tmp"
|
||||
remote_cleanup_path="$remote_tmp"
|
||||
fi
|
||||
|
||||
# Copy script into container as /workspace/exec-script.sh
|
||||
echo " Copying script into container..."
|
||||
$cmd_prefix docker cp "$target_script_path" "$container:/workspace/exec-script.sh"
|
||||
|
||||
# Make executable
|
||||
$cmd_prefix docker exec "$container" chmod +x /workspace/exec-script.sh
|
||||
|
||||
# Cleanup remote temp
|
||||
if [[ -n "$remote_cleanup_path" ]]; then
|
||||
ssh -o BatchMode=yes -o StrictHostKeyChecking=no "$node_ip" "rm -f $remote_cleanup_path"
|
||||
fi
|
||||
|
||||
echo " Launch script copied to $node_ip"
|
||||
}
|
||||
|
||||
# Start Cluster Function
|
||||
start_cluster() {
|
||||
check_cluster_running
|
||||
@@ -494,6 +578,19 @@ start_cluster() {
|
||||
done
|
||||
fi
|
||||
|
||||
# Copy launch script if specified
|
||||
if [[ -n "$LAUNCH_SCRIPT_PATH" ]]; then
|
||||
echo "Copying launch script to cluster nodes..."
|
||||
|
||||
# Copy to Head
|
||||
copy_launch_script_to_container "$HEAD_IP" "$CONTAINER_NAME" "true" "$LAUNCH_SCRIPT_PATH"
|
||||
|
||||
# Copy to Workers
|
||||
for worker in "${PEER_NODES[@]}"; do
|
||||
copy_launch_script_to_container "$worker" "$CONTAINER_NAME" "false" "$LAUNCH_SCRIPT_PATH"
|
||||
done
|
||||
fi
|
||||
|
||||
if [[ "$SOLO_MODE" == "false" ]]; then
|
||||
wait_for_cluster
|
||||
else
|
||||
|
||||
184
profiles/README.md
Normal file
184
profiles/README.md
Normal file
@@ -0,0 +1,184 @@
|
||||
# Launch Scripts
|
||||
|
||||
This directory contains bash scripts that can be executed in the container using the `--launch-script` option. Launch scripts are simple, executable bash files that run directly inside the container.
|
||||
|
||||
## Why Launch Scripts?
|
||||
|
||||
- **Simple** - Just write a bash script that runs your command
|
||||
- **Flexible** - Use any bash features: environment variables, conditionals, loops
|
||||
- **Standalone** - Each script can be tested directly on a head node
|
||||
- **No magic** - What you see is what gets executed
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Use a launch script by name (looks in profiles/ directory)
|
||||
./launch-cluster.sh --launch-script example-vllm-minimax
|
||||
|
||||
# Use a launch script by filename
|
||||
./launch-cluster.sh --launch-script example-vllm-minimax.sh
|
||||
|
||||
# Use a launch script with absolute path
|
||||
./launch-cluster.sh --launch-script /path/to/my-script.sh
|
||||
|
||||
# Combine with mods if needed
|
||||
./launch-cluster.sh --launch-script my-script.sh --apply-mod mods/my-patch
|
||||
|
||||
# Combine with other options
|
||||
./launch-cluster.sh -n 192.168.1.1,192.168.1.2 --launch-script my-model.sh -d
|
||||
```
|
||||
|
||||
When using `--launch-script`, the `exec` action is automatically implied if no action is specified.
|
||||
|
||||
## Script Structure
|
||||
|
||||
Launch scripts are simple bash scripts. The script is copied into the container at `/workspace/exec-script.sh` and executed.
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# PROFILE: Human-readable name
|
||||
# DESCRIPTION: What this script does
|
||||
|
||||
# Optional: Set environment variables
|
||||
export MY_VAR="value"
|
||||
|
||||
# Run your command
|
||||
vllm serve org/model-name \
|
||||
--port 8000 \
|
||||
--host 0.0.0.0 \
|
||||
--gpu-memory-utilization 0.7
|
||||
```
|
||||
|
||||
### Metadata Comments
|
||||
|
||||
The `# PROFILE:` and `# DESCRIPTION:` comments are optional but recommended for documentation:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# PROFILE: MiniMax-M2-AWQ Example
|
||||
# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Basic vLLM Serving
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# PROFILE: MiniMax-M2-AWQ
|
||||
# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
|
||||
|
||||
vllm serve QuantTrio/MiniMax-M2-AWQ \
|
||||
--port 8000 \
|
||||
--host 0.0.0.0 \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
-tp 2 \
|
||||
--distributed-executor-backend ray \
|
||||
--max-model-len 128000 \
|
||||
--load-format fastsafetensors \
|
||||
--enable-auto-tool-choice \
|
||||
--tool-call-parser minimax_m2
|
||||
```
|
||||
|
||||
### With Environment Variables
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# PROFILE: OpenAI GPT-OSS 120B
|
||||
# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
|
||||
|
||||
# Enable FlashInfer MOE with MXFP4/MXFP8 quantization
|
||||
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
|
||||
|
||||
vllm serve openai/gpt-oss-120b \
|
||||
--tool-call-parser openai \
|
||||
--enable-auto-tool-choice \
|
||||
--tensor-parallel-size 2 \
|
||||
--distributed-executor-backend ray \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000
|
||||
```
|
||||
|
||||
### With Conditional Logic
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# PROFILE: Adaptive Model Server
|
||||
# DESCRIPTION: Adjusts settings based on available GPUs
|
||||
|
||||
GPU_COUNT=$(nvidia-smi -L | wc -l)
|
||||
echo "Detected $GPU_COUNT GPUs"
|
||||
|
||||
if [[ $GPU_COUNT -ge 4 ]]; then
|
||||
TP_SIZE=4
|
||||
MEM_UTIL=0.9
|
||||
else
|
||||
TP_SIZE=2
|
||||
MEM_UTIL=0.7
|
||||
fi
|
||||
|
||||
vllm serve meta-llama/Llama-3.1-70B-Instruct \
|
||||
--port 8000 \
|
||||
--host 0.0.0.0 \
|
||||
-tp $TP_SIZE \
|
||||
--gpu-memory-utilization $MEM_UTIL \
|
||||
--distributed-executor-backend ray
|
||||
```
|
||||
|
||||
### SGLang
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# PROFILE: SGLang Llama 3.1
|
||||
# DESCRIPTION: SGLang runtime with Llama 3.1
|
||||
|
||||
sglang launch meta-llama/Llama-3.1-8B-Instruct \
|
||||
--port 8000 \
|
||||
--host 0.0.0.0 \
|
||||
--tp 2
|
||||
```
|
||||
|
||||
### With Model Requiring Patches
|
||||
|
||||
If your model requires patches, use `--apply-mod` alongside `--launch-script`:
|
||||
|
||||
```bash
|
||||
# Script: vllm-glm-4.7-nvfp4.sh
|
||||
#!/bin/bash
|
||||
# PROFILE: Salyut1/GLM-4.7-NVFP4
|
||||
# DESCRIPTION: vLLM serving GLM-4.7-NVFP4
|
||||
# NOTE: Requires --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4
|
||||
|
||||
vllm serve Salyut1/GLM-4.7-NVFP4 \
|
||||
--attention-config.backend flashinfer \
|
||||
--tool-call-parser glm47 \
|
||||
-tp 2 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000
|
||||
```
|
||||
|
||||
Usage:
|
||||
```bash
|
||||
./launch-cluster.sh --launch-script vllm-glm-4.7-nvfp4.sh --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4 exec
|
||||
```
|
||||
|
||||
## Creating a New Launch Script
|
||||
|
||||
1. Create a new `.sh` file in this directory
|
||||
2. Add the shebang `#!/bin/bash`
|
||||
3. Add `# PROFILE:` and `# DESCRIPTION:` comments
|
||||
4. Write your command (e.g., `vllm serve ...`)
|
||||
5. Run with `./launch-cluster.sh --launch-script my-script.sh exec`
|
||||
|
||||
## Testing Scripts
|
||||
|
||||
Since launch scripts are standard bash files, you can test them directly:
|
||||
|
||||
```bash
|
||||
# Inside a running container or on a head node with the runtime installed
|
||||
cd profiles
|
||||
./my-script.sh
|
||||
```
|
||||
|
||||
This makes development and debugging much easier than complex configuration systems.
|
||||
|
||||
15
profiles/example-vllm-minimax.sh
Normal file
15
profiles/example-vllm-minimax.sh
Normal file
@@ -0,0 +1,15 @@
|
||||
#!/bin/bash
|
||||
# PROFILE: MiniMax-M2-AWQ Example
|
||||
# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
|
||||
|
||||
vllm serve QuantTrio/MiniMax-M2-AWQ \
|
||||
--port 8000 \
|
||||
--host 0.0.0.0 \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
-tp 2 \
|
||||
--distributed-executor-backend ray \
|
||||
--max-model-len 128000 \
|
||||
--load-format fastsafetensors \
|
||||
--enable-auto-tool-choice \
|
||||
--tool-call-parser minimax_m2 \
|
||||
--reasoning-parser minimax_m2_append_think
|
||||
17
profiles/vllm-glm-4.7-nvfp4.sh
Normal file
17
profiles/vllm-glm-4.7-nvfp4.sh
Normal file
@@ -0,0 +1,17 @@
|
||||
#!/bin/bash
|
||||
# PROFILE: Salyut1/GLM-4.7-NVFP4
|
||||
# DESCRIPTION: vLLM serving GLM-4.7-NVFP4
|
||||
# NOTE: This profile requires --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4 to fix k/v scales incompatibility
|
||||
# See: https://huggingface.co/Salyut1/GLM-4.7-NVFP4/discussions/3#694ab9b6e2efa04b7ecb0c4b
|
||||
|
||||
vllm serve Salyut1/GLM-4.7-NVFP4 \
|
||||
--attention-config.backend flashinfer \
|
||||
--tool-call-parser glm47 \
|
||||
--reasoning-parser glm45 \
|
||||
--enable-auto-tool-choice \
|
||||
-tp 2 \
|
||||
--gpu-memory-utilization 0.88 \
|
||||
--max-model-len 32000 \
|
||||
--distributed-executor-backend ray \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000
|
||||
20
profiles/vllm-openai-gpt-oss-120b.sh
Normal file
20
profiles/vllm-openai-gpt-oss-120b.sh
Normal file
@@ -0,0 +1,20 @@
|
||||
#!/bin/bash
|
||||
# PROFILE: OpenAI GPT-OSS 120B
|
||||
# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
|
||||
|
||||
# Enable FlashInfer MOE with MXFP4/MXFP8 quantization
|
||||
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
|
||||
|
||||
vllm serve openai/gpt-oss-120b \
|
||||
--tool-call-parser openai \
|
||||
--enable-auto-tool-choice \
|
||||
--tensor-parallel-size 2 \
|
||||
--distributed-executor-backend ray \
|
||||
--kv-cache-dtype fp8 \
|
||||
--gpu-memory-utilization 0.70 \
|
||||
--max-model-len 128000 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--max-num-seqs 8 \
|
||||
--enable-prefix-caching \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000
|
||||
Reference in New Issue
Block a user