Adding sample profile and profile loader

2026-01-25 21:22:45 -05:00
parent 133ed9cfb9
commit 751bc5a47a
6 changed files with 390 additions and 8 deletions
--- a/README.md
+++ b/README.md
@@ -16,10 +16,11 @@ While it was primarily developed to support multi-node inference, it works just
 - [4. Using `run-cluster-node.sh` (Internal)](#4-using-run-cluster-nodesh-internal)
 - [5. Configuration Details](#5-configuration-details)
 - [6. Mods and Patches](#6-mods-and-patches)
- [7. Using cluster mode for inference](#7-using-cluster-mode-for-inference)
- [8. Fastsafetensors](#8-fastsafetensors)
- [9. Benchmarking](#9-benchmarking)
- [10. Downloading Models](#10-downloading-models)
+- [7. Launch Scripts](#7-launch-scripts)
+- [8. Using cluster mode for inference](#8-using-cluster-mode-for-inference)
+- [9. Fastsafetensors](#9-fastsafetensors)
+- [10. Benchmarking](#10-benchmarking)
+- [11. Downloading Models](#11-downloading-models)

 ## DISCLAIMER

@@ -770,7 +771,55 @@ Mods can be used for:
 - Customizing vLLM behavior for specific workloads
 - Rapid iteration on development without rebuilding the entire image

-## 7\. Using cluster mode for inference
+## 7\. Launch Scripts
+
+Launch scripts provide a simple way to define reusable model configurations. Instead of passing long command lines, you can create a bash script that is copied into the container and executed directly.
+
+### Basic Usage
+
+```bash
+# Use a launch script by name (looks in profiles/ directory)
+./launch-cluster.sh --launch-script example-vllm-minimax
+
+# Use with explicit nodes
+./launch-cluster.sh -n 192.168.1.1,192.168.1.2 --launch-script vllm-openai-gpt-oss-120b.sh
+
+# Combine with mods for models requiring patches
+./launch-cluster.sh --launch-script vllm-glm-4.7-nvfp4.sh --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4
+```
+
+### Script Format
+
+Launch scripts are simple bash files that run directly inside the container:
+
+```bash
+#!/bin/bash
+# PROFILE: OpenAI GPT-OSS 120B
+# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
+
+# Set environment variables if needed
+export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
+
+# Run your command
+vllm serve openai/gpt-oss-120b \
+    --host 0.0.0.0 \
+    --port 8000 \
+    --tensor-parallel-size 2 \
+    --distributed-executor-backend ray \
+    --enable-auto-tool-choice
+```
+
+### Available Launch Scripts
+
+The `profiles/` directory contains ready-to-use launch scripts:
+
+- **example-vllm-minimax.sh** - MiniMax-M2-AWQ with Ray distributed backend
+- **vllm-openai-gpt-oss-120b.sh** - OpenAI GPT-OSS 120B with FlashInfer MOE
+- **vllm-glm-4.7-nvfp4.sh** - GLM-4.7-NVFP4 (requires the glm4_moe patch mod)
+
+See [profiles/README.md](profiles/README.md) for detailed documentation and more examples.
+
+## 8\. Using cluster mode for inference

 First, start follow the instructions above to start the head container on your first Spark, and node container on the second Spark.
 Then, on the first Spark, run vllm like this:
@@ -787,7 +836,7 @@ docker exec -it vllm_node

 And execute vllm command inside.

-## 8\. Fastsafetensors
+## 9\. Fastsafetensors

 This build includes support for fastsafetensors loading which significantly improves loading speeds, especially on DGX Spark where MMAP performance is very poor currently.
 [Fasttensors](https://github.com/foundation-model-stack/fastsafetensors/) solve this issue by using more efficient multi-threaded loading while avoiding mmap.
@@ -801,11 +850,11 @@ To use this method, simply include `--load-format fastsafetensors` when running
 HF_HUB_OFFLINE=1 vllm serve openai/gpt-oss-120b --port 8888 --host 0.0.0.0 --trust_remote_code --swap-space 16 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --load-format fastsafetensors
 ```

-## 9\. Benchmarking
+## 10\. Benchmarking

 I recommend using [llama-benchy](https://github.com/eugr/llama-benchy) - a new benchmarking tool that delivers results in the same format as llama-bench from llama.cpp suite.

-## 10\. Downloading Models
+## 11\. Downloading Models

 The `hf-download.sh` script provides a convenient way to download models from HuggingFace and distribute them across your cluster nodes. It uses Huggingface CLI via `uvx` for fast downloads and `rsync` for distribution across the cluster.

--- a/launch-cluster.sh
+++ b/launch-cluster.sh
@@ -26,6 +26,8 @@ ACTION="start"
 CLUSTER_WAS_RUNNING="false"
 MOD_PATHS=()
 MOD_TYPES=()
+LAUNCH_SCRIPT_PATH=""
+SCRIPT_DIR="$(dirname "$(realpath "$0")")"

 ACTIONS_ARG=""
 SOLO_MODE="false"
@@ -41,11 +43,16 @@ usage() {
    echo "  -e, --env       Environment variable to pass to container (e.g. -e VAR=val)"
    echo "  --nccl-debug    NCCL debug level (Optional, one of: VERSION, WARN, INFO, TRACE). If no level is provided, defaults to INFO."
    echo "  --apply-mod     Path to directory or zip file containing run.sh to apply before launch (Can be specified multiple times)"
+    echo "  --launch-script Path to bash script to execute in the container (from profiles/ directory or absolute path)"
    echo "  --check-config  Check configuration and auto-detection without launching"
    echo "  --solo          Solo mode: skip autodetection, launch only on current node, do not launch Ray cluster"
    echo "  -d              Daemon mode (only for 'start' action)"
    echo "  action          start | stop | status | exec (Default: start)"
    echo "  command         Command to run (only for 'exec' action)"
+    echo ""
+    echo "Launch Script Usage:"
+    echo "  $0 --launch-script profiles/my-script.sh   # Script copied to container and executed"
+    echo "  $0 --launch-script /path/to/script.sh      # Uses absolute path to script"
    exit 1
 }

@@ -59,6 +66,7 @@ while [[ "$#" -gt 0 ]]; do
        --ib-if) IB_IF="$2"; shift ;;
        -e|--env) DOCKER_ARGS="$DOCKER_ARGS -e $2"; shift ;;
        --apply-mod) MOD_PATHS+=("$2"); shift ;;
+        --launch-script) LAUNCH_SCRIPT_PATH="$2"; shift ;;
        --nccl-debug)
            if [[ -n "$2" && "$2" =~ ^(VERSION|WARN|INFO|TRACE)$ ]]; then
                NCCL_DEBUG_VAL="$2"
@@ -107,6 +115,37 @@ if [[ -n "$NCCL_DEBUG_VAL" ]]; then
    esac
 fi

+# Resolve launch script path if specified
+if [[ -n "$LAUNCH_SCRIPT_PATH" ]]; then
+    # Check if it's an absolute path or relative path that exists
+    if [[ -f "$LAUNCH_SCRIPT_PATH" ]]; then
+        LAUNCH_SCRIPT_PATH=$(realpath "$LAUNCH_SCRIPT_PATH")
+    # Check if it's just a filename, look in profiles/ directory
+    elif [[ -f "$SCRIPT_DIR/profiles/$LAUNCH_SCRIPT_PATH" ]]; then
+        LAUNCH_SCRIPT_PATH="$SCRIPT_DIR/profiles/$LAUNCH_SCRIPT_PATH"
+    # Check if it's a name without .sh extension
+    elif [[ -f "$SCRIPT_DIR/profiles/${LAUNCH_SCRIPT_PATH}.sh" ]]; then
+        LAUNCH_SCRIPT_PATH="$SCRIPT_DIR/profiles/${LAUNCH_SCRIPT_PATH}.sh"
+    else
+        echo "Error: Launch script '$LAUNCH_SCRIPT_PATH' not found."
+        echo "Searched in:"
+        echo "  - $LAUNCH_SCRIPT_PATH"
+        echo "  - $SCRIPT_DIR/profiles/$LAUNCH_SCRIPT_PATH"
+        echo "  - $SCRIPT_DIR/profiles/${LAUNCH_SCRIPT_PATH}.sh"
+        exit 1
+    fi
+    
+    echo "Using launch script: $LAUNCH_SCRIPT_PATH"
+    
+    # Set command to run the copied script (use absolute path since docker exec may not be in /workspace)
+    COMMAND_TO_RUN="/workspace/exec-script.sh"
+    
+    # If launch script is specified, default action to exec unless explicitly set to stop/status
+    if [[ "$ACTION" == "start" ]]; then
+        ACTION="exec"
+    fi
+fi
+
 # Validate MOD_PATHS if set
 for i in "${!MOD_PATHS[@]}"; do
    mod_path="${MOD_PATHS[$i]}"
@@ -426,6 +465,51 @@ apply_mod_to_container() {
    fi
 }

+# Copy Launch Script to Container Function
+copy_launch_script_to_container() {
+    local node_ip="$1"
+    local container="$2"
+    local is_local="$3" # true/false
+    local script_path="$4"
+
+    echo "Copying launch script to $node_ip..."
+
+    # Command prefix for remote vs local
+    local cmd_prefix=""
+    if [[ "$is_local" == "false" ]]; then
+        cmd_prefix="ssh -o BatchMode=yes -o StrictHostKeyChecking=no $node_ip"
+    fi
+
+    local target_script_path="$script_path"
+    local remote_cleanup_path=""
+
+    # Copy script to remote node first if needed
+    if [[ "$is_local" == "false" ]]; then
+        local remote_tmp="/tmp/exec_script_$(date +%s)_$RANDOM.sh"
+        echo "  Copying script to $node_ip:$remote_tmp..."
+        if ! scp -o BatchMode=yes -o StrictHostKeyChecking=no "$script_path" "$node_ip:$remote_tmp"; then
+            echo "Error: Failed to copy launch script to $node_ip"
+            exit 1
+        fi
+        target_script_path="$remote_tmp"
+        remote_cleanup_path="$remote_tmp"
+    fi
+
+    # Copy script into container as /workspace/exec-script.sh
+    echo "  Copying script into container..."
+    $cmd_prefix docker cp "$target_script_path" "$container:/workspace/exec-script.sh"
+
+    # Make executable
+    $cmd_prefix docker exec "$container" chmod +x /workspace/exec-script.sh
+
+    # Cleanup remote temp
+    if [[ -n "$remote_cleanup_path" ]]; then
+        ssh -o BatchMode=yes -o StrictHostKeyChecking=no "$node_ip" "rm -f $remote_cleanup_path"
+    fi
+
+    echo "  Launch script copied to $node_ip"
+}
+
 # Start Cluster Function
 start_cluster() {
    check_cluster_running
@@ -494,6 +578,19 @@ start_cluster() {
        done
    fi

+    # Copy launch script if specified
+    if [[ -n "$LAUNCH_SCRIPT_PATH" ]]; then
+        echo "Copying launch script to cluster nodes..."
+        
+        # Copy to Head
+        copy_launch_script_to_container "$HEAD_IP" "$CONTAINER_NAME" "true" "$LAUNCH_SCRIPT_PATH"
+        
+        # Copy to Workers
+        for worker in "${PEER_NODES[@]}"; do
+            copy_launch_script_to_container "$worker" "$CONTAINER_NAME" "false" "$LAUNCH_SCRIPT_PATH"
+        done
+    fi
+
    if [[ "$SOLO_MODE" == "false" ]]; then
        wait_for_cluster
    else
--- a/profiles/README.md
+++ b/profiles/README.md
@@ -0,0 +1,184 @@
+# Launch Scripts
+
+This directory contains bash scripts that can be executed in the container using the `--launch-script` option. Launch scripts are simple, executable bash files that run directly inside the container.
+
+## Why Launch Scripts?
+
+- **Simple** - Just write a bash script that runs your command
+- **Flexible** - Use any bash features: environment variables, conditionals, loops
+- **Standalone** - Each script can be tested directly on a head node
+- **No magic** - What you see is what gets executed
+
+## Usage
+
+```bash
+# Use a launch script by name (looks in profiles/ directory)
+./launch-cluster.sh --launch-script example-vllm-minimax
+
+# Use a launch script by filename
+./launch-cluster.sh --launch-script example-vllm-minimax.sh
+
+# Use a launch script with absolute path
+./launch-cluster.sh --launch-script /path/to/my-script.sh
+
+# Combine with mods if needed
+./launch-cluster.sh --launch-script my-script.sh --apply-mod mods/my-patch
+
+# Combine with other options
+./launch-cluster.sh -n 192.168.1.1,192.168.1.2 --launch-script my-model.sh -d
+```
+
+When using `--launch-script`, the `exec` action is automatically implied if no action is specified.
+
+## Script Structure
+
+Launch scripts are simple bash scripts. The script is copied into the container at `/workspace/exec-script.sh` and executed.
+
+```bash
+#!/bin/bash
+# PROFILE: Human-readable name
+# DESCRIPTION: What this script does
+
+# Optional: Set environment variables
+export MY_VAR="value"
+
+# Run your command
+vllm serve org/model-name \
+    --port 8000 \
+    --host 0.0.0.0 \
+    --gpu-memory-utilization 0.7
+```
+
+### Metadata Comments
+
+The `# PROFILE:` and `# DESCRIPTION:` comments are optional but recommended for documentation:
+
+```bash
+#!/bin/bash
+# PROFILE: MiniMax-M2-AWQ Example
+# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
+```
+
+## Examples
+
+### Basic vLLM Serving
+
+```bash
+#!/bin/bash
+# PROFILE: MiniMax-M2-AWQ
+# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
+
+vllm serve QuantTrio/MiniMax-M2-AWQ \
+    --port 8000 \
+    --host 0.0.0.0 \
+    --gpu-memory-utilization 0.7 \
+    -tp 2 \
+    --distributed-executor-backend ray \
+    --max-model-len 128000 \
+    --load-format fastsafetensors \
+    --enable-auto-tool-choice \
+    --tool-call-parser minimax_m2
+```
+
+### With Environment Variables
+
+```bash
+#!/bin/bash
+# PROFILE: OpenAI GPT-OSS 120B
+# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
+
+# Enable FlashInfer MOE with MXFP4/MXFP8 quantization
+export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
+
+vllm serve openai/gpt-oss-120b \
+    --tool-call-parser openai \
+    --enable-auto-tool-choice \
+    --tensor-parallel-size 2 \
+    --distributed-executor-backend ray \
+    --host 0.0.0.0 \
+    --port 8000
+```
+
+### With Conditional Logic
+
+```bash
+#!/bin/bash
+# PROFILE: Adaptive Model Server
+# DESCRIPTION: Adjusts settings based on available GPUs
+
+GPU_COUNT=$(nvidia-smi -L | wc -l)
+echo "Detected $GPU_COUNT GPUs"
+
+if [[ $GPU_COUNT -ge 4 ]]; then
+    TP_SIZE=4
+    MEM_UTIL=0.9
+else
+    TP_SIZE=2
+    MEM_UTIL=0.7
+fi
+
+vllm serve meta-llama/Llama-3.1-70B-Instruct \
+    --port 8000 \
+    --host 0.0.0.0 \
+    -tp $TP_SIZE \
+    --gpu-memory-utilization $MEM_UTIL \
+    --distributed-executor-backend ray
+```
+
+### SGLang
+
+```bash
+#!/bin/bash
+# PROFILE: SGLang Llama 3.1
+# DESCRIPTION: SGLang runtime with Llama 3.1
+
+sglang launch meta-llama/Llama-3.1-8B-Instruct \
+    --port 8000 \
+    --host 0.0.0.0 \
+    --tp 2
+```
+
+### With Model Requiring Patches
+
+If your model requires patches, use `--apply-mod` alongside `--launch-script`:
+
+```bash
+# Script: vllm-glm-4.7-nvfp4.sh
+#!/bin/bash
+# PROFILE: Salyut1/GLM-4.7-NVFP4
+# DESCRIPTION: vLLM serving GLM-4.7-NVFP4
+# NOTE: Requires --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4
+
+vllm serve Salyut1/GLM-4.7-NVFP4 \
+    --attention-config.backend flashinfer \
+    --tool-call-parser glm47 \
+    -tp 2 \
+    --host 0.0.0.0 \
+    --port 8000
+```
+
+Usage:
+```bash
+./launch-cluster.sh --launch-script vllm-glm-4.7-nvfp4.sh --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4 exec
+```
+
+## Creating a New Launch Script
+
+1. Create a new `.sh` file in this directory
+2. Add the shebang `#!/bin/bash`
+3. Add `# PROFILE:` and `# DESCRIPTION:` comments
+4. Write your command (e.g., `vllm serve ...`)
+5. Run with `./launch-cluster.sh --launch-script my-script.sh exec`
+
+## Testing Scripts
+
+Since launch scripts are standard bash files, you can test them directly:
+
+```bash
+# Inside a running container or on a head node with the runtime installed
+cd profiles
+./my-script.sh
+```
+
+This makes development and debugging much easier than complex configuration systems.
+
--- a/profiles/example-vllm-minimax.sh
+++ b/profiles/example-vllm-minimax.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+# PROFILE: MiniMax-M2-AWQ Example
+# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
+
+vllm serve QuantTrio/MiniMax-M2-AWQ \
+    --port 8000 \
+    --host 0.0.0.0 \
+    --gpu-memory-utilization 0.7 \
+    -tp 2 \
+    --distributed-executor-backend ray \
+    --max-model-len 128000 \
+    --load-format fastsafetensors \
+    --enable-auto-tool-choice \
+    --tool-call-parser minimax_m2 \
+    --reasoning-parser minimax_m2_append_think
--- a/profiles/vllm-glm-4.7-nvfp4.sh
+++ b/profiles/vllm-glm-4.7-nvfp4.sh
@@ -0,0 +1,17 @@
+#!/bin/bash
+# PROFILE: Salyut1/GLM-4.7-NVFP4
+# DESCRIPTION: vLLM serving GLM-4.7-NVFP4
+# NOTE: This profile requires --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4 to fix k/v scales incompatibility
+# See: https://huggingface.co/Salyut1/GLM-4.7-NVFP4/discussions/3#694ab9b6e2efa04b7ecb0c4b
+
+vllm serve Salyut1/GLM-4.7-NVFP4 \
+    --attention-config.backend flashinfer \
+    --tool-call-parser glm47 \
+    --reasoning-parser glm45 \
+    --enable-auto-tool-choice \
+    -tp 2 \
+    --gpu-memory-utilization 0.88 \
+    --max-model-len 32000 \
+    --distributed-executor-backend ray \
+    --host 0.0.0.0 \
+    --port 8000
--- a/profiles/vllm-openai-gpt-oss-120b.sh
+++ b/profiles/vllm-openai-gpt-oss-120b.sh
@@ -0,0 +1,20 @@
+#!/bin/bash
+# PROFILE: OpenAI GPT-OSS 120B
+# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
+
+# Enable FlashInfer MOE with MXFP4/MXFP8 quantization
+export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
+
+vllm serve openai/gpt-oss-120b \
+    --tool-call-parser openai \
+    --enable-auto-tool-choice \
+    --tensor-parallel-size 2 \
+    --distributed-executor-backend ray \
+    --kv-cache-dtype fp8 \
+    --gpu-memory-utilization 0.70 \
+    --max-model-len 128000 \
+    --max-num-batched-tokens 4096 \
+    --max-num-seqs 8 \
+    --enable-prefix-caching \
+    --host 0.0.0.0 \
+    --port 8000