Adding sample profile and profile loader

2026-01-25 21:22:45 -05:00
parent 133ed9cfb9
commit 751bc5a47a
6 changed files with 390 additions and 8 deletions
--- a/profiles/README.md
+++ b/profiles/README.md
@@ -0,0 +1,184 @@
+# Launch Scripts
+
+This directory contains bash scripts that can be executed in the container using the `--launch-script` option. Launch scripts are simple, executable bash files that run directly inside the container.
+
+## Why Launch Scripts?
+
+- **Simple** - Just write a bash script that runs your command
+- **Flexible** - Use any bash features: environment variables, conditionals, loops
+- **Standalone** - Each script can be tested directly on a head node
+- **No magic** - What you see is what gets executed
+
+## Usage
+
+```bash
+# Use a launch script by name (looks in profiles/ directory)
+./launch-cluster.sh --launch-script example-vllm-minimax
+
+# Use a launch script by filename
+./launch-cluster.sh --launch-script example-vllm-minimax.sh
+
+# Use a launch script with absolute path
+./launch-cluster.sh --launch-script /path/to/my-script.sh
+
+# Combine with mods if needed
+./launch-cluster.sh --launch-script my-script.sh --apply-mod mods/my-patch
+
+# Combine with other options
+./launch-cluster.sh -n 192.168.1.1,192.168.1.2 --launch-script my-model.sh -d
+```
+
+When using `--launch-script`, the `exec` action is automatically implied if no action is specified.
+
+## Script Structure
+
+Launch scripts are simple bash scripts. The script is copied into the container at `/workspace/exec-script.sh` and executed.
+
+```bash
+#!/bin/bash
+# PROFILE: Human-readable name
+# DESCRIPTION: What this script does
+
+# Optional: Set environment variables
+export MY_VAR="value"
+
+# Run your command
+vllm serve org/model-name \
+    --port 8000 \
+    --host 0.0.0.0 \
+    --gpu-memory-utilization 0.7
+```
+
+### Metadata Comments
+
+The `# PROFILE:` and `# DESCRIPTION:` comments are optional but recommended for documentation:
+
+```bash
+#!/bin/bash
+# PROFILE: MiniMax-M2-AWQ Example
+# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
+```
+
+## Examples
+
+### Basic vLLM Serving
+
+```bash
+#!/bin/bash
+# PROFILE: MiniMax-M2-AWQ
+# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
+
+vllm serve QuantTrio/MiniMax-M2-AWQ \
+    --port 8000 \
+    --host 0.0.0.0 \
+    --gpu-memory-utilization 0.7 \
+    -tp 2 \
+    --distributed-executor-backend ray \
+    --max-model-len 128000 \
+    --load-format fastsafetensors \
+    --enable-auto-tool-choice \
+    --tool-call-parser minimax_m2
+```
+
+### With Environment Variables
+
+```bash
+#!/bin/bash
+# PROFILE: OpenAI GPT-OSS 120B
+# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
+
+# Enable FlashInfer MOE with MXFP4/MXFP8 quantization
+export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
+
+vllm serve openai/gpt-oss-120b \
+    --tool-call-parser openai \
+    --enable-auto-tool-choice \
+    --tensor-parallel-size 2 \
+    --distributed-executor-backend ray \
+    --host 0.0.0.0 \
+    --port 8000
+```
+
+### With Conditional Logic
+
+```bash
+#!/bin/bash
+# PROFILE: Adaptive Model Server
+# DESCRIPTION: Adjusts settings based on available GPUs
+
+GPU_COUNT=$(nvidia-smi -L | wc -l)
+echo "Detected $GPU_COUNT GPUs"
+
+if [[ $GPU_COUNT -ge 4 ]]; then
+    TP_SIZE=4
+    MEM_UTIL=0.9
+else
+    TP_SIZE=2
+    MEM_UTIL=0.7
+fi
+
+vllm serve meta-llama/Llama-3.1-70B-Instruct \
+    --port 8000 \
+    --host 0.0.0.0 \
+    -tp $TP_SIZE \
+    --gpu-memory-utilization $MEM_UTIL \
+    --distributed-executor-backend ray
+```
+
+### SGLang
+
+```bash
+#!/bin/bash
+# PROFILE: SGLang Llama 3.1
+# DESCRIPTION: SGLang runtime with Llama 3.1
+
+sglang launch meta-llama/Llama-3.1-8B-Instruct \
+    --port 8000 \
+    --host 0.0.0.0 \
+    --tp 2
+```
+
+### With Model Requiring Patches
+
+If your model requires patches, use `--apply-mod` alongside `--launch-script`:
+
+```bash
+# Script: vllm-glm-4.7-nvfp4.sh
+#!/bin/bash
+# PROFILE: Salyut1/GLM-4.7-NVFP4
+# DESCRIPTION: vLLM serving GLM-4.7-NVFP4
+# NOTE: Requires --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4
+
+vllm serve Salyut1/GLM-4.7-NVFP4 \
+    --attention-config.backend flashinfer \
+    --tool-call-parser glm47 \
+    -tp 2 \
+    --host 0.0.0.0 \
+    --port 8000
+```
+
+Usage:
+```bash
+./launch-cluster.sh --launch-script vllm-glm-4.7-nvfp4.sh --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4 exec
+```
+
+## Creating a New Launch Script
+
+1. Create a new `.sh` file in this directory
+2. Add the shebang `#!/bin/bash`
+3. Add `# PROFILE:` and `# DESCRIPTION:` comments
+4. Write your command (e.g., `vllm serve ...`)
+5. Run with `./launch-cluster.sh --launch-script my-script.sh exec`
+
+## Testing Scripts
+
+Since launch scripts are standard bash files, you can test them directly:
+
+```bash
+# Inside a running container or on a head node with the runtime installed
+cd profiles
+./my-script.sh
+```
+
+This makes development and debugging much easier than complex configuration systems.
+
--- a/profiles/example-vllm-minimax.sh
+++ b/profiles/example-vllm-minimax.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+# PROFILE: MiniMax-M2-AWQ Example
+# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
+
+vllm serve QuantTrio/MiniMax-M2-AWQ \
+    --port 8000 \
+    --host 0.0.0.0 \
+    --gpu-memory-utilization 0.7 \
+    -tp 2 \
+    --distributed-executor-backend ray \
+    --max-model-len 128000 \
+    --load-format fastsafetensors \
+    --enable-auto-tool-choice \
+    --tool-call-parser minimax_m2 \
+    --reasoning-parser minimax_m2_append_think
--- a/profiles/vllm-glm-4.7-nvfp4.sh
+++ b/profiles/vllm-glm-4.7-nvfp4.sh
@@ -0,0 +1,17 @@
+#!/bin/bash
+# PROFILE: Salyut1/GLM-4.7-NVFP4
+# DESCRIPTION: vLLM serving GLM-4.7-NVFP4
+# NOTE: This profile requires --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4 to fix k/v scales incompatibility
+# See: https://huggingface.co/Salyut1/GLM-4.7-NVFP4/discussions/3#694ab9b6e2efa04b7ecb0c4b
+
+vllm serve Salyut1/GLM-4.7-NVFP4 \
+    --attention-config.backend flashinfer \
+    --tool-call-parser glm47 \
+    --reasoning-parser glm45 \
+    --enable-auto-tool-choice \
+    -tp 2 \
+    --gpu-memory-utilization 0.88 \
+    --max-model-len 32000 \
+    --distributed-executor-backend ray \
+    --host 0.0.0.0 \
+    --port 8000
--- a/profiles/vllm-openai-gpt-oss-120b.sh
+++ b/profiles/vllm-openai-gpt-oss-120b.sh
@@ -0,0 +1,20 @@
+#!/bin/bash
+# PROFILE: OpenAI GPT-OSS 120B
+# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
+
+# Enable FlashInfer MOE with MXFP4/MXFP8 quantization
+export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
+
+vllm serve openai/gpt-oss-120b \
+    --tool-call-parser openai \
+    --enable-auto-tool-choice \
+    --tensor-parallel-size 2 \
+    --distributed-executor-backend ray \
+    --kv-cache-dtype fp8 \
+    --gpu-memory-utilization 0.70 \
+    --max-model-len 128000 \
+    --max-num-batched-tokens 4096 \
+    --max-num-seqs 8 \
+    --enable-prefix-caching \
+    --host 0.0.0.0 \
+    --port 8000