Adding sample profile and profile loader
This commit is contained in:
184
profiles/README.md
Normal file
184
profiles/README.md
Normal file
@@ -0,0 +1,184 @@
|
||||
# Launch Scripts
|
||||
|
||||
This directory contains bash scripts that can be executed in the container using the `--launch-script` option. Launch scripts are simple, executable bash files that run directly inside the container.
|
||||
|
||||
## Why Launch Scripts?
|
||||
|
||||
- **Simple** - Just write a bash script that runs your command
|
||||
- **Flexible** - Use any bash features: environment variables, conditionals, loops
|
||||
- **Standalone** - Each script can be tested directly on a head node
|
||||
- **No magic** - What you see is what gets executed
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Use a launch script by name (looks in profiles/ directory)
|
||||
./launch-cluster.sh --launch-script example-vllm-minimax
|
||||
|
||||
# Use a launch script by filename
|
||||
./launch-cluster.sh --launch-script example-vllm-minimax.sh
|
||||
|
||||
# Use a launch script with absolute path
|
||||
./launch-cluster.sh --launch-script /path/to/my-script.sh
|
||||
|
||||
# Combine with mods if needed
|
||||
./launch-cluster.sh --launch-script my-script.sh --apply-mod mods/my-patch
|
||||
|
||||
# Combine with other options
|
||||
./launch-cluster.sh -n 192.168.1.1,192.168.1.2 --launch-script my-model.sh -d
|
||||
```
|
||||
|
||||
When using `--launch-script`, the `exec` action is automatically implied if no action is specified.
|
||||
|
||||
## Script Structure
|
||||
|
||||
Launch scripts are simple bash scripts. The script is copied into the container at `/workspace/exec-script.sh` and executed.
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# PROFILE: Human-readable name
|
||||
# DESCRIPTION: What this script does
|
||||
|
||||
# Optional: Set environment variables
|
||||
export MY_VAR="value"
|
||||
|
||||
# Run your command
|
||||
vllm serve org/model-name \
|
||||
--port 8000 \
|
||||
--host 0.0.0.0 \
|
||||
--gpu-memory-utilization 0.7
|
||||
```
|
||||
|
||||
### Metadata Comments
|
||||
|
||||
The `# PROFILE:` and `# DESCRIPTION:` comments are optional but recommended for documentation:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# PROFILE: MiniMax-M2-AWQ Example
|
||||
# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Basic vLLM Serving
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# PROFILE: MiniMax-M2-AWQ
|
||||
# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
|
||||
|
||||
vllm serve QuantTrio/MiniMax-M2-AWQ \
|
||||
--port 8000 \
|
||||
--host 0.0.0.0 \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
-tp 2 \
|
||||
--distributed-executor-backend ray \
|
||||
--max-model-len 128000 \
|
||||
--load-format fastsafetensors \
|
||||
--enable-auto-tool-choice \
|
||||
--tool-call-parser minimax_m2
|
||||
```
|
||||
|
||||
### With Environment Variables
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# PROFILE: OpenAI GPT-OSS 120B
|
||||
# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
|
||||
|
||||
# Enable FlashInfer MOE with MXFP4/MXFP8 quantization
|
||||
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
|
||||
|
||||
vllm serve openai/gpt-oss-120b \
|
||||
--tool-call-parser openai \
|
||||
--enable-auto-tool-choice \
|
||||
--tensor-parallel-size 2 \
|
||||
--distributed-executor-backend ray \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000
|
||||
```
|
||||
|
||||
### With Conditional Logic
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# PROFILE: Adaptive Model Server
|
||||
# DESCRIPTION: Adjusts settings based on available GPUs
|
||||
|
||||
GPU_COUNT=$(nvidia-smi -L | wc -l)
|
||||
echo "Detected $GPU_COUNT GPUs"
|
||||
|
||||
if [[ $GPU_COUNT -ge 4 ]]; then
|
||||
TP_SIZE=4
|
||||
MEM_UTIL=0.9
|
||||
else
|
||||
TP_SIZE=2
|
||||
MEM_UTIL=0.7
|
||||
fi
|
||||
|
||||
vllm serve meta-llama/Llama-3.1-70B-Instruct \
|
||||
--port 8000 \
|
||||
--host 0.0.0.0 \
|
||||
-tp $TP_SIZE \
|
||||
--gpu-memory-utilization $MEM_UTIL \
|
||||
--distributed-executor-backend ray
|
||||
```
|
||||
|
||||
### SGLang
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# PROFILE: SGLang Llama 3.1
|
||||
# DESCRIPTION: SGLang runtime with Llama 3.1
|
||||
|
||||
sglang launch meta-llama/Llama-3.1-8B-Instruct \
|
||||
--port 8000 \
|
||||
--host 0.0.0.0 \
|
||||
--tp 2
|
||||
```
|
||||
|
||||
### With Model Requiring Patches
|
||||
|
||||
If your model requires patches, use `--apply-mod` alongside `--launch-script`:
|
||||
|
||||
```bash
|
||||
# Script: vllm-glm-4.7-nvfp4.sh
|
||||
#!/bin/bash
|
||||
# PROFILE: Salyut1/GLM-4.7-NVFP4
|
||||
# DESCRIPTION: vLLM serving GLM-4.7-NVFP4
|
||||
# NOTE: Requires --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4
|
||||
|
||||
vllm serve Salyut1/GLM-4.7-NVFP4 \
|
||||
--attention-config.backend flashinfer \
|
||||
--tool-call-parser glm47 \
|
||||
-tp 2 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000
|
||||
```
|
||||
|
||||
Usage:
|
||||
```bash
|
||||
./launch-cluster.sh --launch-script vllm-glm-4.7-nvfp4.sh --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4 exec
|
||||
```
|
||||
|
||||
## Creating a New Launch Script
|
||||
|
||||
1. Create a new `.sh` file in this directory
|
||||
2. Add the shebang `#!/bin/bash`
|
||||
3. Add `# PROFILE:` and `# DESCRIPTION:` comments
|
||||
4. Write your command (e.g., `vllm serve ...`)
|
||||
5. Run with `./launch-cluster.sh --launch-script my-script.sh exec`
|
||||
|
||||
## Testing Scripts
|
||||
|
||||
Since launch scripts are standard bash files, you can test them directly:
|
||||
|
||||
```bash
|
||||
# Inside a running container or on a head node with the runtime installed
|
||||
cd profiles
|
||||
./my-script.sh
|
||||
```
|
||||
|
||||
This makes development and debugging much easier than complex configuration systems.
|
||||
|
||||
15
profiles/example-vllm-minimax.sh
Normal file
15
profiles/example-vllm-minimax.sh
Normal file
@@ -0,0 +1,15 @@
|
||||
#!/bin/bash
|
||||
# PROFILE: MiniMax-M2-AWQ Example
|
||||
# DESCRIPTION: vLLM serving MiniMax-M2-AWQ with Ray distributed backend
|
||||
|
||||
vllm serve QuantTrio/MiniMax-M2-AWQ \
|
||||
--port 8000 \
|
||||
--host 0.0.0.0 \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
-tp 2 \
|
||||
--distributed-executor-backend ray \
|
||||
--max-model-len 128000 \
|
||||
--load-format fastsafetensors \
|
||||
--enable-auto-tool-choice \
|
||||
--tool-call-parser minimax_m2 \
|
||||
--reasoning-parser minimax_m2_append_think
|
||||
17
profiles/vllm-glm-4.7-nvfp4.sh
Normal file
17
profiles/vllm-glm-4.7-nvfp4.sh
Normal file
@@ -0,0 +1,17 @@
|
||||
#!/bin/bash
|
||||
# PROFILE: Salyut1/GLM-4.7-NVFP4
|
||||
# DESCRIPTION: vLLM serving GLM-4.7-NVFP4
|
||||
# NOTE: This profile requires --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4 to fix k/v scales incompatibility
|
||||
# See: https://huggingface.co/Salyut1/GLM-4.7-NVFP4/discussions/3#694ab9b6e2efa04b7ecb0c4b
|
||||
|
||||
vllm serve Salyut1/GLM-4.7-NVFP4 \
|
||||
--attention-config.backend flashinfer \
|
||||
--tool-call-parser glm47 \
|
||||
--reasoning-parser glm45 \
|
||||
--enable-auto-tool-choice \
|
||||
-tp 2 \
|
||||
--gpu-memory-utilization 0.88 \
|
||||
--max-model-len 32000 \
|
||||
--distributed-executor-backend ray \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000
|
||||
20
profiles/vllm-openai-gpt-oss-120b.sh
Normal file
20
profiles/vllm-openai-gpt-oss-120b.sh
Normal file
@@ -0,0 +1,20 @@
|
||||
#!/bin/bash
|
||||
# PROFILE: OpenAI GPT-OSS 120B
|
||||
# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
|
||||
|
||||
# Enable FlashInfer MOE with MXFP4/MXFP8 quantization
|
||||
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
|
||||
|
||||
vllm serve openai/gpt-oss-120b \
|
||||
--tool-call-parser openai \
|
||||
--enable-auto-tool-choice \
|
||||
--tensor-parallel-size 2 \
|
||||
--distributed-executor-backend ray \
|
||||
--kv-cache-dtype fp8 \
|
||||
--gpu-memory-utilization 0.70 \
|
||||
--max-model-len 128000 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--max-num-seqs 8 \
|
||||
--enable-prefix-caching \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000
|
||||
Reference in New Issue
Block a user