Adding sample profile and profile loader

2026-01-25 21:22:45 -05:00
parent 133ed9cfb9
commit 751bc5a47a
6 changed files with 390 additions and 8 deletions
--- a/README.md
+++ b/README.md
@@ -16,10 +16,11 @@ While it was primarily developed to support multi-node inference, it works just
 - [4. Using `run-cluster-node.sh` (Internal)](#4-using-run-cluster-nodesh-internal)
 - [5. Configuration Details](#5-configuration-details)
 - [6. Mods and Patches](#6-mods-and-patches)
- [7. Using cluster mode for inference](#7-using-cluster-mode-for-inference)
- [8. Fastsafetensors](#8-fastsafetensors)
- [9. Benchmarking](#9-benchmarking)
- [10. Downloading Models](#10-downloading-models)
+- [7. Launch Scripts](#7-launch-scripts)
+- [8. Using cluster mode for inference](#8-using-cluster-mode-for-inference)
+- [9. Fastsafetensors](#9-fastsafetensors)
+- [10. Benchmarking](#10-benchmarking)
+- [11. Downloading Models](#11-downloading-models)

 ## DISCLAIMER

@@ -770,7 +771,55 @@ Mods can be used for:
 - Customizing vLLM behavior for specific workloads
 - Rapid iteration on development without rebuilding the entire image

-## 7\. Using cluster mode for inference
+## 7\. Launch Scripts
+
+Launch scripts provide a simple way to define reusable model configurations. Instead of passing long command lines, you can create a bash script that is copied into the container and executed directly.
+
+### Basic Usage
+
+```bash
+# Use a launch script by name (looks in profiles/ directory)
+./launch-cluster.sh --launch-script example-vllm-minimax
+
+# Use with explicit nodes
+./launch-cluster.sh -n 192.168.1.1,192.168.1.2 --launch-script vllm-openai-gpt-oss-120b.sh
+
+# Combine with mods for models requiring patches
+./launch-cluster.sh --launch-script vllm-glm-4.7-nvfp4.sh --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4
+```
+
+### Script Format
+
+Launch scripts are simple bash files that run directly inside the container:
+
+```bash
+#!/bin/bash
+# PROFILE: OpenAI GPT-OSS 120B
+# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
+
+# Set environment variables if needed
+export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
+
+# Run your command
+vllm serve openai/gpt-oss-120b \
+    --host 0.0.0.0 \
+    --port 8000 \
+    --tensor-parallel-size 2 \
+    --distributed-executor-backend ray \
+    --enable-auto-tool-choice
+```
+
+### Available Launch Scripts
+
+The `profiles/` directory contains ready-to-use launch scripts:
+
+- **example-vllm-minimax.sh** - MiniMax-M2-AWQ with Ray distributed backend
+- **vllm-openai-gpt-oss-120b.sh** - OpenAI GPT-OSS 120B with FlashInfer MOE
+- **vllm-glm-4.7-nvfp4.sh** - GLM-4.7-NVFP4 (requires the glm4_moe patch mod)
+
+See [profiles/README.md](profiles/README.md) for detailed documentation and more examples.
+
+## 8\. Using cluster mode for inference

 First, start follow the instructions above to start the head container on your first Spark, and node container on the second Spark.
 Then, on the first Spark, run vllm like this:
@@ -787,7 +836,7 @@ docker exec -it vllm_node

 And execute vllm command inside.

-## 8\. Fastsafetensors
+## 9\. Fastsafetensors

 This build includes support for fastsafetensors loading which significantly improves loading speeds, especially on DGX Spark where MMAP performance is very poor currently.
 [Fasttensors](https://github.com/foundation-model-stack/fastsafetensors/) solve this issue by using more efficient multi-threaded loading while avoiding mmap.
@@ -801,11 +850,11 @@ To use this method, simply include `--load-format fastsafetensors` when running
 HF_HUB_OFFLINE=1 vllm serve openai/gpt-oss-120b --port 8888 --host 0.0.0.0 --trust_remote_code --swap-space 16 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --load-format fastsafetensors
 ```

-## 9\. Benchmarking
+## 10\. Benchmarking

 I recommend using [llama-benchy](https://github.com/eugr/llama-benchy) - a new benchmarking tool that delivers results in the same format as llama-bench from llama.cpp suite.

-## 10\. Downloading Models
+## 11\. Downloading Models

 The `hf-download.sh` script provides a convenient way to download models from HuggingFace and distribute them across your cluster nodes. It uses Huggingface CLI via `uvx` for fast downloads and `rsync` for distribution across the cluster.