Adding sample profile and profile loader

This commit is contained in:
Raphael Amorim
2026-01-25 21:22:45 -05:00
parent 133ed9cfb9
commit 751bc5a47a
6 changed files with 390 additions and 8 deletions

View File

@@ -16,10 +16,11 @@ While it was primarily developed to support multi-node inference, it works just
- [4. Using `run-cluster-node.sh` (Internal)](#4-using-run-cluster-nodesh-internal)
- [5. Configuration Details](#5-configuration-details)
- [6. Mods and Patches](#6-mods-and-patches)
- [7. Using cluster mode for inference](#7-using-cluster-mode-for-inference)
- [8. Fastsafetensors](#8-fastsafetensors)
- [9. Benchmarking](#9-benchmarking)
- [10. Downloading Models](#10-downloading-models)
- [7. Launch Scripts](#7-launch-scripts)
- [8. Using cluster mode for inference](#8-using-cluster-mode-for-inference)
- [9. Fastsafetensors](#9-fastsafetensors)
- [10. Benchmarking](#10-benchmarking)
- [11. Downloading Models](#11-downloading-models)
## DISCLAIMER
@@ -770,7 +771,55 @@ Mods can be used for:
- Customizing vLLM behavior for specific workloads
- Rapid iteration on development without rebuilding the entire image
## 7\. Using cluster mode for inference
## 7\. Launch Scripts
Launch scripts provide a simple way to define reusable model configurations. Instead of passing long command lines, you can create a bash script that is copied into the container and executed directly.
### Basic Usage
```bash
# Use a launch script by name (looks in profiles/ directory)
./launch-cluster.sh --launch-script example-vllm-minimax
# Use with explicit nodes
./launch-cluster.sh -n 192.168.1.1,192.168.1.2 --launch-script vllm-openai-gpt-oss-120b.sh
# Combine with mods for models requiring patches
./launch-cluster.sh --launch-script vllm-glm-4.7-nvfp4.sh --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4
```
### Script Format
Launch scripts are simple bash files that run directly inside the container:
```bash
#!/bin/bash
# PROFILE: OpenAI GPT-OSS 120B
# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
# Set environment variables if needed
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
# Run your command
vllm serve openai/gpt-oss-120b \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--enable-auto-tool-choice
```
### Available Launch Scripts
The `profiles/` directory contains ready-to-use launch scripts:
- **example-vllm-minimax.sh** - MiniMax-M2-AWQ with Ray distributed backend
- **vllm-openai-gpt-oss-120b.sh** - OpenAI GPT-OSS 120B with FlashInfer MOE
- **vllm-glm-4.7-nvfp4.sh** - GLM-4.7-NVFP4 (requires the glm4_moe patch mod)
See [profiles/README.md](profiles/README.md) for detailed documentation and more examples.
## 8\. Using cluster mode for inference
First, start follow the instructions above to start the head container on your first Spark, and node container on the second Spark.
Then, on the first Spark, run vllm like this:
@@ -787,7 +836,7 @@ docker exec -it vllm_node
And execute vllm command inside.
## 8\. Fastsafetensors
## 9\. Fastsafetensors
This build includes support for fastsafetensors loading which significantly improves loading speeds, especially on DGX Spark where MMAP performance is very poor currently.
[Fasttensors](https://github.com/foundation-model-stack/fastsafetensors/) solve this issue by using more efficient multi-threaded loading while avoiding mmap.
@@ -801,11 +850,11 @@ To use this method, simply include `--load-format fastsafetensors` when running
HF_HUB_OFFLINE=1 vllm serve openai/gpt-oss-120b --port 8888 --host 0.0.0.0 --trust_remote_code --swap-space 16 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --load-format fastsafetensors
```
## 9\. Benchmarking
## 10\. Benchmarking
I recommend using [llama-benchy](https://github.com/eugr/llama-benchy) - a new benchmarking tool that delivers results in the same format as llama-bench from llama.cpp suite.
## 10\. Downloading Models
## 11\. Downloading Models
The `hf-download.sh` script provides a convenient way to download models from HuggingFace and distribute them across your cluster nodes. It uses Huggingface CLI via `uvx` for fast downloads and `rsync` for distribution across the cluster.