Adding sample profile and profile loader
This commit is contained in:
65
README.md
65
README.md
@@ -16,10 +16,11 @@ While it was primarily developed to support multi-node inference, it works just
|
||||
- [4. Using `run-cluster-node.sh` (Internal)](#4-using-run-cluster-nodesh-internal)
|
||||
- [5. Configuration Details](#5-configuration-details)
|
||||
- [6. Mods and Patches](#6-mods-and-patches)
|
||||
- [7. Using cluster mode for inference](#7-using-cluster-mode-for-inference)
|
||||
- [8. Fastsafetensors](#8-fastsafetensors)
|
||||
- [9. Benchmarking](#9-benchmarking)
|
||||
- [10. Downloading Models](#10-downloading-models)
|
||||
- [7. Launch Scripts](#7-launch-scripts)
|
||||
- [8. Using cluster mode for inference](#8-using-cluster-mode-for-inference)
|
||||
- [9. Fastsafetensors](#9-fastsafetensors)
|
||||
- [10. Benchmarking](#10-benchmarking)
|
||||
- [11. Downloading Models](#11-downloading-models)
|
||||
|
||||
## DISCLAIMER
|
||||
|
||||
@@ -770,7 +771,55 @@ Mods can be used for:
|
||||
- Customizing vLLM behavior for specific workloads
|
||||
- Rapid iteration on development without rebuilding the entire image
|
||||
|
||||
## 7\. Using cluster mode for inference
|
||||
## 7\. Launch Scripts
|
||||
|
||||
Launch scripts provide a simple way to define reusable model configurations. Instead of passing long command lines, you can create a bash script that is copied into the container and executed directly.
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```bash
|
||||
# Use a launch script by name (looks in profiles/ directory)
|
||||
./launch-cluster.sh --launch-script example-vllm-minimax
|
||||
|
||||
# Use with explicit nodes
|
||||
./launch-cluster.sh -n 192.168.1.1,192.168.1.2 --launch-script vllm-openai-gpt-oss-120b.sh
|
||||
|
||||
# Combine with mods for models requiring patches
|
||||
./launch-cluster.sh --launch-script vllm-glm-4.7-nvfp4.sh --apply-mod mods/fix-Salyut1-GLM-4.7-NVFP4
|
||||
```
|
||||
|
||||
### Script Format
|
||||
|
||||
Launch scripts are simple bash files that run directly inside the container:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# PROFILE: OpenAI GPT-OSS 120B
|
||||
# DESCRIPTION: vLLM serving openai/gpt-oss-120b with FlashInfer MOE optimization
|
||||
|
||||
# Set environment variables if needed
|
||||
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
|
||||
|
||||
# Run your command
|
||||
vllm serve openai/gpt-oss-120b \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--tensor-parallel-size 2 \
|
||||
--distributed-executor-backend ray \
|
||||
--enable-auto-tool-choice
|
||||
```
|
||||
|
||||
### Available Launch Scripts
|
||||
|
||||
The `profiles/` directory contains ready-to-use launch scripts:
|
||||
|
||||
- **example-vllm-minimax.sh** - MiniMax-M2-AWQ with Ray distributed backend
|
||||
- **vllm-openai-gpt-oss-120b.sh** - OpenAI GPT-OSS 120B with FlashInfer MOE
|
||||
- **vllm-glm-4.7-nvfp4.sh** - GLM-4.7-NVFP4 (requires the glm4_moe patch mod)
|
||||
|
||||
See [profiles/README.md](profiles/README.md) for detailed documentation and more examples.
|
||||
|
||||
## 8\. Using cluster mode for inference
|
||||
|
||||
First, start follow the instructions above to start the head container on your first Spark, and node container on the second Spark.
|
||||
Then, on the first Spark, run vllm like this:
|
||||
@@ -787,7 +836,7 @@ docker exec -it vllm_node
|
||||
|
||||
And execute vllm command inside.
|
||||
|
||||
## 8\. Fastsafetensors
|
||||
## 9\. Fastsafetensors
|
||||
|
||||
This build includes support for fastsafetensors loading which significantly improves loading speeds, especially on DGX Spark where MMAP performance is very poor currently.
|
||||
[Fasttensors](https://github.com/foundation-model-stack/fastsafetensors/) solve this issue by using more efficient multi-threaded loading while avoiding mmap.
|
||||
@@ -801,11 +850,11 @@ To use this method, simply include `--load-format fastsafetensors` when running
|
||||
HF_HUB_OFFLINE=1 vllm serve openai/gpt-oss-120b --port 8888 --host 0.0.0.0 --trust_remote_code --swap-space 16 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --load-format fastsafetensors
|
||||
```
|
||||
|
||||
## 9\. Benchmarking
|
||||
## 10\. Benchmarking
|
||||
|
||||
I recommend using [llama-benchy](https://github.com/eugr/llama-benchy) - a new benchmarking tool that delivers results in the same format as llama-bench from llama.cpp suite.
|
||||
|
||||
## 10\. Downloading Models
|
||||
## 11\. Downloading Models
|
||||
|
||||
The `hf-download.sh` script provides a convenient way to download models from HuggingFace and distribute them across your cluster nodes. It uses Huggingface CLI via `uvx` for fast downloads and `rsync` for distribution across the cluster.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user