Added Quickstart section to README
This commit is contained in:
68
README.md
68
README.md
@@ -8,6 +8,7 @@ While it was primarily developed to support multi-node inference, it works just
|
|||||||
## Table of Contents
|
## Table of Contents
|
||||||
|
|
||||||
- [DISCLAIMER](#disclaimer)
|
- [DISCLAIMER](#disclaimer)
|
||||||
|
- [QUICK START](#quick-start)
|
||||||
- [CHANGELOG](#changelog)
|
- [CHANGELOG](#changelog)
|
||||||
- [1. Building the Docker Image](#1-building-the-docker-image)
|
- [1. Building the Docker Image](#1-building-the-docker-image)
|
||||||
- [2. Launching the Cluster (Recommended)](#2-launching-the-cluster-recommended)
|
- [2. Launching the Cluster (Recommended)](#2-launching-the-cluster-recommended)
|
||||||
@@ -24,6 +25,73 @@ This repository is not affiliated with NVIDIA or their subsidiaries. This is a c
|
|||||||
|
|
||||||
The Dockerfile builds from the main branch of VLLM, so depending on when you run the build process, it may not be in fully functioning state. You can target a specific vLLM release by setting `--vllm-ref` parameter or use `--use-wheels release` to install pre-built release wheels.
|
The Dockerfile builds from the main branch of VLLM, so depending on when you run the build process, it may not be in fully functioning state. You can target a specific vLLM release by setting `--vllm-ref` parameter or use `--use-wheels release` to install pre-built release wheels.
|
||||||
|
|
||||||
|
## QUICK START
|
||||||
|
|
||||||
|
### Build
|
||||||
|
|
||||||
|
Check out locally. If using DGX Spark cluster, do it on the head node.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/eugr/spark-vllm-docker.git
|
||||||
|
cd spark-vllm-docker
|
||||||
|
```
|
||||||
|
|
||||||
|
Build the container.
|
||||||
|
|
||||||
|
**If you have only one DGX Spark:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./build-and-copy.sh --use-wheels
|
||||||
|
```
|
||||||
|
|
||||||
|
**On DGX Spark cluster:**
|
||||||
|
|
||||||
|
Make sure you connect your Sparks together and enable passwordless SSH as described in NVidia's [Connect Two Sparks Playbook](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks).
|
||||||
|
|
||||||
|
Then run the following command that will build and distribute image across the cluster.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./build-and-copy.sh --use-wheels -c
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run
|
||||||
|
|
||||||
|
**On a single node**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker run \
|
||||||
|
--privileged \
|
||||||
|
--gpus all \
|
||||||
|
-it --rm \
|
||||||
|
--network host --ipc=host \
|
||||||
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||||
|
vllm-node \
|
||||||
|
bash -c -i "vllm serve \
|
||||||
|
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ \
|
||||||
|
--port 8000 --host 0.0.0.0 \
|
||||||
|
--gpu-memory-utilization 0.7 \
|
||||||
|
--load-format fastsafetensors"
|
||||||
|
```
|
||||||
|
|
||||||
|
**On a cluster**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./launch-cluster.sh exec vllm serve \
|
||||||
|
QuantTrio/MiniMax-M2-AWQ \
|
||||||
|
--port 8000 --host 0.0.0.0 \
|
||||||
|
--gpu-memory-utilization 0.7 \
|
||||||
|
-tp 2 \
|
||||||
|
--distributed-executor-backend ray \
|
||||||
|
--max-model-len 128000 \
|
||||||
|
--load-format fastsafetensors \
|
||||||
|
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
|
||||||
|
--reasoning-parser minimax_m2_append_think
|
||||||
|
```
|
||||||
|
|
||||||
|
This will run the model on all available cluster nodes.
|
||||||
|
|
||||||
|
**NOTE:** do not use `--load-format fastsafetensors` if you are loading models that would take >0.8 of available RAM (without KV cache) as it may result in out of memory situation.
|
||||||
|
|
||||||
## CHANGELOG
|
## CHANGELOG
|
||||||
|
|
||||||
**IMPORTANT**
|
**IMPORTANT**
|
||||||
|
|||||||
Reference in New Issue
Block a user