Added Quickstart section to README

This commit is contained in:
Eugene Rakhmatulin
2025-12-21 14:53:05 -08:00
parent 11db634aad
commit 82802f0cad

View File

@@ -8,6 +8,7 @@ While it was primarily developed to support multi-node inference, it works just
## Table of Contents
- [DISCLAIMER](#disclaimer)
- [QUICK START](#quick-start)
- [CHANGELOG](#changelog)
- [1. Building the Docker Image](#1-building-the-docker-image)
- [2. Launching the Cluster (Recommended)](#2-launching-the-cluster-recommended)
@@ -24,6 +25,73 @@ This repository is not affiliated with NVIDIA or their subsidiaries. This is a c
The Dockerfile builds from the main branch of VLLM, so depending on when you run the build process, it may not be in fully functioning state. You can target a specific vLLM release by setting `--vllm-ref` parameter or use `--use-wheels release` to install pre-built release wheels.
## QUICK START
### Build
Check out locally. If using DGX Spark cluster, do it on the head node.
```bash
git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker
```
Build the container.
**If you have only one DGX Spark:**
```bash
./build-and-copy.sh --use-wheels
```
**On DGX Spark cluster:**
Make sure you connect your Sparks together and enable passwordless SSH as described in NVidia's [Connect Two Sparks Playbook](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks).
Then run the following command that will build and distribute image across the cluster.
```bash
./build-and-copy.sh --use-wheels -c
```
### Run
**On a single node**:
```bash
docker run \
--privileged \
--gpus all \
-it --rm \
--network host --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm-node \
bash -c -i "vllm serve \
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ \
--port 8000 --host 0.0.0.0 \
--gpu-memory-utilization 0.7 \
--load-format fastsafetensors"
```
**On a cluster**
```bash
./launch-cluster.sh exec vllm serve \
QuantTrio/MiniMax-M2-AWQ \
--port 8000 --host 0.0.0.0 \
--gpu-memory-utilization 0.7 \
-tp 2 \
--distributed-executor-backend ray \
--max-model-len 128000 \
--load-format fastsafetensors \
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think
```
This will run the model on all available cluster nodes.
**NOTE:** do not use `--load-format fastsafetensors` if you are loading models that would take >0.8 of available RAM (without KV cache) as it may result in out of memory situation.
## CHANGELOG
**IMPORTANT**