Added Quickstart section to README

2025-12-21 14:53:05 -08:00
parent 11db634aad
commit 82802f0cad
1 changed files with 68 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -8,6 +8,7 @@ While it was primarily developed to support multi-node inference, it works just
 ## Table of Contents

 - [DISCLAIMER](#disclaimer)
+- [QUICK START](#quick-start)
 - [CHANGELOG](#changelog)
 - [1. Building the Docker Image](#1-building-the-docker-image)
 - [2. Launching the Cluster (Recommended)](#2-launching-the-cluster-recommended)
@@ -24,6 +25,73 @@ This repository is not affiliated with NVIDIA or their subsidiaries. This is a c

 The Dockerfile builds from the main branch of VLLM, so depending on when you run the build process, it may not be in fully functioning state. You can target a specific vLLM release by setting `--vllm-ref` parameter or use `--use-wheels release` to install pre-built release wheels.

+## QUICK START
+
+### Build
+
+Check out locally. If using DGX Spark cluster, do it on the head node.
+
+```bash
+git clone https://github.com/eugr/spark-vllm-docker.git
+cd spark-vllm-docker
+```
+
+Build the container.
+
+**If you have only one DGX Spark:**
+
+```bash
+./build-and-copy.sh --use-wheels
+```
+
+**On DGX Spark cluster:**
+
+Make sure you connect your Sparks together and enable passwordless SSH as described in NVidia's [Connect Two Sparks Playbook](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks). 
+
+Then run the following command that will build and distribute image across the cluster.
+
+```bash
+./build-and-copy.sh --use-wheels -c
+```
+
+### Run
+
+**On a single node**:
+
+```bash
+ docker run \
+  --privileged \
+  --gpus all \
+  -it --rm \
+  --network host --ipc=host \
+  -v  ~/.cache/huggingface:/root/.cache/huggingface \
+  vllm-node \
+  bash -c -i "vllm serve \
+  QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ \
+  --port 8000 --host 0.0.0.0 \
+  --gpu-memory-utilization 0.7 \
+  --load-format fastsafetensors"
+```
+
+**On a cluster**
+
+```bash
+./launch-cluster.sh exec vllm serve \
+  QuantTrio/MiniMax-M2-AWQ \
+  --port 8000 --host 0.0.0.0 \
+  --gpu-memory-utilization 0.7 \
+  -tp 2 \
+  --distributed-executor-backend ray \
+  --max-model-len 128000 \
+  --load-format fastsafetensors \
+  --enable-auto-tool-choice --tool-call-parser minimax_m2 \
+  --reasoning-parser minimax_m2_append_think
+```
+
+This will run the model on all available cluster nodes.
+
+**NOTE:** do not use `--load-format fastsafetensors` if you are loading models that would take >0.8 of available RAM (without KV cache) as it may result in out of memory situation.
+
 ## CHANGELOG

 **IMPORTANT**