Added Quickstart section to README

2025-12-21 14:53:05 -08:00
parent 11db634aad
commit 82802f0cad
1 changed files with 68 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -8,6 +8,7 @@ While it was primarily developed to support multi-node inference, it works just
 ## Table of Contents
 - [DISCLAIMER](#disclaimer)
 - [QUICK START](#quick-start)
 - [CHANGELOG](#changelog)
 - [1. Building the Docker Image](#1-building-the-docker-image)
 - [2. Launching the Cluster (Recommended)](#2-launching-the-cluster-recommended)
@@ -24,6 +25,73 @@ This repository is not affiliated with NVIDIA or their subsidiaries. This is a c
 The Dockerfile builds from the main branch of VLLM, so depending on when you run the build process, it may not be in fully functioning state. You can target a specific vLLM release by setting `--vllm-ref` parameter or use `--use-wheels release` to install pre-built release wheels.
 ## QUICK START
 ### Build
 Check out locally. If using DGX Spark cluster, do it on the head node.
 ```bash
 git clone https://github.com/eugr/spark-vllm-docker.git
 cd spark-vllm-docker
 ```
 Build the container.
 **If you have only one DGX Spark:**
 ```bash
 ./build-and-copy.sh --use-wheels
 ```
 **On DGX Spark cluster:**
 Make sure you connect your Sparks together and enable passwordless SSH as described in NVidia's [Connect Two Sparks Playbook](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks). 
 Then run the following command that will build and distribute image across the cluster.
 ```bash
 ./build-and-copy.sh --use-wheels -c
 ```
 ### Run
 **On a single node**:
 ```bash
 docker run \
  --privileged \
  --gpus all \
  -it --rm \
  --network host --ipc=host \
  -v  ~/.cache/huggingface:/root/.cache/huggingface \
  vllm-node \
  bash -c -i "vllm serve \
  QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ \
  --port 8000 --host 0.0.0.0 \
  --gpu-memory-utilization 0.7 \
  --load-format fastsafetensors"
 ```
 **On a cluster**
 ```bash
 ./launch-cluster.sh exec vllm serve \
  QuantTrio/MiniMax-M2-AWQ \
  --port 8000 --host 0.0.0.0 \
  --gpu-memory-utilization 0.7 \
  -tp 2 \
  --distributed-executor-backend ray \
  --max-model-len 128000 \
  --load-format fastsafetensors \
  --enable-auto-tool-choice --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think
 ```
 This will run the model on all available cluster nodes.
 **NOTE:** do not use `--load-format fastsafetensors` if you are loading models that would take >0.8 of available RAM (without KV cache) as it may result in out of memory situation.
 ## CHANGELOG
 **IMPORTANT**