Merge branch 'main' into pr-2

2025-12-19 08:51:52 -08:00
parent 11355677f6 de055928b8
commit 23858a3c7f
5 changed files with 599 additions and 10 deletions
--- a/README.md
+++ b/README.md
@@ -3,6 +3,19 @@

 This repository contains the Docker configuration and startup scripts to run a multi-node vLLM inference cluster using Ray. It supports InfiniBand/RDMA (NCCL) and custom environment configuration for high-performance setups.

+## Table of Contents
+
+- [DISCLAIMER](#disclaimer)
+- [CHANGELOG](#changelog)
+- [1. Building the Docker Image](#1-building-the-docker-image)
+- [2. Launching the Cluster (Recommended)](#2-launching-the-cluster-recommended)
+- [3. Running the Container (Manual)](#3-running-the-container-manual)
+- [4. Using `run-cluster-node.sh` (Internal)](#4-using-run-cluster-nodesh-internal)
+- [5. Configuration Details](#5-configuration-details)
+- [6. Using cluster mode for inference](#6-using-cluster-mode-for-inference)
+- [7. Fastsafetensors](#7-fastsafetensors)
+- [8. Benchmarking](#8-benchmarking)
+
 ## DISCLAIMER

 This repository is not affiliated with NVIDIA or their subsidiaries. The content is provided as a reference material only, not intended for production use.
@@ -12,12 +25,18 @@ The Dockerfile builds from the main branch of VLLM, so depending on when you run

 ## CHANGELOG

-### 2025-12-18
+### 2025-12-19

-Updated `build-and-copy.sh` to support copying to multiple hosts.
+Updated `build-and-copy.sh` to support copying to multiple hosts (thanks @eric-humane for the contribution).
 - Added `-c, --copy-to` (accepts space- or comma-separated host lists) and kept `--copy-to-host` as a backward-compatible alias.
 - Added `--copy-parallel` to copy to all hosts concurrently.
- Short `-h` is now used for help.
+- **BREAKING CHANGE**: Short `-h` argument is now used for help. Use `-c` for copy.
+
+### 2025-12-18
+
+- Added `launch-cluster.sh` convenience script for basic cluster management - see details below.
+- Added `-j` / `--build-jobs` argument to `build-and-copy.sh` to control build parallelism.
+- Added `--nccl-debug` option to specify NCCL debug level. Default is none to decrease verbosity.

 ### 2025-12-15

@@ -69,6 +88,7 @@ Using a provided build script is recommended, but if you want to build using `do
 | `CACHEBUST_VLLM` | `1` | Change this to force a fresh git clone and rebuild of vLLM source code. |
 | `TRITON_REF` | `v3.5.1` | Triton commit SHA, branch, or tag to build. |
 | `VLLM_REF` | `main` | vLLM commit SHA, branch, or tag to build. |
+| `BUILD_JOBS` | `16` | Number of parallel build jobs (default: 16). |

 ### Using the Build Script (Recommended)

@@ -154,6 +174,7 @@ Using a different username:
 | `-c, --copy-to <host[,host...] or host host...>` | Host(s) to copy the image to after building (space- or comma-separated list after the flag). |
 | `--copy-to-host` | Alias for `--copy-to` (backwards compatibility). |
 | `--copy-parallel` | Copy to all specified hosts concurrently. |
+| `-j, --build-jobs <jobs>` | Number of parallel build jobs (default: Dockerfile default) |
 | `-u, --user <user>` | Username for SSH connection (default: current user) |
 | `--no-build` | Skip building, only copy existing image (requires `--copy-to`) |
 | `-h, --help` | Show help message |
@@ -172,7 +193,85 @@ docker save vllm-node | ssh your_username@another_spark_hostname_or_ip "docker l

 -----

-## 2\. Running the Container
+## 2\. Launching the Cluster (Recommended)
+
+The `launch-cluster.sh` script simplifies the process of starting the cluster nodes. It handles Docker parameters, network interface detection, and node configuration automatically.
+
+### Basic Usage
+
+**Start the container (auto-detects everything):**
+
+```bash
+./launch-cluster.sh
+```
+
+This will:
+1.  Auto-detect the active InfiniBand and Ethernet interfaces.
+2.  Auto-detect the node IP.
+3.  Launch the container in interactive mode.
+4.  Start the Ray cluster node (head or worker depending on the IP).
+
+Assumptions and limitations:
+
+- It assumes that you've already set up passwordless SSH access on all nodes. If not, follow NVidia's [Connect Two Sparks Playbook](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks). I recommend setting up static IPs in the configuration instead of automatically assigning them every time, but this script should work with automatically assigned addresses too.
+- By default, it assumes that the container image name is `vllm-node`. If it differs, you need to specify it with `-t <name>` parameter.
+- If both ConnectX **physical** ports are utilized, and both have IP addresses, it will use whatever interface it finds first. Use `--eth-if` to override.
+- It will ignore IPs associated with the 2nd "clone" of the physical interface. For instance, the outermost port on Spark has two logical Ethernet interfaces: `enp1s0f1np1` and `enP2p1s0f1np1`. Only `enp1s0f1np1` will be used. To override, use `--eth-if` parameter.
+- It assumes that the same physical interfaces are named the same on all nodes (IOW, enp1s0f1np1 refers to the same physical port on all nodes). If it's not the case, you will have to launch cluster nodes manually or modify the script.
+- It will mount only `~/.cache/huggingface` to the container by default. If you want to mount other caches, you'll have to pass set `VLLM_SPARK_EXTRA_DOCKER_ARGS` environment variable, e.g.: `VLLM_SPARK_EXTRA_DOCKER_ARGS="-v $HOME/.cache/vllm:/root/.cache/vllm" ./launch-cluster.sh ...`. Please note that you must use `$HOME` instead of `~` here as the latter won't be expanded if passed through the variable to docker arguments.
+
+
+**Start in daemon mode (background):**
+
+```bash
+./launch-cluster.sh -d
+```
+
+**Stop the container:**
+
+```bash
+./launch-cluster.sh stop
+```
+
+**Check status:**
+
+```bash
+./launch-cluster.sh status
+```
+
+**Execute a command inside the running container:**
+
+```bash
+./launch-cluster.sh exec vllm serve ...
+```
+
+### Auto-Detection
+
+The script attempts to automatically detect:
+*   **Ethernet Interface:** The interface associated with the active InfiniBand device that has an IP address.
+*   **InfiniBand Interface:** The active InfiniBand devices. By default both active RoCE interfaces that correspond to active IB port(s) will be utilized.
+*   **Node Role:** Based on the detected IP address and the list of nodes (defaults to `192.168.177.11` as head and `192.168.177.12` as worker).
+
+### Manual Overrides
+
+You can override the auto-detected values if needed:
+
+```bash
+./launch-cluster.sh --nodes "10.0.0.1,10.0.0.2" --eth-if enp1s0f1np1 --ib-if rocep1s0f1
+```
+
+| Flag | Description |
+| :--- | :--- |
+| `-n, --nodes` | Comma-separated list of node IPs (Head node first). |
+| `-t` | Docker image name (default: `vllm-node`). |
+| `--name` | Container name (default: `vllm_node`). |
+| `--eth-if` | Ethernet interface name. |
+| `--ib-if` | InfiniBand interface name. |
+| `--nccl-debug` | NCCL debug level (e.g., INFO, WARN). Defaults to INFO if flag is present but value is omitted. |
+| `--check-config` | Check configuration and auto-detection without launching. |
+| `-d` | Run in daemon mode (detached). |
+
+## 3\. Running the Container (Manual)

 Ray and NCCL require specific Docker flags to function correctly across multiple nodes (Shared memory, Network namespace, and Hardware access).

@@ -231,7 +330,7 @@ docker run --privileged --gpus all -it --rm \

 -----

-## 3\. Using `run-cluster-node.sh`
+## 4\. Using `run-cluster-node.sh` (Internal)

 The script is used to configure the environment and launch Ray either in head or node mode.

@@ -289,7 +388,7 @@ You need to make sure you allocate IP addresses to them (no need to allocate IP

 -----

-## 4\. Configuration Details
+## 5\. Configuration Details

 ### Environment Persistence

@@ -301,7 +400,7 @@ docker exec -it vllm_node bash

 All environment variables (NCCL, Ray, vLLM config) set by the startup script will be loaded automatically in this new session.

-## 5\. Using cluster mode for inference
+## 6\. Using cluster mode for inference

 First, start follow the instructions above to start the head container on your first Spark, and node container on the second Spark.
 Then, on the first Spark, run vllm like this:
@@ -318,7 +417,7 @@ docker exec -it vllm_node

 And execute vllm command inside.

-## 6\. Fastsafetensors
+## 7\. Fastsafetensors

 This build includes support for fastsafetensors loading which significantly improves loading speeds, especially on DGX Spark where MMAP performance is very poor currently.
 [Fasttensors](https://github.com/foundation-model-stack/fastsafetensors/) solve this issue by using more efficient multi-threaded loading while avoiding mmap.
@@ -332,7 +431,7 @@ To use this method, simply include `--load-format fastsafetensors` when running
 HF_HUB_OFFLINE=1 vllm serve openai/gpt-oss-120b --port 8888 --host 0.0.0.0 --trust_remote_code --swap-space 16 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --load-format fastsafetensors
 ```

-## 7\. Benchmarking
+## 8\. Benchmarking

 Follow the guidance in [VLLM Benchmark Suites](https://docs.vllm.ai/en/latest/contributing/benchmarks/) to download benchmarking dataset, and then run a benchmark with a command like this (assuming you are running on head node, otherwise specify `--host` parameter):