Updated Dockerfile to include 2 levels of cache busters, added the cluster script and README.

2025-11-24 15:21:08 -08:00
parent 0ad880e0fe
commit 3ecca4d2b7
3 changed files with 250 additions and 2 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,118 @@
+# vLLM Ray Cluster Node Docker for DGX Spark
+
+This repository contains the Docker configuration and startup scripts to run a multi-node vLLM inference cluster using Ray. It supports InfiniBand/RDMA (NCCL) and custom environment configuration for high-performance setups.
+
+## 1\. Building the Docker Image
+
+The Dockerfile includes specific **Build Arguments** to allow you to selectively rebuild layers (e.g., update the vLLM source code without re-downloading PyTorch).
+
+### Option A: Standard Build (First Time)
+
+```bash
+docker build -t vllm-node .
+```
+
+### Option B: Fast Rebuild (Update vLLM Source Only)
+
+Use this if you want to pull the latest code from GitHub but keep the heavy dependencies (Torch, FlashInfer, system deps) cached.
+
+```bash
+docker build \
+  --build-arg CACHEBUST_VLLM=$(date +%s) \
+  -t vllm-node .
+```
+
+### Option C: Full Rebuild (Update All Dependencies)
+
+Use this to force a re-download of PyTorch, FlashInfer, and system packages.
+
+```bash
+docker build \
+  --build-arg CACHEBUST_DEPS=$(date +%s) \
+  -t vllm-node .
+```
+
+-----
+
+## 2\. Running the Container
+
+Ray and NCCL require specific Docker flags to function correctly across multiple nodes (Shared memory, Network namespace, and Hardware access).
+
+```bash
+docker run -it --rm \
+  --gpus all \
+  --net=host \
+  --ipc=host \
+  --privileged \
+  --name vllm_node \
+  vllm-node bash
+```
+
+**Flags Explained:**
+
+  * `--net=host`: **Required.** Ray needs full control over network ports; port mapping is insufficient for multi-node clusters.
+  * `--ipc=host`: **Required.** Allows shared memory access for PyTorch/NCCL.
+  * `--privileged`: **Required for InfiniBand.** Grants the container access to RDMA devices (`/dev/infiniband`).
+
+-----
+
+## 3\. Using `run-cluster-node.sh`
+
+Once inside the container, use the included script to configure the environment and launch Ray.
+
+### Syntax
+
+```bash
+./run-cluster-node.sh [OPTIONS]
+```
+
+| Flag | Long Flag | Description | Required? |
+| :--- | :--- | :--- | :--- |
+| `-r` | `--role` | Role of the machine: `head` or `node`. | **Yes** |
+| `-h` | `--host-ip` | The IP address of **this** specific machine (IB or Eth IP). | **Yes** |
+| `-e` | `--eth-if` | Ethernet interface name (e.g., `eth0`, `enp3s0`). | **Yes** |
+| `-i` | `--ib-if` | InfiniBand interface name (e.g., `ib0`, `rocep1s0f1`). | **Yes** |
+| `-m` | `--head-ip` | The IP address of the **Head Node**. | Only if role is `node` |
+
+### Example: Starting the Head Node
+
+```bash
+./run-cluster-node.sh \
+  --role head \
+  --host-ip 192.168.177.11 \
+  --eth-if enp1s0f1np1 \
+  --ib-if rocep1s0f1
+```
+
+### Example: Starting a Worker Node
+
+```bash
+./run-cluster-node.sh \
+  --role node \
+  --host-ip 192.168.177.12 \
+  --eth-if enp1s0f1np1 \
+  --ib-if rocep1s0f1 \
+  --head-ip 192.168.177.11
+```
+
+-----
+
+## 4\. Configuration Details
+
+### Environment Persistence
+
+The script automatically appends exported variables to `~/.bashrc`. If you need to open a second terminal into the running container for debugging, simply run:
+
+```bash
+docker exec -it vllm_node bash
+```
+
+All environment variables (NCCL, Ray, vLLM config) set by the startup script will be loaded automatically in this new session.
+
+### Hardware Architecture
+
+**Note:** The Dockerfile defaults to `TORCH_CUDA_ARCH_LIST=12.1a` (NVIDIA GB10). If you are using different hardware, update the `ENV` variable in the Dockerfile before building:
+
+  * **H100:** `9.0`
+  * **A100:** `8.0`
+  * **L40S:** `8.9`