Files

eugr bd48032c45 Fixed typo in docker command in README

2025-11-24 16:34:19 -08:00

4.5 KiB

Raw Blame History

vLLM Ray Cluster Node Docker for DGX Spark

This repository contains the Docker configuration and startup scripts to run a multi-node vLLM inference cluster using Ray. It supports InfiniBand/RDMA (NCCL) and custom environment configuration for high-performance setups.

1. Building the Docker Image

The Dockerfile includes specific Build Arguments to allow you to selectively rebuild layers (e.g., update the vLLM source code without re-downloading PyTorch).

Option A: Standard Build (First Time)

docker build -t vllm-node .

Option B: Fast Rebuild (Update vLLM Source Only)

Use this if you want to pull the latest code from GitHub but keep the heavy dependencies (Torch, FlashInfer, system deps) cached.

docker build \
  --build-arg CACHEBUST_VLLM=$(date +%s) \
  -t vllm-node .

Option C: Full Rebuild (Update All Dependencies)

Use this to force a re-download of PyTorch, FlashInfer, and system packages.

docker build \
  --build-arg CACHEBUST_DEPS=$(date +%s) \
  -t vllm-node .

Copying the container to another Spark node

To avoid extra network overhead, you can copy the image directly to your second Spark node via ConnectX 7 interface by using the following command:

docker save vllm-node | ssh your_username@another_spark_hostname_or_ip "docker load"

2. Running the Container

Ray and NCCL require specific Docker flags to function correctly across multiple nodes (Shared memory, Network namespace, and Hardware access).

docker run -it --rm \
  --gpus all \
  --net=host \
  --ipc=host \
  --privileged \
  --name vllm_node \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm-node bash

Or if you want to start the cluster node (head or regular), you can launch with the run-cluster.sh script (see details below):

On head node:

docker run --privileged --gpus all -it --rm \
  --ipc=host --shm-size 10.24g \
  --network host \
  --name vllm_node \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm-node ./run-cluster-node.sh \
    --role head \
    --host-ip 192.168.177.11 \
    --eth-if enp1s0f1np1 \
    --ib-if rocep1s0f1

On worker node

docker run --privileged --gpus all -it --rm \
  --ipc=host --shm-size 10.24g \
  --network host \
  --name vllm_node \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm-node ./run-cluster-node.sh \
    --role node \
    --host-ip 192.168.177.12 \
    --eth-if enp1s0f1np1 \
    --ib-if rocep1s0f1 \
    --head-ip 192.168.177.11

Flags Explained:

--net=host: Required. Ray needs full control over network ports; port mapping is insufficient for multi-node clusters.
--ipc=host: Required. Allows shared memory access for PyTorch/NCCL.
--privileged: Required for InfiniBand. Grants the container access to RDMA devices (/dev/infiniband).

3. Using `run-cluster-node.sh`

Once inside the container, use the included script to configure the environment and launch Ray.

Syntax

./run-cluster-node.sh [OPTIONS]

Flag	Long Flag	Description	Required?
`-r`	`--role`	Role of the machine: `head` or `node`.	Yes
`-h`	`--host-ip`	The IP address of this specific machine (IB or Eth IP).	Yes
`-e`	`--eth-if`	Ethernet interface name (e.g., `eth0`, `enp3s0`).	Yes
`-i`	`--ib-if`	InfiniBand interface name (e.g., `ib0`, `rocep1s0f1`).	Yes
`-m`	`--head-ip`	The IP address of the Head Node.	Only if role is `node`

Example: Starting the Head Node

./run-cluster-node.sh \
  --role head \
  --host-ip 192.168.177.11 \
  --eth-if enp1s0f1np1 \
  --ib-if rocep1s0f1

Example: Starting a Worker Node

./run-cluster-node.sh \
  --role node \
  --host-ip 192.168.177.12 \
  --eth-if enp1s0f1np1 \
  --ib-if rocep1s0f1 \
  --head-ip 192.168.177.11

4. Configuration Details

Environment Persistence

The script automatically appends exported variables to ~/.bashrc. If you need to open a second terminal into the running container for debugging, simply run:

docker exec -it vllm_node bash

All environment variables (NCCL, Ray, vLLM config) set by the startup script will be loaded automatically in this new session.

Hardware Architecture

Note: The Dockerfile defaults to TORCH_CUDA_ARCH_LIST=12.1a (NVIDIA GB10). If you are using different hardware, update the ENV variable in the Dockerfile before building:

H100: 9.0
A100: 8.0
L40S: 8.9

4.5 KiB Raw Blame History