Merge branch 'main' into pr-6

2025-12-18 22:02:13 -08:00
parent 442f7369ad e6efd668cd
commit 8a0cb3c853
3 changed files with 531 additions and 7 deletions
--- a/README.md
+++ b/README.md
@@ -3,6 +3,19 @@

 This repository contains the Docker configuration and startup scripts to run a multi-node vLLM inference cluster using Ray. It supports InfiniBand/RDMA (NCCL) and custom environment configuration for high-performance setups.

+## Table of Contents
+
+- [DISCLAIMER](#disclaimer)
+- [CHANGELOG](#changelog)
+- [1. Building the Docker Image](#1-building-the-docker-image)
+- [2. Launching the Cluster (Recommended)](#2-launching-the-cluster-recommended)
+- [3. Running the Container (Manual)](#3-running-the-container-manual)
+- [4. Using `run-cluster-node.sh` (Internal)](#4-using-run-cluster-nodesh-internal)
+- [5. Configuration Details](#5-configuration-details)
+- [6. Using cluster mode for inference](#6-using-cluster-mode-for-inference)
+- [7. Fastsafetensors](#7-fastsafetensors)
+- [8. Benchmarking](#8-benchmarking)
+
 ## DISCLAIMER

 This repository is not affiliated with NVIDIA or their subsidiaries. The content is provided as a reference material only, not intended for production use.
@@ -12,6 +25,10 @@ The Dockerfile builds from the main branch of VLLM, so depending on when you run

 ## CHANGELOG

+### 2025-12-18
+
+Added `launch-cluster.sh` convenience script for basic cluster management - see details below.
+
 ### 2025-12-15

 Updated `build-and-copy.sh` flags:
@@ -151,7 +168,84 @@ docker save vllm-node | ssh your_username@another_spark_hostname_or_ip "docker l

 -----

-## 2\. Running the Container
+## 2\. Launching the Cluster (Recommended)
+
+The `launch-cluster.sh` script simplifies the process of starting the cluster nodes. It handles Docker parameters, network interface detection, and node configuration automatically.
+
+### Basic Usage
+
+**Start the container (auto-detects everything):**
+
+```bash
+./launch-cluster.sh
+```
+
+This will:
+1.  Auto-detect the active InfiniBand and Ethernet interfaces.
+2.  Auto-detect the node IP.
+3.  Launch the container in interactive mode.
+4.  Start the Ray cluster node (head or worker depending on the IP).
+
+Assumptions and limitations:
+
+- It assumes that you've already set up passwordless SSH access on all nodes. If not, follow NVidia's [Connect Two Sparks Playbook](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks). I recommend setting up static IPs in the configuration instead of automatically assigning them every time, but this script should work with automatically assigned addresses too.
+- By default, it assumes that the container image name is `vllm-node`. If it differs, you need to specify it with `-t <name>` parameter.
+- If both ConnectX **physical** ports are utilized, and both have IP addresses, it will use whatever interface it finds first. Use `--eth-if` to override.
+- It will ignore IPs associated with the 2nd "clone" of the physical interface. For instance, the outermost port on Spark has two logical Ethernet interfaces: `enp1s0f1np1` and `enP2p1s0f1np1`. Only `enp1s0f1np1` will be used. To override, use `--eth-if` parameter.
+- It assumes that the same physical interfaces are named the same on all nodes (IOW, enp1s0f1np1 refers to the same physical port on all nodes). If it's not the case, you will have to launch cluster nodes manually or modify the script.
+- It will mount only `~/.cache/huggingface` to the container by default. If you want to mount other caches, you'll have to pass set `VLLMSPARK_EXTRA_DOCKER_ARGS` environment variable, e.g.: `VLLM_SPARK_EXTRA_DOCKER_ARGS="-v $HOME/.cache/vllm:/root/.cache/vllm" ./launch-cluster.sh ...`. Please note that you must use `$HOME` instead of `~` here as the latter won't be expanded if passed through the variable to docker arguments.
+
+
+**Start in daemon mode (background):**
+
+```bash
+./launch-cluster.sh -d
+```
+
+**Stop the container:**
+
+```bash
+./launch-cluster.sh stop
+```
+
+**Check status:**
+
+```bash
+./launch-cluster.sh status
+```
+
+**Execute a command inside the running container:**
+
+```bash
+./launch-cluster.sh exec vllm serve ...
+```
+
+### Auto-Detection
+
+The script attempts to automatically detect:
+*   **Ethernet Interface:** The interface associated with the active InfiniBand device that has an IP address.
+*   **InfiniBand Interface:** The active InfiniBand devices. By default both active RoCE interfaces that correspond to active IB port(s) will be utilized.
+*   **Node Role:** Based on the detected IP address and the list of nodes (defaults to `192.168.177.11` as head and `192.168.177.12` as worker).
+
+### Manual Overrides
+
+You can override the auto-detected values if needed:
+
+```bash
+./launch-cluster.sh --nodes "10.0.0.1,10.0.0.2" --eth-if enp1s0f1np1 --ib-if rocep1s0f1
+```
+
+| Flag | Description |
+| :--- | :--- |
+| `-n, --nodes` | Comma-separated list of node IPs (Head node first). |
+| `-t` | Docker image name (default: `vllm-node`). |
+| `--name` | Container name (default: `vllm_node`). |
+| `--eth-if` | Ethernet interface name. |
+| `--ib-if` | InfiniBand interface name. |
+| `--check-config` | Check configuration and auto-detection without launching. |
+| `-d` | Run in daemon mode (detached). |
+
+## 3\. Running the Container (Manual)

 Ray and NCCL require specific Docker flags to function correctly across multiple nodes (Shared memory, Network namespace, and Hardware access).

@@ -210,7 +304,7 @@ docker run --privileged --gpus all -it --rm \

 -----

-## 3\. Using `run-cluster-node.sh`
+## 4\. Using `run-cluster-node.sh` (Internal)

 The script is used to configure the environment and launch Ray either in head or node mode.

@@ -268,7 +362,7 @@ You need to make sure you allocate IP addresses to them (no need to allocate IP

 -----

-## 4\. Configuration Details
+## 5\. Configuration Details

 ### Environment Persistence

@@ -280,7 +374,7 @@ docker exec -it vllm_node bash

 All environment variables (NCCL, Ray, vLLM config) set by the startup script will be loaded automatically in this new session.

-## 5\. Using cluster mode for inference
+## 6\. Using cluster mode for inference

 First, start follow the instructions above to start the head container on your first Spark, and node container on the second Spark.
 Then, on the first Spark, run vllm like this:
@@ -297,7 +391,7 @@ docker exec -it vllm_node

 And execute vllm command inside.

-## 6\. Fastsafetensors
+## 7\. Fastsafetensors

 This build includes support for fastsafetensors loading which significantly improves loading speeds, especially on DGX Spark where MMAP performance is very poor currently.
 [Fasttensors](https://github.com/foundation-model-stack/fastsafetensors/) solve this issue by using more efficient multi-threaded loading while avoiding mmap.
@@ -311,7 +405,7 @@ To use this method, simply include `--load-format fastsafetensors` when running
 HF_HUB_OFFLINE=1 vllm serve openai/gpt-oss-120b --port 8888 --host 0.0.0.0 --trust_remote_code --swap-space 16 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --load-format fastsafetensors
 ```

-## 7\. Benchmarking
+## 8\. Benchmarking

 Follow the guidance in [VLLM Benchmark Suites](https://docs.vllm.ai/en/latest/contributing/benchmarks/) to download benchmarking dataset, and then run a benchmark with a command like this (assuming you are running on head node, otherwise specify `--host` parameter):

--- a/launch-cluster.sh
+++ b/launch-cluster.sh
@@ -0,0 +1,429 @@
+#!/bin/bash
+
+# Default Configuration
+IMAGE_NAME="vllm-node"
+DEFAULT_CONTAINER_NAME="vllm_node"
+# Modify these if you want to pass additional docker args or set VLLMSPARK_EXTRA_DOCKER_ARGS variable
+DOCKER_ARGS="-e NCCL_DEBUG=INFO -e NCCL_IGNORE_CPU_AFFINITY=1 -v $HOME/.cache/huggingface:/root/.cache/huggingface"
+
+# Append additional arguments from environment variable
+if [[ -n "$VLLMSPARK_EXTRA_DOCKER_ARGS" ]]; then
+    DOCKER_ARGS="$DOCKER_ARGS $VLLMSPARK_EXTRA_DOCKER_ARGS"
+fi
+
+# ETH_IF and IB_IF will be auto-detected if not provided
+ETH_IF=""
+IB_IF=""
+
+# Initialize variables
+NODES_ARG=""
+CONTAINER_NAME="$DEFAULT_CONTAINER_NAME"
+COMMAND_TO_RUN=""
+DAEMON_MODE="false"
+CHECK_CONFIG="false"
+ACTION="start"
+
+# Function to print usage
+usage() {
+    echo "Usage: $0 [-n <node_ips>] [-t <image_name>] [--name <container_name>] [--eth-if <if_name>] [--ib-if <if_name>] [--check-config] [-d] [action] [command]"
+    echo "  -n, --nodes     Comma-separated list of node IPs (Optional, auto-detected if omitted)"
+    echo "  -t              Docker image name (Optional, default: $IMAGE_NAME)"
+    echo "  --name          Container name (Optional, default: $DEFAULT_CONTAINER_NAME)"
+    echo "  --eth-if        Ethernet interface (Optional, auto-detected)"
+    echo "  --ib-if         InfiniBand interface (Optional, auto-detected)"
+    echo "  --check-config  Check configuration and auto-detection without launching"
+    echo "  -d              Daemon mode (only for 'start' action)"
+    echo "  action          start | stop | status | exec (Default: start)"
+    echo "  command         Command to run (only for 'exec' action)"
+    exit 1
+}
+
+# Parse arguments
+while [[ "$#" -gt 0 ]]; do
+    case $1 in
+        -n|--nodes) NODES_ARG="$2"; shift ;;
+        -t) IMAGE_NAME="$2"; shift ;;
+        --name) CONTAINER_NAME="$2"; shift ;;
+        --eth-if) ETH_IF="$2"; shift ;;
+        --ib-if) IB_IF="$2"; shift ;;
+        --check-config) CHECK_CONFIG="true" ;;
+        -d) DAEMON_MODE="true" ;;
+        -h|--help) usage ;;
+        start|stop|status) 
+            ACTION="$1" 
+            ;;
+        exec)
+            ACTION="exec"
+            shift
+            COMMAND_TO_RUN="$@"
+            break
+            ;;
+        *) 
+            # If it's not a flag and not a known action, treat as exec command for backward compatibility
+            # unless it's the default 'start' implied.
+            # However, to support "omitted" = start, we need to be careful.
+            # If the arg looks like a command, it's exec.
+            ACTION="exec"
+            COMMAND_TO_RUN="$@"
+            break 
+            ;;
+    esac
+    shift
+done
+
+# --- Auto-Detection Logic ---
+
+# Check for required tools if auto-detection is needed
+if [[ -z "$ETH_IF" || -z "$IB_IF" || -z "$NODES_ARG" ]]; then
+    if ! command -v ibdev2netdev &> /dev/null; then
+        echo "Error: ibdev2netdev not found. Cannot auto-detect interfaces."
+        exit 1
+    fi
+fi
+
+# 1. Detect Interfaces (ETH_IF and IB_IF)
+if [[ -z "$ETH_IF" || -z "$IB_IF" ]]; then
+    echo "Auto-detecting interfaces..."
+    
+    # Get all Up interfaces: "rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)"
+    # We capture: IB_DEV, NET_DEV
+    mapfile -t IB_NET_PAIRS < <(ibdev2netdev | awk '/Up\)/ {print $1 " " $5}')
+    
+    if [ ${#IB_NET_PAIRS[@]} -eq 0 ]; then
+        echo "Error: No active IB interfaces found."
+        exit 1
+    fi
+
+    DETECTED_IB_IFS=()
+    CANDIDATE_ETH_IFS=()
+
+    for pair in "${IB_NET_PAIRS[@]}"; do
+        ib_dev=$(echo "$pair" | awk '{print $1}')
+        net_dev=$(echo "$pair" | awk '{print $2}')
+        
+        DETECTED_IB_IFS+=("$ib_dev")
+        
+        # Check if interface has an IP address
+        if ip addr show "$net_dev" | grep -q "inet "; then
+            CANDIDATE_ETH_IFS+=("$net_dev")
+        fi
+    done
+
+    # Set IB_IF if not provided
+    if [[ -z "$IB_IF" ]]; then
+        IB_IF=$(IFS=,; echo "${DETECTED_IB_IFS[*]}")
+        echo "  Detected IB_IF: $IB_IF"
+    fi
+
+    # Set ETH_IF if not provided
+    if [[ -z "$ETH_IF" ]]; then
+        if [ ${#CANDIDATE_ETH_IFS[@]} -eq 0 ]; then
+            echo "Error: No active IB-associated interfaces have IP addresses."
+            exit 1
+        fi
+        
+        # Selection logic: Prefer interface without capital 'P'
+        SELECTED_ETH=""
+        for iface in "${CANDIDATE_ETH_IFS[@]}"; do
+            if [[ "$iface" != *"P"* ]]; then
+                SELECTED_ETH="$iface"
+                break
+            fi
+        done
+        
+        # Fallback: Use the first one if all have 'P' or none found yet
+        if [[ -z "$SELECTED_ETH" ]]; then
+            SELECTED_ETH="${CANDIDATE_ETH_IFS[0]}"
+        fi
+        
+        ETH_IF="$SELECTED_ETH"
+        echo "  Detected ETH_IF: $ETH_IF"
+    fi
+fi
+
+# 2. Detect Nodes if not provided
+if [[ -z "$NODES_ARG" ]]; then
+    echo "Auto-detecting nodes..."
+    
+    if ! command -v nc &> /dev/null; then
+        echo "Error: nc (netcat) not found. Please install netcat."
+        exit 1
+    fi
+    
+    if ! command -v python3 &> /dev/null; then
+        echo "Error: python3 not found. Please install python3."
+        exit 1
+    fi
+
+    # Get CIDR of the selected ETH_IF
+    CIDR=$(ip -o -f inet addr show "$ETH_IF" | awk '{print $4}' | head -n 1)
+    
+    if [[ -z "$CIDR" ]]; then
+        echo "Error: Could not determine IP/CIDR for interface $ETH_IF"
+        exit 1
+    fi
+    
+    LOCAL_IP=${CIDR%/*}
+    echo "  Detected Local IP: $LOCAL_IP ($CIDR)"
+
+    DETECTED_IPS=("$LOCAL_IP")
+    
+    echo "  Scanning for SSH peers on $CIDR..."
+    
+    # Generate list of IPs using python
+    ALL_IPS=$(python3 -c "import ipaddress, sys; [print(ip) for ip in ipaddress.ip_network(sys.argv[1], strict=False).hosts()]" "$CIDR")
+    
+    TEMP_IPS_FILE=$(mktemp)
+    
+    # Scan in parallel
+    for ip in $ALL_IPS; do
+        # Skip own IP
+        if [[ "$ip" == "$LOCAL_IP" ]]; then continue; fi
+        
+        (
+            # Check port 22 with 1 second timeout
+            if nc -z -w 1 "$ip" 22 &>/dev/null; then
+                echo "$ip" >> "$TEMP_IPS_FILE"
+            fi
+        ) &
+    done
+    
+    # Wait for all background scans to complete
+    wait
+    
+    # Read found IPs
+    if [[ -f "$TEMP_IPS_FILE" ]]; then
+        while read -r ip; do
+             DETECTED_IPS+=("$ip")
+             echo "  Found peer: $ip"
+        done < "$TEMP_IPS_FILE"
+        rm -f "$TEMP_IPS_FILE"
+    fi
+    
+    # Sort IPs
+    IFS=$'\n' SORTED_IPS=($(sort <<<"${DETECTED_IPS[*]}"))
+    unset IFS
+    
+    NODES_ARG=$(IFS=,; echo "${SORTED_IPS[*]}")
+    echo "  Cluster Nodes: $NODES_ARG"
+fi
+
+if [[ -z "$NODES_ARG" ]]; then
+    echo "Error: Nodes argument (-n) is mandatory or could not be auto-detected."
+    usage
+fi
+
+# Split nodes into array
+IFS=',' read -r -a ALL_NODES <<< "$NODES_ARG"
+
+# Detect Head IP (Local IP)
+HEAD_IP=""
+LOCAL_IPS=$(hostname -I)
+for ip in "${ALL_NODES[@]}"; do
+    # Trim whitespace
+    ip=$(echo "$ip" | xargs)
+    if [[ " $LOCAL_IPS " =~ " $ip " ]]; then
+        HEAD_IP="$ip"
+        break
+    fi
+done
+
+if [[ -z "$HEAD_IP" ]]; then
+    echo "Error: Could not determine Head IP. This script must be run on one of the nodes specified in -n."
+    exit 1
+fi
+
+# Identify Worker Nodes
+WORKER_NODES=()
+for ip in "${ALL_NODES[@]}"; do
+    ip=$(echo "$ip" | xargs)
+    if [[ "$ip" != "$HEAD_IP" ]]; then
+        WORKER_NODES+=("$ip")
+    fi
+done
+
+echo "Head Node: $HEAD_IP"
+echo "Worker Nodes: ${WORKER_NODES[*]}"
+echo "Container Name: $CONTAINER_NAME"
+echo "Action: $ACTION"
+
+# Check SSH connectivity to worker nodes
+if [[ "$ACTION" == "start" || "$ACTION" == "exec" || "$CHECK_CONFIG" == "true" ]]; then
+    if [ ${#WORKER_NODES[@]} -gt 0 ]; then
+        echo "Checking SSH connectivity to worker nodes..."
+        for worker in "${WORKER_NODES[@]}"; do
+            if ! ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no "$worker" true 2>/dev/null; then
+                echo "Error: Passwordless SSH to $worker failed."
+                echo "  Please ensure SSH keys are configured and the host is reachable."
+                exit 1
+            fi
+            echo "  SSH to $worker: OK"
+        done
+    fi
+fi
+
+if [[ "$CHECK_CONFIG" == "true" ]]; then
+    echo "Configuration Check Complete."
+    echo "  Image Name: $IMAGE_NAME"
+    echo "  ETH Interface: $ETH_IF"
+    echo "  IB Interface: $IB_IF"
+    exit 0
+fi
+
+# Cleanup Function
+cleanup() {
+    # Remove traps to prevent nested cleanup
+    trap - EXIT INT TERM HUP
+
+    echo ""
+    echo "Stopping cluster..."
+    
+    # Stop Head
+    echo "Stopping head node ($HEAD_IP)..."
+    docker stop "$CONTAINER_NAME" >/dev/null 2>&1 || true
+    
+    # Stop Workers
+    for worker in "${WORKER_NODES[@]}"; do
+        echo "Stopping worker node ($worker)..."
+        ssh "$worker" "docker stop $CONTAINER_NAME" >/dev/null 2>&1 || true
+    done
+    
+    echo "Cluster stopped."
+}
+
+# Handle 'stop' action
+if [[ "$ACTION" == "stop" ]]; then
+    cleanup
+    exit 0
+fi
+
+# Handle 'status' action
+if [[ "$ACTION" == "status" ]]; then
+    echo "Checking status..."
+    
+    # Check Head
+    if docker ps | grep -q "$CONTAINER_NAME"; then
+        echo "[HEAD] $HEAD_IP: Container '$CONTAINER_NAME' is RUNNING."
+        echo "--- Ray Status ---"
+        docker exec "$CONTAINER_NAME" ray status || echo "Failed to get ray status."
+        echo "------------------"
+    else
+        echo "[HEAD] $HEAD_IP: Container '$CONTAINER_NAME' is NOT running."
+    fi
+    
+    # Check Workers
+    for worker in "${WORKER_NODES[@]}"; do
+        if ssh "$worker" "docker ps | grep -q '$CONTAINER_NAME'"; then
+             echo "[WORKER] $worker: Container '$CONTAINER_NAME' is RUNNING."
+        else
+             echo "[WORKER] $worker: Container '$CONTAINER_NAME' is NOT running."
+        fi
+    done
+    exit 0
+fi
+
+# Trap signals
+# Only trap if we are NOT in daemon mode, OR if we are in exec mode (always cleanup after exec)
+if [[ "$DAEMON_MODE" == "false" ]] || [[ "$ACTION" == "exec" ]]; then
+    trap cleanup EXIT INT TERM HUP
+fi
+
+# Check if cluster is already running
+check_cluster_running() {
+    local running=false
+    
+    # Check Head
+    if docker ps --format '{{.Names}}' | grep -q "^${CONTAINER_NAME}$"; then
+        echo "Warning: Container '$CONTAINER_NAME' is already running on head node ($HEAD_IP)."
+        running=true
+    fi
+    
+    # Check Workers
+    for worker in "${WORKER_NODES[@]}"; do
+        if ssh "$worker" "docker ps --format '{{.Names}}' | grep -q '^${CONTAINER_NAME}$'"; then
+             echo "Warning: Container '$CONTAINER_NAME' is already running on worker node ($worker)."
+             running=true
+        fi
+    done
+    
+    if [[ "$running" == "true" ]]; then
+        echo "Cluster containers are already running. Please stop them first or use a different name."
+        exit 1
+    fi
+}
+
+# Start Cluster Function
+start_cluster() {
+    check_cluster_running
+
+    # Start Head Node
+    echo "Starting Head Node on $HEAD_IP..."
+    docker run -d --privileged --gpus all --rm \
+        --ipc=host --network host \
+        --name "$CONTAINER_NAME" \
+        $DOCKER_ARGS \
+        "$IMAGE_NAME" \
+        ./run-cluster-node.sh \
+        --role head \
+        --host-ip "$HEAD_IP" \
+        --eth-if "$ETH_IF" \
+        --ib-if "$IB_IF"
+
+    # Start Worker Nodes
+    # Start Worker Nodes
+    for worker in "${WORKER_NODES[@]}"; do
+        echo "Starting Worker Node on $worker..."
+        ssh "$worker" "docker run -d --privileged --gpus all --rm \
+            --ipc=host --network host \
+            --name $CONTAINER_NAME \
+            $DOCKER_ARGS \
+            $IMAGE_NAME \
+            ./run-cluster-node.sh \
+            --role node \
+            --host-ip $worker \
+            --eth-if $ETH_IF \
+            --ib-if $IB_IF \
+            --head-ip $HEAD_IP"
+    done
+
+    wait_for_cluster
+}
+
+# Wait for Cluster Readiness
+wait_for_cluster() {
+    echo "Waiting for cluster to be ready..."
+    local retries=30
+    local count=0
+    
+    while [[ $count -lt $retries ]]; do
+        # Check if ray is responsive
+        if docker exec "$CONTAINER_NAME" ray status >/dev/null 2>&1; then
+             echo "Cluster head is responsive."
+             # Give workers a moment to connect
+             sleep 5
+             return 0
+        fi
+        
+        sleep 2
+        ((count++))
+    done
+    
+    echo "Timeout waiting for cluster to start."
+    exit 1
+}
+
+if [[ "$ACTION" == "exec" ]]; then
+    start_cluster
+    echo "Executing command on head node: $COMMAND_TO_RUN"
+    docker exec -it "$CONTAINER_NAME" bash -i -c "$COMMAND_TO_RUN"
+elif [[ "$ACTION" == "start" ]]; then
+    start_cluster
+    if [[ "$DAEMON_MODE" == "true" ]]; then
+        echo "Cluster started in background (Daemon mode)."
+    else
+        echo "Cluster started. Tailing logs from head node..."
+        echo "Press Ctrl+C to stop the cluster."
+        docker logs -f "$CONTAINER_NAME" &
+        wait $!
+    fi
+fi
--- a/run-cluster-node.sh
+++ b/run-cluster-node.sh
@@ -109,7 +109,8 @@ if [ "${NODE_TYPE}" == "head" ]; then
        --node-ip-address "$VLLM_HOST_IP" \
 	--include-dashboard=True \
        --dashboard-host "0.0.0.0" \
-        --dashboard-port 8265
+        --dashboard-port 8265 \
+        --disable-usage-stats
 else
    echo "Starting Ray WORKER node connecting to $HEAD_IP..."
    exec ray start --block \