Merge branch 'main' into pytorch-base

This commit is contained in:
Eugene Rakhmatulin
2026-02-03 12:55:03 -08:00
6 changed files with 398 additions and 4 deletions

View File

@@ -77,6 +77,19 @@ Then run the following command that will build and distribute image across the c
**On a single node**:
**NEW** - `launch-cluster.sh` now supports solo mode, which is now a recommended way to run the container on a single Spark:
```bash
./launch-cluster.sh --solo exec \
vllm serve \
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ \
--port 8000 --host 0.0.0.0 \
--gpu-memory-utilization 0.7 \
--load-format fastsafetensors
```
**To launch using regular `docker run`**
```bash
docker run \
--privileged \
@@ -146,6 +159,8 @@ For periodic maintenance, I recommend using a filter: `docker builder prune --fi
### 2026-02-02
#### Nemotron Nano mod
Added a mod for nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B support. It supports all Nemotron Nano models/quants using the same reasoning parser.
To use, add `--apply-mod mods/nemotron-nano` to `./launch-cluster.sh` arguments.
@@ -172,6 +187,38 @@ For example, to run nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 on a single node
Please note, that NVFP4 models on Spark are not fully supported on vLLM (any build) yet, so the performance will not be optimal. You will likely see Flashinfer errors during load. This model is also known to crash sometimes.
#### Ability to use launch-cluster.sh with NVIDIA NGC containers
Added a new mod that enables using cluster launch script with NVIDIA NGC vLLM or any other vLLM container that includes Infiniband libraries and Ray support.
To use, add `--apply-mod mods/use-ngc-vllm` to `./launch-cluster.sh` arguments. It can be combined with other mods.
For example, to launch Nemotron Nano in the cluster using NGC container, you can use the following command:
```bash
./launch-cluster.sh \
-t nvcr.io/nvidia/vllm:26.01-py3 \
--apply-mod mods/use-ngc-vllm \
--apply-mod mods/nemotron-nano \
-e VLLM_USE_FLASHINFER_MOE_FP4=1 \
-e VLLM_FLASHINFER_MOE_BACKEND=throughput \
exec vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--max-model-len 262144 \
--port 8888 --host 0.0.0.0 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin nano_v3_reasoning_parser.py \
--reasoning-parser nano_v3 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.7 \
--tensor-parallel-size 2 \
--distributed-executor-backend ray
```
Make sure you have the container pulled on both nodes!
At this point it doesn't seem like NGC container performs any better for this model than a custom build.
### 2026-01-29
#### New Parameters for launch-cluster.sh

223
docs/NETWORKING.md Normal file
View File

@@ -0,0 +1,223 @@
# DGX Spark Networking
The following guide is for two node cluster, but it is also applicable to larger clusters.
See [this post](https://forums.developer.nvidia.com/t/6x-spark-setup/354399/56) for an example of 6-8 node Spark cluster.
Please keep in mind that to get the most from vLLM you need to have number of nodes that corresponds to power of 2 - e.g. 2, 4 or 8 nodes.
The guide assumes that the nodes are named `spark` and `spark2`, but you can use any names.
Same with IP addresses: we use `192.168.177.0/24` subnet with `.11` and `.12` assigned to both nodes, but you can use any IP addresses, as long as they are in the same subnet.
## DGX Spark ConnectX quirks
DGX Spark has a pretty unique ConnectX setup.
To achieve 200G transfer speed, ConnectX NIC needs ~x8 PCIe 5.0 lanes.
However, DGX Spark SOC can't provide more than x4 PCIe lanes per device due to hardware limitations.
So to achieve 200G on a single cable connection, each physical port shares the same pair of PCIe5 x4 connections.
Each PCIe 5 x4 link is represented by two Ethernet and two RoCE interfaces:
```bash
eugr@spark:~$ ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
```
In this case, the single cable is plugged in the outermost QSFP port (the right one if looking from the back).
This port has two pairs of "twins" associated with it:
- Ethernet: `enp1s0f1np1` and `enP2p1s0f1np1`
- RoCE/IB: `rocep1s0f1` and `roceP2p1s0f1`
Each of the twins represents one PCIe x4 link and can provide up to 100G link speed.
For vLLM, we need RDMA over RoCE, so Ethernet speed is not that important, that's why we can assign IP only to one of the ports - in this case `enp1s0f1np1`.
However, in order to get full bandwidth in NCCL RDMA mode, we need to utilize **both** RoCE twins. It is achieved by setting `NCCL_IB_HCA` to both RoCE interfaces: `export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1`
`./launch-cluster.sh` does this automatically, along with autodiscovery of interfaces, so as long as you set up your Ethernet interface properly, vLLM will utilize both RoCE twins.
Also, note that connecting two Sparks using **both** ports won't give you any noticeable advantage in bandwidth, so single connection is sufficient.
If you connect 3 Sparks by daisy-chaining them, you will only be able to sustain 100G between each pair of Sparks.
## Connecting more than 2 Sparks in the cluster
To connect more than 2 Sparks, you will need a proper switch, for example [Microtik CRS812-DDQ](https://mikrotik.com/product/crs812_ddq).
Please refer to [this post](https://forums.developer.nvidia.com/t/6x-spark-setup/354399/56) for an example of setting up a 6-8 node Spark cluster.
## Network setup
Assuming both are connected using rightmost QFSP port (when looking from the back).
Create `/etc/netplan/40-cx7.yaml` on `spark`:
```yaml
network:
version: 2
ethernets:
enp1s0f1np1:
dhcp4: no
dhcp6: no # Explicitly disable DHCPv6
link-local: [ ipv4 ] # Restrict link-local addresses to IPv4 only
mtu: 9000
addresses: [192.168.177.11/24]
enP2p1s0f1np1:
dhcp4: no
dhcp6: no
link-local: [ ipv4 ]
mtu: 9000
```
Create `/etc/netplan/40-cx7.yaml` on `spark2`:
```yaml
network:
version: 2
ethernets:
enp1s0f1np1:
dhcp4: no
dhcp6: no # Explicitly disable DHCPv6
link-local: [ ipv4 ] # Restrict link-local addresses to IPv4 only
mtu: 9000
addresses: [192.168.177.12/24]
enP2p1s0f1np1:
dhcp4: no
dhcp6: no
link-local: [ ipv4 ]
mtu: 9000
```
Please note, that only one interface of the "twin" pair needs an IP address, but MTU needs to be set on both.
You can also assign a separate address to another "twin" if you want to utilize the second interface independently, but make sure you assign an IP address from a different subnet.
For instance, for the example above, if you want to assign an IP to `enP2p1s0f1np1`, you need to use `192.168.177.12` on `spark`. **DO NOT use the same subnet on both "twins"** - it will confuse autodiscovery and mess up routing.
This will not affect vLLM performance as it will use RDMA over RoCE using both "twins", even if the IP is only set on one.
Then run on each node:
```bash
sudo chmod 600 /etc/netplan/40-cx7.yaml
sudo netplan apply
```
Set up passwordless ssh. On spark:
```bash
wget https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/connect-two-sparks/assets/discover-sparks
chmod +x discover-sparks
./discover-sparks
```
MTU setting (testing):
```bash
sudo ip link set dev enp1s0f1np1 mtu 9000
```
Benchmark connection (use perftest package):
```
$ ib_write_bw 192.168.177.12 -d rocep1s0f1 --report_gbits -q 4 -R --force-link IB
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : rocep1s0f1
Number of qps : 4 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 1024[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : ON
Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x03ec PSN 0xb680ae
local address: LID 0000 QPN 0x03ed PSN 0x808800
local address: LID 0000 QPN 0x03ee PSN 0x5b694a
local address: LID 0000 QPN 0x03ef PSN 0xe2efd1
remote address: LID 0000 QPN 0x03eb PSN 0x75f6ee
remote address: LID 0000 QPN 0x03ec PSN 0x436140
remote address: LID 0000 QPN 0x03ed PSN 0x81698a
remote address: LID 0000 QPN 0x03ee PSN 0x4a8b11
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 20000 111.72 111.71 0.213070
---------------------------------------------------------------------------------------
```
Latency test:
```bash
ib_write_lat 192.168.177.12 -d rocep1s0f1 --report_gbits -R --force-link IB
```
```
---------------------------------------------------------------------------------------
RDMA_Write Latency Test
Dual-port : OFF Device : rocep1s0f1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: OFF
ibv_wr* API : ON
TX depth : 1
Mtu : 1024[B]
Link type : IB
Max inline data : 220[B]
rdma_cm QPs : ON
Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x02ee PSN 0xb0c21c
remote address: LID 0000 QPN 0x02ee PSN 0x14568b
---------------------------------------------------------------------------------------
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec]
2 1000 1.42 1.93 1.47 1.47 0.00 1.57 1.93
---------------------------------------------------------------------------------------
```
## NCCL Setup
From https://build.nvidia.com/spark/nccl/stacked-sparks
```bash
# Install dependencies and build NCCL
sudo apt-get update && sudo apt-get install -y libopenmpi-dev
git clone -b v2.28.3-1 https://github.com/NVIDIA/nccl.git ~/nccl/
cd ~/nccl/
make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"
# Set environment variables
export CUDA_HOME="/usr/local/cuda"
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
export NCCL_HOME="$HOME/nccl/build/"
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
```
Build NCCL Test Suite:
```bash
# Clone and build NCCL tests
git clone https://github.com/NVIDIA/nccl-tests.git ~/nccl-tests/
cd ~/nccl-tests/
make MPI=1
```
Test on both nodes:
```bash
# Set network interface environment variables (use your active interface)
export UCX_NET_DEVICES=enp1s0f1np1
export NCCL_SOCKET_IFNAME=enp1s0f1np1
export OMPI_MCA_btl_tcp_if_include=enp1s0f1np1
export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1
export NCCL_IB_DISABLE=0
# Run the all_gather performance test across both nodes
mpirun -np 2 -H 192.168.177.11:1,192.168.177.12:1 \
--mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
$HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2
```

View File

@@ -403,14 +403,15 @@ apply_mod_to_container() {
# 3. Run run.sh
echo " Running patch script on $node_ip..."
local exec_cmd="cd $container_dest && chmod +x run.sh && ./run.sh"
local local_exec_cmd="export WORKSPACE_DIR=\$PWD && cd $container_dest && chmod +x run.sh && ./run.sh"
local remote_exec_cmd="export WORKSPACE_DIR=\\\$PWD && cd $container_dest && chmod +x run.sh && ./run.sh"
local ret_code=0
if [[ "$is_local" == "true" ]]; then
docker exec "$container" bash -c "$exec_cmd"
docker exec "$container" bash -c "$local_exec_cmd"
ret_code=$?
else
$cmd_prefix docker exec "$container" bash -c "\"$exec_cmd\""
$cmd_prefix docker exec "$container" bash -c "\"$remote_exec_cmd\""
ret_code=$?
fi

View File

@@ -1,4 +1,4 @@
#!/bin/bash
set -e
cd $VLLM_BASE_DIR
cd $WORKSPACE_DIR
wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py

View File

@@ -0,0 +1,117 @@
#!/bin/bash
set -e
# Define a function to export immediately AND save to .bashrc for future sessions
export_persist() {
local var_name="$1"
local var_value="$2"
# 1. Export for the current running process
export "$var_name"="$var_value"
# 2. Append to .bashrc (idempotent check to avoid duplicate lines)
if ! grep -q "export $var_name=" ~/.bashrc; then
echo "export $var_name=\"$var_value\"" >> ~/.bashrc
else
# Optional: Update the existing line if it exists
sed -i "s|export $var_name=.*|export $var_name=\"$var_value\"|" ~/.bashrc
fi
}
# --- Help Function ---
usage() {
echo "Usage: $0 [OPTIONS]"
echo ""
echo "Required Arguments:"
echo " -r, --role <head|node> : Set the node type"
echo " -h, --host-ip <ip> : IP address of this interface (Host IP)"
echo " -e, --eth-if <name> : Ethernet interface name (e.g., eth0)"
echo " -i, --ib-if <name> : InfiniBand/RDMA interface name"
echo ""
echo "Conditional Arguments:"
echo " -m, --head-ip <ip> : IP of the head node (REQUIRED if role is 'node')"
echo ""
echo "Example:"
echo " $0 --role head --host-ip 192.168.1.10 --eth-if eth0 --ib-if ib0"
echo " $0 --role node --host-ip 192.168.1.20 --eth-if eth0 --ib-if ib0 --head-ip 192.168.1.10"
exit 1
}
# --- Argument Parsing ---
# Initialize variables to empty
NODE_TYPE=""
HOST_IP=""
ETH_IF_NAME=""
IB_IF_NAME=""
HEAD_IP=""
while [[ "$#" -gt 0 ]]; do
case $1 in
-r|--role) NODE_TYPE="$2"; shift ;;
-h|--host-ip) HOST_IP="$2"; shift ;;
-e|--eth-if) ETH_IF_NAME="$2"; shift ;;
-i|--ib-if) IB_IF_NAME="$2"; shift ;;
-m|--head-ip) HEAD_IP="$2"; shift ;;
*) echo "Unknown parameter passed: $1"; usage ;;
esac
shift
done
# --- Validation ---
# 1. Check if all common required arguments are present
if [[ -z "$NODE_TYPE" || -z "$HOST_IP" || -z "$ETH_IF_NAME" || -z "$IB_IF_NAME" ]]; then
echo "Error: Missing required arguments."
usage
fi
# 2. Validate Role
if [[ "$NODE_TYPE" != "head" && "$NODE_TYPE" != "node" ]]; then
echo "Error: --role must be 'head' or 'node'."
exit 1
fi
# 3. Conditional Check for Head IP
if [[ "$NODE_TYPE" == "node" && -z "$HEAD_IP" ]]; then
echo "Error: When --role is 'node', you must provide --head-ip."
exit 1
fi
# --- Environment Configuration ---
echo "Configuring environment for [$NODE_TYPE] at $HOST_IP..."
export_persist VLLM_HOST_IP "$HOST_IP"
export_persist RAY_NODE_IP_ADDRESS "$HOST_IP"
export_persist RAY_OVERRIDE_NODE_IP_ADDRESS "$HOST_IP"
# Network Interface
export_persist MN_IF_NAME "$ETH_IF_NAME"
export_persist UCX_NET_DEVICES "$ETH_IF_NAME"
export_persist NCCL_SOCKET_IFNAME "$ETH_IF_NAME"
# InfiniBand
export_persist NCCL_IB_HCA "$IB_IF_NAME"
export_persist NCCL_IB_DISABLE "0"
# Sockets/Transport
export_persist OMPI_MCA_btl_tcp_if_include "$ETH_IF_NAME"
export_persist GLOO_SOCKET_IFNAME "$ETH_IF_NAME"
export_persist TP_SOCKET_IFNAME "$ETH_IF_NAME"
export_persist RAY_memory_monitor_refresh_ms "0"
# --- Execution ---
if [ "${NODE_TYPE}" == "head" ]; then
echo "Starting Ray HEAD node..."
exec ray start --block --head --port 6379 \
--node-ip-address "$VLLM_HOST_IP" \
--disable-usage-stats
else
echo "Starting Ray WORKER node connecting to $HEAD_IP..."
exec ray start --block \
--address="$HEAD_IP:6379" \
--node-ip-address "$VLLM_HOST_IP"
fi

6
mods/use-ngc-vllm/run.sh Normal file
View File

@@ -0,0 +1,6 @@
#!/bin/bash
set -e
echo "Setting up cluster initialization script..."
cp run-cluster-node.sh $WORKSPACE_DIR/run-cluster-node.sh
chmod +x $WORKSPACE_DIR/run-cluster-node.sh