From f8eb294c580e5f3f6951650e3281003e198b32d7 Mon Sep 17 00:00:00 2001 From: Eugene Rakhmatulin Date: Tue, 3 Feb 2026 12:54:38 -0800 Subject: [PATCH] Updated README.md and added Networking Guide. --- README.md | 13 +++ docs/NETWORKING.md | 223 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 236 insertions(+) create mode 100644 docs/NETWORKING.md diff --git a/README.md b/README.md index e2ca582..0548cfd 100644 --- a/README.md +++ b/README.md @@ -77,6 +77,19 @@ Then run the following command that will build and distribute image across the c **On a single node**: +**NEW** - `launch-cluster.sh` now supports solo mode, which is now a recommended way to run the container on a single Spark: + +```bash +./launch-cluster.sh --solo exec \ + vllm serve \ + QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ \ + --port 8000 --host 0.0.0.0 \ + --gpu-memory-utilization 0.7 \ + --load-format fastsafetensors +``` + +**To launch using regular `docker run`** + ```bash docker run \ --privileged \ diff --git a/docs/NETWORKING.md b/docs/NETWORKING.md new file mode 100644 index 0000000..3821336 --- /dev/null +++ b/docs/NETWORKING.md @@ -0,0 +1,223 @@ +# DGX Spark Networking + +The following guide is for two node cluster, but it is also applicable to larger clusters. + +See [this post](https://forums.developer.nvidia.com/t/6x-spark-setup/354399/56) for an example of 6-8 node Spark cluster. +Please keep in mind that to get the most from vLLM you need to have number of nodes that corresponds to power of 2 - e.g. 2, 4 or 8 nodes. + +The guide assumes that the nodes are named `spark` and `spark2`, but you can use any names. +Same with IP addresses: we use `192.168.177.0/24` subnet with `.11` and `.12` assigned to both nodes, but you can use any IP addresses, as long as they are in the same subnet. + +## DGX Spark ConnectX quirks + +DGX Spark has a pretty unique ConnectX setup. + +To achieve 200G transfer speed, ConnectX NIC needs ~x8 PCIe 5.0 lanes. + +However, DGX Spark SOC can't provide more than x4 PCIe lanes per device due to hardware limitations. +So to achieve 200G on a single cable connection, each physical port shares the same pair of PCIe5 x4 connections. +Each PCIe 5 x4 link is represented by two Ethernet and two RoCE interfaces: + +```bash +eugr@spark:~$ ibdev2netdev +rocep1s0f0 port 1 ==> enp1s0f0np0 (Down) +rocep1s0f1 port 1 ==> enp1s0f1np1 (Up) +roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down) +roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up) +``` + +In this case, the single cable is plugged in the outermost QSFP port (the right one if looking from the back). +This port has two pairs of "twins" associated with it: + +- Ethernet: `enp1s0f1np1` and `enP2p1s0f1np1` +- RoCE/IB: `rocep1s0f1` and `roceP2p1s0f1` + +Each of the twins represents one PCIe x4 link and can provide up to 100G link speed. + +For vLLM, we need RDMA over RoCE, so Ethernet speed is not that important, that's why we can assign IP only to one of the ports - in this case `enp1s0f1np1`. +However, in order to get full bandwidth in NCCL RDMA mode, we need to utilize **both** RoCE twins. It is achieved by setting `NCCL_IB_HCA` to both RoCE interfaces: `export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1` + +`./launch-cluster.sh` does this automatically, along with autodiscovery of interfaces, so as long as you set up your Ethernet interface properly, vLLM will utilize both RoCE twins. + +Also, note that connecting two Sparks using **both** ports won't give you any noticeable advantage in bandwidth, so single connection is sufficient. +If you connect 3 Sparks by daisy-chaining them, you will only be able to sustain 100G between each pair of Sparks. + +## Connecting more than 2 Sparks in the cluster + +To connect more than 2 Sparks, you will need a proper switch, for example [Microtik CRS812-DDQ](https://mikrotik.com/product/crs812_ddq). +Please refer to [this post](https://forums.developer.nvidia.com/t/6x-spark-setup/354399/56) for an example of setting up a 6-8 node Spark cluster. + +## Network setup + +Assuming both are connected using rightmost QFSP port (when looking from the back). + +Create `/etc/netplan/40-cx7.yaml` on `spark`: +```yaml +network: + version: 2 + ethernets: + enp1s0f1np1: + dhcp4: no + dhcp6: no # Explicitly disable DHCPv6 + link-local: [ ipv4 ] # Restrict link-local addresses to IPv4 only + mtu: 9000 + addresses: [192.168.177.11/24] + enP2p1s0f1np1: + dhcp4: no + dhcp6: no + link-local: [ ipv4 ] + mtu: 9000 +``` + +Create `/etc/netplan/40-cx7.yaml` on `spark2`: +```yaml +network: + version: 2 + ethernets: + enp1s0f1np1: + dhcp4: no + dhcp6: no # Explicitly disable DHCPv6 + link-local: [ ipv4 ] # Restrict link-local addresses to IPv4 only + mtu: 9000 + addresses: [192.168.177.12/24] + enP2p1s0f1np1: + dhcp4: no + dhcp6: no + link-local: [ ipv4 ] + mtu: 9000 +``` + +Please note, that only one interface of the "twin" pair needs an IP address, but MTU needs to be set on both. +You can also assign a separate address to another "twin" if you want to utilize the second interface independently, but make sure you assign an IP address from a different subnet. + +For instance, for the example above, if you want to assign an IP to `enP2p1s0f1np1`, you need to use `192.168.177.12` on `spark`. **DO NOT use the same subnet on both "twins"** - it will confuse autodiscovery and mess up routing. + +This will not affect vLLM performance as it will use RDMA over RoCE using both "twins", even if the IP is only set on one. + +Then run on each node: + +```bash +sudo chmod 600 /etc/netplan/40-cx7.yaml +sudo netplan apply +``` + +Set up passwordless ssh. On spark: + +```bash +wget https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/connect-two-sparks/assets/discover-sparks +chmod +x discover-sparks +./discover-sparks +``` + +MTU setting (testing): + +```bash +sudo ip link set dev enp1s0f1np1 mtu 9000 +``` + +Benchmark connection (use perftest package): + +``` +$ ib_write_bw 192.168.177.12 -d rocep1s0f1 --report_gbits -q 4 -R --force-link IB +--------------------------------------------------------------------------------------- + RDMA_Write BW Test + Dual-port : OFF Device : rocep1s0f1 + Number of qps : 4 Transport type : IB + Connection type : RC Using SRQ : OFF + PCIe relax order: ON + ibv_wr* API : ON + TX depth : 128 + CQ Moderation : 1 + Mtu : 1024[B] + Link type : IB + Max inline data : 0[B] + rdma_cm QPs : ON + Data ex. method : rdma_cm +--------------------------------------------------------------------------------------- + local address: LID 0000 QPN 0x03ec PSN 0xb680ae + local address: LID 0000 QPN 0x03ed PSN 0x808800 + local address: LID 0000 QPN 0x03ee PSN 0x5b694a + local address: LID 0000 QPN 0x03ef PSN 0xe2efd1 + remote address: LID 0000 QPN 0x03eb PSN 0x75f6ee + remote address: LID 0000 QPN 0x03ec PSN 0x436140 + remote address: LID 0000 QPN 0x03ed PSN 0x81698a + remote address: LID 0000 QPN 0x03ee PSN 0x4a8b11 +--------------------------------------------------------------------------------------- + #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] + 65536 20000 111.72 111.71 0.213070 +--------------------------------------------------------------------------------------- +``` + +Latency test: + +```bash +ib_write_lat 192.168.177.12 -d rocep1s0f1 --report_gbits -R --force-link IB +``` + +``` +--------------------------------------------------------------------------------------- + RDMA_Write Latency Test + Dual-port : OFF Device : rocep1s0f1 + Number of qps : 1 Transport type : IB + Connection type : RC Using SRQ : OFF + PCIe relax order: OFF + ibv_wr* API : ON + TX depth : 1 + Mtu : 1024[B] + Link type : IB + Max inline data : 220[B] + rdma_cm QPs : ON + Data ex. method : rdma_cm +--------------------------------------------------------------------------------------- + local address: LID 0000 QPN 0x02ee PSN 0xb0c21c + remote address: LID 0000 QPN 0x02ee PSN 0x14568b +--------------------------------------------------------------------------------------- + #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec] + 2 1000 1.42 1.93 1.47 1.47 0.00 1.57 1.93 +--------------------------------------------------------------------------------------- +``` + +## NCCL Setup + +From https://build.nvidia.com/spark/nccl/stacked-sparks + +```bash +# Install dependencies and build NCCL +sudo apt-get update && sudo apt-get install -y libopenmpi-dev +git clone -b v2.28.3-1 https://github.com/NVIDIA/nccl.git ~/nccl/ +cd ~/nccl/ +make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121" + +# Set environment variables +export CUDA_HOME="/usr/local/cuda" +export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi" +export NCCL_HOME="$HOME/nccl/build/" +export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH" +``` + +Build NCCL Test Suite: + +```bash +# Clone and build NCCL tests +git clone https://github.com/NVIDIA/nccl-tests.git ~/nccl-tests/ +cd ~/nccl-tests/ +make MPI=1 +``` + +Test on both nodes: + +```bash +# Set network interface environment variables (use your active interface) +export UCX_NET_DEVICES=enp1s0f1np1 +export NCCL_SOCKET_IFNAME=enp1s0f1np1 +export OMPI_MCA_btl_tcp_if_include=enp1s0f1np1 +export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1 +export NCCL_IB_DISABLE=0 + +# Run the all_gather performance test across both nodes +mpirun -np 2 -H 192.168.177.11:1,192.168.177.12:1 \ + --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \ + -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \ + $HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2 + +``` \ No newline at end of file