Updated documentation; default image tags in build script

2026-03-27 16:41:09 -07:00
parent 51d69c5c17
commit c1a6cec074
4 changed files with 451 additions and 23 deletions
--- a/docs/NETWORKING.md
+++ b/docs/NETWORKING.md
@@ -42,13 +42,54 @@ However, in order to get full bandwidth in NCCL RDMA mode, we need to utilize **
 Also, note that connecting two Sparks using **both** ports won't give you any noticeable advantage in bandwidth, so single connection is sufficient.
 If you connect 3 Sparks by daisy-chaining them, you will only be able to sustain 100G between each pair of Sparks.

-## Connecting more than 2 Sparks in the cluster
+## Connecting 3 Sparks in a mesh cluster without a switch
+
+Three Sparks can be connected together in a cluster without using a separate RoCE switch.
+However, all three Sparks need to be on the same wired network using it's 10G Ethernet port (RG-45, not QSFP). Being on a same wireless network should work too, but it's not recommended and was not tested.
+
+You need to make sure they are connected the following way: port 0 on one Spark should connect to port 1 on another Spark (unlike non-mesh configuration).
+Example diagram:
+
+```mermaid
+block-beta
+    columns 1
+    
+    block:Spark3
+        columns 2
+        Title3["Spark 3"]:2
+        s3p0["Port 0<br>192.168.187.13<br>192.168.188.13"] s3p1["Port 1<br>192.168.197.13<br>192.168.198.13"]
+    end
+    
+    space
+    
+    block:Spark2
+        columns 2
+        Title2["Spark 2"]:2
+        s2p0["Port 0<br>192.168.197.12<br>192.168.198.12"] s2p1["Port 1<br>192.168.177.12<br>192.168.178.13"]
+    end
+    
+    space
+    
+    block:Spark1
+        columns 2
+        Title1["Spark 1"]:2
+        s1p0["Port 0<br>192.168.177.11<br>192.168.178.11"] s1p1["Port 1<br>192.168.187.11<br>192.168.188.11"]
+    end
+
+    s1p0 <--> s2p1
+    s2p0 <--> s3p1
+    s3p0 <--> s1p1
+```
+
+## Connecting more than 2 Sparks in the cluster using a switch

 To connect more than 2 Sparks, you will need a proper switch, for example [Microtik CRS812-DDQ](https://mikrotik.com/product/crs812_ddq).
 Please refer to [this post](https://forums.developer.nvidia.com/t/6x-spark-setup/354399/56) for an example of setting up a 6-8 node Spark cluster.

 ## Network setup

+### For dual Sparks or multiple Sparks using a QSFP switch
+
 Assuming both are connected using rightmost QFSP port (when looking from the back).

 Create `/etc/netplan/40-cx7.yaml` on `spark`:
@@ -115,6 +156,122 @@ MTU setting (testing):
 sudo ip link set dev enp1s0f1np1 mtu 9000
 ```

+### For 3-node mesh
+
+3-node mesh is configured differently than dual clusters or clusters using a QSFP switch.
+
+Assuming, your Sparks are connected according to the diagram above:
+
+Create `/etc/netplan/40-cx7.yaml` on `spark1`:
+```yaml
+network:
+  version: 2
+  ethernets:
+    enp1s0f0np0:
+      dhcp4: no
+      dhcp6: no        # Explicitly disable DHCPv6
+      link-local: []   # Restrict link-local addresses to static IPv4 only
+      mtu: 9000
+      addresses: [192.168.177.11/24]
+    enP2p1s0f0np0:
+      dhcp4: no
+      dhcp6: no
+      link-local: []
+      mtu: 9000
+      addresses: [192.168.178.11/24]
+    enp1s0f1np1:
+      dhcp4: no
+      dhcp6: no        # Explicitly disable DHCPv6
+      link-local: []   # Restrict link-local addresses to static IPv4 only
+      mtu: 9000
+      addresses: [192.168.187.11/24]
+    enP2p1s0f1np1:
+      dhcp4: no
+      dhcp6: no
+      link-local: []
+      mtu: 9000
+      addresses: [192.168.188.11/24]
+```
+
+Create `/etc/netplan/40-cx7.yaml` on `spark2`:
+```yaml
+network:
+  version: 2
+  ethernets:
+    enp1s0f0np0:
+      dhcp4: no
+      dhcp6: no        # Explicitly disable DHCPv6
+      link-local: []   # Restrict link-local addresses to static IPv4 only
+      mtu: 9000
+      addresses: [192.168.197.12/24]
+    enP2p1s0f0np0:
+      dhcp4: no
+      dhcp6: no
+      link-local: []
+      mtu: 9000
+      addresses: [192.168.198.12/24]
+    enp1s0f1np1:
+      dhcp4: no
+      dhcp6: no        # Explicitly disable DHCPv6
+      link-local: []   # Restrict link-local addresses to static IPv4 only
+      mtu: 9000
+      addresses: [192.168.177.12/24]
+    enP2p1s0f1np1:
+      dhcp4: no
+      dhcp6: no
+      link-local: []
+      mtu: 9000
+      addresses: [192.168.178.12/24]
+```
+
+Create `/etc/netplan/40-cx7.yaml` on `spark3`:
+```yaml
+network:
+  version: 2
+  ethernets:
+    enp1s0f0np0:
+      dhcp4: no
+      dhcp6: no        # Explicitly disable DHCPv6
+      link-local: []   # Restrict link-local addresses to static IPv4 only
+      mtu: 9000
+      addresses: [192.168.187.13/24]
+    enP2p1s0f0np0:
+      dhcp4: no
+      dhcp6: no
+      link-local: []
+      mtu: 9000
+      addresses: [192.168.188.13/24]
+    enp1s0f1np1:
+      dhcp4: no
+      dhcp6: no        # Explicitly disable DHCPv6
+      link-local: []   # Restrict link-local addresses to static IPv4 only
+      mtu: 9000
+      addresses: [192.168.197.13/24]
+    enP2p1s0f1np1:
+      dhcp4: no
+      dhcp6: no
+      link-local: []
+      mtu: 9000
+      addresses: [192.168.198.13/24]
+```
+
+Then run (on each Spark):
+
+```bash
+sudo chmod 600 /etc/netplan/40-cx7.yaml
+sudo netplan apply
+```
+
+### Passwordless SSH and benchmarks
+
+Set up passwordless ssh. On the first spark:
+
+```bash
+wget https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/connect-two-sparks/assets/discover-sparks
+chmod +x discover-sparks
+./discover-sparks
+```
+
 **Benchmark connection (use perftest package):**

 Run the receiver on `spark2` node:
@@ -196,7 +353,9 @@ ib_write_lat 192.168.177.12 -d rocep1s0f1 --report_gbits -R --force-link IB
 ---------------------------------------------------------------------------------------
 ```

-## NCCL Setup
+## NCCL Tests
+
+### Dual Sparks or Sparks via QSFP switch

 From https://build.nvidia.com/spark/nccl/stacked-sparks

@@ -240,3 +399,51 @@ mpirun -np 2 -H 192.168.177.11:1,192.168.177.12:1 \
  $HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2

 ```
+
+### 3-node mesh
+
+```bash
+# Install dependencies and build NCCL
+sudo apt-get update && sudo apt-get install -y libopenmpi-dev
+git clone -b dgxspark-3node-ring https://github.com/zyang-dev/nccl.git ~/nccl/
+cd ~/nccl/
+make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"
+
+# Set environment variables
+export CUDA_HOME="/usr/local/cuda"
+export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
+export NCCL_HOME="$HOME/nccl/build/"
+export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
+```
+
+Build NCCL Test Suite:
+
+```bash
+# Clone and build NCCL tests
+git clone https://github.com/NVIDIA/nccl-tests.git ~/nccl-tests/
+cd ~/nccl-tests/
+make MPI=1
+```
+
+Test on both nodes (replace spark1, spark2, spark3 with the actual hostnames or IP address on non-QSFP interface):
+
+```bash
+# Set environment variables
+export CUDA_HOME="/usr/local/cuda"
+export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
+export NCCL_HOME="$HOME/nccl_spark_cluster/build/"
+export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
+
+# For 3-node mesh we have to use 10G interface for OOB communication!
+export UCX_NET_DEVICES=enP7s7
+export NCCL_SOCKET_IFNAME=enP7s7
+export OMPI_MCA_btl_tcp_if_include=enP7s7
+export NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0,rocep1s0f1,roceP2p1s0f1
+export NCCL_IB_DISABLE=0
+
+# Run the all_gather performance test across both nodes
+mpirun -np 3 -H spark1:1,spark2:1,spark3:1 \
+  --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
+  -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH -x NCCL_IB_MERGE_NICS=0 -x NCCL_NET_PLUGIN=none -x NCCL_IB_SUBNET_AWARE_ROUTING=1 \
+  $HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 3
+```