Updated documentation; default image tags in build script

This commit is contained in:
Eugene Rakhmatulin
2026-03-27 16:41:09 -07:00
parent 51d69c5c17
commit c1a6cec074
4 changed files with 451 additions and 23 deletions

205
README.md
View File

@@ -52,8 +52,8 @@ Build the container.
**On DGX Spark cluster:**
Make sure you connect your Sparks together and enable passwordless SSH as described in NVidia's [Connect Two Sparks Playbook](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks).
You can also check out our new [Networking Guide](docs/NETWORKING.md).
Make sure you connect your Sparks together and enable passwordless SSH as described in our [Networking Guide](docs/NETWORKING.md). You can also check out NVidia's [Connect Two Sparks Playbook](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks), but using our guide is the best way to get started.
**NEW**: the guide now includes instructions on setting up 3-node Spark mesh!
Then run the following command that will build and distribute image across the cluster.
@@ -127,8 +127,6 @@ This will run the model on all available cluster nodes.
**Also:** You can use any vLLM container that has "bash" as its default entrypoint with the launch script. It was tested with NGC vLLM, but can work with others too. To use such container in the cluster, you need to specify `--apply-mod use-ngc-vllm` argument to `./launch-cluster.sh`. However, it's recommended to build the container using this repository for best compatibility and most up-to-date features.
## CHANGELOG
**IMPORTANT**
You may want to prune your build cache every once in a while, especially if you've been using these container builds since the beginning.
@@ -149,6 +147,97 @@ Don't do it every time you rebuild, because it will slow down compilation times.
For periodic maintenance, I recommend using a filter: `docker builder prune --filter until=72h`
## CHANGELOG
### 2026-03-27
#### Default image tag in `build-and-copy.sh`
`build-and-copy.sh` now automatically sets a sensible default image tag when `-t` is not specified:
- `--tf5` / `--pre-tf` - tag defaults to `vllm-node-tf5`
- `--exp-mxfp4` - tag defaults to `vllm-node-mxfp4`
- in all other cases - tag defaults to `vllm-node` (no change)
An explicit `-t <tag>` always takes precedence.
#### Support for 3-node mesh setups
Added initial support for setups where 3 Sparks are connected in a ring-like mesh without an additional switch.
See [Networking Guide](docs/NETWORKING.md) for instructions on how to connect and set up networking in such cluster.
Autodiscover function in both `launch-cluster.sh` and `run-recipe.sh` now can detect mesh setups and configure parameters accordingly.
You can try running a model on all 3 nodes in pipeline-parallel configuration using the following recipe:
```bash
./run-recipe.sh recipes/3x-spark-cluster/qwen3.5-397b-int4-autoround --setup # you can drop --setup and --force-build on subsequent calls
```
Please note that `--tensor-parallel-size 3` or `-tp 3` is not supported by any commonly used model, so the only two viable options to utilize all three nodes for a single model are:
- `--pipeline-parallel 3` will let you run a model that can't fit on dual Sparks, but without additional speed improvements (total throughtput may improve though).
- `--data-parallel 3` (possibly with `--enable-expert-parallel`) will let you run a model that can fit on a single Spark, but allow for better concurrency.
You can also run models with `--tensor-parallel 2` in a 3-node configuration - in this case only first two nodes (from autodiscovery/.env or from the CLI parameters) will be utilized.
#### GB10 Verification During Node Discovery
Node discovery now confirms each SSH-reachable peer is a GB10 system before adding it to the cluster:
Only hosts reporting `NVIDIA GB10` are included. This prevents accidentally adding non-Spark machines that happen to be on the same subnet.
#### Separate COPY_HOSTS Discovery
Autodiscover now determines the host list used for image and model distribution separately from `CLUSTER_NODES`:
- **Non-mesh**: `COPY_HOSTS` mirrors `CLUSTER_NODES` (no change in behaviour).
- **Mesh**: scans the direct IB-attached `enp1s0f0np0` and `enp1s0f1np1` interfaces (not the OOB ETH interface), so large file transfers use the faster direct InfiniBand path.
`COPY_HOSTS` is saved to `.env` and respected by `build-and-copy.sh`, `hf-download.sh`, and `run-recipe.py`.
#### Interactive Configuration Save in `autodiscover.sh`
`autodiscover.sh` now handles `.env` creation with a guided interactive flow, replacing the previous logic in `run-recipe.py`:
- Runs automatically when `.env` is absent.
- Asks per-node confirmation for both `CLUSTER_NODES` and `COPY_HOSTS`.
- Skips if `.env` already exists (use `--setup` to force).
`run-recipe.py` no longer contains its own `.env`-save prompt — it delegates entirely to `autodiscover.sh`.
#### `--setup` Flag in `launch-cluster.sh` and `build-and-copy.sh`
Both scripts now accept `--setup` to force a full autodiscovery run and overwrite the existing `.env`:
```bash
./launch-cluster.sh --setup exec vllm serve ...
./build-and-copy.sh --setup -c
```
This is equivalent to the existing `--setup` in `run-recipe.sh`.
#### `--config` Flag
`hf-download.sh`, `build-and-copy.sh` and `launch-cluster.sh` now accept `--config <file>` to load a custom `.env` configuration file. `COPY_HOSTS` from the config is used for model distribution:
```bash
./hf-download.sh QuantTrio/MiniMax-M2-AWQ --config /path/to/cluster.env -c --copy-parallel
```
#### Parallelism-Aware Node Trimming
`launch-cluster.sh` now parses `-tp` / `--tensor-parallel-size`, `-pp` / `--pipeline-parallel-size`, and `-dp` / `--data-parallel-size` from the exec command or launch script and adjusts the active node count accordingly — for both Ray and no-Ray modes.
- If **fewer nodes are needed** than configured, only the required nodes get containers started (excess nodes are left idle).
- If **more nodes are needed** than available, an error is raised before anything starts.
```
Note: Command requires 2 node(s) (tp=2 * pp=1 * dp=1); using 2 of 3 configured node(s).
Error: Command requires 4 nodes (tp=4 * pp=1 * dp=1) but only 3 node(s) are configured.
```
No flags required — the check is automatic whenever parallelism arguments are present in the command.
### 2026-03-18
#### `--master-port` / `--head-port` Parameter
@@ -591,7 +680,8 @@ See (this post on NVIDIA forums)[https://forums.developer.nvidia.com/t/make-glm-
To use the mod, first build the container with Transformers 5 support (`--pre-tf`) flag, e.g.:
```bash
./build-and-copy.sh -t vllm-node-tf5 --pre-tf -c
# Image tag defaults to vllm-node-tf5 when --tf5/--pre-tf is used
./build-and-copy.sh --pre-tf -c
```
Then, to run on a single node:
@@ -641,7 +731,8 @@ It is currently the fastest way to run GPT-OSS on DGX Spark, achieving 60 t/s on
To use this build, first build the container with `--exp-mxfp4` flag. I recommend using a separate label as it is currently not recommended to use this build for models other than gpt-oss:
```bash
./build-and-copy.sh -t vllm-node-mxfp4 --exp-mxfp4 -c
# Image tag defaults to vllm-node-mxfp4 when --exp-mxfp4 is used
./build-and-copy.sh --exp-mxfp4 -c
```
Then, to run on a single Spark:
@@ -885,7 +976,7 @@ Using a different username:
| Flag | Description |
| :--- | :--- |
| `-t, --tag <tag>` | Image tag (default: `vllm-node`) |
| `-t, --tag <tag>` | Image tag (default: `vllm-node`; auto-set to `vllm-node-tf5` with `--tf5`, `vllm-node-mxfp4` with `--exp-mxfp4`) |
| `--gpu-arch <arch>` | Target GPU architecture (default: `12.1a`) |
| `--rebuild-flashinfer` | Skip prebuilt wheel download; force a fresh local FlashInfer build |
| `--rebuild-vllm` | Force rebuild vLLM from source |
@@ -900,9 +991,13 @@ Using a different username:
| `-u, --user <user>` | Username for SSH connection (default: current user) |
| `--full-log` | Enable full Docker build output (`--progress=plain`) |
| `--no-build` | Skip building, only copy existing image (requires `--copy-to`) |
| `--network <name>` | Docker network to use during build (e.g. `host`). |
| `--cleanup` | Remove all cached `.whl` and `*-commit` files from the `wheels/` directory. |
| `--config <file>` | Path to `.env` configuration file (default: `.env` in script directory) |
| `--setup` | Force autodiscovery and save configuration to `.env` (even if `.env` already exists) |
| `-h, --help` | Show help message |
**IMPORTANT**: When copying to another node, make sure you use the Spark IP assigned to its ConnectX 7 interface (enp1s0f1np1), and not the 10G interface (enP7s7)! If you omit the IP address and use `-c` without addresses, it will use autodiscovery to detect a proper IP address.
**IMPORTANT**: When copying to another node manually, use the IP assigned to a ConnectX 7 interface (`enp1s0f*`), not the 10G/wireless interfaces. When using `-c` without addresses, autodiscovery selects the correct interface automatically — in mesh mode it uses the direct IB-attached interfaces (`enp1s0f0np0`, `enp1s0f1np1`) for maximum transfer speed.
### Copying the container to another Spark node (Manual Method)
@@ -971,9 +1066,12 @@ Assumptions and limitations:
### Auto-Detection
The script attempts to automatically detect:
* **Ethernet Interface:** The interface associated with the active InfiniBand device that has an IP address.
* **InfiniBand Interface:** The active InfiniBand devices. By default both active RoCE interfaces that correspond to active IB port(s) will be utilized.
* **Node Role:** Based on the detected IP address and the list of nodes (defaults to `192.168.177.11` as head and `192.168.177.12` as worker).
* **Ethernet Interface (`ETH_IF`):** Determined by the number of active CX7 interfaces:
- **2 active** (standard): the `enp*` interface (no capital P) that has an IP address.
- **4 active** (mesh topology): `enP7s7` (preferred) or `wlP9s9` (wireless, shown with a warning) — the cluster coordination interface is separate from the CX7 ports in this configuration.
* **InfiniBand Interface (`IB_IF`):** All active RoCE devices. In mesh mode this is always `rocep1s0f0,roceP2p1s0f0,rocep1s0f1,roceP2p1s0f1`.
* **Cluster peers:** Discovered by scanning the `ETH_IF` subnet for hosts with SSH access **and** a GB10 GPU (`nvidia-smi --query-gpu=name` must return `NVIDIA GB10`).
* **Copy hosts (`COPY_HOSTS`):** In standard mode, same as cluster peers. In mesh mode, scanned separately on `enp1s0f0np0` and `enp1s0f1np1` subnets so that image/model transfers use the direct InfiniBand path.
### Manual Overrides
@@ -1006,6 +1104,10 @@ You can override the auto-detected values if needed:
| `--mem-swap-limit-gb` | Memory+swap limit in GB (default: mem-limit + 10, only with `--non-privileged`). |
| `--pids-limit` | Process limit (default: 4096, only with `--non-privileged`). |
| `--shm-size-gb` | Shared memory size in GB (default: 64, only with `--non-privileged`). |
| `--config <file>` | Path to `.env` configuration file (default: `.env` in script directory). |
| `--setup` | Force autodiscovery and save configuration to `.env` (even if `.env` already exists). |
| `start \| stop \| status \| exec` | Action to perform (default: `start`). Not compatible with `--launch-script`. |
| `command` | Command to execute inside the container (only for `exec` action). |
### Non-Privileged Mode
@@ -1149,6 +1251,61 @@ You need to make sure you allocate IP addresses to them (no need to allocate IP
## 5\. Configuration Details
### Cluster Configuration (`.env` file)
The scripts share a `.env` file (default: `.env` in the repo directory) for persistent cluster configuration. It is created automatically by autodiscovery — run `--discover` (via `run-recipe.sh`) or `--setup` (via `launch-cluster.sh` / `build-and-copy.sh`) on first use.
**Supported variables:**
| Variable | Description |
| :--- | :--- |
| `CLUSTER_NODES` | Comma-separated node IPs used for Ray/vLLM cluster (head node first). |
| `COPY_HOSTS` | Comma-separated node IPs used for image and model distribution. In mesh mode these are the IPs on the direct IB-attached interfaces, which may differ from `CLUSTER_NODES`. |
| `LOCAL_IP` | IP address of the local node. |
| `ETH_IF` | Ethernet interface for cluster coordination (e.g. `enp1s0f1np1` or `enP7s7`). |
| `IB_IF` | Comma-separated RoCE/IB device names (e.g. `rocep1s0f0,roceP2p1s0f0,rocep1s0f1,roceP2p1s0f1`). |
| `CONTAINER_*` | Any variable prefixed with `CONTAINER_` (except `CONTAINER_NAME`) is passed as `-e VAR=VALUE` to the container. Example: `CONTAINER_NCCL_DEBUG=INFO``-e NCCL_DEBUG=INFO`. |
**Mesh-mode NCCL variables** (written automatically when mesh topology is detected):
```
CONTAINER_NCCL_NET_PLUGIN=none
CONTAINER_NCCL_IB_SUBNET_AWARE_ROUTING=1
CONTAINER_NCCL_IB_MERGE_NICS=0
```
**Example `.env` for a standard 2-node cluster:**
```
CLUSTER_NODES=192.168.177.11,192.168.177.12
COPY_HOSTS=192.168.177.12
LOCAL_IP=192.168.177.11
ETH_IF=enp1s0f1np1
IB_IF=rocep1s0f1,roceP2p1s0f1
```
To use a custom config file path, pass `--config /path/to/file.env` to any script.
### Autodiscovery Workflow
On first run, if no `.env` is present, the scripts will automatically trigger autodiscovery. You can also run it explicitly:
```bash
# Via run-recipe.sh
./run-recipe.sh --discover
# Via launch-cluster.sh or build-and-copy.sh (force re-run even if .env exists)
./launch-cluster.sh --setup exec vllm serve ...
./build-and-copy.sh --setup -c
```
Autodiscovery:
1. Detects active CX7 interfaces and determines mesh vs. standard topology.
2. Scans the network for SSH-reachable GB10 peers.
3. In mesh mode, separately discovers `COPY_HOSTS` on direct IB-attached interfaces.
4. Prompts for per-node confirmation for both `CLUSTER_NODES` and `COPY_HOSTS`.
5. Saves the result to `.env`.
### Environment Persistence
The script automatically appends exported variables to `~/.bashrc`. If you need to open a second terminal into the running container for debugging, simply run:
@@ -1322,6 +1479,32 @@ The `hf-download.sh` script provides a convenient way to download models from Hu
./hf-download.sh -c --copy-parallel QuantTrio/MiniMax-M2-AWQ
```
**Use nodes from `.env` (respects `COPY_HOSTS`):**
```bash
./hf-download.sh -c QuantTrio/MiniMax-M2-AWQ
```
When `-c` is given without explicit hosts, the script checks `COPY_HOSTS` in `.env` first, then falls back to autodiscovery. In mesh mode this means transfers go over the direct IB-attached interfaces automatically.
**Use a custom config file:**
```bash
./hf-download.sh --config /path/to/cluster.env -c QuantTrio/MiniMax-M2-AWQ
```
**Available options:**
| Flag | Description |
| :--- | :--- |
| `<model-name>` | HuggingFace model ID (e.g. `QuantTrio/MiniMax-M2-AWQ`). Required. |
| `-c, --copy-to <hosts>` | Host(s) to copy the model to after download (space- or comma-separated). Omit hosts to use `COPY_HOSTS` from `.env` or autodiscovery. |
| `--copy-to-host` | Alias for `--copy-to` (backwards compatibility). |
| `--copy-parallel` | Copy to all hosts concurrently instead of serially. |
| `-u, --user <user>` | SSH username for remote copies (default: current user). |
| `--config <file>` | Path to `.env` configuration file (default: `.env` in script directory). |
| `-h, --help` | Show help message. |
### Hardware Architecture
**Note:** This project targets `12.1a` architecture (NVIDIA GB10 / DGX Spark). If you are using different hardware, you can use `--gpu-arch` flag in `./build-and-copy.sh`.

View File

@@ -6,6 +6,7 @@ START_TIME=$(date +%s)
# Default values
IMAGE_TAG="vllm-node"
IMAGE_TAG_SET=false
REBUILD_FLASHINFER=false
REBUILD_VLLM=false
COPY_HOSTS=()
@@ -264,7 +265,7 @@ if downloads:
# Help function
usage() {
echo "Usage: $0 [OPTIONS]"
echo " -t, --tag <tag> : Image tag (default: 'vllm-node')"
echo " -t, --tag <tag> : Image tag (default: 'vllm-node', 'vllm-node-tf5' with --tf5, 'vllm-node-mxfp4' with --exp-mxfp4)"
echo " --gpu-arch <arch> : GPU architecture (default: '12.1a')"
echo " --rebuild-flashinfer : Force rebuild of FlashInfer wheels (ignore cached wheels)"
echo " --rebuild-vllm : Force rebuild of vLLM wheels (ignore cached wheels)"
@@ -291,7 +292,7 @@ usage() {
CONFIG_FILE_SET=false
while [[ "$#" -gt 0 ]]; do
case $1 in
-t|--tag) IMAGE_TAG="$2"; shift ;;
-t|--tag) IMAGE_TAG="$2"; IMAGE_TAG_SET=true; shift ;;
--gpu-arch) GPU_ARCH_LIST="$2"; shift ;;
--rebuild-flashinfer) REBUILD_FLASHINFER=true ;;
--rebuild-vllm) REBUILD_VLLM=true ;;
@@ -342,6 +343,15 @@ while [[ "$#" -gt 0 ]]; do
shift
done
# Apply default IMAGE_TAG based on flags if -t was not specified
if [ "$IMAGE_TAG_SET" = false ]; then
if [ "$PRE_TRANSFORMERS" = true ]; then
IMAGE_TAG="vllm-node-tf5"
elif [ "$EXP_MXFP4" = true ]; then
IMAGE_TAG="vllm-node-mxfp4"
fi
fi
# Source autodiscover.sh to load .env file
source "$(dirname "$0")/autodiscover.sh"

View File

@@ -42,13 +42,54 @@ However, in order to get full bandwidth in NCCL RDMA mode, we need to utilize **
Also, note that connecting two Sparks using **both** ports won't give you any noticeable advantage in bandwidth, so single connection is sufficient.
If you connect 3 Sparks by daisy-chaining them, you will only be able to sustain 100G between each pair of Sparks.
## Connecting more than 2 Sparks in the cluster
## Connecting 3 Sparks in a mesh cluster without a switch
Three Sparks can be connected together in a cluster without using a separate RoCE switch.
However, all three Sparks need to be on the same wired network using it's 10G Ethernet port (RG-45, not QSFP). Being on a same wireless network should work too, but it's not recommended and was not tested.
You need to make sure they are connected the following way: port 0 on one Spark should connect to port 1 on another Spark (unlike non-mesh configuration).
Example diagram:
```mermaid
block-beta
columns 1
block:Spark3
columns 2
Title3["Spark 3"]:2
s3p0["Port 0<br>192.168.187.13<br>192.168.188.13"] s3p1["Port 1<br>192.168.197.13<br>192.168.198.13"]
end
space
block:Spark2
columns 2
Title2["Spark 2"]:2
s2p0["Port 0<br>192.168.197.12<br>192.168.198.12"] s2p1["Port 1<br>192.168.177.12<br>192.168.178.13"]
end
space
block:Spark1
columns 2
Title1["Spark 1"]:2
s1p0["Port 0<br>192.168.177.11<br>192.168.178.11"] s1p1["Port 1<br>192.168.187.11<br>192.168.188.11"]
end
s1p0 <--> s2p1
s2p0 <--> s3p1
s3p0 <--> s1p1
```
## Connecting more than 2 Sparks in the cluster using a switch
To connect more than 2 Sparks, you will need a proper switch, for example [Microtik CRS812-DDQ](https://mikrotik.com/product/crs812_ddq).
Please refer to [this post](https://forums.developer.nvidia.com/t/6x-spark-setup/354399/56) for an example of setting up a 6-8 node Spark cluster.
## Network setup
### For dual Sparks or multiple Sparks using a QSFP switch
Assuming both are connected using rightmost QFSP port (when looking from the back).
Create `/etc/netplan/40-cx7.yaml` on `spark`:
@@ -115,6 +156,122 @@ MTU setting (testing):
sudo ip link set dev enp1s0f1np1 mtu 9000
```
### For 3-node mesh
3-node mesh is configured differently than dual clusters or clusters using a QSFP switch.
Assuming, your Sparks are connected according to the diagram above:
Create `/etc/netplan/40-cx7.yaml` on `spark1`:
```yaml
network:
version: 2
ethernets:
enp1s0f0np0:
dhcp4: no
dhcp6: no # Explicitly disable DHCPv6
link-local: [] # Restrict link-local addresses to static IPv4 only
mtu: 9000
addresses: [192.168.177.11/24]
enP2p1s0f0np0:
dhcp4: no
dhcp6: no
link-local: []
mtu: 9000
addresses: [192.168.178.11/24]
enp1s0f1np1:
dhcp4: no
dhcp6: no # Explicitly disable DHCPv6
link-local: [] # Restrict link-local addresses to static IPv4 only
mtu: 9000
addresses: [192.168.187.11/24]
enP2p1s0f1np1:
dhcp4: no
dhcp6: no
link-local: []
mtu: 9000
addresses: [192.168.188.11/24]
```
Create `/etc/netplan/40-cx7.yaml` on `spark2`:
```yaml
network:
version: 2
ethernets:
enp1s0f0np0:
dhcp4: no
dhcp6: no # Explicitly disable DHCPv6
link-local: [] # Restrict link-local addresses to static IPv4 only
mtu: 9000
addresses: [192.168.197.12/24]
enP2p1s0f0np0:
dhcp4: no
dhcp6: no
link-local: []
mtu: 9000
addresses: [192.168.198.12/24]
enp1s0f1np1:
dhcp4: no
dhcp6: no # Explicitly disable DHCPv6
link-local: [] # Restrict link-local addresses to static IPv4 only
mtu: 9000
addresses: [192.168.177.12/24]
enP2p1s0f1np1:
dhcp4: no
dhcp6: no
link-local: []
mtu: 9000
addresses: [192.168.178.12/24]
```
Create `/etc/netplan/40-cx7.yaml` on `spark3`:
```yaml
network:
version: 2
ethernets:
enp1s0f0np0:
dhcp4: no
dhcp6: no # Explicitly disable DHCPv6
link-local: [] # Restrict link-local addresses to static IPv4 only
mtu: 9000
addresses: [192.168.187.13/24]
enP2p1s0f0np0:
dhcp4: no
dhcp6: no
link-local: []
mtu: 9000
addresses: [192.168.188.13/24]
enp1s0f1np1:
dhcp4: no
dhcp6: no # Explicitly disable DHCPv6
link-local: [] # Restrict link-local addresses to static IPv4 only
mtu: 9000
addresses: [192.168.197.13/24]
enP2p1s0f1np1:
dhcp4: no
dhcp6: no
link-local: []
mtu: 9000
addresses: [192.168.198.13/24]
```
Then run (on each Spark):
```bash
sudo chmod 600 /etc/netplan/40-cx7.yaml
sudo netplan apply
```
### Passwordless SSH and benchmarks
Set up passwordless ssh. On the first spark:
```bash
wget https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/connect-two-sparks/assets/discover-sparks
chmod +x discover-sparks
./discover-sparks
```
**Benchmark connection (use perftest package):**
Run the receiver on `spark2` node:
@@ -196,7 +353,9 @@ ib_write_lat 192.168.177.12 -d rocep1s0f1 --report_gbits -R --force-link IB
---------------------------------------------------------------------------------------
```
## NCCL Setup
## NCCL Tests
### Dual Sparks or Sparks via QSFP switch
From https://build.nvidia.com/spark/nccl/stacked-sparks
@@ -240,3 +399,51 @@ mpirun -np 2 -H 192.168.177.11:1,192.168.177.12:1 \
$HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2
```
### 3-node mesh
```bash
# Install dependencies and build NCCL
sudo apt-get update && sudo apt-get install -y libopenmpi-dev
git clone -b dgxspark-3node-ring https://github.com/zyang-dev/nccl.git ~/nccl/
cd ~/nccl/
make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"
# Set environment variables
export CUDA_HOME="/usr/local/cuda"
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
export NCCL_HOME="$HOME/nccl/build/"
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
```
Build NCCL Test Suite:
```bash
# Clone and build NCCL tests
git clone https://github.com/NVIDIA/nccl-tests.git ~/nccl-tests/
cd ~/nccl-tests/
make MPI=1
```
Test on both nodes (replace spark1, spark2, spark3 with the actual hostnames or IP address on non-QSFP interface):
```bash
# Set environment variables
export CUDA_HOME="/usr/local/cuda"
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
export NCCL_HOME="$HOME/nccl_spark_cluster/build/"
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
# For 3-node mesh we have to use 10G interface for OOB communication!
export UCX_NET_DEVICES=enP7s7
export NCCL_SOCKET_IFNAME=enP7s7
export OMPI_MCA_btl_tcp_if_include=enP7s7
export NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0,rocep1s0f1,roceP2p1s0f1
export NCCL_IB_DISABLE=0
# Run the all_gather performance test across both nodes
mpirun -np 3 -H spark1:1,spark2:1,spark3:1 \
--mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH -x NCCL_IB_MERGE_NICS=0 -x NCCL_NET_PLUGIN=none -x NCCL_IB_SUBNET_AWARE_ROUTING=1 \
$HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 3
```

View File

@@ -44,12 +44,16 @@ The recipe runner can automatically discover cluster nodes:
```
When you run `--discover`, it:
1. Scans the network for nodes with SSH access
2. Prompts you to select which nodes to include
3. Saves the configuration to `.env`
1. Detects active CX7 interfaces and determines mesh vs. standard topology.
2. Scans the network for peers that are both SSH-reachable **and** have an NVIDIA GB10 GPU.
3. In mesh mode, separately discovers `COPY_HOSTS` on the direct IB-attached interfaces.
4. Prompts for per-node confirmation for `CLUSTER_NODES` and `COPY_HOSTS`.
5. Saves the full configuration (including mesh NCCL settings if applicable) to `.env`.
Future recipe runs will automatically use nodes from `.env` unless you specify `-n` or `--solo`.
When distributing the container image or model files, the runner uses `COPY_HOSTS` from `.env` (which may differ from `CLUSTER_NODES` in mesh mode) to ensure transfers go over the fastest available path.
## Workflow Modes
### Solo Mode (Single Node)
@@ -169,6 +173,7 @@ Usage: ./run-recipe.sh [OPTIONS] [RECIPE]
Cluster discovery:
--discover Auto-detect cluster nodes and save to .env
--show-env Show current .env configuration
--config FILE Path to .env configuration file (default: .env in repo directory)
Recipe overrides:
--port PORT Override port
@@ -186,10 +191,25 @@ Setup options:
Launch options:
--solo Run in solo mode (single node, no Ray)
--no-ray Multi-node without Ray (PyTorch distributed backend)
-n, --nodes IPS Comma-separated node IPs (first = head)
-d, --daemon Run in daemon mode
-t, --container IMAGE Override container from recipe
--name NAME Override container name
--nccl-debug LEVEL NCCL debug level (VERSION, WARN, INFO, TRACE)
--master-port PORT Cluster coordination port: Ray head port or PyTorch
distributed master port (default: 29501).
Alias: --head-port
--eth-if IFACE Override Ethernet interface
--ib-if IFACE Override InfiniBand interface
-e VAR=VALUE Pass environment variable to container (repeatable)
-j N Number of parallel build jobs
--no-cache-dirs Do not mount ~/.cache/vllm, ~/.cache/flashinfer, ~/.triton
--non-privileged Run container without --privileged
--mem-limit-gb N Memory limit in GB (only with --non-privileged)
--mem-swap-limit-gb N Memory+swap limit in GB (only with --non-privileged)
--pids-limit N Process limit (only with --non-privileged)
--shm-size-gb N Shared memory size in GB (only with --non-privileged)
Extra vLLM arguments:
-- ARGS... Pass additional arguments directly to vLLM
@@ -261,10 +281,18 @@ command: |
```
┌─────────────────────────────────────────────────────────┐
│ autodiscover.sh │
│ - Interface detection (standard / mesh topology) │
│ - GB10 peer verification via SSH │
│ - CLUSTER_NODES and COPY_HOSTS discovery │
│ - Interactive .env save with per-node confirmation │
└──────────────────────────┬──────────────────────────────┘
│ sourced by
┌─────────────────────────────────────────────────────────┐
│ run-recipe.sh / run-recipe.py │
│ - Parses YAML recipe │
│ - Auto-discovers cluster nodes (--discover)
│ - Loads nodes from .env │
│ - Loads / triggers cluster discovery (--discover) │
│ - Handles --setup (build + download + run) │
│ - Generates launch script from template │
│ - Applies CLI overrides │
@@ -274,7 +302,7 @@ command: |
┌──────────────────────┐ ┌───────────────────────────────┐
│ build-and-copy.sh │ │ hf-download.sh │
│ - Docker build │ │ - HuggingFace model download │
│ - Copy to workers │ │ - Rsync to workers
│ - Copy to COPY_HOSTS│ │ - Rsync to COPY_HOSTS
└──────────────────────┘ └───────────────────────────────┘
│ then calls (for run)
@@ -282,7 +310,7 @@ command: |
┌─────────────────────────────────────────────────────────┐
│ launch-cluster.sh │
│ - Cluster orchestration │
│ - Container lifecycle
│ - Container lifecycle (trimmed to required node count)
│ - Mod application │
│ - Launch script execution │
└─────────────────────────────────────────────────────────┘