Updated documentation; default image tags in build script

2026-03-27 16:41:09 -07:00
parent 51d69c5c17
commit c1a6cec074
4 changed files with 451 additions and 23 deletions
--- a/README.md
+++ b/README.md
@@ -52,8 +52,8 @@ Build the container.

 **On DGX Spark cluster:**

-Make sure you connect your Sparks together and enable passwordless SSH as described in NVidia's [Connect Two Sparks Playbook](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks). 
-You can also check out our new [Networking Guide](docs/NETWORKING.md).
+Make sure you connect your Sparks together and enable passwordless SSH as described in our [Networking Guide](docs/NETWORKING.md). You can also check out NVidia's [Connect Two Sparks Playbook](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks), but using our guide is the best way to get started.
+**NEW**: the guide now includes instructions on setting up 3-node Spark mesh!

 Then run the following command that will build and distribute image across the cluster.

@@ -127,8 +127,6 @@ This will run the model on all available cluster nodes.

 **Also:** You can use any vLLM container that has "bash" as its default entrypoint with the launch script. It was tested with NGC vLLM, but can work with others too. To use such container in the cluster, you need to specify `--apply-mod use-ngc-vllm` argument to `./launch-cluster.sh`. However, it's recommended to build the container using this repository for best compatibility and most up-to-date features. 

-## CHANGELOG
-
 **IMPORTANT**

 You may want to prune your build cache every once in a while, especially if you've been using these container builds since the beginning.
@@ -149,6 +147,97 @@ Don't do it every time you rebuild, because it will slow down compilation times.

 For periodic maintenance, I recommend using a filter: `docker builder prune --filter until=72h`

+## CHANGELOG
+
+### 2026-03-27
+
+#### Default image tag in `build-and-copy.sh`
+
+`build-and-copy.sh` now automatically sets a sensible default image tag when `-t` is not specified:
+
+- `--tf5` / `--pre-tf` - tag defaults to `vllm-node-tf5`
+- `--exp-mxfp4` - tag defaults to `vllm-node-mxfp4`
+- in all other cases - tag defaults to `vllm-node` (no change)
+
+An explicit `-t <tag>` always takes precedence.
+
+#### Support for 3-node mesh setups
+
+Added initial support for setups where 3 Sparks are connected in a ring-like mesh without an additional switch.
+See [Networking Guide](docs/NETWORKING.md) for instructions on how to connect and set up networking in such cluster.
+
+Autodiscover function in both `launch-cluster.sh` and `run-recipe.sh` now can detect mesh setups and configure parameters accordingly.
+
+You can try running a model on all 3 nodes in pipeline-parallel configuration using the following recipe:
+
+```bash
+./run-recipe.sh recipes/3x-spark-cluster/qwen3.5-397b-int4-autoround --setup # you can drop --setup and --force-build on subsequent calls
+```
+
+Please note that `--tensor-parallel-size 3` or `-tp 3` is not supported by any commonly used model, so the only two viable options to utilize all three nodes for a single model are:
+
+- `--pipeline-parallel 3` will let you run a model that can't fit on dual Sparks, but without additional speed improvements (total throughtput may improve though).
+- `--data-parallel 3` (possibly with `--enable-expert-parallel`) will let you run a model that can fit on a single Spark, but allow for better concurrency.
+
+You can also run models with `--tensor-parallel 2` in a 3-node configuration - in this case only first two nodes (from autodiscovery/.env or from the CLI parameters) will be utilized.
+
+#### GB10 Verification During Node Discovery
+
+Node discovery now confirms each SSH-reachable peer is a GB10 system before adding it to the cluster:
+Only hosts reporting `NVIDIA GB10` are included. This prevents accidentally adding non-Spark machines that happen to be on the same subnet.
+
+#### Separate COPY_HOSTS Discovery
+
+Autodiscover now determines the host list used for image and model distribution separately from `CLUSTER_NODES`:
+
+- **Non-mesh**: `COPY_HOSTS` mirrors `CLUSTER_NODES` (no change in behaviour).
+- **Mesh**: scans the direct IB-attached `enp1s0f0np0` and `enp1s0f1np1` interfaces (not the OOB ETH interface), so large file transfers use the faster direct InfiniBand path.
+
+`COPY_HOSTS` is saved to `.env` and respected by `build-and-copy.sh`, `hf-download.sh`, and `run-recipe.py`.
+
+#### Interactive Configuration Save in `autodiscover.sh`
+
+`autodiscover.sh` now handles `.env` creation with a guided interactive flow, replacing the previous logic in `run-recipe.py`:
+
+- Runs automatically when `.env` is absent.
+- Asks per-node confirmation for both `CLUSTER_NODES` and `COPY_HOSTS`.
+- Skips if `.env` already exists (use `--setup` to force).
+
+`run-recipe.py` no longer contains its own `.env`-save prompt — it delegates entirely to `autodiscover.sh`.
+
+#### `--setup` Flag in `launch-cluster.sh` and `build-and-copy.sh`
+
+Both scripts now accept `--setup` to force a full autodiscovery run and overwrite the existing `.env`:
+
+```bash
+./launch-cluster.sh --setup exec vllm serve ...
+./build-and-copy.sh --setup -c
+```
+
+This is equivalent to the existing `--setup` in `run-recipe.sh`.
+
+#### `--config` Flag
+
+`hf-download.sh`, `build-and-copy.sh` and `launch-cluster.sh` now accept `--config <file>` to load a custom `.env` configuration file. `COPY_HOSTS` from the config is used for model distribution:
+
+```bash
+./hf-download.sh QuantTrio/MiniMax-M2-AWQ --config /path/to/cluster.env -c --copy-parallel
+```
+
+#### Parallelism-Aware Node Trimming
+
+`launch-cluster.sh` now parses `-tp` / `--tensor-parallel-size`, `-pp` / `--pipeline-parallel-size`, and `-dp` / `--data-parallel-size` from the exec command or launch script and adjusts the active node count accordingly — for both Ray and no-Ray modes.
+
+- If **fewer nodes are needed** than configured, only the required nodes get containers started (excess nodes are left idle).
+- If **more nodes are needed** than available, an error is raised before anything starts.
+
+```
+Note: Command requires 2 node(s) (tp=2 * pp=1 * dp=1); using 2 of 3 configured node(s).
+Error: Command requires 4 nodes (tp=4 * pp=1 * dp=1) but only 3 node(s) are configured.
+```
+
+No flags required — the check is automatic whenever parallelism arguments are present in the command.
+
 ### 2026-03-18

 #### `--master-port` / `--head-port` Parameter
@@ -591,7 +680,8 @@ See (this post on NVIDIA forums)[https://forums.developer.nvidia.com/t/make-glm-
 To use the mod, first build the container with Transformers 5 support (`--pre-tf`) flag, e.g.:

 ```bash
-./build-and-copy.sh -t vllm-node-tf5 --pre-tf -c
+# Image tag defaults to vllm-node-tf5 when --tf5/--pre-tf is used
+./build-and-copy.sh --pre-tf -c
 ```

 Then, to run on a single node:
@@ -641,7 +731,8 @@ It is currently the fastest way to run GPT-OSS on DGX Spark, achieving 60 t/s on
 To use this build, first build the container with `--exp-mxfp4` flag. I recommend using a separate label as it is currently not recommended to use this build for models other than gpt-oss:

 ```bash
-./build-and-copy.sh -t vllm-node-mxfp4 --exp-mxfp4 -c
+# Image tag defaults to vllm-node-mxfp4 when --exp-mxfp4 is used
+./build-and-copy.sh --exp-mxfp4 -c
 ```

 Then, to run on a single Spark:
@@ -885,7 +976,7 @@ Using a different username:

 | Flag | Description |
 | :--- | :--- |
-| `-t, --tag <tag>` | Image tag (default: `vllm-node`) |
+| `-t, --tag <tag>` | Image tag (default: `vllm-node`; auto-set to `vllm-node-tf5` with `--tf5`, `vllm-node-mxfp4` with `--exp-mxfp4`) |
 | `--gpu-arch <arch>` | Target GPU architecture (default: `12.1a`) |
 | `--rebuild-flashinfer` | Skip prebuilt wheel download; force a fresh local FlashInfer build |
 | `--rebuild-vllm` | Force rebuild vLLM from source |
@@ -900,9 +991,13 @@ Using a different username:
 | `-u, --user <user>` | Username for SSH connection (default: current user) |
 | `--full-log` | Enable full Docker build output (`--progress=plain`) |
 | `--no-build` | Skip building, only copy existing image (requires `--copy-to`) |
+| `--network <name>` | Docker network to use during build (e.g. `host`). |
+| `--cleanup` | Remove all cached `.whl` and `*-commit` files from the `wheels/` directory. |
+| `--config <file>` | Path to `.env` configuration file (default: `.env` in script directory) |
+| `--setup` | Force autodiscovery and save configuration to `.env` (even if `.env` already exists) |
 | `-h, --help` | Show help message |

-**IMPORTANT**: When copying to another node, make sure you use the Spark IP assigned to its ConnectX 7 interface (enp1s0f1np1), and not the 10G interface (enP7s7)! If you omit the IP address and use `-c` without addresses, it will use autodiscovery to detect a proper IP address.
+**IMPORTANT**: When copying to another node manually, use the IP assigned to a ConnectX 7 interface (`enp1s0f*`), not the 10G/wireless interfaces. When using `-c` without addresses, autodiscovery selects the correct interface automatically — in mesh mode it uses the direct IB-attached interfaces (`enp1s0f0np0`, `enp1s0f1np1`) for maximum transfer speed.

 ### Copying the container to another Spark node (Manual Method)

@@ -971,9 +1066,12 @@ Assumptions and limitations:
 ### Auto-Detection

 The script attempts to automatically detect:
-*   **Ethernet Interface:** The interface associated with the active InfiniBand device that has an IP address.
-*   **InfiniBand Interface:** The active InfiniBand devices. By default both active RoCE interfaces that correspond to active IB port(s) will be utilized.
-*   **Node Role:** Based on the detected IP address and the list of nodes (defaults to `192.168.177.11` as head and `192.168.177.12` as worker).
+*   **Ethernet Interface (`ETH_IF`):** Determined by the number of active CX7 interfaces:
+    - **2 active** (standard): the `enp*` interface (no capital P) that has an IP address.
+    - **4 active** (mesh topology): `enP7s7` (preferred) or `wlP9s9` (wireless, shown with a warning) — the cluster coordination interface is separate from the CX7 ports in this configuration.
+*   **InfiniBand Interface (`IB_IF`):** All active RoCE devices. In mesh mode this is always `rocep1s0f0,roceP2p1s0f0,rocep1s0f1,roceP2p1s0f1`.
+*   **Cluster peers:** Discovered by scanning the `ETH_IF` subnet for hosts with SSH access **and** a GB10 GPU (`nvidia-smi --query-gpu=name` must return `NVIDIA GB10`).
+*   **Copy hosts (`COPY_HOSTS`):** In standard mode, same as cluster peers. In mesh mode, scanned separately on `enp1s0f0np0` and `enp1s0f1np1` subnets so that image/model transfers use the direct InfiniBand path.

 ### Manual Overrides

@@ -1006,6 +1104,10 @@ You can override the auto-detected values if needed:
 | `--mem-swap-limit-gb` | Memory+swap limit in GB (default: mem-limit + 10, only with `--non-privileged`). |
 | `--pids-limit` | Process limit (default: 4096, only with `--non-privileged`). |
 | `--shm-size-gb` | Shared memory size in GB (default: 64, only with `--non-privileged`). |
+| `--config <file>` | Path to `.env` configuration file (default: `.env` in script directory). |
+| `--setup` | Force autodiscovery and save configuration to `.env` (even if `.env` already exists). |
+| `start \| stop \| status \| exec` | Action to perform (default: `start`). Not compatible with `--launch-script`. |
+| `command` | Command to execute inside the container (only for `exec` action). |

 ### Non-Privileged Mode

@@ -1149,6 +1251,61 @@ You need to make sure you allocate IP addresses to them (no need to allocate IP

 ## 5\. Configuration Details

+### Cluster Configuration (`.env` file)
+
+The scripts share a `.env` file (default: `.env` in the repo directory) for persistent cluster configuration. It is created automatically by autodiscovery — run `--discover` (via `run-recipe.sh`) or `--setup` (via `launch-cluster.sh` / `build-and-copy.sh`) on first use.
+
+**Supported variables:**
+
+| Variable | Description |
+| :--- | :--- |
+| `CLUSTER_NODES` | Comma-separated node IPs used for Ray/vLLM cluster (head node first). |
+| `COPY_HOSTS` | Comma-separated node IPs used for image and model distribution. In mesh mode these are the IPs on the direct IB-attached interfaces, which may differ from `CLUSTER_NODES`. |
+| `LOCAL_IP` | IP address of the local node. |
+| `ETH_IF` | Ethernet interface for cluster coordination (e.g. `enp1s0f1np1` or `enP7s7`). |
+| `IB_IF` | Comma-separated RoCE/IB device names (e.g. `rocep1s0f0,roceP2p1s0f0,rocep1s0f1,roceP2p1s0f1`). |
+| `CONTAINER_*` | Any variable prefixed with `CONTAINER_` (except `CONTAINER_NAME`) is passed as `-e VAR=VALUE` to the container. Example: `CONTAINER_NCCL_DEBUG=INFO` → `-e NCCL_DEBUG=INFO`. |
+
+**Mesh-mode NCCL variables** (written automatically when mesh topology is detected):
+
+```
+CONTAINER_NCCL_NET_PLUGIN=none
+CONTAINER_NCCL_IB_SUBNET_AWARE_ROUTING=1
+CONTAINER_NCCL_IB_MERGE_NICS=0
+```
+
+**Example `.env` for a standard 2-node cluster:**
+
+```
+CLUSTER_NODES=192.168.177.11,192.168.177.12
+COPY_HOSTS=192.168.177.12
+LOCAL_IP=192.168.177.11
+ETH_IF=enp1s0f1np1
+IB_IF=rocep1s0f1,roceP2p1s0f1
+```
+
+To use a custom config file path, pass `--config /path/to/file.env` to any script.
+
+### Autodiscovery Workflow
+
+On first run, if no `.env` is present, the scripts will automatically trigger autodiscovery. You can also run it explicitly:
+
+```bash
+# Via run-recipe.sh
+./run-recipe.sh --discover
+
+# Via launch-cluster.sh or build-and-copy.sh (force re-run even if .env exists)
+./launch-cluster.sh --setup exec vllm serve ...
+./build-and-copy.sh --setup -c
+```
+
+Autodiscovery:
+1. Detects active CX7 interfaces and determines mesh vs. standard topology.
+2. Scans the network for SSH-reachable GB10 peers.
+3. In mesh mode, separately discovers `COPY_HOSTS` on direct IB-attached interfaces.
+4. Prompts for per-node confirmation for both `CLUSTER_NODES` and `COPY_HOSTS`.
+5. Saves the result to `.env`.
+
 ### Environment Persistence

 The script automatically appends exported variables to `~/.bashrc`. If you need to open a second terminal into the running container for debugging, simply run:
@@ -1322,6 +1479,32 @@ The `hf-download.sh` script provides a convenient way to download models from Hu
 ./hf-download.sh -c --copy-parallel QuantTrio/MiniMax-M2-AWQ
 ```

+**Use nodes from `.env` (respects `COPY_HOSTS`):**
+
+```bash
+./hf-download.sh -c QuantTrio/MiniMax-M2-AWQ
+```
+
+When `-c` is given without explicit hosts, the script checks `COPY_HOSTS` in `.env` first, then falls back to autodiscovery. In mesh mode this means transfers go over the direct IB-attached interfaces automatically.
+
+**Use a custom config file:**
+
+```bash
+./hf-download.sh --config /path/to/cluster.env -c QuantTrio/MiniMax-M2-AWQ
+```
+
+**Available options:**
+
+| Flag | Description |
+| :--- | :--- |
+| `<model-name>` | HuggingFace model ID (e.g. `QuantTrio/MiniMax-M2-AWQ`). Required. |
+| `-c, --copy-to <hosts>` | Host(s) to copy the model to after download (space- or comma-separated). Omit hosts to use `COPY_HOSTS` from `.env` or autodiscovery. |
+| `--copy-to-host` | Alias for `--copy-to` (backwards compatibility). |
+| `--copy-parallel` | Copy to all hosts concurrently instead of serially. |
+| `-u, --user <user>` | SSH username for remote copies (default: current user). |
+| `--config <file>` | Path to `.env` configuration file (default: `.env` in script directory). |
+| `-h, --help` | Show help message. |
+
 ### Hardware Architecture

 **Note:** This project targets `12.1a` architecture (NVIDIA GB10 / DGX Spark). If you are using different hardware, you can use `--gpu-arch` flag in `./build-and-copy.sh`.
--- a/build-and-copy.sh
+++ b/build-and-copy.sh
@@ -6,6 +6,7 @@ START_TIME=$(date +%s)

 # Default values
 IMAGE_TAG="vllm-node"
+IMAGE_TAG_SET=false
 REBUILD_FLASHINFER=false
 REBUILD_VLLM=false
 COPY_HOSTS=()
@@ -264,7 +265,7 @@ if downloads:
 # Help function
 usage() {
    echo "Usage: $0 [OPTIONS]"
-    echo "  -t, --tag <tag>               : Image tag (default: 'vllm-node')"
+    echo "  -t, --tag <tag>               : Image tag (default: 'vllm-node', 'vllm-node-tf5' with --tf5, 'vllm-node-mxfp4' with --exp-mxfp4)"
    echo "  --gpu-arch <arch>             : GPU architecture (default: '12.1a')"
    echo "  --rebuild-flashinfer          : Force rebuild of FlashInfer wheels (ignore cached wheels)"
    echo "  --rebuild-vllm                : Force rebuild of vLLM wheels (ignore cached wheels)"
@@ -291,7 +292,7 @@ usage() {
 CONFIG_FILE_SET=false
 while [[ "$#" -gt 0 ]]; do
    case $1 in
-        -t|--tag) IMAGE_TAG="$2"; shift ;;
+        -t|--tag) IMAGE_TAG="$2"; IMAGE_TAG_SET=true; shift ;;
        --gpu-arch) GPU_ARCH_LIST="$2"; shift ;;
        --rebuild-flashinfer) REBUILD_FLASHINFER=true ;;
        --rebuild-vllm) REBUILD_VLLM=true ;;
@@ -342,6 +343,15 @@ while [[ "$#" -gt 0 ]]; do
    shift
 done

+# Apply default IMAGE_TAG based on flags if -t was not specified
+if [ "$IMAGE_TAG_SET" = false ]; then
+    if [ "$PRE_TRANSFORMERS" = true ]; then
+        IMAGE_TAG="vllm-node-tf5"
+    elif [ "$EXP_MXFP4" = true ]; then
+        IMAGE_TAG="vllm-node-mxfp4"
+    fi
+fi
+
 # Source autodiscover.sh to load .env file
 source "$(dirname "$0")/autodiscover.sh"

--- a/docs/NETWORKING.md
+++ b/docs/NETWORKING.md
@@ -42,13 +42,54 @@ However, in order to get full bandwidth in NCCL RDMA mode, we need to utilize **
 Also, note that connecting two Sparks using **both** ports won't give you any noticeable advantage in bandwidth, so single connection is sufficient.
 If you connect 3 Sparks by daisy-chaining them, you will only be able to sustain 100G between each pair of Sparks.

-## Connecting more than 2 Sparks in the cluster
+## Connecting 3 Sparks in a mesh cluster without a switch
+
+Three Sparks can be connected together in a cluster without using a separate RoCE switch.
+However, all three Sparks need to be on the same wired network using it's 10G Ethernet port (RG-45, not QSFP). Being on a same wireless network should work too, but it's not recommended and was not tested.
+
+You need to make sure they are connected the following way: port 0 on one Spark should connect to port 1 on another Spark (unlike non-mesh configuration).
+Example diagram:
+
+```mermaid
+block-beta
+    columns 1
+    
+    block:Spark3
+        columns 2
+        Title3["Spark 3"]:2
+        s3p0["Port 0<br>192.168.187.13<br>192.168.188.13"] s3p1["Port 1<br>192.168.197.13<br>192.168.198.13"]
+    end
+    
+    space
+    
+    block:Spark2
+        columns 2
+        Title2["Spark 2"]:2
+        s2p0["Port 0<br>192.168.197.12<br>192.168.198.12"] s2p1["Port 1<br>192.168.177.12<br>192.168.178.13"]
+    end
+    
+    space
+    
+    block:Spark1
+        columns 2
+        Title1["Spark 1"]:2
+        s1p0["Port 0<br>192.168.177.11<br>192.168.178.11"] s1p1["Port 1<br>192.168.187.11<br>192.168.188.11"]
+    end
+
+    s1p0 <--> s2p1
+    s2p0 <--> s3p1
+    s3p0 <--> s1p1
+```
+
+## Connecting more than 2 Sparks in the cluster using a switch

 To connect more than 2 Sparks, you will need a proper switch, for example [Microtik CRS812-DDQ](https://mikrotik.com/product/crs812_ddq).
 Please refer to [this post](https://forums.developer.nvidia.com/t/6x-spark-setup/354399/56) for an example of setting up a 6-8 node Spark cluster.

 ## Network setup

+### For dual Sparks or multiple Sparks using a QSFP switch
+
 Assuming both are connected using rightmost QFSP port (when looking from the back).

 Create `/etc/netplan/40-cx7.yaml` on `spark`:
@@ -115,6 +156,122 @@ MTU setting (testing):
 sudo ip link set dev enp1s0f1np1 mtu 9000
 ```

+### For 3-node mesh
+
+3-node mesh is configured differently than dual clusters or clusters using a QSFP switch.
+
+Assuming, your Sparks are connected according to the diagram above:
+
+Create `/etc/netplan/40-cx7.yaml` on `spark1`:
+```yaml
+network:
+  version: 2
+  ethernets:
+    enp1s0f0np0:
+      dhcp4: no
+      dhcp6: no        # Explicitly disable DHCPv6
+      link-local: []   # Restrict link-local addresses to static IPv4 only
+      mtu: 9000
+      addresses: [192.168.177.11/24]
+    enP2p1s0f0np0:
+      dhcp4: no
+      dhcp6: no
+      link-local: []
+      mtu: 9000
+      addresses: [192.168.178.11/24]
+    enp1s0f1np1:
+      dhcp4: no
+      dhcp6: no        # Explicitly disable DHCPv6
+      link-local: []   # Restrict link-local addresses to static IPv4 only
+      mtu: 9000
+      addresses: [192.168.187.11/24]
+    enP2p1s0f1np1:
+      dhcp4: no
+      dhcp6: no
+      link-local: []
+      mtu: 9000
+      addresses: [192.168.188.11/24]
+```
+
+Create `/etc/netplan/40-cx7.yaml` on `spark2`:
+```yaml
+network:
+  version: 2
+  ethernets:
+    enp1s0f0np0:
+      dhcp4: no
+      dhcp6: no        # Explicitly disable DHCPv6
+      link-local: []   # Restrict link-local addresses to static IPv4 only
+      mtu: 9000
+      addresses: [192.168.197.12/24]
+    enP2p1s0f0np0:
+      dhcp4: no
+      dhcp6: no
+      link-local: []
+      mtu: 9000
+      addresses: [192.168.198.12/24]
+    enp1s0f1np1:
+      dhcp4: no
+      dhcp6: no        # Explicitly disable DHCPv6
+      link-local: []   # Restrict link-local addresses to static IPv4 only
+      mtu: 9000
+      addresses: [192.168.177.12/24]
+    enP2p1s0f1np1:
+      dhcp4: no
+      dhcp6: no
+      link-local: []
+      mtu: 9000
+      addresses: [192.168.178.12/24]
+```
+
+Create `/etc/netplan/40-cx7.yaml` on `spark3`:
+```yaml
+network:
+  version: 2
+  ethernets:
+    enp1s0f0np0:
+      dhcp4: no
+      dhcp6: no        # Explicitly disable DHCPv6
+      link-local: []   # Restrict link-local addresses to static IPv4 only
+      mtu: 9000
+      addresses: [192.168.187.13/24]
+    enP2p1s0f0np0:
+      dhcp4: no
+      dhcp6: no
+      link-local: []
+      mtu: 9000
+      addresses: [192.168.188.13/24]
+    enp1s0f1np1:
+      dhcp4: no
+      dhcp6: no        # Explicitly disable DHCPv6
+      link-local: []   # Restrict link-local addresses to static IPv4 only
+      mtu: 9000
+      addresses: [192.168.197.13/24]
+    enP2p1s0f1np1:
+      dhcp4: no
+      dhcp6: no
+      link-local: []
+      mtu: 9000
+      addresses: [192.168.198.13/24]
+```
+
+Then run (on each Spark):
+
+```bash
+sudo chmod 600 /etc/netplan/40-cx7.yaml
+sudo netplan apply
+```
+
+### Passwordless SSH and benchmarks
+
+Set up passwordless ssh. On the first spark:
+
+```bash
+wget https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/connect-two-sparks/assets/discover-sparks
+chmod +x discover-sparks
+./discover-sparks
+```
+
 **Benchmark connection (use perftest package):**

 Run the receiver on `spark2` node:
@@ -196,7 +353,9 @@ ib_write_lat 192.168.177.12 -d rocep1s0f1 --report_gbits -R --force-link IB
 ---------------------------------------------------------------------------------------
 ```

-## NCCL Setup
+## NCCL Tests
+
+### Dual Sparks or Sparks via QSFP switch

 From https://build.nvidia.com/spark/nccl/stacked-sparks

@@ -240,3 +399,51 @@ mpirun -np 2 -H 192.168.177.11:1,192.168.177.12:1 \
  $HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2

 ```
+
+### 3-node mesh
+
+```bash
+# Install dependencies and build NCCL
+sudo apt-get update && sudo apt-get install -y libopenmpi-dev
+git clone -b dgxspark-3node-ring https://github.com/zyang-dev/nccl.git ~/nccl/
+cd ~/nccl/
+make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"
+
+# Set environment variables
+export CUDA_HOME="/usr/local/cuda"
+export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
+export NCCL_HOME="$HOME/nccl/build/"
+export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
+```
+
+Build NCCL Test Suite:
+
+```bash
+# Clone and build NCCL tests
+git clone https://github.com/NVIDIA/nccl-tests.git ~/nccl-tests/
+cd ~/nccl-tests/
+make MPI=1
+```
+
+Test on both nodes (replace spark1, spark2, spark3 with the actual hostnames or IP address on non-QSFP interface):
+
+```bash
+# Set environment variables
+export CUDA_HOME="/usr/local/cuda"
+export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
+export NCCL_HOME="$HOME/nccl_spark_cluster/build/"
+export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
+
+# For 3-node mesh we have to use 10G interface for OOB communication!
+export UCX_NET_DEVICES=enP7s7
+export NCCL_SOCKET_IFNAME=enP7s7
+export OMPI_MCA_btl_tcp_if_include=enP7s7
+export NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0,rocep1s0f1,roceP2p1s0f1
+export NCCL_IB_DISABLE=0
+
+# Run the all_gather performance test across both nodes
+mpirun -np 3 -H spark1:1,spark2:1,spark3:1 \
+  --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
+  -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH -x NCCL_IB_MERGE_NICS=0 -x NCCL_NET_PLUGIN=none -x NCCL_IB_SUBNET_AWARE_ROUTING=1 \
+  $HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 3
+```
--- a/recipes/README.md
+++ b/recipes/README.md
@@ -44,12 +44,16 @@ The recipe runner can automatically discover cluster nodes:
 ```

 When you run `--discover`, it:
-1. Scans the network for nodes with SSH access
-2. Prompts you to select which nodes to include
-3. Saves the configuration to `.env`
+1. Detects active CX7 interfaces and determines mesh vs. standard topology.
+2. Scans the network for peers that are both SSH-reachable **and** have an NVIDIA GB10 GPU.
+3. In mesh mode, separately discovers `COPY_HOSTS` on the direct IB-attached interfaces.
+4. Prompts for per-node confirmation for `CLUSTER_NODES` and `COPY_HOSTS`.
+5. Saves the full configuration (including mesh NCCL settings if applicable) to `.env`.

 Future recipe runs will automatically use nodes from `.env` unless you specify `-n` or `--solo`.

+When distributing the container image or model files, the runner uses `COPY_HOSTS` from `.env` (which may differ from `CLUSTER_NODES` in mesh mode) to ensure transfers go over the fastest available path.
+
 ## Workflow Modes

 ### Solo Mode (Single Node)
@@ -169,6 +173,7 @@ Usage: ./run-recipe.sh [OPTIONS] [RECIPE]
 Cluster discovery:
  --discover                  Auto-detect cluster nodes and save to .env
  --show-env                  Show current .env configuration
+  --config FILE               Path to .env configuration file (default: .env in repo directory)

 Recipe overrides:
  --port PORT                 Override port
@@ -186,10 +191,25 @@ Setup options:

 Launch options:
  --solo                      Run in solo mode (single node, no Ray)
+  --no-ray                    Multi-node without Ray (PyTorch distributed backend)
  -n, --nodes IPS             Comma-separated node IPs (first = head)
  -d, --daemon                Run in daemon mode
  -t, --container IMAGE       Override container from recipe
+  --name NAME                 Override container name
  --nccl-debug LEVEL          NCCL debug level (VERSION, WARN, INFO, TRACE)
+  --master-port PORT          Cluster coordination port: Ray head port or PyTorch
+                              distributed master port (default: 29501).
+                              Alias: --head-port
+  --eth-if IFACE              Override Ethernet interface
+  --ib-if IFACE               Override InfiniBand interface
+  -e VAR=VALUE                Pass environment variable to container (repeatable)
+  -j N                        Number of parallel build jobs
+  --no-cache-dirs             Do not mount ~/.cache/vllm, ~/.cache/flashinfer, ~/.triton
+  --non-privileged            Run container without --privileged
+  --mem-limit-gb N            Memory limit in GB (only with --non-privileged)
+  --mem-swap-limit-gb N       Memory+swap limit in GB (only with --non-privileged)
+  --pids-limit N              Process limit (only with --non-privileged)
+  --shm-size-gb N             Shared memory size in GB (only with --non-privileged)

 Extra vLLM arguments:
  -- ARGS...                  Pass additional arguments directly to vLLM
@@ -261,10 +281,18 @@ command: |

 ```
 ┌─────────────────────────────────────────────────────────┐
+│  autodiscover.sh                                        │
+│  - Interface detection (standard / mesh topology)       │
+│  - GB10 peer verification via SSH                       │
+│  - CLUSTER_NODES and COPY_HOSTS discovery               │
+│  - Interactive .env save with per-node confirmation     │
+└──────────────────────────┬──────────────────────────────┘
+                           │ sourced by
+                           ▼
+┌─────────────────────────────────────────────────────────┐
 │  run-recipe.sh / run-recipe.py                          │
 │  - Parses YAML recipe                                   │
-│  - Auto-discovers cluster nodes (--discover)            │
-│  - Loads nodes from .env                                │
+│  - Loads / triggers cluster discovery (--discover)      │
 │  - Handles --setup (build + download + run)             │
 │  - Generates launch script from template                │
 │  - Applies CLI overrides                                │
@@ -274,7 +302,7 @@ command: |
 ┌──────────────────────┐  ┌───────────────────────────────┐
 │  build-and-copy.sh   │  │  hf-download.sh               │
 │  - Docker build      │  │  - HuggingFace model download │
-│  - Copy to workers   │  │  - Rsync to workers           │
+│  - Copy to COPY_HOSTS│  │  - Rsync to COPY_HOSTS        │
 └──────────────────────┘  └───────────────────────────────┘
           │
           │ then calls (for run)
@@ -282,7 +310,7 @@ command: |
 ┌─────────────────────────────────────────────────────────┐
 │  launch-cluster.sh                                      │
 │  - Cluster orchestration                                │
-│  - Container lifecycle                                  │
+│  - Container lifecycle (trimmed to required node count) │
 │  - Mod application                                      │
 │  - Launch script execution                              │
 └─────────────────────────────────────────────────────────┘