Updated README to describe no-ray mode

2026-03-17 15:27:22 -07:00
parent ccea2ba861
commit fb0687cd1b
1 changed files with 62 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -149,8 +149,68 @@ Don't do it every time you rebuild, because it will slow down compilation times.

 For periodic maintenance, I recommend using a filter: `docker builder prune --filter until=72h`

+### 2026-03-17
+
+#### Major Cluster Orchestration Refactoring
+
+Significantly refactored the internal cluster startup logic in `launch-cluster.sh`:
+- Removed the standalone `run-cluster-node.sh` script; its logic is now fully integrated into `launch-cluster.sh`.
+- Ray head/worker startup, environment variable injection, and launch script distribution are now handled by `launch-cluster.sh` directly.
+- Worker containers are started with proper per-node environment variables (`VLLM_HOST_IP`, `NCCL_SOCKET_IFNAME`, etc.) injected via `docker run`/`docker exec` instead of relying on `.bashrc`.
+
+#### No-Ray Multi-Node Mode
+
+Added `--no-ray` flag to `launch-cluster.sh` to run multi-node vLLM clusters without Ray, using PyTorch's native distributed backend instead. It slightly improves inference performance for most models and reduces memory requirements.
+
+```bash
+./launch-cluster.sh --no-ray exec vllm serve ...
+```
+
+`--no-ray` is incompatible with `--solo` (which already runs without Ray).
+
+#### `run-recipe.sh` No-Ray Mode and Extended Flag Passthrough
+
+`run-recipe.sh` now supports `--no-ray` flag for running multi-node inference without Ray (uses PyTorch distributed backend instead):
+
+```bash
+./run-recipe.sh qwen3.5-122b-fp8 --no-ray
+```
+
+The following `launch-cluster.sh` flags are now also passed through from `run-recipe.sh`:
+`--name`, `--eth-if`, `--ib-if`, `-j`, `--no-cache-dirs`, `--non-privileged`, `--mem-limit-gb`, `--mem-swap-limit-gb`, `--pids-limit`, `--shm-size-gb`.
+
+#### Nemotron-3-Nano-NVFP4 Switched to Marlin Backend
+
+The `nemotron-3-nano-nvfp4` recipe has been updated to use the Marlin backend for better performance and reliability (until Flashinfer fully supports NVFP4 on sm121).
+
 ### 2026-03-12

+#### Experimental `--gpu-memory-utilization-gb` Mod
+
+Added a new mod `mods/gpu-mem-util-gb` that adds a `--gpu-memory-utilization-gb` flag to vLLM, allowing you to specify GPU memory reservation in GiB instead of as a fraction. This is particularly useful on DGX Spark's unified memory architecture where available memory changes dynamically.
+
+```bash
+./launch-cluster.sh --apply-mod mods/gpu-mem-util-gb exec vllm serve ... \
+  --gpu-memory-utilization-gb 110
+```
+
+Cannot be used simultaneously with `--kv-cache-memory-bytes`.
+
+#### Qwen3.5-397B INT4-AutoRound TP=4 Recipe (4× Spark Cluster)
+
+Added `recipes/4x-spark-cluster/qwen3.5-397b-int4-autoround.yaml` for running Intel/Qwen3.5-397B-A17B-int4-AutoRound across 4 DGX Spark nodes with tensor parallelism (TP=4).
+
+Benchmarked at ~37 tok/s single-user, ~103 tok/s aggregate (4 concurrent users).
+
+Includes a new mod `mods/fix-qwen35-tp4-marlin` that resolves a Marlin kernel constraint (`MIN_THREAD_N=64`) that breaks certain projection layers at TP=4.
+
+**Note:** Requires NVIDIA driver 580.x. Driver 590.x has a CUDAGraph capture deadlock on GB10 unified memory.
+
+```bash
+./run-recipe.sh 4x-spark-cluster/qwen3.5-397b-int4-autoround
+```
+Thanks @sonusflow for the contribution.
+
 #### Nemotron-3-Super-120B NVFP4 Recipe

 Added a new recipe `nemotron-3-super-nvfp4` for running `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4` with Marlin kernels. Supports both solo and cluster modes. Includes a custom reasoning parser (`super_v3_reasoning_parser.py`) fetched from the model repository. Supports both dual and single Spark configurations.
@@ -892,6 +952,7 @@ You can override the auto-detected values if needed:
 | `--nccl-debug` | NCCL debug level (e.g., INFO, WARN). Defaults to INFO if flag is present but value is omitted. |
 | `--check-config` | Check configuration and auto-detection without launching. |
 | `--solo` | Solo mode: skip autodetection, launch only on current node, do not launch Ray cluster |
+| `--no-ray` | No-Ray mode: run multi-node vLLM without Ray (uses PyTorch distributed backend). |
 | `--no-cache-dirs` | Do not mount default cache directories (~/.cache/vllm, ~/.cache/flashinfer, ~/.triton). |
 | `--launch-script` | Path to bash script to execute in the container (from examples/ directory or absolute path). If launch script is specified, action should be omitted. |
 | `-d` | Run in daemon mode (detached). |