Updated README, added NVFP4 fix

This commit is contained in:
Eugene Rakhmatulin
2026-03-30 11:45:40 -07:00
parent a3201f8873
commit 45494688d1
2 changed files with 16 additions and 21 deletions

View File

@@ -14,6 +14,8 @@ ENV MAX_JOBS=${BUILD_JOBS}
ENV CMAKE_BUILD_PARALLEL_LEVEL=${BUILD_JOBS} ENV CMAKE_BUILD_PARALLEL_LEVEL=${BUILD_JOBS}
ENV NINJAFLAGS="-j${BUILD_JOBS}" ENV NINJAFLAGS="-j${BUILD_JOBS}"
ENV MAKEFLAGS="-j${BUILD_JOBS}" ENV MAKEFLAGS="-j${BUILD_JOBS}"
ENV DG_JIT_USE_NVRTC=1
ENV USE_CUDNN=1
# Set non-interactive frontend to prevent apt prompts # Set non-interactive frontend to prevent apt prompts
ENV DEBIAN_FRONTEND=noninteractive ENV DEBIAN_FRONTEND=noninteractive
@@ -120,6 +122,16 @@ RUN if [ -n "$FLASHINFER_PRS" ]; then \
done; \ done; \
fi fi
# TEMPORARY patch for NVFP4 crash (PR 2913)
RUN curl -fsL https://github.com/flashinfer-ai/flashinfer/pull/38423.diff -o pr2913.diff \
&& if git apply --reverse --check pr2913.diff 2>/dev/null; then \
echo "PR #2913 already applied, skipping."; \
else \
echo "Applying FI PR #2913..."; \
git apply -v pr2913.diff; \
fi \
&& rm pr2913.diff
# Apply patch to avoid re-downloading existing cubins # Apply patch to avoid re-downloading existing cubins
COPY flashinfer_cache.patch . COPY flashinfer_cache.patch .
RUN --mount=type=cache,id=uv-cache,target=/root/.cache/uv \ RUN --mount=type=cache,id=uv-cache,target=/root/.cache/uv \
@@ -247,6 +259,8 @@ ENV MAX_JOBS=${BUILD_JOBS}
ENV CMAKE_BUILD_PARALLEL_LEVEL=${BUILD_JOBS} ENV CMAKE_BUILD_PARALLEL_LEVEL=${BUILD_JOBS}
ENV NINJAFLAGS="-j${BUILD_JOBS}" ENV NINJAFLAGS="-j${BUILD_JOBS}"
ENV MAKEFLAGS="-j${BUILD_JOBS}" ENV MAKEFLAGS="-j${BUILD_JOBS}"
ENV DG_JIT_USE_NVRTC=1
ENV USE_CUDNN=1
ENV DEBIAN_FRONTEND=noninteractive ENV DEBIAN_FRONTEND=noninteractive
ENV PIP_BREAK_SYSTEM_PACKAGES=1 ENV PIP_BREAK_SYSTEM_PACKAGES=1

View File

@@ -69,7 +69,7 @@ An initial build speed depends on your Internet connection speed and whether the
**On a single node**: **On a single node**:
**NEW** - `launch-cluster.sh` now supports solo mode, which is now a recommended way to run the container on a single Spark: `launch-cluster.sh` supports solo mode, which is now a recommended way to run the container on a single Spark:
```bash ```bash
./launch-cluster.sh --solo exec \ ./launch-cluster.sh --solo exec \
@@ -80,23 +80,6 @@ An initial build speed depends on your Internet connection speed and whether the
--load-format fastsafetensors --load-format fastsafetensors
``` ```
**To launch using regular `docker run`**
```bash
docker run \
--privileged \
--gpus all \
-it --rm \
--network host --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm-node \
bash -c -i "vllm serve \
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ \
--port 8000 --host 0.0.0.0 \
--gpu-memory-utilization 0.7 \
--load-format fastsafetensors"
```
**On a cluster** **On a cluster**
It's recommended to download the model on one node and distribute across the cluster using ConnectX interconnect prior to launching. This is to avoid re-downloading the model from the Internet on every node in the cluster. It's recommended to download the model on one node and distribute across the cluster using ConnectX interconnect prior to launching. This is to avoid re-downloading the model from the Internet on every node in the cluster.
@@ -151,7 +134,7 @@ For periodic maintenance, I recommend using a filter: `docker builder prune --fi
## CHANGELOG ## CHANGELOG
### 2026-03-29 ### 2026-03-30
#### Flags to specify Flashinfer ref and apply PRs #### Flags to specify Flashinfer ref and apply PRs
@@ -162,8 +145,6 @@ For periodic maintenance, I recommend using a filter: `docker builder prune --fi
Both flags are incompatible with `--exp-mxfp4`. Both flags are incompatible with `--exp-mxfp4`.
### 2026-03-27
#### Default image tag in `build-and-copy.sh` #### Default image tag in `build-and-copy.sh`
`build-and-copy.sh` now automatically sets a sensible default image tag when `-t` is not specified: `build-and-copy.sh` now automatically sets a sensible default image tag when `-t` is not specified: