Added patch to allow fastsafetensors in cluster config
This commit is contained in:
@@ -80,6 +80,10 @@ RUN python3 use_existing_torch.py && \
|
|||||||
sed -i "/flashinfer/d" requirements/cuda.txt && \
|
sed -i "/flashinfer/d" requirements/cuda.txt && \
|
||||||
pip install -r requirements/build.txt
|
pip install -r requirements/build.txt
|
||||||
|
|
||||||
|
# TEMPORARY PATCH for fastsafetensors loading in cluster setup - tracking https://github.com/foundation-model-stack/fastsafetensors/issues/36
|
||||||
|
COPY fastsafetensors.patch .
|
||||||
|
RUN patch -p1 < fastsafetensors.patch
|
||||||
|
|
||||||
# Final Build
|
# Final Build
|
||||||
# Uses --no-build-isolation to respect the pre-installed Torch/FlashInfer
|
# Uses --no-build-isolation to respect the pre-installed Torch/FlashInfer
|
||||||
RUN pip install --no-build-isolation . -v
|
RUN pip install --no-build-isolation . -v
|
||||||
|
|||||||
24
README.md
24
README.md
@@ -10,6 +10,14 @@ Some of the steps and parameters may be unnecessary, and some may be missing. Th
|
|||||||
|
|
||||||
The Dockerfile builds from the main branch of VLLM, so depending on when you run the build process, it may not be in fully functioning state.
|
The Dockerfile builds from the main branch of VLLM, so depending on when you run the build process, it may not be in fully functioning state.
|
||||||
|
|
||||||
|
## CHANGELOG
|
||||||
|
|
||||||
|
### 2025-11-26
|
||||||
|
|
||||||
|
Initial release.
|
||||||
|
Updated RoCE configuration example to include both interfaces in the list.
|
||||||
|
Applied patch to enable FastSafeTensors in cluster configuration (EXPERIMENTAL) and added documentation on fastsafetensors use.
|
||||||
|
|
||||||
## 1\. Building the Docker Image
|
## 1\. Building the Docker Image
|
||||||
|
|
||||||
The Dockerfile includes specific **Build Arguments** to allow you to selectively rebuild layers (e.g., update the vLLM source code without re-downloading PyTorch).
|
The Dockerfile includes specific **Build Arguments** to allow you to selectively rebuild layers (e.g., update the vLLM source code without re-downloading PyTorch).
|
||||||
@@ -198,7 +206,21 @@ docker exec -it vllm_node
|
|||||||
|
|
||||||
And execute vllm command inside.
|
And execute vllm command inside.
|
||||||
|
|
||||||
## 6\. Benchmarking
|
## 6\. Fastsafetensors
|
||||||
|
|
||||||
|
This build includes support for fastsafetensors loading which significantly improves loading speeds, especially on DGX Spark where MMAP performance is very poor currently.
|
||||||
|
[Fasttensors](https://github.com/foundation-model-stack/fastsafetensors/) solve this issue by using more efficient multi-threaded loading while avoiding mmap.
|
||||||
|
|
||||||
|
This build also implements an EXPERIMENTAL patch to allow use of fastsafetensors in a cluster configuration (it won't work without it!).
|
||||||
|
Please refer to [this issue](https://github.com/foundation-model-stack/fastsafetensors/issues/36) for the details.
|
||||||
|
|
||||||
|
To use this method, simply include `--load-format fastsafetensors` when running VLLM, for example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
HF_HUB_OFFLINE=1 vllm serve openai/gpt-oss-120b --port 8888 --host 0.0.0.0 --trust_remote_code --swap-space 16 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --load-format fastsafetensors
|
||||||
|
```
|
||||||
|
|
||||||
|
## 7\. Benchmarking
|
||||||
|
|
||||||
Follow the guidance in [VLLM Benchmark Suites](https://docs.vllm.ai/en/latest/contributing/benchmarks/) to download benchmarking dataset, and then run a benchmark with a command like this (assuming you are running on head node, otherwise specify `--host` parameter):
|
Follow the guidance in [VLLM Benchmark Suites](https://docs.vllm.ai/en/latest/contributing/benchmarks/) to download benchmarking dataset, and then run a benchmark with a command like this (assuming you are running on head node, otherwise specify `--host` parameter):
|
||||||
|
|
||||||
|
|||||||
28
fastsafetensors.patch
Normal file
28
fastsafetensors.patch
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
diff --git a/vllm/model_executor/model_loader/weight_utils.py b/vllm/model_executor/model_loader/weight_utils.py
|
||||||
|
index 0809bdfa9..a7878f44f 100644
|
||||||
|
--- a/vllm/model_executor/model_loader/weight_utils.py
|
||||||
|
+++ b/vllm/model_executor/model_loader/weight_utils.py
|
||||||
|
@@ -28,6 +28,7 @@ from vllm import envs
|
||||||
|
from vllm.config import ModelConfig
|
||||||
|
from vllm.config.load import LoadConfig
|
||||||
|
from vllm.distributed import get_tensor_model_parallel_rank
|
||||||
|
+from vllm.distributed.parallel_state import get_world_group
|
||||||
|
from vllm.logger import init_logger
|
||||||
|
from vllm.model_executor.layers.quantization import (
|
||||||
|
QuantizationConfig,
|
||||||
|
@@ -770,11 +771,13 @@ def fastsafetensors_weights_iterator(
|
||||||
|
"""Iterate over the weights in the model safetensor files
|
||||||
|
using fastsafetensor library."""
|
||||||
|
if torch.distributed.is_initialized():
|
||||||
|
- pg = torch.distributed.group.WORLD
|
||||||
|
+ world = get_world_group()
|
||||||
|
+ pg = world.device_group
|
||||||
|
+ device = world.device
|
||||||
|
else:
|
||||||
|
pg = SingleGroup()
|
||||||
|
+ device = torch.device(f"cuda:{pg.rank()}")
|
||||||
|
|
||||||
|
- device = torch.device(f"cuda:{pg.rank()}")
|
||||||
|
weight_files_sub_lists = [
|
||||||
|
hf_weights_files[i : i + pg.size()]
|
||||||
|
for i in range(0, len(hf_weights_files), pg.size())
|
||||||
Reference in New Issue
Block a user