4 Commits

Author SHA1 Message Date
Eugene Rakhmatulin
a749fcce87 Added a recipe for qwen3.5-122B-FP8 2026-03-04 16:49:39 -08:00
Eugene Rakhmatulin
505a060a7d vLLM prebuilt wheels support 2026-03-04 16:01:50 -08:00
Eugene Rakhmatulin
ca34ebcffc Merge branch 'main' into vllm-wheels 2026-03-04 15:59:16 -08:00
Eugene Rakhmatulin
2152ef127d Now can use prebuilt vLLM wheels 2026-03-04 13:33:32 -08:00
3 changed files with 86 additions and 24 deletions

View File

@@ -26,7 +26,10 @@ While it was primarily developed to support multi-node inference, it works just
This repository is not affiliated with NVIDIA or their subsidiaries. This is a community effort aimed to help DGX Spark users to set up and run the most recent versions of vLLM on Spark cluster or single nodes.
The Dockerfile builds from the main branch of VLLM, so depending on when you run the build process, it may not be in fully functioning state. You can target a specific vLLM release by setting `--vllm-ref` parameter.
Unless `--rebuild-vllm` or `--vllm-ref` or `--apply-vllm-pr` is specified, the builder will fetch the latest precompiled vLLM wheels from the repository. They are built nightly and tested on multiple models in both cluster and solo configuration before publishing.
We will expand the selection of models we test in the pipeline, but since vLLM is a rapidly developing platform, some things may break.
If you want to build the latest from main branch, you can specify `--rebuild-vllm` flag. Or you can target a specific vLLM release by setting `--vllm-ref` parameter.
## QUICK START
@@ -58,7 +61,7 @@ Then run the following command that will build and distribute image across the c
./build-and-copy.sh -c
```
An initial build will take around 20-30 minutes, but subsequent builds will be faster. Precompiled vLLM wheels for DGX Spark will also be available soon.
An initial build speed depends on your Internet connection speed and whether the base image is already present on your machine. After base image pull, the build should take only 2-3 minutes. If `--rebuild-vllm` and/or `--rebuild-flashinfer` is used to trigger a build from the sourcew, it will take between 20-40 minutes, but subsequent builds will be faster. Prebuilt FlashInfer and vLLM wheels are downloaded automatically from GitHub releases, so compilation from source is usually not required.
### Run
@@ -120,7 +123,7 @@ To launch the model:
This will run the model on all available cluster nodes.
**NOTE:** do not use `--load-format fastsafetensors` if you are loading models that would take >0.8 of available RAM (without KV cache) as it may result in out of memory situation.
**NOTE:** do not use `--load-format fastsafetensors` if you are loading models that would take >0.85 of available RAM (without KV cache) as it may result in out of memory situation.
**Also:** You can use any vLLM container that has "bash" as its default entrypoint with the launch script. It was tested with NGC vLLM, but can work with others too. To use such container in the cluster, you need to specify `--apply-mod use-ngc-vllm` argument to `./launch-cluster.sh`. However, it's recommended to build the container using this repository for best compatibility and most up-to-date features.
@@ -146,6 +149,21 @@ Don't do it every time you rebuild, because it will slow down compilation times.
For periodic maintenance, I recommend using a filter: `docker builder prune --filter until=72h`
### 2026-03-04
#### Prebuilt vLLM Wheels via GitHub Releases
`build-and-copy.sh` now automatically downloads prebuilt vLLM wheels from the [GitHub releases](https://github.com/eugr/spark-vllm-docker/releases/tag/prebuilt-vllm-current) before falling back to a local build — identical to the existing FlashInfer download mechanism. This eliminates the need to compile vLLM from source on first use.
The download logic mirrors the FlashInfer behaviour:
- If prebuilt wheels are available and newer than any locally cached version, they are downloaded automatically.
- If the download fails (e.g. no network, release not found, GPU arch not supported), the script falls back to building locally, or reuses existing local wheels if present.
- `--rebuild-vllm`, `--vllm-ref`, or `--apply-vllm-pr` skip the download entirely and force a local build.
No new flags are required — the download happens transparently.
All prebuilt wheels are now tested with multiple models in both solo and cluster configuration as a part of automated deployment pipeline which will now run nightly. The wheels are released only if they pass all the tests and no significant performance regressions are detected.
### 2026-03-02
#### Qwen3.5-122B-INT4-Autoround Support
@@ -178,7 +196,6 @@ Added a new mod for Intel/Qwen3-Coder-Next-INT4-Autoround model support: `mods/f
Changed reasoning parser in Minimax for better compatibility with modern clients (like coding tools).
### 2026-02-18
#### Completely Redesigned Build Process

View File

@@ -23,6 +23,7 @@ BUILD_JOBS="16"
GPU_ARCH_LIST="12.1a"
WHEELS_REPO="eugr/spark-vllm-docker"
FLASHINFER_RELEASE_TAG="prebuilt-flashinfer-current"
VLLM_RELEASE_TAG="prebuilt-vllm-current"
# Space-separated list of GPU architectures for which prebuilt wheels are available
PREBUILT_WHEELS_SUPPORTED_ARCHS="12.1a"
@@ -347,30 +348,32 @@ if [ "$NO_BUILD" = false ]; then
# ----------------------------------------------------------
# Phase 2: vLLM wheels
# ----------------------------------------------------------
VLLM_WHEELS_EXIST=false
if compgen -G "./wheels/vllm*.whl" > /dev/null 2>&1; then
VLLM_WHEELS_EXIST=true
fi
if [ "$VLLM_REF_SET" = true ] || [ -n "$VLLM_PRS" ]; then
REBUILD_VLLM=true
fi
if [ "$REBUILD_VLLM" = true ] || [ "$VLLM_WHEELS_EXIST" = false ]; then
if [ "$REBUILD_VLLM" = true ]; then
if [ "$VLLM_REF_SET" = true ] && [ -n "$VLLM_PRS" ]; then
echo "Rebuilding vLLM wheels (--vllm-ref and --apply-vllm-pr specified)..."
elif [ "$VLLM_REF_SET" = true ]; then
echo "Rebuilding vLLM wheels (--vllm-ref specified)..."
elif [ -n "$VLLM_PRS" ]; then
echo "Rebuilding vLLM wheels (--apply-vllm-pr specified)..."
else
echo "Rebuilding vLLM wheels (--rebuild-vllm specified)..."
fi
BUILD_VLLM=false
if [ "$REBUILD_VLLM" = true ]; then
if [ "$VLLM_REF_SET" = true ] && [ -n "$VLLM_PRS" ]; then
echo "Rebuilding vLLM wheels (--vllm-ref and --apply-vllm-pr specified)..."
elif [ "$VLLM_REF_SET" = true ]; then
echo "Rebuilding vLLM wheels (--vllm-ref specified)..."
elif [ -n "$VLLM_PRS" ]; then
echo "Rebuilding vLLM wheels (--apply-vllm-pr specified)..."
else
echo "No vLLM wheels found in ./wheels/ — building..."
echo "Rebuilding vLLM wheels (--rebuild-vllm specified)..."
fi
BUILD_VLLM=true
elif try_download_wheels "$VLLM_RELEASE_TAG" "vllm"; then
echo "vLLM wheels ready."
elif compgen -G "./wheels/vllm*.whl" > /dev/null 2>&1; then
echo "Download failed — using existing local vLLM wheels."
else
echo "No vLLM wheels available (download failed) — building..."
BUILD_VLLM=true
fi
if [ "$BUILD_VLLM" = true ]; then
# Back up existing vllm wheels; restore them if the build fails
VLLM_BACKUP="./wheels/.backup-vllm"
rm -rf "$VLLM_BACKUP" && mkdir -p "$VLLM_BACKUP"
@@ -393,7 +396,6 @@ if [ "$NO_BUILD" = false ]; then
VLLM_CMD+=("--build-arg" "VLLM_PRS=$VLLM_PRS")
fi
VLLM_CMD+=(".")
echo "vLLM build command: ${VLLM_CMD[*]}"
@@ -408,8 +410,6 @@ if [ "$NO_BUILD" = false ]; then
rm -rf "$VLLM_BACKUP"
exit 1
fi
else
echo "vLLM wheels already present in ./wheels/ — skipping build."
fi
# ----------------------------------------------------------

View File

@@ -0,0 +1,45 @@
# Recipe: Qwen3.5-122B-A10B-FP8
# Qwen3.5-122B model in native FP8 quantization
recipe_version: "1"
name: Qwen3.5-122B-FP8
description: vLLM serving Qwen3.5-122B-FP8
# HuggingFace model to download (optional, for --download-model)
model: Qwen/Qwen3.5-122B-A10B-FP8
# Only cluster is supported
cluster_only: true
# Container image to use
container: vllm-node
# No mods required
mods: []
# Default settings (can be overridden via CLI)
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
gpu_memory_utilization: 0.7
max_model_len: 262144
max_num_batched_tokens: 8192
# Environment variables
env: {}
# The vLLM serve command template
command: |
vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \
--max-model-len {max_model_len} \
--gpu-memory-utilization {gpu_memory_utilization} \
--port {port} \
--host {host} \
--load-format fastsafetensors \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
-tp {tensor_parallel} --distributed-executor-backend ray \
--max-num-batched-tokens {max_num_batched_tokens}