added new mod for glm4.7-flash-awq, solo model support.
This commit is contained in:
59
README.md
59
README.md
@@ -146,7 +146,63 @@ For periodic maintenance, I recommend using a filter: `docker builder prune --fi
|
|||||||
|
|
||||||
### 2026-01-29
|
### 2026-01-29
|
||||||
|
|
||||||
Added `-e` / `--env` parameter to `launch-cluster.sh` to pass environment variables to the container.
|
#### New Parameters for launch-cluster.sh
|
||||||
|
|
||||||
|
- Added **solo mode** to `launch-cluster.sh` to launch models on a single node. Just use `--solo` flag or if you have only a single Spark, it will default to Solo mode if no other nodes are found.
|
||||||
|
- Added `-e` / `--env` parameter to `launch-cluster.sh` to pass environment variables to the container.
|
||||||
|
|
||||||
|
#### New Mod for GLM-4.7-Flash-AWQ
|
||||||
|
|
||||||
|
Added a mod to prevent severe inference speed degradation when using cyankiwi/GLM-4.7-Flash-AWQ-4bit (and potentially other AWQ quants of this model).
|
||||||
|
See (this post on NVIDIA forums)[https://forums.developer.nvidia.com/t/make-glm-4-7-flash-go-brrrrr/359111] for implementation details.
|
||||||
|
|
||||||
|
To use the mod, first build the container with Transformers 5 support (`--pre-tf`) flag, e.g.:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./build-and-copy.sh -t vllm-node-tf5 --use-wheels --pre-tf -c
|
||||||
|
```
|
||||||
|
|
||||||
|
Drop `--use-wheels` if you experience an error during build (see the annoucement in the Quick Start section).
|
||||||
|
|
||||||
|
Then, to run on a single node:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./launch-cluster.sh -t vllm-node-tf5 --solo \
|
||||||
|
--apply-mod mods/fix-glm-4.7-flash-AWQ \
|
||||||
|
exec vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
|
||||||
|
--tool-call-parser glm47 \
|
||||||
|
--reasoning-parser glm45 \
|
||||||
|
--enable-auto-tool-choice \
|
||||||
|
--served-model-name glm-4.7-flash \
|
||||||
|
--max-model-len 202752 \
|
||||||
|
--max-num-batched-tokens 4096 \
|
||||||
|
--max-num-seqs 64 \
|
||||||
|
--host 0.0.0.0 --port 8888 \
|
||||||
|
--gpu-memory-utilization 0.7
|
||||||
|
```
|
||||||
|
|
||||||
|
To run on cluster:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./launch-cluster.sh -t vllm-node-tf5 \
|
||||||
|
--apply-mod mods/fix-glm-4.7-flash-AWQ \
|
||||||
|
exec vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
|
||||||
|
--tool-call-parser glm47 \
|
||||||
|
--reasoning-parser glm45 \
|
||||||
|
--enable-auto-tool-choice \
|
||||||
|
--served-model-name glm-4.7-flash \
|
||||||
|
--max-model-len 202752 \
|
||||||
|
--max-num-batched-tokens 4096 \
|
||||||
|
--max-num-seqs 64 \
|
||||||
|
--host 0.0.0.0 --port 8888 \
|
||||||
|
--gpu-memory-utilization 0.7 \
|
||||||
|
--distributed-executor-backend ray \
|
||||||
|
--tensor-parallel-size 2
|
||||||
|
```
|
||||||
|
|
||||||
|
**NOTE**: vLLM implementation is suboptimal even with the patch. The model performance is still significantly slower than it should be for the model with this number of active parameters. Running in the cluster increases prompt processing performance, but not token generation. You can expect ~40 t/s generation speed in both single node and cluster.
|
||||||
|
|
||||||
|
#### Experimental Optimized MXFP4 Build
|
||||||
|
|
||||||
Added an experimental build option, optimized for DGX Spark and gpt-oss models by [Christopher Owen](https://github.com/christopherowen/spark-vllm-mxfp4-docker/blob/main/Dockerfile).
|
Added an experimental build option, optimized for DGX Spark and gpt-oss models by [Christopher Owen](https://github.com/christopherowen/spark-vllm-mxfp4-docker/blob/main/Dockerfile).
|
||||||
|
|
||||||
@@ -537,6 +593,7 @@ You can override the auto-detected values if needed:
|
|||||||
| `--apply-mod` | Apply mods/patches from specified directory. Can be used multiple times to apply multiple mods. |
|
| `--apply-mod` | Apply mods/patches from specified directory. Can be used multiple times to apply multiple mods. |
|
||||||
| `--nccl-debug` | NCCL debug level (e.g., INFO, WARN). Defaults to INFO if flag is present but value is omitted. |
|
| `--nccl-debug` | NCCL debug level (e.g., INFO, WARN). Defaults to INFO if flag is present but value is omitted. |
|
||||||
| `--check-config` | Check configuration and auto-detection without launching. |
|
| `--check-config` | Check configuration and auto-detection without launching. |
|
||||||
|
| `--solo` | Solo mode: skip autodetection, launch only on current node, do not launch Ray cluster |
|
||||||
| `-d` | Run in daemon mode (detached). |
|
| `-d` | Run in daemon mode (detached). |
|
||||||
|
|
||||||
## 3\. Running the Container (Manual)
|
## 3\. Running the Container (Manual)
|
||||||
|
|||||||
@@ -27,9 +27,12 @@ CLUSTER_WAS_RUNNING="false"
|
|||||||
MOD_PATHS=()
|
MOD_PATHS=()
|
||||||
MOD_TYPES=()
|
MOD_TYPES=()
|
||||||
|
|
||||||
|
ACTIONS_ARG=""
|
||||||
|
SOLO_MODE="false"
|
||||||
|
|
||||||
# Function to print usage
|
# Function to print usage
|
||||||
usage() {
|
usage() {
|
||||||
echo "Usage: $0 [-n <node_ips>] [-t <image_name>] [--name <container_name>] [--eth-if <if_name>] [--ib-if <if_name>] [--nccl-debug <level>] [--check-config] [-d] [action] [command]"
|
echo "Usage: $0 [-n <node_ips>] [-t <image_name>] [--name <container_name>] [--eth-if <if_name>] [--ib-if <if_name>] [--nccl-debug <level>] [--check-config] [--solo] [-d] [action] [command]"
|
||||||
echo " -n, --nodes Comma-separated list of node IPs (Optional, auto-detected if omitted)"
|
echo " -n, --nodes Comma-separated list of node IPs (Optional, auto-detected if omitted)"
|
||||||
echo " -t Docker image name (Optional, default: $IMAGE_NAME)"
|
echo " -t Docker image name (Optional, default: $IMAGE_NAME)"
|
||||||
echo " --name Container name (Optional, default: $DEFAULT_CONTAINER_NAME)"
|
echo " --name Container name (Optional, default: $DEFAULT_CONTAINER_NAME)"
|
||||||
@@ -39,6 +42,7 @@ usage() {
|
|||||||
echo " --nccl-debug NCCL debug level (Optional, one of: VERSION, WARN, INFO, TRACE). If no level is provided, defaults to INFO."
|
echo " --nccl-debug NCCL debug level (Optional, one of: VERSION, WARN, INFO, TRACE). If no level is provided, defaults to INFO."
|
||||||
echo " --apply-mod Path to directory or zip file containing run.sh to apply before launch (Can be specified multiple times)"
|
echo " --apply-mod Path to directory or zip file containing run.sh to apply before launch (Can be specified multiple times)"
|
||||||
echo " --check-config Check configuration and auto-detection without launching"
|
echo " --check-config Check configuration and auto-detection without launching"
|
||||||
|
echo " --solo Solo mode: skip autodetection, launch only on current node, do not launch Ray cluster"
|
||||||
echo " -d Daemon mode (only for 'start' action)"
|
echo " -d Daemon mode (only for 'start' action)"
|
||||||
echo " action start | stop | status | exec (Default: start)"
|
echo " action start | stop | status | exec (Default: start)"
|
||||||
echo " command Command to run (only for 'exec' action)"
|
echo " command Command to run (only for 'exec' action)"
|
||||||
@@ -64,6 +68,7 @@ while [[ "$#" -gt 0 ]]; do
|
|||||||
fi
|
fi
|
||||||
;;
|
;;
|
||||||
--check-config) CHECK_CONFIG="true" ;;
|
--check-config) CHECK_CONFIG="true" ;;
|
||||||
|
--solo) SOLO_MODE="true" ;;
|
||||||
-d) DAEMON_MODE="true" ;;
|
-d) DAEMON_MODE="true" ;;
|
||||||
-h|--help) usage ;;
|
-h|--help) usage ;;
|
||||||
start|stop|status)
|
start|stop|status)
|
||||||
@@ -145,7 +150,20 @@ source "$(dirname "$0")/autodiscover.sh"
|
|||||||
|
|
||||||
# Perform auto-detection
|
# Perform auto-detection
|
||||||
detect_interfaces || exit 1
|
detect_interfaces || exit 1
|
||||||
|
|
||||||
|
if [[ "$SOLO_MODE" == "true" ]]; then
|
||||||
|
if [[ -n "$NODES_ARG" ]]; then
|
||||||
|
echo "Error: --solo is incompatible with -n/--nodes."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
# Solo mode: skip node detection, just get local IP
|
||||||
|
detect_local_ip || exit 1
|
||||||
|
NODES_ARG="$LOCAL_IP"
|
||||||
|
PEER_NODES=()
|
||||||
|
echo "Solo mode enabled. Skipping node detection."
|
||||||
|
else
|
||||||
detect_nodes || exit 1
|
detect_nodes || exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
if [[ -z "$NODES_ARG" ]]; then
|
if [[ -z "$NODES_ARG" ]]; then
|
||||||
echo "Error: Nodes argument (-n) is mandatory or could not be auto-detected."
|
echo "Error: Nodes argument (-n) is mandatory or could not be auto-detected."
|
||||||
@@ -174,6 +192,12 @@ if [ "$FOUND_HEAD" = false ]; then
|
|||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
# Implicit Solo Mode Detection
|
||||||
|
if [[ "$SOLO_MODE" == "false" && ${#PEER_NODES[@]} -eq 0 ]]; then
|
||||||
|
echo "Only local node detected/configured. Activating solo mode (no Ray cluster)."
|
||||||
|
SOLO_MODE="true"
|
||||||
|
fi
|
||||||
|
|
||||||
echo "Head Node: $HEAD_IP"
|
echo "Head Node: $HEAD_IP"
|
||||||
echo "Worker Nodes: ${PEER_NODES[*]}"
|
echo "Worker Nodes: ${PEER_NODES[*]}"
|
||||||
echo "Container Name: $CONTAINER_NAME"
|
echo "Container Name: $CONTAINER_NAME"
|
||||||
@@ -413,11 +437,19 @@ start_cluster() {
|
|||||||
echo "Starting Head Node on $HEAD_IP..."
|
echo "Starting Head Node on $HEAD_IP..."
|
||||||
|
|
||||||
local head_cmd_args=()
|
local head_cmd_args=()
|
||||||
|
if [[ "$SOLO_MODE" == "true" ]]; then
|
||||||
|
if [[ ${#MOD_PATHS[@]} -gt 0 ]]; then
|
||||||
|
head_cmd_args=(bash -c "echo Waiting for mod application...; while [ ! -f /tmp/mod_done ]; do sleep 1; done; echo Mod applied, starting container...; exec sleep infinity")
|
||||||
|
else
|
||||||
|
head_cmd_args=(sleep infinity)
|
||||||
|
fi
|
||||||
|
else
|
||||||
if [[ ${#MOD_PATHS[@]} -gt 0 ]]; then
|
if [[ ${#MOD_PATHS[@]} -gt 0 ]]; then
|
||||||
head_cmd_args=(bash -c "echo Waiting for mod application...; while [ ! -f /tmp/mod_done ]; do sleep 1; done; echo Mod applied, starting node...; exec ./run-cluster-node.sh --role head --host-ip $HEAD_IP --eth-if $ETH_IF --ib-if $IB_IF")
|
head_cmd_args=(bash -c "echo Waiting for mod application...; while [ ! -f /tmp/mod_done ]; do sleep 1; done; echo Mod applied, starting node...; exec ./run-cluster-node.sh --role head --host-ip $HEAD_IP --eth-if $ETH_IF --ib-if $IB_IF")
|
||||||
else
|
else
|
||||||
head_cmd_args=(./run-cluster-node.sh --role head --host-ip "$HEAD_IP" --eth-if "$ETH_IF" --ib-if "$IB_IF")
|
head_cmd_args=(./run-cluster-node.sh --role head --host-ip "$HEAD_IP" --eth-if "$ETH_IF" --ib-if "$IB_IF")
|
||||||
fi
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
docker run -d --privileged --gpus all --rm \
|
docker run -d --privileged --gpus all --rm \
|
||||||
--ipc=host --network host \
|
--ipc=host --network host \
|
||||||
@@ -461,7 +493,13 @@ start_cluster() {
|
|||||||
done
|
done
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
if [[ "$SOLO_MODE" == "false" ]]; then
|
||||||
wait_for_cluster
|
wait_for_cluster
|
||||||
|
else
|
||||||
|
echo "Solo mode active: Skipping Ray cluster readiness check."
|
||||||
|
# Give container a moment to start up
|
||||||
|
sleep 2
|
||||||
|
fi
|
||||||
}
|
}
|
||||||
|
|
||||||
# Wait for Cluster Readiness
|
# Wait for Cluster Readiness
|
||||||
|
|||||||
13
mods/fix-glm-4.7-flash-AWQ/glm47_flash.patch
Normal file
13
mods/fix-glm-4.7-flash-AWQ/glm47_flash.patch
Normal file
@@ -0,0 +1,13 @@
|
|||||||
|
--- a/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/triton_mla.py
|
||||||
|
+++ b/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/triton_mla.py
|
||||||
|
@@ -135,7 +135,9 @@
|
||||||
|
lse = torch.zeros(B, q_num_heads, dtype=q.dtype, device=q.device)
|
||||||
|
|
||||||
|
# For batch invariance, use only 1 split to ensure deterministic reduction
|
||||||
|
- num_kv_splits = 1 if vllm_is_batch_invariant() else 4
|
||||||
|
+ # Dynamic splits: ~1.5K tokens per split, clamped to [32, 128]
|
||||||
|
+ max_seq_len = int(attn_metadata.decode.seq_lens.max().item())
|
||||||
|
+ num_kv_splits = 1 if vllm_is_batch_invariant() else max(32, min(128, max_seq_len // 1500))
|
||||||
|
|
||||||
|
# TODO(lucas) Allocate ahead of time
|
||||||
|
attn_logits = torch.empty(
|
||||||
3
mods/fix-glm-4.7-flash-AWQ/run.sh
Normal file
3
mods/fix-glm-4.7-flash-AWQ/run.sh
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
set -e
|
||||||
|
patch -p1 -d / < glm47_flash.patch
|
||||||
Reference in New Issue
Block a user