diff --git a/README.md b/README.md index 3eddfe7..d9129cd 100644 --- a/README.md +++ b/README.md @@ -164,7 +164,7 @@ Don't do it every time you rebuild, because it will slow down compilation times. For periodic maintenance, I recommend using a filter: `docker builder prune --filter until=72h` -### 2026-02-14 +### 2026-02-17 #### Non-Privileged Mode Support @@ -181,6 +181,8 @@ Example usage: ./launch-cluster.sh --non-privileged --mem-limit-gb 120 --shm-size-gb 64 exec vllm serve ... ``` +May result in a slightly reduced performance (within 2%) in exchange for better reliability and stability. + ### 2026-02-12 Added a mod for Qwen3-Coder-Next-FP8 that fixes: diff --git a/recipes/qwen3-coder-next-fp8.yaml b/recipes/qwen3-coder-next-fp8.yaml index 192db84..68cbe85 100644 --- a/recipes/qwen3-coder-next-fp8.yaml +++ b/recipes/qwen3-coder-next-fp8.yaml @@ -24,7 +24,7 @@ defaults: host: 0.0.0.0 tensor_parallel: 2 gpu_memory_utilization: 0.7 - max_model_len: 262144 + max_model_len: 131072 # Environment variables env: {} @@ -37,6 +37,7 @@ command: | --gpu-memory-utilization {gpu_memory_utilization} \ --host {host} \ --port {port} \ + --kv-cache-dtype fp8 \ --load-format fastsafetensors \ --attention-backend flashinfer \ --enable-prefix-caching \