From da5b44f9343cc725535d36722bfef1162d2c2612 Mon Sep 17 00:00:00 2001 From: Ren Xuancheng Date: Tue, 12 Mar 2024 15:36:02 +0800 Subject: [PATCH] Update README.md --- README.md | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index d3f19ce..2b132d1 100644 --- a/README.md +++ b/README.md @@ -889,18 +889,13 @@ The statistics are listed below: For deployment and fast inference, we suggest using vLLM. -If you use cuda 12.1 and pytorch 2.1, you can directly use the following command to install vLLM. +If you use **CUDA 12.1 and PyTorch 2.1**, you can directly use the following command to install vLLM. ```bash -# pip install vllm # This line is faster but it does not support quantization models. - -# The below lines support int4 quantization (int8 will be supported soon). The installation are slower (~10 minutes). -git clone https://github.com/QwenLM/vllm-gptq -cd vllm-gptq -pip install -e . +pip install vllm ``` -Otherwise, please refer to the official vLLM [Installation Instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html), or our [vLLM repo for GPTQ quantization](https://github.com/QwenLM/vllm-gptq). +Otherwise, please refer to the official vLLM [Installation Instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html). #### vLLM + Transformer-like Wrapper