Update README.md

This commit is contained in:
Ren Xuancheng
2024-03-12 15:36:02 +08:00
committed by GitHub
parent aa862758f3
commit da5b44f934

View File

@@ -889,18 +889,13 @@ The statistics are listed below:
For deployment and fast inference, we suggest using vLLM. For deployment and fast inference, we suggest using vLLM.
If you use cuda 12.1 and pytorch 2.1, you can directly use the following command to install vLLM. If you use **CUDA 12.1 and PyTorch 2.1**, you can directly use the following command to install vLLM.
```bash ```bash
# pip install vllm # This line is faster but it does not support quantization models. pip install vllm
# The below lines support int4 quantization (int8 will be supported soon). The installation are slower (~10 minutes).
git clone https://github.com/QwenLM/vllm-gptq
cd vllm-gptq
pip install -e .
``` ```
Otherwise, please refer to the official vLLM [Installation Instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html), or our [vLLM repo for GPTQ quantization](https://github.com/QwenLM/vllm-gptq). Otherwise, please refer to the official vLLM [Installation Instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html).
#### vLLM + Transformer-like Wrapper #### vLLM + Transformer-like Wrapper