init commit of recipes (#1027)

Add recipes
2026-05-20 16:35:47 +08:00 · 2024-01-30 01:57:09 -06:00
parent d275e5b91a
commit ee01f36ed9
30 changed files with 5146 additions and 0 deletions
--- a/recipes/inference/dashscope/README.md
+++ b/recipes/inference/dashscope/README.md
@@ -0,0 +1,56 @@
+# Inference Qwen Using DashScope
+
+The most simple way to use Qwen through APIs is DashScope API service through Alibaba Cloud. We give an introduction to the usage. Additionally, we provide a script for you to deploy an OpenAI-style API on your own servers.
+
+DashScope is the large language model API service provided by Alibaba Cloud, which now supports Qwen. Note that the models behind DashScope are in-house versions temporarily without details provided. The services include `qwen-turbo` and `qwen-plus`, where the former one runs faster and the latter achieves better performance. For more information, visit the documentation [here](https://dashscope.aliyun.com).
+
+Please head to the official website [link](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.6c2774fahtfXdn) to create a DashScope account and obtain the API key (AK). We recommend setting the AK with an environment variable:
+```bash
+export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
+```
+Then please install the packages and click [here](https://help.aliyun.com/zh/dashscope/developer-reference/install-dashscope-sdk) for the documentation. If you use Python, you can install DashScope with pip:
+```bash
+pip install dashscope
+```
+If you use JAVA SDK, you can install it in this way:
+```xml
+<!-- https://mvnrepository.com/artifact/com.alibaba/dashscope-sdk-java -->
+<dependency>
+    <groupId>com.alibaba</groupId>
+    <artifactId>dashscope-sdk-java</artifactId>
+    <version>the-latest-version</version>
+</dependency>
+```
+The simplest way to use DashScope is the usage with messages, which is similar to OpenAI API. The example is demonstrated below:
+```python
+import random
+from http import HTTPStatus
+from dashscope import Generation
+
+
+def call_with_messages():
+    messages = [{'role': 'system', 'content': 'You are a helpful assistant.'},
+                {'role': 'user', 'content': '如何做西红柿鸡蛋？'}]
+    gen = Generation()
+    response = gen.call(
+        Generation.Models.qwen_turbo,
+        messages=messages,
+        seed=random.randint(1, 10000),  # set the random seed, optional, default to 1234 if not set
+        result_format='message',  # set the result to be "message" format.
+    )
+    return response
+
+
+if __name__ == '__main__':
+    response = call_with_messages()
+    if response.status_code == HTTPStatus.OK:
+        print(response)
+    else:
+        print('Request id: %s, Status code: %s, error code: %s, error message: %s' % (
+            response.request_id, response.status_code,
+            response.code, response.message
+        ))
+```
+For more usages, please visit the official website for more details.
+<br><br>
+
--- a/recipes/inference/hf_modelscope/README.md
+++ b/recipes/inference/hf_modelscope/README.md
@@ -0,0 +1,248 @@
+# Inference Qwen Using 🤖 ModelScope and 🤗 Transformers
+
+Below, we provide simple examples to show how to inference Qwen with 🤖 ModelScope and 🤗 Transformers.
+
+## Requirements
+
+* python 3.8 and above
+* pytorch 1.12 and above, 2.0 and above are recommended
+* transformers 4.32 and above
+* CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
+<br>
+
+## Installation
+
+You can use our pre-built docker images to skip most of the environment setup steps, see Section ["Using Pre-built Docker Images"](https://github.com/QwenLM/Qwen?tab=readme-ov-file#-docker) for more details. 
+
+If not using docker, please make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.
+
+```bash
+pip install -r Qwen/requirements.txt
+```
+
+If your device supports fp16 or bf16, we recommend installing [flash-attention](https://github.com/Dao-AILab/flash-attention) (**we support flash attention 2 now.**) for higher efficiency and lower memory usage. (**flash-attention is optional and the project can run normally without installing it**)
+
+```bash
+git clone https://github.com/Dao-AILab/flash-attention
+cd flash-attention && pip install .
+# Below are optional. Installing them might be slow.
+# pip install csrc/layer_norm
+# If the version of flash-attn is higher than 2.1.1, the following is not needed.
+# pip install csrc/rotary
+```
+
+Now you can start with ModelScope or Transformers.
+
+## 🤗 Transformers
+
+To use Qwen-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. Remember to pass in the correct model names or paths, such as "Qwen/Qwen-7B-Chat" and "Qwen/Qwen-14B-Chat". However, **please make sure that you are using the latest code.**
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+
+# Model names: "Qwen/Qwen-7B-Chat", "Qwen/Qwen-14B-Chat"
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
+
+# use bf16
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
+# use fp16
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
+# use cpu only
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
+# use auto mode, automatically select precision based on the device.
+model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen-7B-Chat",
+    device_map="auto",
+    trust_remote_code=True
+).eval()
+
+# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
+# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
+
+# 1st dialogue turn
+response, history = model.chat(tokenizer, "你好", history=None)
+print(response)
+# 你好！很高兴为你提供帮助。
+
+# 2nd dialogue turn
+response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
+print(response)
+# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
+# 故事的主人公叫李明，他来自一个普通的家庭，父母都是普通的工人。从小，李明就立下了一个目标：要成为一名成功的企业家。
+# 为了实现这个目标，李明勤奋学习，考上了大学。在大学期间，他积极参加各种创业比赛，获得了不少奖项。他还利用课余时间去实习，积累了宝贵的经验。
+# 毕业后，李明决定开始自己的创业之路。他开始寻找投资机会，但多次都被拒绝了。然而，他并没有放弃。他继续努力，不断改进自己的创业计划，并寻找新的投资机会。
+# 最终，李明成功地获得了一笔投资，开始了自己的创业之路。他成立了一家科技公司，专注于开发新型软件。在他的领导下，公司迅速发展起来，成为了一家成功的科技企业。
+# 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险，不断学习和改进自己。他的成功也证明了，只要努力奋斗，任何人都有可能取得成功。
+
+# 3rd dialogue turn
+response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
+print(response)
+# 《奋斗创业：一个年轻人的成功之路》
+```
+
+Running Qwen, the base language model, is also simple.
+
+<details>
+  <summary>Running Qwen</summary>
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+
+# Model names: "Qwen/Qwen-7B", "Qwen/Qwen-14B" 
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
+# use bf16
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
+# use fp16
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
+# use cpu only
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
+# use auto mode, automatically select precision based on the device.
+model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen-7B",
+    device_map="auto",
+    trust_remote_code=True
+).eval()
+
+# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
+# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
+
+inputs = tokenizer('蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是', return_tensors='pt')
+inputs = inputs.to(model.device)
+pred = model.generate(**inputs)
+print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
+# 蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是亚的斯亚贝巴（Addis Ababa）...
+```
+
+</details>
+
+<p id="DownloadModel">
+In the event of a network issue while attempting to download model checkpoints and codes from HuggingFace, an alternative approach is to initially fetch the checkpoint from ModelScope and then load it from the local directory as outlined below:
+</p>
+
+```python
+from modelscope import snapshot_download
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+# Downloading model checkpoint to a local dir model_dir
+# model_dir = snapshot_download('qwen/Qwen-7B')
+# model_dir = snapshot_download('qwen/Qwen-7B-Chat')
+# model_dir = snapshot_download('qwen/Qwen-14B')
+model_dir = snapshot_download('qwen/Qwen-14B-Chat')
+
+# Loading local checkpoints
+# trust_remote_code is still set as True since we still load codes from local dir instead of transformers
+tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_dir,
+    device_map="auto",
+    trust_remote_code=True
+).eval()
+```
+
+## 🤖 ModelScope
+
+ModelScope is an open-source platform for Model-as-a-Service (MaaS), which provides flexible and cost-effective model service to AI developers. Similarly, you can run the models with ModelScope as shown below:
+
+```python
+from modelscope import AutoModelForCausalLM, AutoTokenizer
+from modelscope import GenerationConfig
+
+# Model names: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat"
+tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
+model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
+
+response, history = model.chat(tokenizer, "你好", history=None)
+print(response)
+response, history = model.chat(tokenizer, "浙江的省会在哪里？", history=history) 
+print(response)
+response, history = model.chat(tokenizer, "它有什么好玩的景点", history=history)
+print(response)
+```
+
+## Batch Inference
+Qwen supports batch inference. With flash attention enabled, using batch inference can bring a 40% speedup. The example code is shown below:
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers import GenerationConfig
+from qwen_generation_utils import make_context, decode_tokens, get_stop_words_ids
+
+tokenizer = AutoTokenizer.from_pretrained(
+    './',
+    pad_token='<|extra_0|>',
+    eos_token='<|endoftext|>',
+    padding_side='left',
+    trust_remote_code=True
+)
+model = AutoModelForCausalLM.from_pretrained(
+    './',
+    pad_token_id=tokenizer.pad_token_id,
+    device_map="auto",
+    trust_remote_code=True
+).eval()
+model.generation_config = GenerationConfig.from_pretrained('./', pad_token_id=tokenizer.pad_token_id)
+
+all_raw_text = ["我想听你说爱我。", "今天我想吃点啥，甜甜的，推荐下", "我马上迟到了，怎么做才能不迟到"]
+batch_raw_text = []
+for q in all_raw_text:
+    raw_text, _ = make_context(
+        tokenizer,
+        q,
+        system="You are a helpful assistant.",
+        max_window_size=model.generation_config.max_window_size,
+        chat_format=model.generation_config.chat_format,
+    )
+    batch_raw_text.append(raw_text)
+
+batch_input_ids = tokenizer(batch_raw_text, padding='longest')
+batch_input_ids = torch.LongTensor(batch_input_ids['input_ids']).to(model.device)
+batch_out_ids = model.generate(
+    batch_input_ids,
+    return_dict_in_generate=False,
+    generation_config=model.generation_config
+)
+padding_lens = [batch_input_ids[i].eq(tokenizer.pad_token_id).sum().item() for i in range(batch_input_ids.size(0))]
+
+batch_response = [
+    decode_tokens(
+        batch_out_ids[i][padding_lens[i]:],
+        tokenizer,
+        raw_text_len=len(batch_raw_text[i]),
+        context_length=(batch_input_ids[i].size(0)-padding_lens[i]),
+        chat_format="chatml",
+        verbose=False,
+        errors='replace'
+    ) for i in range(len(all_raw_text))
+]
+print(batch_response)
+
+response, _ = model.chat(tokenizer, "我想听你说爱我。", history=None)
+print(response)
+
+response, _ = model.chat(tokenizer, "今天我想吃点啥，甜甜的，推荐下", history=None)
+print(response)
+
+response, _ = model.chat(tokenizer, "我马上迟到了，怎么做才能不迟到", history=None)
+print(response)
+```
+
+## CPU
+
+To deploy our models on CPU, we strongly advise you to use [qwen.cpp](https://github.com/QwenLM/qwen.cpp), which is a pure C++ implementation of Qwen and tiktoken. Check the repo for more details!
+
+Also, it is also simple to directly run the model on CPU, which requires your specification of device:
+
+```python
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
+```
+
+However, it is likely that you suffer from extremely low inference efficiency.
+
+## Multiple GPUs
+
+If you suffer from lack of GPU memory and you would like to run the model on more than 1 GPU, you can directly use the default loading method, which is now supported by Transformers. The previous method based on `utils.py` is deprecated.
+
+However, though this method is simple, the efficiency of the native pipeline parallelism is low. We advise you to use vLLM with FastChat and please read [the section](../vllm/README.md) for deployment.
--- a/recipes/inference/quantization/README.md
+++ b/recipes/inference/quantization/README.md
@@ -0,0 +1,113 @@
+# Quantization
+
+## GPTQ
+
+We provide a solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release the Int4 and Int8 quantized models, which achieve nearly lossless model effects but improved performance on both memory costs and inference speed.
+
+Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages:
+
+```bash
+pip install auto-gptq optimum
+```
+
+If you meet problems installing `auto-gptq`, we advise you to check out the official [repo](https://github.com/PanQiWei/AutoGPTQ) to find a wheel.
+
+> Note: The pre-compiled `auto-gptq` packages strongly depend on the version of `torch` and its CUDA version. Moreover, due to recent update, 
+> you may also encounter unsupported version errors from `transformers`, `optimum`, or `peft`.
+> We recommend using the latest versions meeting the following requirements:
+> - torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1
+> - torch>=2.0,<2.1 auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0
+
+Then you can load the quantized model easily and run inference as same as usual:
+
+```python
+# Model names: "Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4"
+model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen-7B-Chat-Int4",
+    device_map="auto",
+    trust_remote_code=True
+).eval()
+response, history = model.chat(tokenizer, "Hi", history=None)
+```
+
+We illustrate the model performance of both BF16, Int8 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
+
+| Quantization         | MMLU | CEval (val) | GSM8K | Humaneval |
+|----------------------|:----:|:-----------:|:-----:|:---------:|
+| Qwen-1.8B-Chat (BF16)| 43.3 |    55.6     | 33.7  |   26.2    |
+| Qwen-1.8B-Chat (Int8)| 43.1 |    55.8     | 33.0  |   27.4    |
+| Qwen-1.8B-Chat (Int4)| 42.9 |    52.8     | 31.2  |   25.0    |
+| Qwen-7B-Chat (BF16)  | 55.8 |    59.7     | 50.3  |   37.2    |
+| Qwen-7B-Chat (Int8)  | 55.4 |    59.4     | 48.3  |   34.8    |
+| Qwen-7B-Chat (Int4)  | 55.1 |    59.2     | 49.7  |   29.9    |
+| Qwen-14B-Chat (BF16) | 64.6 |    69.8     | 60.1  |   43.9    |
+| Qwen-14B-Chat (Int8) | 63.6 |    68.6     | 60.0  |   48.2    |
+| Qwen-14B-Chat (Int4) | 63.3 |    69.0     | 59.8  |   45.7    |
+| Qwen-72B-Chat (BF16) | 74.4 |    80.1     | 76.4  |   64.6    |
+| Qwen-72B-Chat (Int8) | 73.5 |    80.1     | 73.5  |   62.2    |
+| Qwen-72B-Chat (Int4) | 73.4 |    80.1     | 75.3  |   61.6    |
+
+## Quantization of KV cache
+
+> NOTE: Please be aware that due to the internal mechanism of Hugging Face, the support files for this functionality 
+> (i.e., `cache_autogptq_cuda_256.cpp` and `cache_autogptq_cuda_kernel_256.cu`) may be missing. Please manually download
+> them from the Hugging Face Hub and place them into the same folder as the other module files.
+
+The attention KV cache can be quantized and compressed for storage, to get a higher sample throughput. The arguments `use_cache_quantization` and `use_cache_kernel` in `config.json` are provided to enable KV cache quantization. The specific use method is as follows:
+```python
+model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen-7B-Chat",
+     device_map="auto",
+     trust_remote_code=True,
+     use_cache_quantization=True,
+     use_cache_kernel=True,
+     use_flash_attn=False
+)
+```
+Attention: Currently, KV cache quantization and flash attention cannot be used at the same time.
+If you enable KV cache quantization and flash attention at the same time (`use_flash_attn=True`, `use_cache_quantization=True`, `use_cache_kernel=True`), `use_flash_attn` is disabled by default (`use_flash_attn=false`).
+
+We have verified that the use of the quantized Int8-KV-Cache model does not suffer from significant performance degradation in downstream evaluation. In the following, we focus on profiling its memory footprint in different conditions. 
+The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. 
+We use BF16 models to generate 1024 tokens by default, and "OOM" indicates out-of-memory error.
+
+With KV cache quantization, the model can infer with a larger batch size (bs).
+
+| USE KV Cache |  bs=1  |  bs=4  | bs=16  | bs=32  | bs=64  | bs=100 |
+|--------------|:------:|:------:|:------:|:------:|:------:|:------:|
+| No           | 16.3GB | 24.1GB | 31.7GB | 48.7GB |  OOM   |  OOM   |
+| Yes          | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB |
+
+With KV cache quantization the model can save more memory when generating longer sequence (`sl`, sequence length, referring to the number of tokens generated) at the stage of inference.
+
+| USE KV Cache | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 |
+|--------------|:------:|:-------:|:-------:|:-------:|:-------:|
+| No           | 15.2GB | 16.3GB  | 17.6GB  | 19.5GB  | 23.2GB  |
+| Yes          |  15GB  | 15.5GB  | 15.8GB  | 16.6GB  | 17.6GB  |
+
+The model with KV cache quantization will convert the format of `layer_past` from float to int8, and meanwhile the quantized `layer-past` will also store the quantization parameters.
+
+Specific steps are as follows:
+
+1. Quantize key/value
+```
+    qv,scale,zero_point=quantize_cache_v(v)
+```
+2. Store into layer_past
+
+The following is the format of quantized `layer_past`:
+```
+    layer_past=((q_key,key_scale,key_zero_point),
+                (q_value,value_scale,value_zero_point))
+```
+
+The original format of `layer_past` is shown below:
+```
+    layer_past=(key,value)
+```
+
+If you want to use the attention KV which is quantized, you can use the dequantization operation to convert the Int8 key/value back to the float format as follows:
+```
+    v=dequantize_cache_torch(qv,scale,zero_point)
+```
+<br>
--- a/recipes/inference/tensorrt/README.md
+++ b/recipes/inference/tensorrt/README.md
@@ -0,0 +1,46 @@
+# Inference Qwen Using TensorRT-LLM
+Below, we provide a simple example to show how to inference Qwen by TensorRT-LLM. We recommend using GPUs with compute capability of at least SM_80 such as A10 and A800 to run this example, as we have tested on these GPUs. You can find your gpu compute capability on this [link](https://developer.nvidia.com/cuda-gpus).
+
+## Installation
+You can use pre-built docker image to run this example. Simultaneously, You can also refer to the official [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) for installation and detailed usage.
+```bash
+docker run --gpus all -it --ipc=host --network=host pai-image-manage-registry.cn-wulanchabu.cr.aliyuncs.com/pai/llm-inference:tensorrt-llm-0.8.0 bash
+```
+## Quickstart
+1. Download model by modelscope
+
+```bash
+cd TensorRT-LLM/examples/qwen
+python3 -c "from modelscope.hub.snapshot_download import snapshot_download; snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')"
+mkdir -p ./tmp/Qwen
+mv Qwen/Qwen-1_8B-Chat ./tmp/Qwen/1_8B
+```
+
+2. Build TensorRT engine from HF checkpoint
+
+```bash
+python3 build.py --hf_model_dir ./tmp/Qwen/1_8B/ \
+                --dtype float16 \
+                --remove_input_padding \
+                --use_gpt_attention_plugin float16 \
+                --enable_context_fmha \
+                --use_gemm_plugin float16 \
+                --output_dir ./tmp/Qwen/1_8B/trt_engines/fp16/1-gpu/
+```
+
+3. Inference
+```bash
+python3 ../run.py --input_text "你好，请问你叫什么？" \
+                  --max_output_len=512 \
+                  --tokenizer_dir ./tmp/Qwen/1_8B/ \
+                  --engine_dir=./tmp/Qwen/1_8B/trt_engines/fp16/1-gpu
+```
+```
+Input [Text 0]: "<|im_start|>system
+You are a helpful assistant.<|im_end|>
+<|im_start|>user
+你好，请问你叫什么？<|im_end|>
+<|im_start|>assistant
+"
+Output [Text 0 Beam 0]: "你好，我是来自阿里云的大规模语言模型，我叫通义千问。"
+```
--- a/recipes/inference/vllm/README.md
+++ b/recipes/inference/vllm/README.md
@@ -0,0 +1,184 @@
+# Inference Qwen Using vLLM
+
+For deployment and fast inference, we suggest using vLLM. 
+
+## Installation
+
+If you use cuda 12.1 and pytorch 2.1, you can directly use the following command to install vLLM.
+```bash
+# Install vLLM with CUDA 12.1.
+pip install vllm
+```
+Otherwise, please refer to the official vLLM [Installation Instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html).
+
+If you have trouble building vLLM, we recommend using Docker image.
+
+```bash
+docker run --gpus all -it --rm --ipc=host --network=host qwenllm/qwen:cu121 bash
+```
+
+## GPU Requirements
+
+Qwen model use Bfloat16 by default, but Bfloat16 is only supported on GPUs with compute capability of at least 8. For GPUs with compute capability less than 8.0, it is recommended to set the dtype to float16. You can find your gpu compute capability on this [link](https://developer.nvidia.com/cuda-gpus).
+
+We have tested the GPU memory usage on NVIDIA Tesla V100 32GB by manually adjusting gpu-memory-utilization in eager mode, you can refer to the following table to determine whether your machine is capable of running these models.
+| Model | seq_len 2048 | seq_len 8192 | seq_len 16384 | seq_len 32768 |
+| :--- | ---: | ---: | ---: | ---: |
+| Qwen-1.8B | 6.22G | 7.46G |  |  |
+| Qwen-7B | 17.94G | 20.96G |  |  |
+| Qwen-7B-Int4 | 9.10G | 12.26G |  |  |
+| Qwen-14B | 33.40G |  |  |  |
+| Qwen-14B-Int4 | 13.30G |  |  |  |
+| Qwen-72B | 166.87G | 185.50G | 210.80G | 253.80G |
+| Qwen-72B-int4 | 55.37G | 73.66G | 97.79G | 158.80G |
+
+We have also listed the models that can run on consumer graphics cards by default sequence length in the following table. If the GPU memory only exceeds the model's memory usage by a small margin, you can make the model run on your machine by reducing the max-model-len parameter.</br>
+(ps: To run Qwen-14B-Int4 on NVIDIA RTX 3080Ti, you need to set gpu-memory-utilization as 0.99 and enforce eager mode)
+
+| GPU Memory | GPU | Support Model |
+| :---: | :---: | :---: |
+| 24GB | NVIDIA RTX 4090/3090/A5000 | Qwen-1.8B/Qwen-7B/Qwen-7B-Int4/Qwen-14B-Int4  |
+| 16GB | NVIDIA RTX A4000 | Qwen-1.8B/Qwen-7B-Int4/Qwen-14B-Int4 |
+| 12GB | NVIDIA RTX 3080Ti/TITAN Xp | Qwen-1.8B/Qwen-14B-Int4 |
+| 11GB | NVIDIA RTX 2080Ti/GTX 1080Ti | Qwen-1.8B |
+| 10GB | NVIDIA RTX 3080 | Qwen-1.8B |
+
+## Usage
+
+### vLLM + Web Demo / OpenAI-like API
+
+You can use FastChat to launch a web demo or an OpenAI API server. First, install FastChat:
+
+```bash
+pip install "fschat[model_worker,webui]=0.2.33" "openai<1.0"
+```
+
+To run Qwen with vLLM and FastChat, you need launch a controller by:
+```bash
+python -m fastchat.serve.controller
+```
+
+Then you can launch the model worker, which means loading your model for inference. For single GPU inference, you can directly run:
+```bash
+python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype bfloat16
+# run int4 model or GPUs with compute capability less than 8.0
+# python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype float16 
+```
+
+However, if you hope to run the model on multiple GPUs for faster inference or larger memory, you can use tensor parallelism supported by vLLM. Suppose you run the model on 4 GPUs, the command is shown below:
+```bash
+python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype bfloat16
+# run int4 model or GPUs with compute capability less than 8.0
+# python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype float16 
+```
+
+After launching your model worker, you can launch a:
+
+* Web UI Demo
+```bash
+python -m fastchat.serve.gradio_web_server
+```
+
+* OpenAI API
+```bash
+python -m fastchat.serve.openai_api_server --host localhost --port 8000
+```
+
+For OpenAI API server, you can invoke the server in the following manner.
+
+```python
+import openai
+openai.api_base = "http://localhost:8000/v1"
+openai.api_key = "none"
+
+# create a request activating streaming response
+for chunk in openai.ChatCompletion.create(
+    model="Qwen",
+    messages=[
+        {"role": "user", "content": "你好"}
+    ],
+    stream=True 
+    # Specifying stop words in streaming output format is not yet supported and is under development.
+):
+    if hasattr(chunk.choices[0].delta, "content"):
+        print(chunk.choices[0].delta.content, end="", flush=True)
+
+# create a request not activating streaming response
+response = openai.ChatCompletion.create(
+    model="Qwen",
+    messages=[
+        {"role": "user", "content": "你好"}
+    ],
+    stream=False,
+    stop=[] # You can add custom stop words here, e.g., stop=["Observation:"] for ReAct prompting.
+)
+print(response.choices[0].message.content)
+```
+
+If you find `"POST /v1/chat/completions HTTP/1.1" 200 OK` in openai_api_server log, it indicates that the call was successful. 
+
+vLLM does not support dynamic-NTK ROPE. Therefore, extending long sequences for Qwen model may lead to quality degradation(even gibberish).
+
+### vLLM + Transformer-like Wrapper
+
+You can download the [wrapper codes](vllm_wrapper.py) and execute the following commands for multiple rounds of dialogue interaction. (Note: It currently only supports the ``model.chat()`` method.)
+
+```python
+from vllm_wrapper import vLLMWrapper
+
+# Bfloat16 is only supported on GPUs with compute capability of at least 8.0, 
+model = vLLMWrapper('Qwen/Qwen-7B-Chat', tensor_parallel_size=1)
+
+# run int4 model or GPUs with compute capability less than 8.0
+# model = vLLMWrapper('Qwen/Qwen-7B-Chat-Int4', tensor_parallel_size=1, dtype="float16")
+
+response, history = model.chat(query="你好", history=None)
+print(response)
+response, history = model.chat(query="给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
+print(response)
+response, history = model.chat(query="给这个故事起一个标题", history=history)
+print(response)
+```
+### vLLM Standalone OpenAI-like API
+
+You can also deploy an OpenAI API server independently through vLLM. First, you need to download [chat template file](template_chatml.jinja).
+
+Then, you can launch an OpenAI API server by following command:
+
+```bash
+python -m vllm.entrypoints.openai.api_server --model $model_path --trust-remote-code --chat-template template_chatml.jinja
+
+# run int4 model or GPUs with compute capability less than 8.0
+# python -m vllm.entrypoints.openai.api_server --model $model_path --trust-remote-code --dtype float16 --chat-template template_chatml.jinja
+```
+
+For vLLM standalone OpenAI API server, You need to set the `stop_token_ids` parameter to `[151645]` or `stop` parameter to `["<|im_end|>"]` when invoking the server.
+
+```python
+import openai
+openai.api_base = "http://localhost:8000/v1"
+openai.api_key = "none"
+
+# create a request activating streaming response
+for chunk in openai.ChatCompletion.create(
+    model="Qwen",
+    messages=[
+        {"role": "user", "content": "你好"}
+    ],
+    stream=True, 
+    stop_token_ids=[151645]
+):
+    if hasattr(chunk.choices[0].delta, "content"):
+        print(chunk.choices[0].delta.content, end="", flush=True)
+
+# create a request not activating streaming response
+response = openai.ChatCompletion.create(
+    model="Qwen",
+    messages=[
+        {"role": "user", "content": "你好"}
+    ],
+    stream=False,
+    stop_token_ids=[151645]
+)
+print(response.choices[0].message.content)
+```
--- a/recipes/inference/vllm/template_chatml.jinja
+++ b/recipes/inference/vllm/template_chatml.jinja
@@ -0,0 +1,6 @@
+{% for message in messages %}
+{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}
+{{'<|im_start|>' + message['role'] + '\n' + message['content']}}
+{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '\n'}}{% endif %}
+{% endfor %}
+{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant\n' }}{% endif %}
--- a/recipes/inference/vllm/vllm_wrapper.py
+++ b/recipes/inference/vllm/vllm_wrapper.py
@@ -0,0 +1,239 @@
+from transformers import PreTrainedTokenizer, GenerationConfig, StoppingCriteriaList
+from typing import Optional, Callable, List, Tuple, Union
+import copy
+import torch
+from transformers import AutoTokenizer
+from transformers.generation.logits_process import LogitsProcessorList
+from packaging import version
+
+_ERROR_BAD_CHAT_FORMAT = """\
+We detect you are probably using the pretrained model (rather than chat model) for chatting, since the chat_format in generation_config is not "chatml".
+If you are directly using the model downloaded from Huggingface, please make sure you are using our "Qwen/Qwen-7B-Chat" Huggingface model (rather than "Qwen/Qwen-7B") when you call model.chat().
+我们检测到您可能在使用预训练模型（而非chat模型）进行多轮chat，因为您当前在generation_config指定的chat_format，并未设置为我们在对话中所支持的"chatml"格式。
+如果您在直接使用我们从Huggingface提供的模型，请确保您在调用model.chat()时，使用的是"Qwen/Qwen-7B-Chat"模型（而非"Qwen/Qwen-7B"预训练模型）。
+"""
+
+IMEND = "<|im_end|>"
+ENDOFTEXT = "<|endoftext|>"
+
+HistoryType = List[Tuple[str, str]]
+TokensType = List[int]
+BatchTokensType = List[List[int]]
+
+def get_stop_words_ids(chat_format, tokenizer):
+    if chat_format == "raw":
+        stop_words_ids = [tokenizer.encode("Human:"), [tokenizer.eod_id]]
+    elif chat_format == "chatml":
+        stop_words_ids = [[tokenizer.im_end_id], [tokenizer.im_start_id]]
+    else:
+        raise NotImplementedError(f"Unknown chat format {chat_format!r}")
+    return stop_words_ids
+
+def make_context(
+    tokenizer: PreTrainedTokenizer,
+    query: str,
+    history: List[Tuple[str, str]] = None,
+    system: str = "",
+    max_window_size: int = 6144,
+    chat_format: str = "chatml",
+):
+    if history is None:
+        history = []
+
+    if chat_format == "chatml":
+        im_start_tokens = [tokenizer.im_start_id]
+        im_end_tokens = [tokenizer.im_end_id]
+        im_start, im_end = tokenizer.decode(im_start_tokens, skip_special_tokens=False), tokenizer.decode(im_end_tokens, skip_special_tokens=False)
+        nl_tokens = tokenizer.encode("\n")
+
+        def _tokenize_str(role, content):
+            return f"{role}\n{content}", tokenizer.encode(
+                role, allowed_special=set()
+            ) + nl_tokens + tokenizer.encode(content, allowed_special=set())
+
+        system_text, system_tokens_part = _tokenize_str("system", system)
+        system_tokens = im_start_tokens + system_tokens_part + im_end_tokens
+
+        raw_text = ""
+        context_tokens = []
+
+        for turn_query, turn_response in reversed(history):
+            query_text, query_tokens_part = _tokenize_str("user", turn_query)
+            query_tokens = im_start_tokens + query_tokens_part + im_end_tokens
+            response_text, response_tokens_part = _tokenize_str(
+                "assistant", turn_response
+            )
+            response_tokens = im_start_tokens + response_tokens_part + im_end_tokens
+
+            next_context_tokens = nl_tokens + query_tokens + nl_tokens + response_tokens
+            prev_chat = (
+                f"\n{im_start}{query_text}{im_end}\n{im_start}{response_text}{im_end}"
+            )
+
+            current_context_size = (
+                len(system_tokens) + len(next_context_tokens) + len(context_tokens)
+            )
+            if current_context_size < max_window_size:
+                context_tokens = next_context_tokens + context_tokens
+                raw_text = prev_chat + raw_text
+            else:
+                break
+
+        context_tokens = system_tokens + context_tokens
+        raw_text = f"{im_start}{system_text}{im_end}" + raw_text
+        context_tokens += (
+            nl_tokens
+            + im_start_tokens
+            + _tokenize_str("user", query)[1]
+            + im_end_tokens
+            + nl_tokens
+            + im_start_tokens
+            + tokenizer.encode("assistant")
+            + nl_tokens
+        )
+        raw_text += f"\n{im_start}user\n{query}{im_end}\n{im_start}assistant\n"
+
+    elif chat_format == "raw":
+        raw_text = query
+        context_tokens = tokenizer.encode(raw_text)
+    else:
+        raise NotImplementedError(f"Unknown chat format {chat_format!r}")
+
+    return raw_text, context_tokens
+
+class vLLMWrapper:
+    def __init__(self,
+               model_dir: str,
+               trust_remote_code: bool = True,
+               tensor_parallel_size: int = 1,
+               gpu_memory_utilization: float = 0.98,
+               dtype: str = "bfloat16",
+               **kwargs):
+
+        if dtype not in ("bfloat16", "float16", "float32"):
+            print("now not support {}!".format(dtype))
+            raise Exception
+
+        # build generation_config
+        self.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=trust_remote_code)
+
+        # build tokenizer
+        self.tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
+        self.tokenizer.eos_token_id = self.generation_config.eos_token_id
+
+        self.stop_words_ids = []
+
+        from vllm import LLM
+        import vllm
+        if version.parse(vllm.__version__) >= version.parse("0.2.2"):
+            self.__vllm_support_repetition_penalty = True
+        else:
+            self.__vllm_support_repetition_penalty = False
+
+        quantization = getattr(kwargs, 'quantization', None)
+
+        self.model = LLM(model=model_dir,
+                            tokenizer=model_dir,
+                            tensor_parallel_size=tensor_parallel_size,
+                            trust_remote_code=trust_remote_code,
+                            quantization=quantization,
+                            gpu_memory_utilization=gpu_memory_utilization,
+                            dtype=dtype)
+
+        for stop_id in get_stop_words_ids(self.generation_config.chat_format, self.tokenizer):
+            self.stop_words_ids.extend(stop_id)
+        self.stop_words_ids.extend([self.generation_config.eos_token_id])
+
+    def chat(self,
+        query: str,
+        history: Optional[HistoryType],
+        tokenizer: PreTrainedTokenizer = None,
+        system: str = "You are a helpful assistant.",
+        generation_config: Optional[GenerationConfig] = None,
+        **kwargs):
+        generation_config = generation_config if generation_config is not None else self.generation_config
+        tokenizer = self.tokenizer if tokenizer is None else tokenizer
+
+        assert generation_config.chat_format == 'chatml', _ERROR_BAD_CHAT_FORMAT
+        if not self.__vllm_support_repetition_penalty and generation_config.repetition_penalty != 1:
+            raise RuntimeError("The installed vLLM doesn't support repetition_penalty, please set ``model.generation_config.repetition_penalty = 1`` or install vllm>=0.2.2")
+
+        if history is None:
+            history = []
+        else:
+            # make a copy of the user's input such that is is left untouched
+            history = copy.deepcopy(history)
+
+        extra_stop_words_ids = kwargs.get('stop_words_ids', None)
+        if extra_stop_words_ids is None:
+            extra_stop_words_ids = []
+
+        max_window_size = kwargs.get('max_window_size', None)
+        if max_window_size is None:
+            max_window_size = generation_config.max_window_size
+
+        from vllm.sampling_params import SamplingParams
+        sampling_kwargs = {
+            "stop_token_ids": self.stop_words_ids,
+            "early_stopping": False,
+            "top_p": generation_config.top_p,
+            "top_k": -1 if generation_config.top_k == 0 else generation_config.top_k,
+            "temperature": generation_config.temperature,
+            "max_tokens": generation_config.max_new_tokens,
+            "repetition_penalty": generation_config.repetition_penalty
+        }
+        if not self.__vllm_support_repetition_penalty:
+            sampling_kwargs.pop("repetition_penalty")
+        sampling_params = SamplingParams(**sampling_kwargs)
+
+        raw_text, context_tokens = make_context(
+            self.tokenizer,
+            query,
+            history=history,
+            system=system,
+            max_window_size=max_window_size,
+            chat_format=generation_config.chat_format,
+        )
+
+        req_outputs = self.model.generate([query],
+                                            sampling_params=sampling_params,
+                                            prompt_token_ids=[context_tokens])
+        req_output = req_outputs[0]
+
+        prompt_str = req_output.prompt
+        prompt_ids = req_output.prompt_token_ids
+        req_sample_output_ids = []
+        req_sample_output_strs = []
+        for sample in req_output.outputs:
+            output_str = sample.text
+            output_ids = sample.token_ids
+            if IMEND in output_str:
+                output_str = output_str[:-len(IMEND)]
+            if ENDOFTEXT in output_str:
+                output_str = output_str[:-len(ENDOFTEXT)]
+            req_sample_output_ids.append(prompt_ids + output_ids)
+            req_sample_output_strs.append(prompt_str + output_str)
+        assert len(req_sample_output_strs) == 1
+        response = req_sample_output_strs[0][len(prompt_str):]
+        history.append((prompt_str, response))
+
+        return response, history
+
+if __name__ == '__main__':
+
+    model_dir = 'Qwen/Qwen-72B-Chat'
+    tensor_parallel_size = 2
+
+    model = vLLMWrapper(model_dir,
+                        tensor_parallel_size=tensor_parallel_size,
+                        )
+
+    response, history = model.chat(query="你好",
+                                   history=None)
+    print(response)
+    response, history = model.chat(query="给我讲一个年轻人奋斗创业最终取得成功的故事。",
+                                   history=history)
+    print(response)
+    response, history = model.chat(query="给这个故事起一个标题",
+                                   history=history)
+    print(response)