mirror of
https://github.com/QwenLM/Qwen.git
synced 2026-05-21 00:45:48 +08:00
fix single-gpu qlora, and add profiling
This commit is contained in:
89
README.md
89
README.md
@@ -15,9 +15,9 @@
|
||||
</p>
|
||||
<br><br>
|
||||
|
||||
| | Qwen-Chat | Qwen-Chat (Int4) | Qwen |
|
||||
|----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:|
|
||||
| 7B | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">π€</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">π€</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary">π€</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">π€</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">π€</a> <a href="https://huggingface.co/Qwen/Qwen-7B">π€</a> |
|
||||
| | Qwen-Chat | Qwen-Chat (Int4) | Qwen |
|
||||
|-----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:|
|
||||
| 7B | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">π€</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">π€</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary">π€</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">π€</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">π€</a> <a href="https://huggingface.co/Qwen/Qwen-7B">π€</a> |
|
||||
| 14B | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary">π€</a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat">π€</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary">π€</a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat-Int4">π€</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B/summary">π€</a> <a href="https://huggingface.co/Qwen/Qwen-14B">π€</a> |
|
||||
|
||||
|
||||
@@ -60,20 +60,20 @@ Qwen-14B and Qwen-7B (this is the new version trained with more tokens and the c
|
||||
<p>
|
||||
<br>
|
||||
|
||||
| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU |
|
||||
|:-------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:---------:|:--------:|:--------:|
|
||||
| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot |
|
||||
| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 |
|
||||
| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 |
|
||||
| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - |
|
||||
| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - |
|
||||
| InternLM-7B | 51.0 | 53.4 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 |
|
||||
| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 |
|
||||
| Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 |
|
||||
| Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 |
|
||||
| Qwen-7B (original) | 56.7 | 59.6 | 51.6 | 10.4 | 24.4 | 31.2 | 40.6 | 58.8 |
|
||||
| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 |
|
||||
| **Qwen-14B** | **66.3** | **72.1** | **61.3** | **24.8** | **32.3** | **40.8** | **53.4** | **71.0** |
|
||||
| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU |
|
||||
|:-------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:|
|
||||
| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot |
|
||||
| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 |
|
||||
| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 |
|
||||
| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - |
|
||||
| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - |
|
||||
| InternLM-7B | 51.0 | 53.4 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 |
|
||||
| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 |
|
||||
| Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 |
|
||||
| Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 |
|
||||
| Qwen-7B (original) | 56.7 | 59.6 | 51.6 | 10.4 | 24.4 | 31.2 | 40.6 | 58.8 |
|
||||
| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 |
|
||||
| **Qwen-14B** | **66.3** | **72.1** | **61.3** | **24.8** | **32.3** | **40.8** | **53.4** | **71.0** |
|
||||
|
||||
For all compared models, we report the best scores between their official reported results and [OpenCompass](https://opencompass.org.cn/leaderboard-llm).
|
||||
|
||||
@@ -274,8 +274,8 @@ We also profile the peak GPU memory usage for encoding 2048 tokens as context (a
|
||||
|----------------------|:-----------------------------------:|:-------------------------------------:|
|
||||
| Qwen-7B-Chat (BF16) | 17.66GB | 22.58GB |
|
||||
| Qwen-7B-Chat (Int4) | 8.21GB | 13.62GB |
|
||||
| Qwen-14B-Chat (BF16) | 30.15GB | 38.94GB |
|
||||
| Qwen-14B-Chat (Int4) | 13.00GB | 21.79GB |
|
||||
| Qwen-14B-Chat (BF16) | 30.15GB | 38.94GB |
|
||||
| Qwen-14B-Chat (Int4) | 13.00GB | 21.79GB |
|
||||
|
||||
The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).
|
||||
<br><br>
|
||||
@@ -308,17 +308,17 @@ We use BF16 models, and generate 1024 tokens (seq-length=1024) by default, and o
|
||||
|
||||
With kv-cache quantization turned on, we can run a larger batch size(bs).
|
||||
|
||||
| USE KVCache | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 |
|
||||
| --- | :---: | :---: | :---: | :---: | :---: | :---: |
|
||||
| no | 16.3GB | 24.1GB | 31.7GB | 48.7GB | oom | oom |
|
||||
| yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB |
|
||||
| USE KVCache | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 |
|
||||
|-------------|:------:|:------:|:------:|:------:|:------:|:------:|
|
||||
| no | 16.3GB | 24.1GB | 31.7GB | 48.7GB | oom | oom |
|
||||
| yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB |
|
||||
|
||||
With kv-cache quantization turned on, the model can save more memory when generate longer seq-length (sl, number of tokens generated) at infer.
|
||||
|
||||
| USE KVCache | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 |
|
||||
| --- | :---: | :---: | :---: | :---: | :---: |
|
||||
| no | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB |
|
||||
| yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB |
|
||||
|-------------|:------:|:-------:|:-------:|:-------:|:-------:|
|
||||
| no | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB |
|
||||
| yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB |
|
||||
|
||||
### Difference of Storage in layer-past
|
||||
The model which turn on the kv-cache quantization will convert the format of layer-past from float to int8, meanwhile the quantianted layer-past will also store quantiantion parameters of current value.
|
||||
@@ -343,9 +343,11 @@ you can use the dequantization operation to convert the int8 key/value back to t
|
||||
```
|
||||
v=dequantize_cache_torch(qv,scale,zero_point)
|
||||
```
|
||||
<br>
|
||||
|
||||
## Finetuning
|
||||
|
||||
### Usage
|
||||
Now we provide the official training script, `finetune.py`, for users to finetune the pretrained model for downstream applications in a simple fashion. Additionally, we provide shell scripts to launch finetuning with no worries. This script supports the training with [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/). The shell scripts that we provide use DeepSpeed (Note: this may have conflicts with the latest version of pydantic) and Peft. You can install them by:
|
||||
```bash
|
||||
pip install peft deepspeed
|
||||
@@ -395,14 +397,16 @@ sh finetune/finetune_lora_single_gpu.sh
|
||||
sh finetune/finetune_lora_ds.sh
|
||||
```
|
||||
|
||||
In comparison with full-parameter finetuning, LoRA ([paper](https://arxiv.org/abs/2106.09685)) only updates the parameters of adapter layers but keeps the original large language model layers frozen. This allows much fewer memory costs and thus fewer computation costs. However, if you still suffer from insufficient memory, you can consider Q-LoRA ([paper](https://arxiv.org/abs/2305.14314)), which uses the quantized large language model and other techniques such as paged attention to allow even fewer memory costs. To run Q-LoRA, directly run the following script (In terms of QLoRA, temporarily we found problems with mixed precision training in the setup of single GPU. We'll fix it as soon as possible):
|
||||
In comparison with full-parameter finetuning, LoRA ([paper](https://arxiv.org/abs/2106.09685)) only updates the parameters of adapter layers but keeps the original large language model layers frozen. This allows much fewer memory costs and thus fewer computation costs. However, if you still suffer from insufficient memory, you can consider Q-LoRA ([paper](https://arxiv.org/abs/2305.14314)), which uses the quantized large language model and other techniques such as paged attention to allow even fewer memory costs. To run Q-LoRA, directly run the following script:
|
||||
|
||||
```bash
|
||||
# Single GPU training
|
||||
sh finetune/finetune_qlora_single_gpu.sh
|
||||
# Distributed training
|
||||
sh finetune/finetune_qlora_ds.sh
|
||||
```
|
||||
|
||||
For Q-LoRA, we advise you to load our provided quantized model, e.g., Qwen-7B-Chat-Int4. However, different from full-parameter finetuning and LoRA, only fp16 is supported for Q-LoRA.
|
||||
For Q-LoRA, we advise you to load our provided quantized model, e.g., Qwen-7B-Chat-Int4. You **SHOULD NOT** use the bf16 models. Different from full-parameter finetuning and LoRA, only fp16 is supported for Q-LoRA.
|
||||
|
||||
Different from full-parameter finetuning, the training of both LoRA and Q-LoRA only saves the adapter parameters. Suppose your training starts from Qwen-7B, you can load the finetuned model for inference as shown below:
|
||||
|
||||
@@ -416,8 +420,33 @@ model = AutoPeftModelForCausalLM.from_pretrained(
|
||||
).eval()
|
||||
```
|
||||
|
||||
The shell scripts uses `torchrun` to run single-GPU or multi-GPU training. For multi-GPU training, you need to specify the proper hyperparameters for distributed training based on your machine.
|
||||
<br><br>
|
||||
For multi-GPU training, you need to specify the proper hyperparameters for distributed training based on your machine.
|
||||
|
||||
### Profiling of Memory and Speed
|
||||
We profile the GPU memory and training speed of both LoRA and Q-LoRA in the setup of single-GPU training. In this test, we experiment on a single A100-SXM4-80G GPU, and we use CUDA 11.8 and Pytorch 2.0. We uniformly use a batch size of 1 and gradient accumulation of 8. We profile the memory (GB) and speed (s/iter) of inputs of different lengths, namely 256, 512, 1024, and 2048. The statistics are listed below:
|
||||
|
||||
<table>
|
||||
<tr>
|
||||
<th rowspan="2">Model Size</th><th rowspan="2">Method</th><th colspan="4" align="center">Sequence Length</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th align="center">256</th><th align="center">512</th><th align="center">1024</th><th align="center">2048</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th rowspan="2">7B</th><td>LoRA</td><td align="center">33.5G / 1.6s/it</td><td align="center">34.0G / 1.7s/it</td><td align="center">35.0G / 3.0s/it</td><td align="center">35.0G / 5.7s/it</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Q-LoRA</td><td align="center">11.5G / 3.0s/it</td><td align="center">12.2G / 3.6s/it</td><td align="center">12.7G / 4.8s/it</td><td align="center">13.9G / 7.3s/it</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th rowspan="2">14B</th><td>LoRA</td><td align="center">51.0G / 2.1s/it</td><td align="center">51.0G / 2.7s/it</td><td align="center">51.5G / 5.0s/it</td><td align="center">53.9G / 9.2s/it</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Q-LoRA</td><td align="center">18.3G / 5.4s/it</td><td align="center">18.4G / 6.4s/it</td><td align="center">18.5G / 8.5s/it</td><td align="center">19.9G / 12.4s/it</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<br>
|
||||
|
||||
## Demo
|
||||
|
||||
|
||||
Reference in New Issue
Block a user