mirror of
https://github.com/QwenLM/Qwen.git
synced 2026-05-20 08:25:47 +08:00
update README
This commit is contained in:
@@ -451,7 +451,7 @@ We illustrate the model performance of both BF16, Int8 and Int4 models on the be
|
||||
### Quantization of KV cache
|
||||
|
||||
> NOTE: Please be aware that due to the internal mechanism of Hugging Face, the support files for this functionality
|
||||
> (i.e., `cache_autogptq_cuda_256.cpp` and `cache_autogptq_cuda_kernel_245.cu`) may be missing. Please manually download
|
||||
> (i.e., `cache_autogptq_cuda_256.cpp` and `cache_autogptq_cuda_kernel_256.cu`) may be missing. Please manually download
|
||||
> them from the Hugging Face Hub and place them into the same folder as the other module files.
|
||||
|
||||
The attention KV cache can be quantized and compressed for storage, to get a higher sample throughput. The arguments `use_cache_quantization` and `use_cache_kernel` in `config.json` are provided to enable KV cache quantization. The specific use method is as follows:
|
||||
@@ -779,7 +779,6 @@ Our provided scripts support multinode finetuning. You can refer to the comments
|
||||
Note: DeepSpeed ZeRO 3 requires much greater inter-node communication rate than ZeRO 2, which will significantly reduce the training speed in the case of multinode finetuning. Therefore, we do not recommend using DeepSpeed ZeRO 3 configurations in multinode finetuning scripts.
|
||||
|
||||
### Profiling of Memory and Speed
|
||||
|
||||
We profile the GPU memory and training speed of both LoRA (LoRA (emb) refers to training the embedding and output layer, while LoRA has no trainable embedding and output layer) and Q-LoRA in the setup of single-GPU training. In this test, we experiment on a single A100-SXM4-80G GPU, and we use CUDA 11.8 and Pytorch 2.0. Flash attention 2 is applied. We uniformly use a batch size of 1 and gradient accumulation of 8. We profile the memory (GB) and speed (s/iter) of inputs of different lengths, namely 256, 512, 1024, 2048, 4096, and 8192. We also report the statistics of full-parameter finetuning with Qwen-7B on 2 A100 GPUs. We only report the statistics of 256, 512, and 1024 tokens due to the limitation of GPU memory.
|
||||
|
||||
For Qwen-7B, we also test the performance of multinode finetuning. We experiment using two servers, each containing two A100-SXM4-80G GPUs, and the rest of configurations are the same as other Qwen-7B experiments. The results of multinode finetuning are marked as LoRA (multinode) in the table.
|
||||
@@ -872,7 +871,6 @@ The statistics are listed below:
|
||||
|
||||
<br>
|
||||
|
||||
|
||||
## Deployment
|
||||
|
||||
### vLLM
|
||||
|
||||
@@ -448,7 +448,7 @@ response, history = model.chat(tokenizer, "Hi", history=None)
|
||||
|
||||
### KV cache量化
|
||||
|
||||
> 注意:由于Hugging Face的内部实现,本功能的支持文件`cache_autogptq_cuda_356.cpp`与`cache_autogptq_cuda_kernel_245.cu`可能没被下载。如需开启使用,请手动从相关位置下载,并放置到相应文件中。
|
||||
> 注意:由于Hugging Face的内部实现,本功能的支持文件`cache_autogptq_cuda_256.cpp`与`cache_autogptq_cuda_kernel_256.cu`可能没被下载。如需开启使用,请手动从相关位置下载,并放置到相应文件中。
|
||||
|
||||
在模型推理时,我们可以将中间结果key以及value的值量化后压缩存储,这样便可以在相同的卡上存储更多的key以及value,增加样本吞吐。
|
||||
|
||||
|
||||
@@ -450,7 +450,7 @@ Ilustramos el rendimiento de los modelos BF16, Int8 e Int4 en la prueba de refer
|
||||
### Cuantización de la caché KV
|
||||
|
||||
> NOTA: Por favor, ten en cuenta que debido al mecanismo interno de Hugging Face, los archivos de soporte para esta funcionalidad
|
||||
> (es decir, `cache_autogptq_cuda_256.cpp` y `cache_autogptq_cuda_kernel_245.cu`).
|
||||
> (es decir, `cache_autogptq_cuda_256.cpp` y `cache_autogptq_cuda_kernel_256.cu`).
|
||||
> Por favor, descárguelos manualmente del Hugging Face Hub y colóquelos en la misma carpeta que los demás archivos del módulo.
|
||||
|
||||
La caché KV de atención puede cuantificarse y comprimirse para su almacenamiento, con el fin de obtener un mayor rendimiento de la muestra. Los argumentos `use_cache_quantization` y `use_cache_kernel` en `config.json` se proporcionan para habilitar la cuantización de la caché KV.
|
||||
|
||||
@@ -451,7 +451,7 @@ Nous illustrons les performances des modèles BF16, Int8 et Int4 sur le benchmar
|
||||
### Quantization du cache KV
|
||||
|
||||
> NOTE : Veuillez noter qu'en raison du mécanisme interne de Hugging Face, les fichiers de support pour cette fonctionnalité
|
||||
> (i.e., `cache_autogptq_cuda_256.cpp` et `cache_autogptq_cuda_kernel_245.cu`) peuvent être manquants.
|
||||
> (i.e., `cache_autogptq_cuda_256.cpp` et `cache_autogptq_cuda_kernel_256.cu`) peuvent être manquants.
|
||||
> Veuillez les télécharger manuellement manuellement depuis le Hugging Face Hub et placez-les dans le même dossier que les autres fichiers du module.
|
||||
|
||||
Le cache KV de l'attention peut être quantifié et compressé pour le stockage, afin d'obtenir un débit d'échantillonnage plus élevé. Les arguments `use_cache_quantization` et `use_cache_kernel` dans `config.json` sont fournis pour activer la quantification du cache KV.
|
||||
|
||||
@@ -447,7 +447,7 @@ response, history = model.chat(tokenizer, "Hi", history=None)
|
||||
### KVキャッシュ量子化
|
||||
|
||||
> 注意: Hugging Faceの内部メカニズムにより、この機能のサポートファイル
|
||||
> (すなわち、`cache_autogptq_cuda_256.cpp`と`cache_autogptq_cuda_kernel_245.cu`)が欠落している可能性があります。以下を手動でダウンロードしてください。
|
||||
> (すなわち、`cache_autogptq_cuda_256.cpp`と`cache_autogptq_cuda_kernel_256.cu`)が欠落している可能性があります。以下を手動でダウンロードしてください。
|
||||
> Hugging Face Hubから手動でダウンロードし、他のモジュールファイルと同じフォルダに入れてください。
|
||||
|
||||
アテンション KV キャッシュを量子化して圧縮して保存すると、サンプルのスループットが向上する。この機能を有効にするには、`config.json` に `use_cache_quantization` と `use_cache_kernel` という引数を指定する。
|
||||
|
||||
Reference in New Issue
Block a user