update gifs

2026-05-20 16:35:47 +08:00 · 2023-08-16 16:16:25 +08:00
parent 4957c33d18
commit 512f90a069
6 changed files with 99 additions and 39 deletions
--- a/README.md
+++ b/README.md
@@ -27,7 +27,6 @@ Qwen-7B is the 7B-parameter version of the large language model series, Qwen (ab

 The following sections include information that you might find it helpful. Specifically, we advise you to read the FAQ section before you launch issues.

-
 ## News

 * 2023.8.3 We release both Qwen-7B and Qwen-7B-Chat on ModelScope and Hugging Face. We also provide a technical memo for more details about the model, including training details and model performance.
@@ -250,11 +249,11 @@ Note: The GPU memory usage profiling in the above table is performed on single A

 We measured the average inference speed of generating 2K tokens under BF16 precision and Int8 or NF4 quantization levels, respectively.

-| Quantization Level | Inference Speed with flash_attn (tokens/s) | Inference Speed w/o flash_attn (tokens/s) |
-| ------ | :---------------------------: | :---------------------------: |
-| BF16 (no quantization) | 30.06 | 27.55 |
-| Int8 (bnb) | 7.94 | 7.86 |
-| NF4 (bnb) | 21.43 | 20.37 |
+| Quantization Level     | Inference Speed with flash_attn (tokens/s) | Inference Speed w/o flash_attn (tokens/s) |
+| ---------------------- | :----------------------------------------: | :---------------------------------------: |
+| BF16 (no quantization) |                   30.06                    |                   27.55                   |
+| Int8 (bnb)             |                    7.94                    |                   7.86                    |
+| NF4 (bnb)              |                   21.43                    |                   20.37                   |

 In detail, the setting of profiling is generating 2048 new tokens with 1 context token. The profiling runs on single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.8. The inference speed is averaged over the generated 2048 tokens.

@@ -265,30 +264,23 @@ We also profile the peak GPU memory usage for encoding 2048 tokens as context (a
 When using flash attention, the memory usage is:

 | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
-| --- | :---: | :---: |
-| BF16 | 18.11GB | 23.52GB |
-| Int8 | 12.17GB | 17.60GB |
-| NF4 | 9.52GB | 14.93GB |
+| ------------------ | :---------------------------------: | :-----------------------------------: |
+| BF16               |               18.11GB               |                23.52GB                |
+| Int8               |               12.17GB               |                17.60GB                |
+| NF4                |               9.52GB                |                14.93GB                |

 When not using flash attention, the memory usage is:

 | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
-| --- | :---: | :---: |
-| BF16 | 18.11GB | 24.40GB |
-| Int8 | 12.18GB | 18.47GB |
-| NF4 | 9.52GB | 15.81GB |
+| ------------------ | :---------------------------------: | :-----------------------------------: |
+| BF16               |               18.11GB               |                24.40GB                |
+| Int8               |               12.18GB               |                18.47GB                |
+| NF4                |               9.52GB                |                15.81GB                |

 The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).

 ## Demo

-### CLI Demo
-
-We provide a CLI demo example in `cli_demo.py`, which supports streaming output for the generation. Users can interact with Qwen-7B-Chat by inputting prompts, and the model returns model outputs in the streaming mode. Run the command below:
-
-```
-python cli_demo.py
-```

 ### Web UI

@@ -304,16 +296,40 @@ Then run the command below and click on the generated link:
 python web_demo.py
 ```

+<p align="center">
+    <br>
+    <img src="assets/web_demo.gif" width="600" />
+    <br>
+<p>
+
+### CLI Demo
+
+We provide a CLI demo example in `cli_demo.py`, which supports streaming output for the generation. Users can interact with Qwen-7B-Chat by inputting prompts, and the model returns model outputs in the streaming mode. Run the command below:
+
+```
+python cli_demo.py
+```
+
+<p align="center">
+    <br>
+    <img src="assets/cli_demo.gif" width="600" />
+    <br>
+<p>
+
 ## API
+
 We provide methods to deploy local API based on OpenAI API (thanks to @hanpenggit). Before you start, install the required packages:

 ```bash
 pip install fastapi uvicorn openai pydantic sse_starlette
 ```
+
 Then run the command to deploy your API:
+
 ```bash
 python openai_api.py
 ```
+
 You can change your arguments, e.g., `-c` for checkpoint name or path, `--cpu-only` for CPU deployment, etc. If you meet problems launching your API deployment, updating the packages to the latest version can probably solve them.

 Using the API is also simple. See the example below:
@@ -345,6 +361,11 @@ response = openai.ChatCompletion.create(
 print(response.choices[0].message.content)
 ```

+<p align="center">
+    <br>
+    <img src="assets/openai_api.gif" width="600" />
+    <br>
+<p>

 ## Tool Usage

--- a/README_CN.md
+++ b/README_CN.md
@@ -280,19 +280,10 @@ model = AutoModelForCausalLM.from_pretrained(
 | Int8 | 12.18GB | 18.47GB |
 | NF4 | 9.52GB | 15.81GB |

-
 以上测速和显存占用情况，均可通过该[评测脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)测算得到。

 ## Demo

-### 交互式Demo
-
-我们提供了一个简单的交互式Demo示例，请查看`cli_demo.py`。当前模型已经支持流式输出，用户可通过输入文字的方式和Qwen-7B-Chat交互，模型将流式输出返回结果。运行如下命令：
-
-```
-python cli_demo.py
-```
-
 ### Web UI

 我们提供了Web UI的demo供用户使用 (感谢 @wysaid 支持)。在开始前，确保已经安装如下代码库：
@@ -307,16 +298,41 @@ pip install -r requirements_web_demo.txt
 python web_demo.py
 ```

+<p align="center">
+    <br>
+    <img src="assets/web_demo.gif" width="600" />
+    <br>
+<p>
+
+
+### 交互式Demo
+
+我们提供了一个简单的交互式Demo示例，请查看`cli_demo.py`。当前模型已经支持流式输出，用户可通过输入文字的方式和Qwen-7B-Chat交互，模型将流式输出返回结果。运行如下命令：
+
+```
+python cli_demo.py
+```
+
+<p align="center">
+    <br>
+    <img src="assets/cli_demo.gif" width="600" />
+    <br>
+<p>
+
 ## API
+
 我们提供了OpenAI API格式的本地API部署方法（感谢@hanpenggit）。在开始之前先安装必要的代码库：

 ```bash
 pip install fastapi uvicorn openai pydantic sse_starlette
 ```
+
 随后即可运行以下命令部署你的本地API：
+
 ```bash
 python openai_api.py
 ```
+
 你也可以修改参数，比如`-c`来修改模型名称或路径, `--cpu-only`改为CPU部署等等。如果部署出现问题，更新上述代码库往往可以解决大多数问题。

 使用API同样非常简单，示例如下：
@@ -348,6 +364,11 @@ response = openai.ChatCompletion.create(
 print(response.choices[0].message.content)
 ```

+<p align="center">
+    <br>
+    <img src="assets/openai_api.gif" width="600" />
+    <br>
+<p>

 ## 工具调用

@@ -405,7 +426,6 @@ For how to write and use prompts for ReAct Prompting, please refer to [the ReAct

 如遇到问题，敬请查阅[FAQ](FAQ_zh.md)以及issue区，如仍无法解决再提交issue。

-
 ## 使用协议

 研究人员与开发者可使用Qwen-7B和Qwen-7B-Chat或进行二次开发。我们同样允许商业使用，具体细节请查看[LICENSE](LICENSE)。如需商用，请填写[问卷](https://dashscope.console.aliyun.com/openModelApply/qianwen)申请。
--- a/README_JA.md
+++ b/README_JA.md
@@ -285,14 +285,6 @@ Flash attentionを使用しない場合、メモリ使用量は次のように

 ## デモ

-### CLI デモ
-
-`cli_demo.py` に CLI のデモ例を用意しています。ユーザはプロンプトを入力することで Qwen-7B-Chat と対話することができ、モデルはストリーミングモードでモデルの出力を返します。以下のコマンドを実行する：
-
-```
-python cli_demo.py
-```
-
 ### ウェブ UI

 ウェブUIデモを構築するためのコードを提供します（@wysaidに感謝）。始める前に、以下のパッケージがインストールされていることを確認してください：
@@ -307,7 +299,28 @@ pip install -r requirements_web_demo.txt
 python web_demo.py
 ```

+<p align="center">
+    <br>
+    <img src="assets/web_demo.gif" width="600" />
+    <br>
+<p>
+
+### CLI デモ
+
+`cli_demo.py` に CLI のデモ例を用意しています。ユーザはプロンプトを入力することで Qwen-7B-Chat と対話することができ、モデルはストリーミングモードでモデルの出力を返します。以下のコマンドを実行する：
+
+```
+python cli_demo.py
+```
+
+<p align="center">
+    <br>
+    <img src="assets/cli_demo.gif" width="600" />
+    <br>
+<p>
+
 ## API
+
 OpenAI APIをベースにローカルAPIをデプロイする方法を提供する（@hanpenggitに感謝）。始める前に、必要なパッケージをインストールしてください：

 ```bash
@@ -351,6 +364,12 @@ response = openai.ChatCompletion.create(
 print(response.choices[0].message.content)
 ```

+<p align="center">
+    <br>
+    <img src="assets/openai_api.gif" width="600" />
+    <br>
+<p>
+
 ## ツールの使用

 Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利用に特化して最適化されており、ユーザは独自の Qwen-7B ベースの LangChain、エージェント、コードインタプリタを構築することができます。ツール利用能力を評価するための評価[ベンチマーク](eval/EVALUATION.md)では、Qwen-7B は安定した性能に達しています。
--- a/assets/cli_demo.gif
+++ b/assets/cli_demo.gif
--- a/assets/openai_api.gif
+++ b/assets/openai_api.gif
--- a/assets/web_demo.gif
+++ b/assets/web_demo.gif