add 72B and 1.8B Qwen models, add Ascend 910 and Hygon DCU support, add docker support

This commit is contained in:
yangapku
2023-11-30 15:29:13 +08:00
parent 981c89b2a9
commit e8e15962d8
52 changed files with 6139 additions and 1435 deletions

14
.dockerignore Normal file
View File

@@ -0,0 +1,14 @@
__pycache__
*.so
build
.coverage_*
*.egg-info
*~
.vscode/
.idea/
.git/
.github/
.DS_Store
/private/
/README-docker.md

7
FAQ.md
View File

@@ -81,3 +81,10 @@ However, temporarily we do not support RLHF. We will provide the code in the nea
In our training, we only use `<|endoftext|>` as the separator and padding token. You can set bos_id, eos_id, and pad_id to tokenizer.eod_id. Learn more about our tokenizer from our documents about the tokenizer.
## Docker
#### Download official docker image is very slow
When downloading our official docker image, you may have a slow download speed due to some network issues. You can refer to [Alibaba Cloud Container Image Service](https://help.aliyun.com/zh/acr/user-guide/accelerate-the-pulls-of-docker-official-images) to accelerate the download of official images.

View File

@@ -76,3 +76,9 @@ Qwen当前支持流式推理。见位于`modeling_qwen.py`的`chat_stream`函数
在训练过程中,我们仅使用<|endoftext|>这一token作为sample/document之间的分隔符及padding位置占位符你可以将bos_id, eos_id, pad_id均指向tokenizer.eod_id。请阅读我们关于tokenizer的文档了解如何设置这些id。
## Docker
#### 下载官方Docker镜像速度很慢
在下载官方镜像时,您可能由于某些网络原因导致下载速度变慢。可以参考[阿里云容器镜像服务](https://help.aliyun.com/zh/acr/user-guide/accelerate-the-pulls-of-docker-official-images)加速官方镜像的下载。

228
LICENSE
View File

@@ -1,53 +1,201 @@
Tongyi Qianwen LICENSE AGREEMENT
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
Tongyi Qianwen Release Date: August 3, 2023
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
By clicking to agree or by using or distributing any portion or element of the Tongyi Qianwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.
1. Definitions.
1. Definitions
a. This Tongyi Qianwen LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement.
b. "We"(or "Us") shall mean Alibaba Cloud.
c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use.
d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You.
e. "Tongyi Qianwen" shall mean the large language models (including Qwen model and Qwen-Chat model), and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us.
f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement.
g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files.
h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation,
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
2. Grant of Rights
You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by Us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
3. Redistribution
You may reproduce and distribute copies of the Materials or derivative works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement;
b. You shall cause any modified files to carry prominent notices stating that You changed the files;
c. You shall retain in all copies of the Materials that You distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Tongyi Qianwen is licensed under the Tongyi Qianwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and
d. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such derivative works as a whole, provided Your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement.
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
4. Restrictions
If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, You shall request a license from Us. You cannot exercise your rights under this Agreement without our express authorization.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
5. Rules of use
a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials.
b. You can not use the Materials or any output therefrom to improve any other large language model (excluding Tongyi Qianwen or derivative works thereof).
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
6. Intellectual Property
a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for Us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications.
b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of Us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials.
c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licences granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
7. Disclaimer of Warranty and Limitation of Liability
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
a. We are not obligated to support, update, provide training for, or develop any further version of the Tongyi Qianwen Materials or to grant any license thereto.
b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM.
c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW ITS CAUSED.
d. You will defend, indemnify and hold harmless Us from and against any claim by any third party arising out of or related to your use or distribution of the Materials.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
8. Survival and Termination.
a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 7 and 9 shall survive the termination of this Agreement.
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
9. Governing Law and Jurisdiction.
a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 2023 Alibaba Cloud
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

228
NOTICE
View File

@@ -50,3 +50,231 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
------------- LICENSE FOR stanford_alpaca code --------------
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
------------- LICENSE FOR PanQiWei AutoGPTQ code --------------
MIT License
Copyright (c) 2023 潘其威(William)
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

1138
README.md

File diff suppressed because it is too large Load Diff

View File

@@ -1,5 +1,5 @@
<p align="left">
中文</a>&nbsp &nbsp<a href="README.md">English</a>&nbsp &nbsp<a href="README_JA.md">日本語</a> &nbsp<a href="README_FR.md">Français</a>
中文</a>&nbsp &nbsp<a href="README.md">English</a>&nbsp &nbsp<a href="README_JA.md">日本語</a> &nbsp<a href="README_FR.md">Français</a> &nbsp<a href="README_ES.md">Español</a>
</p>
<br><br>
@@ -9,20 +9,32 @@
<br>
<p align="center">
🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">魔搭社区</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2309.16609">论文</a> &nbsp&nbsp &nbsp&nbsp🖥 <a href="https://modelscope.cn/studios/qwen/Qwen-14B-Chat-Demo/summary">Demo</a>
🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2309.16609">Paper</a> &nbsp&nbsp &nbsp&nbsp🖥 <a href="https://modelscope.cn/studios/qwen/Qwen-72B-Chat-Demo/summary">Demo</a>
<br>
<a href="assets/wechat.png">微信</a>&nbsp&nbsp &nbsp&nbsp 钉钉 &nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp
<a href="assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp &nbsp&nbsp<a href="https://dashscope.aliyun.com">API</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://qianwen.aliyun.com">Web</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://apps.apple.com/cn/app/%E9%80%9A%E4%B9%89%E5%8D%83%E9%97%AE/id6466733523">APP</a>
</p>
<br><br>
| | Qwen-Chat | Qwen-Chat (Int4) | Qwen-Chat (Int8) | Qwen |
|-----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:|
| 1.8B | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B">🤗</a> |
| 7B | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a> |
| 14B | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B">🤗</a> |
| 72B | <a href="https://modelscope.cn/models/qwen/Qwen-72B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-72B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B">🤗</a> |
我们开源了**Qwen**通义千问系列工作当前开源模型的参数规模为70亿7B和140亿14B。本次开源包括基础模型**Qwen**,即**Qwen-7B**和**Qwen-14B**,以及对话模型**Qwen-Chat**,即**Qwen-7B-Chat**和**Qwen-14B-Chat**。模型链接在表格中,请点击了解详情。同时,我们公开了我们的<b><a href="https://arxiv.org/abs/2309.16609">技术报告</a></b>,请点击上方论文链接查看。
当前基础模型已经稳定训练了大规模高质量且多样化的数据覆盖多语言当前以中文和英文为主总量高达3万亿token。在相关基准评测中Qwen系列模型拿出非常有竞争力的表现显著超出同规模模型并紧追一系列最强的闭源模型。此外我们利用SFT和RLHF技术实现对齐从基座模型训练得到对话模型。Qwen-Chat具备聊天、文字创作、摘要、信息抽取、翻译等能力同时还具备一定的代码生成和简单数学推理的能力。在此基础上我们针对LLM对接外部系统等方面针对性地做了优化当前具备较强的工具调用能力以及最近备受关注的Code Interpreter的能力和扮演Agent的能力。
我们开源了**Qwen**通义千问系列工作当前开源模型的参数规模为18亿1.8B、70亿7B、140亿14B和720亿72B。本次开源包括基础模型**Qwen**,即**Qwen-1.8B**、**Qwen-7B**、**Qwen-14B**、**Qwen-72B**,以及对话模型**Qwen-Chat**,即**Qwen-1.8B-Chat**、**Qwen-7B-Chat**、**Qwen-14B-Chat**和**Qwen-72B-Chat**。模型链接在表格中,请点击了解详情。同时,我们公开了我们的<b><a href="https://arxiv.org/abs/2309.16609">技术报告</a></b>,请点击上方论文链接查看。
当前基础模型已经稳定训练了大规模高质量且多样化的数据覆盖多语言当前以中文和英文为主总量高达3万亿token。在相关基准评测中Qwen系列模型拿出非常有竞争力的表现显著超出同规模模型并紧追一系列最强的闭源模型。此外我们利用SFT和RLHF技术实现对齐从基座模型训练得到对话模型。Qwen-Chat具备聊天、文字创作、摘要、信息抽取、翻译等能力同时还具备一定的代码生成和简单数学推理的能力。在此基础上我们针对LLM对接外部系统等方面针对性地做了优化当前具备较强的工具调用能力以及最近备受关注的Code Interpreter的能力和扮演Agent的能力。我们将各个大小模型的特点列到了下表。
| 模型 | 开源日期 | 最大上下文长度 | System Prompt强化 | 预训练token数 | 微调Q-Lora最小GPU用量 | 生成2048个token的最小显存占用 | 工具调用 |
|:----------|:--------:|:-------:|:---------------:|:---------:|:-----------------:|:-------------------:|:----:|
| Qwen-1.8B | 23.11.30 | 32K | √ | 2.2T | 5.8GB | 2.9GB | √ |
| Qwen-7B | 23.08.03 | 32K | × | 2.4T | 11.5GB | 8.2GB | √ |
| Qwen-14B | 23.09.25 | 8K | × | 3.0T | 18.7GB | 13.0GB | √ |
| Qwen-72B | 23.11.30 | 32K | √ | 3.0T | 61.4GB | 48.9GB | √ |
在这个项目中,你可以了解到以下内容
@@ -45,8 +57,9 @@
## 新闻
* 2023.11.30 🔥 我们推出 **Qwen-72B****Qwen-72B-Chat**,它们在 3T tokens上进行训练并支持 32k 上下文。同时也发布了 **Qwen-1.8B****Qwen-1.8B-Chat**。我们还增强了 Qwen-72B-Chat 和 Qwen-1.8B-Chat 的系统指令System Prompt功能请参阅[示例文档](examples/system_prompt.md)。此外,我们还对**昇腾910**以及**海光DCU**实现了推理的支持,详情请查看`ascend-support``dcu-support`文件夹。
* 2023年10月17日 我们推出了Int8量化模型**Qwen-7B-Chat-Int8**和**Qwen-14B-Chat-Int8**。
* 2023年9月25日 🔥 在魔搭社区ModelScope和Hugging Face推出**Qwen-14B**和**Qwen-14B-Chat**模型,并开源 [qwen.cpp](https://github.com/QwenLM/qwen.cpp) 和 [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent)。**Qwen-7B**和**Qwen-7B-Chat**的代码和模型也同步得到更新。**请使用最新的代码和模型!**
* 2023年9月25日 在魔搭社区ModelScope和Hugging Face推出**Qwen-14B**和**Qwen-14B-Chat**模型,并开源 [qwen.cpp](https://github.com/QwenLM/qwen.cpp) 和 [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent)。**Qwen-7B**和**Qwen-7B-Chat**的代码和模型也同步得到更新。**请使用最新的代码和模型!**
- 相比原版Qwen-7B新版用了更多训练数据从2.2T增加到2.4T tokens序列长度从2048扩展至8192。整体中文能力以及代码能力均有所提升。
* 2023年9月12日 支持Qwen-7B和Qwen-7B-Chat的微调其中包括全参数微调、LoRA以及Q-LoRA。
* 2023年8月21日 发布Qwen-7B-Chat的Int4量化模型Qwen-7B-Chat-Int4。该模型显存占用低推理速度相比半精度模型显著提升在基准评测上效果损失较小。
@@ -55,15 +68,15 @@
## 评测表现
Qwen-14B及Qwen-7B (最新版本使用更大量的token进行预训练)相比同规模模型均实现了效果的显著提升。我们评测的数据集包括MMLU、C-Eval、 GSM8K、 MATH、HumanEval、MBPP、BBH等数据集考察的能力包括自然语言理解、知识、数学计算和推理、代码生成、逻辑推理等。当然即便Qwen-14B相比GPT-3.5和GPT-4仍有差距。
Qwen系列模型相比同规模模型均实现了效果的显著提升。我们评测的数据集包括MMLU、C-Eval、 GSM8K、 MATH、HumanEval、MBPP、BBH等数据集考察的能力包括自然语言理解、知识、数学计算和推理、代码生成、逻辑推理等。Qwen-72B在所有任务上均超越了LLaMA2-70B的性能同时在10项任务中的7项任务中超越GPT-3.5.
<p align="left">
<img src="assets/radar_14b.jpg" width="600"/>
<img src="assets/radar_72b.jpg" width="600"/>
<p>
<br>
| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU |
|:-----------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:|
|:-------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:|
| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot |
| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 |
| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 |
@@ -73,9 +86,12 @@ Qwen-14B及Qwen-7B (最新版本使用更大量的token进行预训练)相比同
| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 |
| Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 |
| Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 |
| **Qwen-7B (original)** | 56.7 | 59.6 | 51.6 | 10.4 | 24.4 | 31.2 | 40.6 | 58.8 |
| Yi-34B | 76.3 | 81.8 | 67.9 | 15.9 | 26.2 | 38.2 | 66.4 | 82.6 |
| XVERSE-65B | 70.8 | 68.6 | 60.3 | - | 26.3 | - | - | - |
| **Qwen-1.8B** | 45.3 | 56.1 | 32.3 | 2.3 | 15.2 | 14.2 | 22.3 | 52.1 |
| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 |
| **Qwen-14B** | **66.3** | **72.1** | **61.3** | **24.8** | **32.3** | **40.8** | **53.4** | **71.0** |
| **Qwen-14B** | 66.3 | 72.1 | 61.3 | 24.8 | 32.3 | 40.8 | 53.4 | 71.0 |
| **Qwen-72B** | **77.4** | **83.3** | **78.9** | **35.2** | **35.4** | **52.2** | **67.7** | **83.6** |
对于以上所有对比模型,我们列出了其官方汇报结果与[OpenCompass](https://opencompass.org.cn/leaderboard-llm)结果之间的最佳分数。
@@ -87,6 +103,7 @@ Qwen-14B及Qwen-7B (最新版本使用更大量的token进行预训练)相比同
* python 3.8及以上版本
* pytorch 1.12及以上版本推荐2.0及以上版本
* transformers 4.32及以上版本
* 建议使用CUDA 11.4及以上GPU用户、flash-attention用户等需考虑此选项
<br>
@@ -94,7 +111,9 @@ Qwen-14B及Qwen-7B (最新版本使用更大量的token进行预训练)相比同
我们提供简单的示例来说明如何利用🤖 ModelScope和🤗 Transformers快速使用Qwen-7B和Qwen-7B-Chat。
在开始前,请确保你已经配置环境并安装好相关的代码包。最重要的是,确保你满足上述要求,然后安装相关的依赖库
你可以使用我们预构建好的Docker镜像省去大部分配置环境的操作,详情见[“使用预构建的docker镜像”](#-使用预构建的docker镜像)一节
如不使用Docker请确保你已经配置好环境并安装好相关的代码包。最重要的是确保你满足上述要求然后安装相关的依赖库。
```bash
pip install -r requirements.txt
@@ -107,6 +126,7 @@ git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# 下方安装可选,安装可能比较缓慢。
# pip install csrc/layer_norm
# 如果flash-attn版本高于2.1.1,下方无需安装。
# pip install csrc/rotary
```
@@ -189,7 +209,9 @@ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
</details>
<p id="DownloadModel">
若在使用上述代码时由于各种原因无法从 HuggingFace 拉取模型和代码,可以先从 ModelScope 下载模型及代码至本地,再从本地加载模型:
</p>
```python
from modelscope import snapshot_download
@@ -316,6 +338,60 @@ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cp
如果你遇到显存不足的问题而希望使用多张GPU进行推理可以使用上述的默认的使用方法读取模型。此前提供的脚本`utils.py`已停止维护。
尽管这个方法很简单但它的效率相对较低。我们建议使用vLLM和FastChat并请阅读部署章节。
### 阿里云灵积DashScopeAPI服务
最简单的使用Qwen模型API服务的方法就是通过DashScope阿里云灵积API模型服务。我们提供了简单介绍说明使用方法。同时我们还提供了自己部署OpenAI格式的API的方法。
DashScope是阿里云提供的大语言模型的API服务目前支持Qwen。但请注意目前提供服务的Qwen模型为内部模型暂无更多具体细节对外透露。模型服务包括`qwen-turbo``qwen-plus``qwen-max``qwen-turbo`速度更快,`qwen-plus`效果更优,`qwen-max`是最新发布的千亿级通义千问2.0模型。详情请查看[文档](https://dashscope.aliyun.com)。
请首先前往[官网](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.6c2774fahtfXdn)开通DashScope获得API KeyAK。建议通过环境变量设置AK
```bash
export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
```
随后安装相关代码包,点击[此处](https://help.aliyun.com/zh/dashscope/developer-reference/install-dashscope-sdk)查看安装文档。如使用python则直接通过pip安装
```bash
pip install dashscope
```
如安装JAVA SDK则通过如下命令安装
```xml
<!-- https://mvnrepository.com/artifact/com.alibaba/dashscope-sdk-java -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>dashscope-sdk-java</artifactId>
<version>the-latest-version</version>
</dependency>
```
最简单的使用方法就是通过messages调用用法类似OpenAI API。示例如下
```python
import random
from http import HTTPStatus
from dashscope import Generation
def call_with_messages():
messages = [{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': '如何做西红柿鸡蛋?'}]
gen = Generation()
response = gen.call(
Generation.Models.qwen_turbo,
messages=messages,
seed=random.randint(1, 10000), # set the random seed, optional, default to 1234 if not set
result_format='message', # set the result to be "message" format.
)
return response
if __name__ == '__main__':
response = call_with_messages()
if response.status_code == HTTPStatus.OK:
print(response)
else:
print('Request id: %s, Status code: %s, error code: %s, error message: %s' % (
response.request_id, response.status_code,
response.code, response.message
))
```
更多用法请查看官方文档了解详情。
<br><br>
@@ -323,7 +399,7 @@ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cp
### GPTQ
**请注意:我们更新量化方案为基于 [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) 的量化提供Int4量化模型。该方案在模型评测效果几乎无损且存储需求更低推理速度更优。**
我们提供了基于[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)的量化方案并开源了Int4和Int8量化模型。量化模型的效果损失很小但能显著降低显存占用并提升推理速度。
以下我们提供示例说明如何使用Int4量化模型。在开始使用前请先保证满足要求如torch 2.0及以上transformers版本为4.32.0及以上,等等),并安装所需安装包:
@@ -333,6 +409,12 @@ pip install auto-gptq optimum
如安装`auto-gptq`遇到问题,我们建议您到官方[repo](https://github.com/PanQiWei/AutoGPTQ)搜索合适的wheel。
> 注意:预编译的`auto-gptq`版本对`torch`版本及其CUDA版本要求严格。同时由于
> 其近期更新,你可能会遇到`transformers`、`optimum`或`peft`抛出的版本错误。
> 我们建议使用符合以下要求的最新版本:
> - torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1
> - torch>=2.0,<2.1 auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0
随后即可使用和上述一致的用法调用量化模型:
```python
@@ -349,12 +431,18 @@ response, history = model.chat(tokenizer, "Hi", history=None)
| Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
|----------------------|:----:|:-----------:|:-----:|:---------:|
| Qwen-1.8B-Chat (BF16)| 43.3 | 55.6 | 33.7 | 26.2 |
| Qwen-1.8B-Chat (Int8)| 43.1 | 55.8 | 33.0 | 27.4 |
| Qwen-1.8B-Chat (Int4)| 42.9 | 52.8 | 31.2 | 25.0 |
| Qwen-7B-Chat (BF16) | 55.8 | 59.7 | 50.3 | 37.2 |
| Qwen-7B-Chat (Int8) | 55.4 | 59.4 | 48.3 | 34.8 |
| Qwen-7B-Chat (Int4) | 55.1 | 59.2 | 49.7 | 29.9 |
| Qwen-14B-Chat (BF16) | 64.6 | 69.8 | 60.1 | 43.9 |
| Qwen-14B-Chat (Int8) | 63.6 | 68.6 | 60.0 | 48.2 |
| Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 |
| Qwen-72B-Chat (BF16) | 74.4 | 80.1 | 76.4 | 64.6 |
| Qwen-72B-Chat (Int8) | 73.5 | 80.1 | 73.5 | 62.2 |
| Qwen-72B-Chat (Int4) | 73.4 | 80.1 | 75.3 | 61.6 |
<br>
@@ -362,9 +450,9 @@ response, history = model.chat(tokenizer, "Hi", history=None)
> 注意由于Hugging Face的内部实现本功能的支持文件`cache_autogptq_cuda_356.cpp`与`cache_autogptq_cuda_kernel_245.cu`可能没被下载。如需开启使用,请手动从相关位置下载,并放置到相应文件中。
在模型infer时可以将中间结果key以及value的值量化后压缩存储这样便可以在相同的卡上存储更多的key以及value增加样本吞吐。
在模型推理时,我们可以将中间结果key以及value的值量化后压缩存储这样便可以在相同的卡上存储更多的key以及value增加样本吞吐。
提供use_cache_quantization以及use_cache_kernel两个参数对模型控制当use_cache_quantization以及use_cache_kernel均开启时将启动kv-cache量化的功能。具体使用如下:
我们在`config.json`里提供了`use_cache_quantization``use_cache_kernel`两个参数来控制是否启用KV cache量化具体使用方法如下:
```python
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat",
@@ -375,43 +463,46 @@ model = AutoModelForCausalLM.from_pretrained(
use_flash_attn=False
)
```
注意:当前该功能目前不支持与flash attn同时开启如果你开了kv cache量化的同时又开了flash attnuse_flash_attn=True use_cache_quantization=True, use_cache_kernel=True),会默认将use_flash_attn关闭
注意当前该功能不支持与flash attention同时开启如果你开了KV cache量化的同时又开了flash attention`use_flash_attn=True` `use_cache_quantization=True`, `use_cache_kernel=True`),程序默认将关闭`use_flash_attn`
效果方面我们验证过Int8 kv-cache的使用对模型整体的精度指标基本无损。我们做了针对显存占用的性能测试。评测运行于单张A100-SXM4-80G GPU模型默认使用BF16格式默认生成的seq-length=1024生成1024个token,其中oom表示out of memory
效果方面我们验证过Int8 KV Cache的使用对模型整体的精度指标基本无损。我们做了针对显存占用的性能测试。评测运行于单张A100-SXM4-80G GPU模型默认使用BF16格式默认生成1024个token其中OOM表示内存不足
开启了kv-cache量化之后模型在infer的时候可以开启更大的batch size(bs)
开启了KV cache量化之后模型在推理的时候可以开启更大的batch size (bs)
| USE KV Cache | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 |
|-------------|:------:|:------:|:------:|:------:|:------:|:------:|
| no | 16.3GB | 24.1GB | 31.7GB | 48.7GB | oom | oom |
| yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB |
|--------------|:------:|:------:|:------:|:------:|:------:|:------:|
| No | 16.3GB | 24.1GB | 31.7GB | 48.7GB | oom | oom |
| Yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB |
开启了kv-cache量化之后模型在infer时预测更长的seq-lengthsl生成的token数结果时,可以节约更多的显存。
开启了KV cache量化之后模型在推理时可在生成更长的序列sl生成的token数时,节约更多的显存。
| USE KV Cache | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 |
|-------------|:------:|:-------:|:-------:|:-------:|:-------:|
|--------------|:------:|:-------:|:-------:|:-------:|:-------:|
| no | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB |
| yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB |
模型开启kv cache量化后模型infer的时候会将原始存进layer_past的float格式的key/value成int8格式的qkey/qvalue和相对应的量化参数。
开启KV cache量化后模型在推理时会将原始存进`layer-past`的float格式的key/value转换成int8格式,同时存储量化部分的参数。
具体操作如下:
1、将key/value进行量化操作
1. 将key/value进行量化操作
```
qv,scale,zero_point=quantize_cache_v(v)
```
2、存入layer_past中:
量化格式的layer_past:
2. 存入`layer_past`中:
量化格式的`layer-past`:
```
layer_past=((q_key,key_scale,key_zero_point),
(q_value,value_scale,value_zero_point))
```
原始格式的layer_past:
原始格式的`layer-past`:
```
layer_past=(key,value)
```
如果需要将layer_past中存好的keyvalue直接取出使用可以使用反量化操作将int8格式的key/value转回float格式
如果需要将`layer-past`中存好的keyvalue直接取出使用可以使用反量化操作将Int8格式的key/value转回float格式
```
v=dequantize_cache_torch(qv,scale,zero_point)
```
@@ -420,118 +511,100 @@ model = AutoModelForCausalLM.from_pretrained(
### 推理性能
这一部分将介绍模型推理的速度和显存占用的相关数据。下文的性能测算使用 [此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py) 完成。
### 推理速度
我们测算了BF16、Int8和Int4模型在使用flash attention v2、v1或不使用时生成2048和8192个token的平均推理速度tokens/s。结果如下所示
我们测算了BF16、Int8和Int4模型在生成2048个token时的平均推理速度tokens/s和显存使用。结果如下所示
<table>
<tr>
<th rowspan="2">Model Size</th><th rowspan="2">Precision</th><th rowspan="2">FlashAttn</th><th colspan="2" align="center">Sequence Length</th>
<td>Model Size</td>
<td>Quantization</td>
<td>Speed (Tokens/s)</td>
<td>GPU Memory Usage</td>
</tr>
<tr>
<th align="center">2048</th><th align="center">8192</th>
</tr>
</tr>
<td rowspan="3">1.8B</td>
<td>BF16</td>
<td>54.09</td>
<td>4.23GB</td>
</tr>
<tr>
<th rowspan="9">7B</th><td align="center" rowspan="3">BF16</td><td align="center">v2</td><td align="center">40.93</td><td align="center">36.14</td>
<td>Int8</td>
<td>55.56</td>
<td>3.48GB</td>
</tr>
<tr>
<td align="center">v1</td><td align="center">40.75</td><td align="center">35.34
<td>Int4</td>
<td>71.07</td>
<td>2.91GB</td>
</tr>
<tr>
<td align="center">Disabled</td><td align="center">37.55</td><td align="center">33.56
<td rowspan="3">7B</td>
<td>BF16</td>
<td>40.93</td>
<td>16.99GB</td>
</tr>
<tr>
<td align="center" rowspan="3">Int8</td><td align="center">v2</td><td align="center">37.47</td><td align="center">32.54</td>
<td>Int8</td>
<td>37.47</td>
<td>11.20GB</td>
</tr>
<tr>
<td align="center">v1</td><td align="center">37.51</td><td align="center">32.39
<td>Int4</td>
<td>50.09</td>
<td>8.21GB</td>
</tr>
<tr>
<td align="center">Disabled</td><td align="center">37.84</td><td align="center">32.65
<td rowspan="3">14B</td>
<td>BF16</td>
<td>32.22</td>
<td>30.15GB</td>
</tr>
<tr>
<td align="center" rowspan="3">Int4</td><td align="center">v2</td><td align="center">50.09</td><td align="center">38.61</td>
<td>Int8</td>
<td>29.28</td>
<td>18.81GB</td>
</tr>
<tr>
<td align="center">v1</td><td align="center">45.98</td><td align="center">36.47
<td>Int4</td>
<td>38.72</td>
<td>13.01GB</td>
</tr>
<tr>
<td align="center">Disabled</td><td align="center">48.12</td><td align="center">36.70
<td rowspan="3">72B</td>
<td>BF16</td>
<td>8.48</td>
<td>144.69GB (2xA100)</td>
</tr>
<tr>
<th rowspan="9">14B</th><td align="center" rowspan="3">BF16</td><td align="center">v2</td><td align="center">32.88</td><td align="center">24.87</td>
<td>Int8</td>
<td>9.05</td>
<td>81.27GB (2xA100)</td>
</tr>
<tr>
<td align="center">v1</td><td align="center">32.76</td><td align="center">28.89
<td>Int4</td>
<td>11.32</td>
<td>48.86GB</td>
</tr>
<tr>
<td align="center">Disabled</td><td align="center">29.32</td><td align="center">22.91
</tr>
<tr>
<td align="center" rowspan="3">Int8</td><td align="center">v2</td><td align="center">29.28</td><td align="center">24.22</td>
</tr>
<tr>
<td align="center">v1</td><td align="center">28.31</td><td align="center">23.87
</tr>
<tr>
<td align="center">Disabled</td><td align="center">31.12</td><td align="center">24.60
</tr>
<tr>
<td align="center" rowspan="3">Int4</td><td align="center">v2</td><td align="center">38.72</td><td align="center">27.33</td>
</tr>
<tr>
<td align="center">v1</td><td align="center">37.81</td><td align="center">26.46
</tr>
<tr>
<td align="center">Disabled</td><td align="center">37.65</td><td align="center">26.00
<td>72B + vLLM</td>
<td>BF16</td>
<td>17.60</td>
<td>2xA100</td>
</tr>
</table>
评测运行于单张A100-SXM4-80G GPU使用PyTorch 2.0.1CUDA 11.4。推理速度是编码2048个token和生成8192个token的速度均值。
评测运行于单张A100-SXM4-80G GPU除非提到使用2xA100使用PyTorch 2.0.1CUDA 11.8和Flash-Attention2。(72B + vLLM 使用 PyTorch 2.1.0和Cuda 11.8.)推理速度是生成2048个token的速度均值。
注意以上Int4/Int8模型生成速度使用autogptq库给出当前``AutoModelForCausalLM.from_pretrained``载入的模型生成速度会慢大约20%。我们已经将该问题汇报给HuggingFace团队若有解决方案将即时更新。
### 显存使用
我们还测算了BF16、Int8和Int4模型编码2048个token及生成8192个token的峰值显存占用情况。结果GB如下所示
<table>
<tr>
<th rowspan="2">Model Size</th><th rowspan="2">Precision</th><th colspan="2" align="center">Sequence Length</th>
</tr>
<tr>
<th align="center">2048</th><th align="center">8192</th>
</tr>
</tr>
</tr>
<tr>
<th rowspan="3">7B</th><td align="center">BF16</td><td align="center">16.99</td><td align="center">22.53</td>
</tr>
<tr>
<td align="center">Int8</td><td align="center">11.20</td><td align="center">16.62
</tr>
<tr>
<td align="center">Int4</td><td align="center">8.21</td><td align="center">13.63</td>
</tr>
<tr>
<th rowspan="3">14B</th><td align="center">BF16</td><td align="center">30.15</td><td align="center">38.94</td>
</tr>
<tr>
<td align="center">Int8</td><td align="center">18.81</td><td align="center">27.54
</tr>
<tr>
<td align="center">Int4</td><td align="center">13.01</td><td align="center">21.79</td>
</tr>
</table>
<br>
我们还测量了不同上下文长度、生成长度、Flash-Attention版本的推理速度和 GPU 内存使用情况。可以在 Hugging Face 或 ModelScope 上的相应的模型介绍页面找到结果。
## 微调
### 使用方法
我们提供了`finetune.py`这个脚本供用户实现在自己的数据上进行微调的功能以接入下游任务。此外我们还提供了shell脚本减少用户的工作量。这个脚本支持 [DeepSpeed](https://github.com/microsoft/DeepSpeed) 和 [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/) 。我们提供的shell脚本使用了DeepSpeed因此建议您确保已经安装DeepSpeed
我们提供了`finetune.py`这个脚本供用户实现在自己的数据上进行微调的功能以接入下游任务。此外我们还提供了shell脚本减少用户的工作量。这个脚本支持 [DeepSpeed](https://github.com/microsoft/DeepSpeed) 和 [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/) 。我们提供的shell脚本使用了DeepSpeed因此建议您确保已经安装DeepSpeed和Peft注意DeepSpeed可能不兼容最新的pydantic版本请确保`pydantic<2.0`)。你可以使用如下命令安装:
```bash
pip install peft deepspeed
```
首先你需要准备你的训练数据。你需要将所有样本放到一个列表中并存入json文件中。每个样本对应一个字典包含id和conversation其中后者为一个列表。示例如下所示
```json
@@ -641,7 +714,12 @@ tokenizer.save_pretrained(new_model_directory)
注意:分布式训练需要根据你的需求和机器指定正确的分布式训练超参数。此外,你需要根据你的数据、显存情况和训练速度预期,使用`--model_max_length`设定你的数据长度。
### 显存占用及训练速度
下面记录7B和14B模型在单GPU使用LoRALoRA (emb)指的是embedding和输出层参与训练而LoRA则不优化这部分参数和QLoRA时处理不同长度输入的显存占用和训练速度的情况。本次评测运行于单张A100-SXM4-80G GPU使用CUDA 11.8和Pytorch 2.0并使用了flash attention 2。我们统一使用batch size为1gradient accumulation为8的训练配置记录输入长度分别为256、512、1024、2048、4096和8192的显存占用GB和训练速度s/iter。我们还使用2张A100测了Qwen-7B的全参数微调。受限于显存大小我们仅测试了256、512和1024token的性能。具体数值如下所示:
下面记录7B和14B模型在单GPU使用LoRALoRA (emb)指的是embedding和输出层参与训练而LoRA则不优化这部分参数和QLoRA时处理不同长度输入的显存占用和训练速度的情况。本次评测运行于单张A100-SXM4-80G GPU使用CUDA 11.8和Pytorch 2.0并使用了flash attention 2。我们统一使用batch size为1gradient accumulation为8的训练配置记录输入长度分别为256、512、1024、2048、4096和8192的显存占用GB和训练速度s/iter。我们还使用2张A100测了Qwen-7B的全参数微调。受限于显存大小我们仅测试了256、512和1024token的性能。
对于 Qwen-72B我们测试了两种方案1使用4个 A100-SXM4-80G GPUs通过 Lora + DeepSpeed ZeRO 3 微调和2使用单张A100-SXM4-80G GPU通过 QLora (int4) 微调。请注意,使用 LoRA (emb) 微调和不带 DeepSpeed ZeRO 3 的 LoRA 微调在4个A100-SXM4-80G GPUs 上都会出现OOM你可以通过将`--deepspeed finetune/ds_config_zero3.json`参数传给[`finetune/finetune_lora_ds.sh`](finetune/finetune_lora_ds.sh)来打开 DeepSpeed ZeRO 3 配置)。
具体数值如下所示:
<table>
<tr>
@@ -652,6 +730,18 @@ tokenizer.save_pretrained(new_model_directory)
</tr>
</tr>
</tr>
<tr>
<th rowspan="4">1.8B</th><td>LoRA</td><td align="center">6.7G / 1.0s/it</td><td align="center">7.4G / 1.0s/it</td><td align="center">8.4G / 1.1s/it</td><td align="center">11.0G / 1.7s/it</td><td align="center">16.2G / 3.3s/it</td><td align="center">21.8G / 6.8s/it</td>
</tr>
<tr>
<td>LoRA (emb)</td><td align="center">13.7G / 1.0s/it</td><td align="center">14.0G / 1.0s/it</td><td align="center">14.0G / 1.1s/it</td><td align="center">15.1G / 1.8s/it</td><td align="center">19.7G / 3.4s/it</td><td align="center">27.7G / 7.0s/it</td>
</tr>
<tr>
<td>Q-LoRA</td><td align="center">5.8G / 1.4s/it</td><td align="center">6.0G / 1.4s/it</td><td align="center">6.6G / 1.4s/it</td><td align="center">7.8G / 2.0s/it</td><td align="center">10.2G / 3.4s/it</td><td align="center">15.8G / 6.5s/it</td>
</tr>
<tr>
<td>Full-parameter</td><td align="center">43.5G / 2.1s/it</td><td align="center">43.5G / 2.2s/it</td><td align="center">43.5G / 2.2s/it</td><td align="center">43.5G / 2.3s/it</td><td align="center">47.1G / 2.8s/it</td><td align="center">48.3G / 5.6s/it</td>
</tr>
<tr>
<th rowspan="4">7B</th><td>LoRA</td><td align="center">20.1G / 1.2s/it</td><td align="center">20.4G / 1.5s/it</td><td align="center">21.5G / 2.8s/it</td><td align="center">23.8G / 5.2s/it</td><td align="center">29.7G / 10.1s/it</td><td align="center">36.6G / 21.3s/it</td>
</tr>
@@ -673,6 +763,12 @@ tokenizer.save_pretrained(new_model_directory)
<tr>
<td>Q-LoRA</td><td align="center">18.7G / 5.3s/it</td><td align="center">18.4G / 6.3s/it</td><td align="center">18.9G / 8.2s/it</td><td align="center">19.9G / 11.8s/it</td><td align="center">23.0G / 20.1s/it</td><td align="center">27.9G / 38.3s/it</td>
</tr>
<tr>
<th rowspan="2">72B</th><td>LoRA + Deepspeed Zero3</td><td align="center">215.4G / 17.6s/it</td><td align="center">217.7G / 20.5s/it</td><td align="center">222.6G / 29.4s/it</td><td align="center">228.8G / 45.7s/it</td><td align="center">249.0G / 83.4s/it</td><td align="center">289.2G / 161.5s/it</td>
</tr>
<tr>
<td>Q-LoRA</td><td align="center">61.4G / 27.4s/it</td><td align="center">61.4G / 31.5s/it</td><td align="center">62.9G / 41.4s/it</td><td align="center">64.1G / 59.5s/it</td><td align="center">68.0G / 97.7s/it</td><td align="center">75.6G / 179.8s/it</td>
</tr>
</table>
<br>
@@ -680,12 +776,40 @@ tokenizer.save_pretrained(new_model_directory)
## 部署
### vLLM
如希望部署及加速推理我们建议你使用vLLM和FastChat。首先安装相应的代码库
如希望部署及加速推理我们建议你使用vLLM
如果你使用cuda12.1和pytorch2.1可以直接使用以下命令安装vLLM。
```bash
pip install vllm
```
否则请参考vLLM官方的[安装说明](https://docs.vllm.ai/en/latest/getting_started/installation.html)。
#### vLLM + 类Transformer接口
请下载[接口封装代码](examples/vllm_wrapper.py)到当前文件夹,并执行以下命令进行多轮对话交互。(注意:该方法当前只支持``model.chat()``接口。)
```python
from vllm_wrapper import vLLMWrapper
model = vLLMWrapper('Qwen/Qwen-7B-Chat', tensor_parallel_size=1)
response, history = model.chat(query="你好", history=None)
print(response)
response, history = model.chat(query="给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
print(response)
response, history = model.chat(query="给这个故事起一个标题", history=history)
print(response)
```
#### vLLM + 网页Demo / 类OpenAI API
你可以使用FastChat去搭建一个网页Demo或类OpenAI API服务器。首先请安装FastChat
```bash
pip install "fschat[model_worker,webui]"
```
你也可以通过`git clone`和`pip install -e .`的方式通过源码安装。如果遇到安装问题,请阅读它们的官方文档。
使用vLLM和FastChat运行Qwen之前首先启动一个controller
```bash
@@ -694,24 +818,30 @@ python -m fastchat.serve.controller
然后启动model worker读取模型。如使用单卡推理运行如下命令
```bash
python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code
python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype bfloat16
```
然而如果你希望使用多GPU加速推理或者增大显存你可以使用vLLM支持的模型并行机制。假设你需要在4张GPU上运行你的模型命令如下所示
```bash
python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4
python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype bfloat16
```
启动model worker后你可以启动一个web demo或者OpenAI API。启动web demo的命令如下
启动model worker后你可以启动一个
* Web UI Demo
```bash
python -m fastchat.serve.gradio_web_server
```
* OpenAI API
使用OpenAI API前请阅读我们的API章节配置好环境然后运行如下命令
```bash
python -m fastchat.serve.openai_api_server --host localhost --port 8000
```
然而如果你觉得使用vLLM和FastChat比较困难你也可以尝试以下我们提供的最简单的方式部署Web Demo、CLI Demo和OpenAI API。
<br>
## Demo
### Web UI
@@ -748,68 +878,12 @@ python cli_demo.py
<p>
<br>
## API
最简单的使用Qwen模型API服务的方法就是通过DashScope阿里云灵积模型服务。我们提供了简单介绍说明使用方法。同时我们还提供了自己部署OpenAI格式的API的方法。
### DashScope
DashScope是阿里云提供的大语言模型的API服务目前支持Qwen。但请注意目前提供服务的Qwen模型为内部模型暂无更多具体细节对外透露。模型服务包括`qwen-turbo`和`qwen-plus`。前者速度更快,后者效果更优。详情请查看[文档](https://dashscope.aliyun.com)。
请首先前往[官网](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.6c2774fahtfXdn)开通DashScope获得API KeyAK。建议通过环境变量设置AK
```bash
export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
```
随后安装相关代码包,点击[此处](https://help.aliyun.com/zh/dashscope/developer-reference/install-dashscope-sdk)查看安装文档。如使用python则直接通过pip安装
```bash
pip install dashscope
```
如安装JAVA SDK则通过如下命令安装
```xml
<!-- https://mvnrepository.com/artifact/com.alibaba/dashscope-sdk-java -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>dashscope-sdk-java</artifactId>
<version>the-latest-version</version>
</dependency>
```
最简单的使用方法就是通过messages调用用法类似OpenAI API。示例如下
```python
import random
from http import HTTPStatus
from dashscope import Generation
def call_with_messages():
messages = [{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': '如何做西红柿鸡蛋?'}]
gen = Generation()
response = gen.call(
Generation.Models.qwen_turbo,
messages=messages,
seed=random.randint(1, 10000), # set the random seed, optional, default to 1234 if not set
result_format='message', # set the result to be "message" format.
)
return response
if __name__ == '__main__':
response = call_with_messages()
if response.status_code == HTTPStatus.OK:
print(response)
else:
print('Request id: %s, Status code: %s, error code: %s, error message: %s' % (
response.request_id, response.status_code,
response.code, response.message
))
```
更多用法请查看官方文档了解详情。
### OpenAI API
### API
我们提供了OpenAI API格式的本地API部署方法感谢@hanpenggit。在开始之前先安装必要的代码库
```bash
pip install fastapi uvicorn openai "pydantic>=2.3.0" sse_starlette
pip install fastapi uvicorn openai pydantic sse_starlette
```
随后即可运行以下命令部署你的本地API
@@ -860,6 +934,86 @@ print(response.choices[0].message.content)
该接口也支持函数调用(**Function Calling**),但暂时仅限 `stream=False` 时能生效。用法见[函数调用示例](examples/function_call_examples.py)。
<br><br>
## 🐳 使用预构建的Docker镜像
为简化部署流程我们提供了预配置好相应环境的Docker镜像[qwenllm/qwen](https://hub.docker.com/r/qwenllm/qwen)只需安装驱动、下载模型文件即可启动Demo、部署OpenAI API以及进行微调。
### 准备操作
1. 根据需要使用的镜像版本安装相应版本的Nvidia驱动
- `qwenllm/qwen:cu117`**推荐**`>= 515.48.07`
- `qwenllm/qwen:cu114`不支持flash-attention`>= 470.82.01`
- `qwenllm/qwen:latest`:与`qwenllm/qwen:cu117`相同
2. 安装并配置[docker](https://docs.docker.com/engine/install/)和[nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
```bash
# 配置docker
sudo systemctl start docker
# 测试docker是否安装正确
sudo docker run hello-world
# 配置nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# 测试nvidia-container-toolkit是否安装正确
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
```
3. 下载模型及代码至本地(参考[此处说明](#DownloadModel)
### 部署
下面我们以Qwen-7B-Chat为例。在启动Web Demo或者部署API前请先参照下方代码完成配置工作
```bash
IMAGE_NAME=qwenllm/qwen:cu117
PORT=8901
CHECKPOINT_PATH=/path/to/Qwen-7B-Chat # 下载到本地的模型及代码路径
```
如下脚本可以帮你部署:
* OpenAI API
```bash
bash docker/docker_openai_api.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT}
```
* Web UI
```bash
bash docker/docker_web_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT}
```
* 交互式Demo
```bash
bash docker/docker_cli_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH}
```
这些命令将自动下载所需镜像以及后台启动Web UI Demo。你可以打开`http://localhost:${PORT}` 来使用该Demo。
如果输出如下内容则说明Demo启动成功
```text
Successfully started web demo. Open '...' to try!
Run `docker logs ...` to check demo status.
Run `docker rm -f ...` to stop and remove the demo.
```
如果你想查看Demo的状态你可以使用这个命令来展示输出结果`docker logs qwen`。
你可以使用这个命令`docker rm -f qwen`来停止服务并删除容器。
## 🔥 系统指令 (System Prompt)
Qwen-1.8-Chat 和 Qwen-72B-Chat 通义千问在多样且存在多轮复杂交互的系统指令上进行了充分训练,使模型可以跟随多样的系统指令,实现上下文(in-context)中的模型定制化,进一步提升了通义千问的可扩展性。
通过系统指令Qwen-Chat能够实现**角色扮演****语言风格迁移****任务设定**,和**行为设定**等能力。
![](assets/system_prompt_language_style.png)
![](assets/system_prompt_role_play_en.png)
更多关于系统指令的介绍信息可以参考[示例文档](examples/system_prompt.md).
## 工具调用
@@ -1084,7 +1238,11 @@ Qwen-Chat针对工具使用、函数调用能力进行了优化。用户可以
## 长文本理解
我们引入了NTK插值、窗口注意力、LogN注意力缩放等技术来提升模型的上下文长度并突破训练序列长度的限制。通过arXiv数据集上的语言模型实验我们的原生长度为2K的Qwen-7B/14B在8K的序列长度下依然表现不错,而原生长度扩展到8K的Qwen-7B能够在32K长序列的设置下取得不错的表现。
我们引入了NTK插值、窗口注意力、LogN注意力缩放等技术来提升模型的上下文长度并突破训练序列长度的限制原生长度为2K的Qwen-14B可以扩展到8K的序列长度而原生长度8K的Qwen-1.8B/7B能够在32K长序列的设置下取得不错的表现。
对于Qwen-72B我们基于RoPE采用更大的旋转Base来适应更长的上下文。Qwen-72B支持32K的上下文长度。
通过arXiv数据集上的语言模型实验发现 Qwen 在长上下文场景下可以达到出色的性能。结果如下:
<table>
<tr>
@@ -1100,12 +1258,11 @@ Qwen-Chat针对工具使用、函数调用能力进行了优化。用户可以
<td>+ dynamic_ntk</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.59</td><td align="center">3.66</td><td align="center">5.71</td><td align="center">-</td>
</tr>
<tr>
<td>+ dynamic_ntk + logn</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.58</td><td align="center">3.56</td><td align="center">4.62</td><td align="center">-</td>
<td>Qwen-1.8B</td><td align="center"><b>5.00</b></td><td align="center"><b>4.48</b></td><td align="center"><b>4.13</b></td><td align="center"><b>3.89</b></td><td align="center">17.42</td><td align="center">433.85</td>
</tr>
<tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.58</td><td align="center">3.49</td><td align="center">4.32</td><td align="center">-</td>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>5.00</b></td><td align="center"><b>4.48</b></td><td align="center"><b>4.14</b></td><td align="center"><b>3.93</b></td><td align="center"><b>3.82</b></td><td align="center"><b>3.83</b></td>
</tr>
<tr>
<tr>
<td>Qwen-7B</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.31</b></td><td align="center">7.27</td><td align="center">181.49</td>
</tr>
@@ -1121,11 +1278,28 @@ Qwen-Chat针对工具使用、函数调用能力进行了优化。用户可以
<tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>-</b></td><td align="center"><b>3.46</b></td><td align="center"><b>3.29</b></td><td align="center"><b>3.18</b></td><td align="center">3.42</td><td align="center">-</td>
</tr>
<tr>
<td>Qwen-72B</td><td align="center"><b>-</b></td><td align="center"><b>-</b></td><td align="center">-</td><td align="center"><b>2.83</b></td><td align="center"><b>2.73</b></td><td align="center"><b>2.72</b></td>
</tr>
</table>
## Tokenization
进一步我们为了验证Qwen-72B-Chat在长文本任务上的能力在[L-Eval](https://arxiv.org/abs/2307.11088)客观题上进行了测试,评分结果如下:
> 注作为术语的“tokenization”在中文中尚无共识的概念对应本文档采用英文表达以利说明。
| Model | Input Length | Average | Coursera | GSM | QuALITY | TOEFL | CodeU | SFcition |
|:------------------|:------------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
| ChatGPT-3.5-16k | 16K | 60.73 | **63.51** | **84.00** | 61.38 | 78.43 | **12.22** | 64.84 |
| **Qwen-72B-Chat** | 32K | **62.30** | 58.13 | 76.00 | **77.22** | **86.24** | 6.66 | **69.53** |
我们进一步进行了“大海捞针”实验(想法来自于[@Greg Kamradt](https://twitter.com/GregKamradt/status/1727018183608193393)),测试模型在不同长度的输入下,是否能检索到文章不同位置的信息,结果如下:
![](assets/qwen_72b_needle_in_a_haystack.png)
以上结果说明Qwen-72B-Chat可以能准确检索到32K以内的输入长度中放在各种位置的信息证明了其具有优秀的长文本处理能力。
## Tokenizer
> 注作为术语的“tokenizer”在中文中尚无共识的概念对应本文档采用英文表达以利说明。
基于tiktoken的tokenizer有别于其他分词器比如sentencepiece tokenizer。尤其在微调阶段需要特别注意特殊token的使用。关于tokenizer的更多信息以及微调时涉及的相关使用请参阅[文档](tokenization_note_zh.md)。
<br><br>
@@ -1155,7 +1329,14 @@ Qwen-Chat针对工具使用、函数调用能力进行了优化。用户可以
## 使用协议
研究人员与开发者可使用Qwen和Qwen-Chat或进行二次开发。我们同样允许商业使用具体细节请查看[LICENSE](LICENSE)。如需商用,请填写问卷([7B](https://dashscope.console.aliyun.com/openModelApply/qianwen), [14B](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat))申请
<https://github.com/QwenLM/Qwen>中的源代码采用[Apache 2.0协议](./LICENSE)授权,您可在该仓库根目录找到协议全文
研究人员与开发者可使用Qwen和Qwen-Chat或进行二次开发。对于商业使用请查看模型各自的LICENSE。
- Qwen-72B、Qwen-14B和Qwen-7B采用[Tongyi Qianwen LICENSE AGREEMENT](./Tongyi%20Qianwen%20LICENSE%20AGREEMENT)授权您可在相应模型的HuggingFace或ModelScope仓库找到协议原文。如需商用您只需遵循使用协议进行商用即可我们欢迎您填写问卷([72B](https://dashscope.console.aliyun.com/openModelApply/Qwen-72B-Chat)、[14B](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat)、[7B](https://dashscope.console.aliyun.com/openModelApply/qianwen))。
- Qwen-1.8B采用[Tongyi Qianwen RESEARCH LICENSE AGREEMENT](./Tongyi%20Qianwen%20RESEARCH%20LICENSE%20AGREEMENT)授权您可在相应模型的HuggingFace或ModelScope仓库找到协议原文。如需商用请联系我们。
<br><br>
## 联系我们

1350
README_ES.md Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -1,5 +1,5 @@
<p align="left">
<a href="README_CN.md">中文</a>&nbsp &nbsp<a href="README.md">English</a>&nbsp &nbsp日本語 &nbsp<a href="README_FR.md">Français</a>
<a href="README_CN.md">中文</a>&nbsp &nbsp<a href="README.md">English</a>&nbsp &nbsp日本語 &nbsp<a href="README_FR.md">Français</a> &nbsp<a href="README_ES.md">Español</a>
</p>
<br><br>
@@ -9,28 +9,34 @@
<br>
<p align="center">
🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2309.16609">Paper</a> &nbsp&nbsp &nbsp&nbsp🖥 <a href="https://modelscope.cn/studios/qwen/Qwen-14B-Chat-Demo/summary">Demo</a>
🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2309.16609">Paper</a> &nbsp&nbsp &nbsp&nbsp🖥 <a href="https://modelscope.cn/studios/qwen/Qwen-72B-Chat-Demo/summary">Demo</a>
<br>
<a href="assets/wechat.png">WeChat</a>&nbsp&nbsp &nbsp&nbsp DingTalk &nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp
<a href="assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp &nbsp&nbsp<a href="https://dashscope.aliyun.com">API</a>
</p>
<br><br>
<p align="left">
日本語ドキュメントメンテナー: <a href="https://github.com/eltociear">Ikko Eltociear Ashimine</a> & Junyang Lin
</p>
<br>
| | Qwen-Chat | Qwen-Chat (Int4) | Qwen-Chat (Int8) | Qwen |
|-----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:|
| 1.8B | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B">🤗</a> |
| 7B | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a> |
| 14B | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B">🤗</a> |
| 72B | <a href="https://modelscope.cn/models/qwen/Qwen-72B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-72B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B">🤗</a> |
Qwen-7B****Qwen-14B****Qwen**シリーズと、**Qwen-7B-Chat****Qwen-14B-Chat****Qwen-Chat**シリーズをオープンソース化しました。上の表にリンクがあります。クリックしてモデルカードをご確認ください。また、テクニカルレポートも公開しました。論文リンクをクリックしてご覧ください!
**Qwen-1.8B**、**Qwen-7B****Qwen-14B**、**Qwen-72B**の基本言語モデルである**Qwen**と、**Qwen-1.8B-Chat**、**Qwen-7B-Chat****Qwen-14B-Chat**、**Qwen-72B-Chat**のチャットモデルである**Qwen-Chat**をオープンソース化しま。上の表にリンクがあります。リンクをクリックしてモデルカードをご確認ください。また、**[テクニカルレポート](https://arxiv.org/abs/2309.16609)**も公開しています。論文リンクをクリックしてご覧ください!
簡単に説明すると、私たちは、ドメインや言語中国語と英語を中心になどを幅広くカバーする最大3兆トークンの多言語データに対して安定的に事前学習された強力なベース言語モデルを持っています。これらのモデルは、ベンチマークデータセットにおいて競争力のあるパフォーマンスを達成することができます。さらに、SFTとRLHFに基づく人間の嗜好に沿ったチャットモデルまだリリースされていませんがあり、チャット、コンテンツ作成、情報抽出、要約、翻訳、コーディング、数学の問題を解くなどが可能で、ツールを使ったり、エージェントとして遊んだり、コードインタプリタとして遊んだりすることもできます。
| モデル | 発行日 | コンテキストの最大長 | システムプロンプトの強化 | 预训练されたトークンの数 | FinetuningQ-Loraの最小GPUメモリ使用量 | 2048トークン生成時の最小GPUメモリ使用量Int4 | ツールの使用能力 |
|:----------|:--------:|:----------:|:------------:|:------------:|:------------------------------:|:-----------------------------:|:--------:|
| Qwen-1.8B | 23.11.30 | 32K | √ | 2.2T | 5.8GB | 2.9GB | √ |
| Qwen-7B | 23.08.03 | 32K | × | 2.4T | 11.5GB | 8.2GB | √ |
| Qwen-14B | 23.09.25 | 8K | × | 3.0T | 18.7GB | 13.0GB | √ |
| Qwen-72B | 23.11.30 | 32K | √ | 3.0T | 61.4GB | 48.9GB | √ |
このレポでは、それを把握することができる:
* Qwenのクイックスタート。
@@ -51,6 +57,7 @@ Qwen-7B**と**Qwen-14B**の**Qwen**シリーズと、**Qwen-7B-Chat**と**Qwen-1
## ニュースとアップデート
* 2023.11.30 🔥 3T トークンで学習し、32k コンテキストをサポートする **Qwen-72B****Qwen-72B-Chat** を、 **Qwen-1.8B****Qwen-1.8B-Chat** とともに、ModelScope と Hugging Face 上でリリースしました。また、Qwen-72B-ChatとQwen-1.8B-Chatのシステム・プロンプト機能を強化しました。[サンプル・ドキュメント](examples/system_prompt.md)を参照してください。さらに、**Ascend 910** と **Hygon DCU** での推論をサポートしました。詳細は `ascend-support``dcu-support` を参照してください。
* 2023.10.17 Int8量子化モデル**Qwen-7B-Chat-Int8**と**Qwen-14B-Chat-Int8**をリリースしました。
* 2023.9.25 🔥 Qwen-14BとQwen-14B-ChatをModelScopeとHugging Faceでリリースしました。[qwen.cpp](https://github.com/QwenLM/qwen.cpp) と [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) もリリースされました。同時に、Qwen-7B と Qwen-7B-Chat も更新しました。Qwen-7Bオリジナルと比較して、Qwen-7Bはより多くの学習トークンを使用し、2.2Tトークンから2.4Tトークンに増加し、コンテキスト長は2048から8192に拡張された。Qwen-7Bの中国語知識とコーディング能力はさらに向上しています。最新のコードとチェックポイントをお使いください
* 2023.9.12 Qwen-7Bモデルにおいて、フルパラメーター・ファインチューニング、LoRA、Q-LoRAを含むファインチューニングをサポートしました。
@@ -60,15 +67,16 @@ Qwen-7B**と**Qwen-14B**の**Qwen**シリーズと、**Qwen-7B-Chat**と**Qwen-1
## 性能
Qwen-14BとQwen-7Bこれは、より多くのトークンで学習され、コンテキストの長さが2048から8192に拡張された新バージョン、自然言語理解、数学的問題解決、コーディングなどに関するモデルの能力を評価する一連のベンチマークデータセット、例えばMMLU、C-Eval、GSM8K、MATH、HumanEval、MBPP、BBHなどにおいて、同様のモデルサイズベースラインモデルを上回る。しかし、Qwen-14BでもGPT-4はおろかGPT-3.5にも大きく遅れをとっています。以下の結果をご覧ください
Qwenモデルは、MMLU、C-Eval、GSM8K、MATH、HumanEval、MBPP、BBHなど、自然言語理解、数学的問題解決、コーディングなどに関するモデルの能力を評価する一連のベンチマークデータセットにおいて、同様のモデルサイズを持つベースラインモデルを上回る性能を発揮する。Qwen-72Bは全てのタスクでLLaMA2-70Bを上回り、10タスク中7タスクでGPT-3.5を上回った
<p align="left">
<img src="assets/radar_14b.jpg" width="600"/>
<img src="assets/radar_72b.jpg" width=600px/>
<p>
<br>
| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU |
|:-------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:|
|:------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:|
| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot |
| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 |
| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 |
@@ -78,9 +86,12 @@ Qwen-14BとQwen-7Bこれは、より多くのトークンで学習され、
| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 |
| Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 |
| Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 |
| Qwen-7B (original) | 56.7 | 59.6 | 51.6 | 10.4 | 24.4 | 31.2 | 40.6 | 58.8 |
| Yi-34B | 76.3 | 81.8 | 67.9 | 15.9 | 26.2 | 38.2 | 66.4 | 82.6 |
| XVERSE-65B | 70.8 | 68.6 | 60.3 | - | 26.3 | - | - | - |
| **Qwen-1.8B** | 45.3 | 56.1 | 32.3 | 2.3 | 15.2 | 14.2 | 22.3 | 52.1 |
| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 |
| **Qwen-14B** | **66.3** | **72.1** | **61.3** | **24.8** | **32.3** | **40.8** | **53.4** | **71.0** |
| **Qwen-14B** | 66.3 | 72.1 | 61.3 | 24.8 | 32.3 | 40.8 | 53.4 | 71.0 |
| **Qwen-72B** | **77.4** | **83.3** | **78.9** | **35.2** | **35.4** | **52.2** | **67.7** | **83.6** |
比較されたすべてのモデルについて、公式に報告された結果と[OpenCompass](https://opencompass.org.cn/leaderboard-llm) の間の最高スコアを報告します。
@@ -92,6 +103,7 @@ Qwen-14BとQwen-7Bこれは、より多くのトークンで学習され、
* python 3.8 以上
* pytorch 1.12 以上、2.0 以上を推奨
* transformers 4.32 以上
* CUDA 11.4 以上を推奨GPU ユーザー、フラッシュアテンションユーザー向けなど)
<br>
@@ -99,7 +111,9 @@ Qwen-14BとQwen-7Bこれは、より多くのトークンで学習され、
以下では、Qwen-Chat と 🤖 ModelScope と 🤗 Transformers の簡単な使用例を示します。
コードを実行する前に、環境のセットアップと必要なパッケージのインストールが済んでいることを確認してください。上記の要件を満たしていることを確認してから、依存するライブラリをインストールしてください。
詳しくはセクション["ビルド済みDockerイメージの使用"](#-using-pre-built-docker-images)を参照してください。
Dockerを使用しない場合は、環境のセットアップと必要なパッケージのインストールが済んでいることを確認してください。上記の要件を満たしていることを確認してから、依存するライブラリをインストールしてください。
```bash
pip install -r requirements.txt
@@ -112,6 +126,7 @@ git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# 以下はオプションです。インストールに時間がかかる場合があります。
# pip install csrc/layer_norm
# flash-attn のバージョンが 2.1.1 以降の場合、以下は必要ありません。
# pip install csrc/rotary
```
@@ -119,7 +134,7 @@ cd flash-attention && pip install .
### 🤗 Transformers
Qwen-Chat を推論に使用するには、以下のように数行のコードを入力するだけです。**最新のコードを使用していることを確認してください。**
Qwen-Chat を推論に使用するには、以下のように数行のコードを入力するだけです。Qwen/Qwen-7B-Chat "や "Qwen/Qwen-14B-Chat "のように、正しいモデル名やパスを渡すことを忘れないでください。**最新のコードを使用していることを確認してください。**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -137,8 +152,8 @@ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code
# オートモードを使用すると、デバイスに応じて自動的に精度が選択されます。
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval()
# 生成のためのハイパーパラメータを指定
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# 生成のためのハイパーパラメータを指定。ただし、4.32.0 以上のトTransformerを使用している場合は、これを行う必要はありません。
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# 第一回対話ターン
response, history = model.chat(tokenizer, "你好", history=None)
@@ -181,8 +196,8 @@ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True
# オートモードを使用すると、デバイスに応じて自動的に精度が選択されます。
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True).eval()
# 生成のためのハイパーパラメータを指定
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
# 生成のためのハイパーパラメータを指定。ただし、4.32.0 以上のトTransformerを使用している場合は、これを行う必要はありません。
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
inputs = tokenizer('蒙古国的首都是乌兰巴托Ulaanbaatar\n冰岛的首都是雷克雅未克Reykjavik\n埃塞俄比亚的首都是', return_tensors='pt')
inputs = inputs.to(model.device)
@@ -193,7 +208,9 @@ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
</details>
<p id="DownloadModel">
HuggingFaceからモデルのチェックポイントとコードをダウンロードする際にネットワークの問題が発生した場合、ModelScopeからチェックポイントをダウンロードする方法はこちらでございます。
</p>
```python
from modelscope import snapshot_download
@@ -321,13 +338,69 @@ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cp
GPUメモリ不足に悩まされ、1つ以上のGPUでモデルを実行したい場合、Transformersでサポートされるようになったデフォルトのロード方法を直接使うことができます。以前の `utils.py` に基づく方法は非推奨です。
しかし、この方法は簡単ですが、ネイティブ・パイプライン並列の効率は低いです。FastChatでvLLMを使用することをお勧めします。
### DashScope
APIを通じてQwenを利用する最も簡単な方法は、Alibaba Cloudを通じたDashScope APIサービスです。その使い方を紹介します。さらに、OpenAIスタイルのAPIをご自身のサーバーにデプロイするためのスクリプトも提供しています。
DashScopeはAlibaba Cloudが提供する大規模言語モデルAPIサービスで、今回Qwenに対応した。DashScopeの背後にあるモデルは、詳細が提供されていない一時的な社内バージョンであることに注意してください。サービスには `qwen-turbo``qwen-plus` があり、前者はより高速に動作し、後者はより優れたパフォーマンスを実現している。詳細はドキュメント [こちら](https://dashscope.aliyun.com) を参照。
公式サイト [link](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.6c2774fahtfXdn) で DashScope アカウントを作成し、API キー (AK) を取得してください。AK は環境変数で設定することをお勧めします:
```bash
export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
```
その後、パッケージをインストールし、ドキュメントは [こちら](https://help.aliyun.com/zh/dashscope/developer-reference/install-dashscope-sdk) をクリックしてください。Python をお使いの場合は、pip で DashScope をインストールできます:
```bash
pip install dashscope
```
JAVA SDKを使用する場合は、この方法でインストールできます
```xml
<!-- https://mvnrepository.com/artifact/com.alibaba/dashscope-sdk-java -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>dashscope-sdk-java</artifactId>
<version>the-latest-version</version>
</dependency>
```
DashScope を使用する最も簡単な方法は、OpenAI API と同様のメッセージを使用する方法です。以下にその例を示す:
```python
import random
from http import HTTPStatus
from dashscope import Generation
def call_with_messages():
messages = [{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': '如何做西红柿鸡蛋?'}]
gen = Generation()
response = gen.call(
Generation.Models.qwen_turbo,
messages=messages,
seed=random.randint(1, 10000), # set the random seed, optional, default to 1234 if not set
result_format='message', # set the result to be "message" format.
)
return response
if __name__ == '__main__':
response = call_with_messages()
if response.status_code == HTTPStatus.OK:
print(response)
else:
print('Request id: %s, Status code: %s, error code: %s, error message: %s' % (
response.request_id, response.status_code,
response.code, response.message
))
```
詳しい使い方は公式サイトをご覧ください。
<br><br>
## 量子化
### GPTQ
**注: [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) に基づく新しい解決策を提供し、Qwen-Chat 用の Int4 量子化モデル[ここをクリック](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4)をリリースしました。このモデルは、従来の解決策と比較して、ほぼ無損失モデル効果を達成しつつ、メモリコストと推論速度の両方で性能向上しています。**
我々は、[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)に基づい解決策を提供し、Int4とInt8の量子化モデルをリリースすることで、ほぼ無損失モデル効果を達成しつつ、メモリコストと推論速度の両方で性能向上させた。
ここでは、量子化されたモデルを推論に使用する方法を説明する。始める前に、auto-gptqの要件を満たしていることを確認しtorch 2.0以上、transformers 4.32.0以上など)、必要なパッケージをインストールしてください:
@@ -337,6 +410,12 @@ pip install auto-gptq optimum
auto-gptq`のインストールに問題がある場合は、公式の[repo](https://github.com/PanQiWei/AutoGPTQ)をチェックして、ホイールを見つけることをお勧めする。
> 注意:コンパイル済みの `auto-gptq` パッケージは `torch` のバージョンと CUDA バージョンに強く依存しています。さらに、最近のアップデートにより
> さらに、最近のアップデートにより、`transformers`、`optimum`、`peft` でサポートされていないバージョンのエラーが発生する可能性があります。
> 以下の要件を満たす最新バージョンの使用をお勧めします:
> - torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1 > - torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1
> - torch>=2.0, <2.1 auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0
そうすれば、量子化されたモデルを簡単にロードすることができ、いつもと同じように推論を実行することができる:
```python
@@ -352,18 +431,28 @@ response, history = model.chat(tokenizer, "Hi", history=None)
| Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
|----------------------|:----:|:-----------:|:-----:|:---------:|
| Qwen-1.8B-Chat (BF16)| 43.3 | 55.6 | 33.7 | 26.2 |
| Qwen-1.8B-Chat (Int8)| 43.1 | 55.8 | 33.0 | 27.4 |
| Qwen-1.8B-Chat (Int4)| 42.9 | 52.8 | 31.2 | 25.0 |
| Qwen-7B-Chat (BF16) | 55.8 | 59.7 | 50.3 | 37.2 |
| Qwen-7B-Chat (Int8) | 55.4 | 59.4 | 48.3 | 34.8 |
| Qwen-7B-Chat (Int4) | 55.1 | 59.2 | 49.7 | 29.9 |
| Qwen-14B-Chat (BF16) | 64.6 | 69.8 | 60.1 | 43.9 |
| Qwen-14B-Chat (Int8) | 63.6 | 68.6 | 60.0 | 48.2 |
| Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 |
| Qwen-72B-Chat (BF16) | 74.4 | 80.1 | 76.4 | 64.6 |
| Qwen-72B-Chat (Int8) | 73.5 | 80.1 | 73.5 | 62.2 |
| Qwen-72B-Chat (Int4) | 73.4 | 80.1 | 75.3 | 61.6 |
### KVキャッシュ量子化
モデルの推論の時に、中間結果のKeyとValueを量子化して圧縮保存することができます。これにより、同じGPUでより多くのKeyとValueを保存することができ、サンプルのスピードを増やすことができます。
> 注意: Hugging Faceの内部メカニズムにより、この機能のサポートファイル
> (すなわち、`cache_autogptq_cuda_256.cpp`と`cache_autogptq_cuda_kernel_245.cu`)が欠落している可能性があります。以下を手動でダウンロードしてください。
> Hugging Face Hubから手動でダウンロードし、他のモジュールファイルと同じフォルダに入れてください。
アテンション KV キャッシュを量子化して圧縮して保存すると、サンプルのスループットが向上する。この機能を有効にするには、`config.json` に `use_cache_quantization` と `use_cache_kernel` という引数を指定する。
具体的な使用方法は以下の通りである:
use_cache_quantizationとuse_cache_kernelという2つのパラメータを提供します。use_cache_quantizationとuse_cache_kernelを両方ONにした場合、KVキャッシュ量子化の機能が有効になります。具体的な使い方は
```python
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat",
@@ -374,43 +463,50 @@ model = AutoModelForCausalLM.from_pretrained(
use_flash_attn=False
)
```
現在、この機能はflash attnと同時に使用することはできません。use_flash_attnをTrueにしてKVキャッシュ量子化とflash attnを同時に有効にした場合、use_flash_attnはデフォルトで無効になります。
注意: 現在、KVキャッシュの量子化とフラッシュ・アテンションを同時に使用することはできない。
KV キャッシュの量子化とフラッシュ・アテンションを同時に有効にした場合(`use_flash_attn=True`, `use_cache_quantization=True`, `use_cache_kernel=True`)、デフォルトでは `use_flash_attn` は無効になる(`use_flash_attn=false`)。
Int8 KVキャッシュ量子化の使用によるモデルの性能の影響はほとんどありませんでした。性能評価は単一のA100-SXM4-80G GPUで実行され、モデルはデフォルトでBF16形式を使用し、生成される文章の長さは1024です。oomはメモリ不足を示します。
量子化されたint8-kvcacheモデルを使用しても、下流の評価で大幅な性能低下がないことを確認しました。以下では、さまざまな条件下でのメモリフットプリントのプロファイリングに焦点を当てます。
プロファイリングは、PyTorch 2.0.1とCUDA 11.4を搭載したシングルA100-SXM4-80G GPUで実行しました。
デフォルトで1024トークンを生成するためにBF16モデルを使用し、"OOM "はメモリ不足エラーを示します。
KVキャッシュ量子化を有効にすると、推論の時により大きなバッチサイズbsを使用できるようになります
KVキャッシュ量子化により、モデルはより大きなバッチサイズbsで推論することができる
| USE KV Cache | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 |
|-------------|:------:|:------:|:------:|:------:|:------:|:------:|
| no | 16.3GB | 24.1GB | 31.7GB | 48.7GB | oom | oom |
| yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB |
|--------------|:------:|:------:|:------:|:------:|:------:|:------:|
| No | 16.3GB | 24.1GB | 31.7GB | 48.7GB | OOM | OOM |
| Yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB |
KVキャッシュ量子化を有効にすると、推論の時により長い文章が生成できる。
KVキャッシュ量子化により、推論段階でより長いシーケンス(`sl`, シーケンス長、生成されるトークン数を指す)を生成する際、モデルはより多くのメモリを節約することができる。
| USE KV Cache | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 |
|-------------|:------:|:-------:|:-------:|:-------:|:-------:|
| no | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB |
| yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB |
|--------------|:------:|:-------:|:-------:|:-------:|:-------:|
| No | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB |
| Yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB |
KVキャッシュ量子化モデルでは、layer-pastのフォーマットをfloatからint8に変換し、量子化された `layer-past` には量子化パラメータも格納される。
モデルがKVキャッシュ量子化を有効にした場合、モデルの推論の時には、元のfloat形式のkey/valueをint8形式のqkey/qvalueと対応する量子化パラメータに変換します。
具体的な手順は以下の通りです:
1key/valueの量子化を行います。
具体的な手順は以下の通り:
1. key/valueの量子化を行います。
```
qv,scale,zero_point=quantize_cache_v(v)
```
2、layer_pastに保存します。
量子化されたのlayer_pastは:
2. `layer_past`に保存します。
量子化されたの`layer-past`は:
```
layer_past=((q_key,key_scale,key_zero_point),
(q_value,value_scale,value_zero_point))
```
元のlayer_past:
`layer_past`の元のフォーマットは以下の通りである:
```
layer_past=(key,value)
```
layer_pastのkey、valueを使用する必要がある場合は、int8形式のkey/valueをfloat形式に戻すために、逆量子化操作を使用することができます。
量子化されたアテンションKVを使用したい場合、
Int8のkey/valueをfloatフォーマットに戻すには、以下のように逆量子化操作を使用します
```
v=dequantize_cache_torch(qv,scale,zero_point)
```
@@ -420,118 +516,97 @@ layer_pastのkey、valueを使用する必要がある場合は、int8形式のk
このセクションでは、さまざまな精度のモデルのスピードとメモリの統計情報を提供する。スピードとメモリーのプロファイリングは[このスクリプト](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)を使用しています。
### 推論スピード
BF16、Int8、Int4の精度のモデルを用いて、2048個と8192個のトークンを生成する平均推論速度tokens/sを、フラッシュアテンションv1、v2を使用した場合と使用しなかった場合の条件で測定した。
BF16、Int8、および Int4 のモデルを使用して 2048 を生成する際の平均推論速度 (トークン/秒) と GPU メモリ使用量を測定しました。
<table>
<tr>
<th rowspan="2">Model Size</th><th rowspan="2">Precision</th><th rowspan="2">FlashAttn</th><th colspan="2" align="center">Sequence Length</th>
<td>Model Size</td>
<td>Quantization</td>
<td>Speed (Tokens/s)</td>
<td>GPU Memory Usage</td>
</tr>
<tr>
<th align="center">2048</th><th align="center">8192</th>
</tr>
</tr>
<td rowspan="3">1.8B</td>
<td>BF16</td>
<td>54.09</td>
<td>4.23GB</td>
</tr>
<tr>
<th rowspan="9">7B</th><td align="center" rowspan="3">BF16</td><td align="center">v2</td><td align="center">40.93</td><td align="center">36.14</td>
<td>Int8</td>
<td>55.56</td>
<td>3.48GB</td>
</tr>
<tr>
<td align="center">v1</td><td align="center">40.75</td><td align="center">35.34
<td>Int4</td>
<td>71.07</td>
<td>2.91GB</td>
</tr>
<tr>
<td align="center">Disabled</td><td align="center">37.55</td><td align="center">33.56
<td rowspan="3">7B</td>
<td>BF16</td>
<td>40.93</td>
<td>16.99GB</td>
</tr>
<tr>
<td align="center" rowspan="3">Int8</td><td align="center">v2</td><td align="center">37.47</td><td align="center">32.54</td>
<td>Int8</td>
<td>37.47</td>
<td>11.20GB</td>
</tr>
<tr>
<td align="center">v1</td><td align="center">37.51</td><td align="center">32.39
<td>Int4</td>
<td>50.09</td>
<td>8.21GB</td>
</tr>
<tr>
<td align="center">Disabled</td><td align="center">37.84</td><td align="center">32.65
<td rowspan="3">14B</td>
<td>BF16</td>
<td>32.22</td>
<td>30.15GB</td>
</tr>
<tr>
<td align="center" rowspan="3">Int4</td><td align="center">v2</td><td align="center">50.09</td><td align="center">38.61</td>
<td>Int8</td>
<td>29.28</td>
<td>18.81GB</td>
</tr>
<tr>
<td align="center">v1</td><td align="center">45.98</td><td align="center">36.47
<td>Int4</td>
<td>38.72</td>
<td>13.01GB</td>
</tr>
<tr>
<td align="center">Disabled</td><td align="center">48.12</td><td align="center">36.70
<td rowspan="3">72B</td>
<td>BF16</td>
<td>8.48</td>
<td>144.69GB (2xA100)</td>
</tr>
<tr>
<th rowspan="9">14B</th><td align="center" rowspan="3">BF16</td><td align="center">v2</td><td align="center">32.88</td><td align="center">24.87</td>
<td>Int8</td>
<td>9.05</td>
<td>81.27GB (2xA100)</td>
</tr>
<tr>
<td align="center">v1</td><td align="center">32.76</td><td align="center">28.89
<td>Int4</td>
<td>11.32</td>
<td>48.86GB</td>
</tr>
<tr>
<td align="center">Disabled</td><td align="center">29.32</td><td align="center">22.91
</tr>
<tr>
<td align="center" rowspan="3">Int8</td><td align="center">v2</td><td align="center">29.28</td><td align="center">24.22</td>
</tr>
<tr>
<td align="center">v1</td><td align="center">28.31</td><td align="center">23.87
</tr>
<tr>
<td align="center">Disabled</td><td align="center">31.12</td><td align="center">24.60
</tr>
<tr>
<td align="center" rowspan="3">Int4</td><td align="center">v2</td><td align="center">38.72</td><td align="center">27.33</td>
</tr>
<tr>
<td align="center">v1</td><td align="center">37.81</td><td align="center">26.46
</tr>
<tr>
<td align="center">Disabled</td><td align="center">37.65</td><td align="center">26.00
<td>72B + vLLM</td>
<td>BF16</td>
<td>17.60</td>
<td>2xA100</td>
</tr>
</table>
詳細には、プロファイリングの設定は、2048個のトークンをエンコードし、8192個の新しいトークンを生成することである。プロファイリングは、PyTorch 2.0.1CUDA 11.4を搭載したシングルA100-SXM4-80G GPUで実行される。推論速度エンコードされたトークンと生成されたトークンの平均である。
プロファイリングは、PyTorch 2.0.1CUDA 11.8、および Flash-Attendant 2 を備えた単一の A100-SXM4-80G GPU (2xA100 について言及されている場合を除く) で実行されます。(72B + vLLM は PyTorch 2.1.0 および Cuda 11.8 を使用します。) 推論速度 は、エンコードされ生成されたトークンの平均である。
注意上記のInt4/Int8モデルの推論速度は、autogptqを使用しています。現在、``AutoModelForCausalLM.from_pretrained``で読み込まれるモデルの推論速度は約20%遅くなります。この問題はHuggingFaceチームに報告済みであり、解決策があれば即座に更新されます。
### GPU メモリ使用量
また、BF16、Int8、Int4量子化レベルのそれぞれにおいて、2048個のトークンをコンテキストとしてエンコードした場合および単一のトークンを生成した場合と、8192個のトークンを生成した場合単一のトークンをコンテキストとして生成した場合のGPUメモリ使用量のピーク値をプロファイリングしました。結果GBを以下に示します。
<table>
<tr>
<th rowspan="2">Model Size</th><th rowspan="2">Precision</th><th colspan="2" align="center">Sequence Length</th>
</tr>
<tr>
<th align="center">2048</th><th align="center">8192</th>
</tr>
</tr>
</tr>
<tr>
<th rowspan="3">7B</th><td align="center">BF16</td><td align="center">16.99</td><td align="center">22.53</td>
</tr>
<tr>
<td align="center">Int8</td><td align="center">11.20</td><td align="center">16.62
</tr>
<tr>
<td align="center">Int4</td><td align="center">8.21</td><td align="center">13.63</td>
</tr>
<tr>
<th rowspan="3">14B</th><td align="center">BF16</td><td align="center">30.15</td><td align="center">38.94</td>
</tr>
<tr>
<td align="center">Int8</td><td align="center">18.81</td><td align="center">27.54
</tr>
<tr>
<td align="center">Int4</td><td align="center">13.01</td><td align="center">21.79</td>
</tr>
</table>
<br>
また、コンテキストと生成の長さ、Flash Attention バージョンのさまざまな設定で推論速度と GPU メモリ使用量も測定します。 結果は、Hugging Face または ModelScope の対応するモデルカードで確認できます。
## ファインチューニング
### 使用方法
現在、公式のトレーニングスクリプト `finetune.py` を提供しています。さらに、finetune.pyのシェルスクリプトを提供し、finetune.pyを実行することで、finetune.pyを起動することができる。さらに、安心してファインチューニングを開始するためのシェルスクリプトも提供しています。このスクリプトは、[DeepSpeed](https://github.com/microsoft/DeepSpeed) および [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/) を使用したトレーニングをサポートします。弊社が提供するシェル・スクリプトは DeepSpeed と Peft を使用するため、事前に DeepSpeed と Peft をインストールすることをお勧めします:
現在、公式のトレーニングスクリプト `finetune.py` を提供しています。さらに、finetune.pyのシェルスクリプトを提供し、finetune.pyを実行することで、finetune.pyを起動することができる。さらに、安心してファインチューニングを開始するためのシェルスクリプトも提供しています。このスクリプトは、[DeepSpeed](https://github.com/microsoft/DeepSpeed) (注意これはpydanticの最新バージョンとコンフリクトする可能性があるので、`pydantic<2.0`にする必要があります) および [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/) を使用したトレーニングをサポートします。弊社が提供するシェル・スクリプトは DeepSpeed と Peft を使用するため、事前に DeepSpeed と Peft をインストールすることをお勧めします:
```bash
pip install -r requirements_finetune.txt
```
@@ -598,7 +673,7 @@ sh finetune/finetune_qlora_single_gpu.sh
sh finetune/finetune_qlora_ds.sh
```
Q-LoRAについては、弊社が提供する量子化モデル、例えばQwen-7B-Chat-Int4をロードすることをお勧めします。BF16モデルは使用し**ない**でくださいフルパラメータ・ファインチューニングやLoRAとは異なり、Q-LoRAではfp16のみがサポートされる。シングルGPUのトレーニングでは、トーチアンプによるエラーが観測されたため、混合精度のトレーニングにはディープスピードを使用する必要がある。また、Q-LoRAの場合、LoRAの特殊トークンの問題が残っています。しかし、Q-LoRAではチャットモデルとしてInt4モデルのみを提供しており、言語モデルはChatML形式の特殊トークンを学習しているため、レイヤーの心配はありません。なお、Int4モデルのレイヤーは学習できないはずなので、学習で特殊なトークンを導入すると、Q-LoRAが動作しなくなる可能性があります。
Q-LoRAについては、弊社が提供する量子化モデル、例えばQwen-7B-Chat-Int4をロードすることをお勧めします。BF16モデルは使用し**ない**でくださいフルパラメータ・ファインチューニングやLoRAとは異なり、Q-LoRAではfp16のみがサポートされる。シングルGPUのトレーニングでは、トーチアンプによるエラーが観測されたため、混合精度のトレーニングにはDeepSpeedを使用する必要がある。また、Q-LoRAの場合、LoRAの特殊トークンの問題が残っています。しかし、Q-LoRAではチャットモデルとしてInt4モデルのみを提供しており、言語モデルはChatML形式の特殊トークンを学習しているため、レイヤーの心配はありません。なお、Int4モデルのレイヤーは学習できないはずなので、学習で特殊なトークンを導入すると、Q-LoRAが動作しなくなる可能性があります。
LoRAとQ-LoRAの学習は、フルパラメータによるファインチューニングとは異なり、アダプターパラメータのみを保存する。仮にQwen-7Bから学習を開始したとすると、以下のようにファインチューニングされたモデルを読み込んで推論を行うことができる
@@ -629,10 +704,27 @@ merged_model = model.merge_and_unload()
merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True)
```
`new_model_directory` ディレクトリには、マージされたモデルの重みとモジュール ファイルが含まれます。 保存されたファイルに `*.cu` および `*.cpp` ファイルが存在しない可能性があることに注意してください。 KVキャッシュ機能を使用したい場合は、手動でコピーしてください。 また、このステップではトークナイザー ファイルは新しいディレクトリに保存されません。 トークナイザー ファイルをコピーするか、次のコードを使用できます。
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
path_to_adapter, # path to the output directory
trust_remote_code=True
)
tokenizer.save_pretrained(new_model_directory)
```
注意マルチGPUトレーニングの場合、分散トレーニング用の適切なハイパーパラメータをマシンに応じて指定する必要があります。また、データ、メモリフットプリント、トレーニング速度を考慮して、引数 `--model_max_length` で最大シーケンス長を指定することをお勧めします。
### メモリと速度のプロファイリング
シングルGPUトレーニングのセットアップにおいて、LoRA (LoRA(emb)はembeddingと出力層を学習させるが、LoRAはembeddingと出力層を学習させない) とQ-LoRAのGPUメモリとトレーニング速度をプロファイリングする。このテストでは、シングルA100-SXM4-80G GPUで実験し、CUDA 11.8とPytorch 2.0を使用します。Flash attention 2を使用します。256、512、1024、2048、4096、8192という異なる長さの入力のメモリGBと速度s/iterをプロファイリングします。また、2台のA100 GPUを用いたQwen-7Bによるフルパラメータ・ファインチューニングの統計量も報告する。GPUメモリの制限のため、256、512、1024トークンの統計のみを報告する。統計量を以下に示す:
シングルGPUトレーニングのセットアップにおいて、LoRA (LoRA(emb)はembeddingと出力層を学習させるが、LoRAはembeddingと出力層を学習させない) とQ-LoRAのGPUメモリとトレーニング速度をプロファイリングする。このテストでは、シングルA100-SXM4-80G GPUで実験し、CUDA 11.8とPytorch 2.0を使用します。Flash attention 2を使用します。256、512、1024、2048、4096、8192という異なる長さの入力のメモリGBと速度s/iterをプロファイリングします。また、2台のA100 GPUを用いたQwen-7Bによるフルパラメータ・ファインチューニングの統計量も報告する。GPUメモリの制限のため、256、512、1024トークンの統計のみを報告する。
Qwen-72B については、2 つの方法で実験します。1) 4 つの A100-SXM4-80G GPU での Lora 微調整 + DeepSpeed ZeRO 3、および 2) 1 つの A100-SXM4-80G GPU での QLora (int4) 微調整。 OOM は、LoRA (emb) 微調整と Deepspeed ZeRO 3 を使用しない LoRA 微調整の両方で 4 つの A100-SXM4-80G GPU で発生することに注意してください (`--deepspeedfinetune/ds_config_zero3.json` を [`finetune/finetune_lora_ds に渡すことができます) .sh`](finetune/finetune_lora_ds.sh) を使用して DeepSpeed ZeRO 3 を有効にします)。
統計量を以下に示す:
<table>
<tr>
@@ -642,6 +734,18 @@ merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_
<th align="center">256</th><th align="center">512</th><th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th>
</tr>
</tr>
</tr>
<tr>
<th rowspan="4">1.8B</th><td>LoRA</td><td align="center">6.7G / 1.0s/it</td><td align="center">7.4G / 1.0s/it</td><td align="center">8.4G / 1.1s/it</td><td align="center">11.0G / 1.7s/it</td><td align="center">16.2G / 3.3s/it</td><td align="center">21.8G / 6.8s/it</td>
</tr>
<tr>
<td>LoRA (emb)</td><td align="center">13.7G / 1.0s/it</td><td align="center">14.0G / 1.0s/it</td><td align="center">14.0G / 1.1s/it</td><td align="center">15.1G / 1.8s/it</td><td align="center">19.7G / 3.4s/it</td><td align="center">27.7G / 7.0s/it</td>
</tr>
<tr>
<td>Q-LoRA</td><td align="center">5.8G / 1.4s/it</td><td align="center">6.0G / 1.4s/it</td><td align="center">6.6G / 1.4s/it</td><td align="center">7.8G / 2.0s/it</td><td align="center">10.2G / 3.4s/it</td><td align="center">15.8G / 6.5s/it</td>
</tr>
<tr>
<td>Full-parameter</td><td align="center">43.5G / 2.1s/it</td><td align="center">43.5G / 2.2s/it</td><td align="center">43.5G / 2.2s/it</td><td align="center">43.5G / 2.3s/it</td><td align="center">47.1G / 2.8s/it</td><td align="center">48.3G / 5.6s/it</td>
</tr>
<tr>
<th rowspan="4">7B</th><td>LoRA</td><td align="center">20.1G / 1.2s/it</td><td align="center">20.4G / 1.5s/it</td><td align="center">21.5G / 2.8s/it</td><td align="center">23.8G / 5.2s/it</td><td align="center">29.7G / 10.1s/it</td><td align="center">36.6G / 21.3s/it</td>
@@ -664,44 +768,77 @@ merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_
<tr>
<td>Q-LoRA</td><td align="center">18.7G / 5.3s/it</td><td align="center">18.4G / 6.3s/it</td><td align="center">18.9G / 8.2s/it</td><td align="center">19.9G / 11.8s/it</td><td align="center">23.0G / 20.1s/it</td><td align="center">27.9G / 38.3s/it</td>
</tr>
<tr>
<th rowspan="2">72B</th><td>LoRA + Deepspeed Zero3</td><td align="center">215.4G / 17.6s/it</td><td align="center">217.7G / 20.5s/it</td><td align="center">222.6G / 29.4s/it</td><td align="center">228.8G / 45.7s/it</td><td align="center">249.0G / 83.4s/it</td><td align="center">289.2G / 161.5s/it</td>
</tr>
<tr>
<td>Q-LoRA</td><td align="center">61.4G / 27.4s/it</td><td align="center">61.4G / 31.5s/it</td><td align="center">62.9G / 41.4s/it</td><td align="center">64.1G / 59.5s/it</td><td align="center">68.0G / 97.7s/it</td><td align="center">75.6G / 179.8s/it</td>
</tr>
</table>
<br>
## デプロイ
### vLLM
デプロイメントと高速推論のためには、FastChatとvLLMを使用することをお勧めします。まずパッケージをインストールしてください:
デプロイメントと高速推論のためには、vLLMを使用することをお勧めします。
cuda 12.1 および pytorch 2.1 を使用している場合は、次のコマンドを直接使用して vLLM をインストールできます。
```bash
pip install vllm
```
それ以外の場合は、公式 vLLM [インストール手順](https://docs.vllm.ai/en/latest/getting_started/installation.html) を参照してください。
#### vLLM + Transformer Wrapper
[ラッパー コード](examples/vllm_wrapper.py) をダウンロードし、複数ラウンドの対話対話のために次のコマンドを実行できます。 (注: 現在は ``model.chat()`` メソッドのみをサポートしています。)
```python
from vllm_wrapper import vLLMWrapper
model = vLLMWrapper('Qwen/Qwen-7B-Chat', tensor_parallel_size=1)
response, history = model.chat(query="你好", history=None)
print(response)
response, history = model.chat(query="给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
print(response)
response, history = model.chat(query="给这个故事起一个标题", history=history)
print(response)
```
#### vLLM + Web デモ / OpenAI API
FastChat を使用して、Web デモまたは OpenAI API サーバーを起動できます。 まず、FastChat をインストールします。
```
pip install "fschat[model_worker,webui]"
```
または、`git clone` と `pip install -e .` を使ってソースからインストールすることもできます。インストールに問題がある場合は、それぞれのドキュメントを読むことを勧める。
QwenをvLLMとFastChatで実行するには、まず以下の方法でコントローラを起動する必要があります
vLLM および FastChat で Qwen を実行するには、の方法でコントローラを起動する必要があります
```bash
python -m fastchat.serve.controller
```
それからmodel workerを起動し、推論のためにモデルをロードします。シングルGPU推論の場合は、直接実行できます
```bash
python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code
python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype bfloat16
```
しかし、より高速な推論や大容量メモリーのために複数のGPUでモデルを実行したい場合は、vLLMがサポートするテンソル並列を使用することができます。モデルを4GPUで実行するとすると、コマンドは以下のようになります
```bash
python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4
python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype bfloat16
```
Model workerを起動したら、Web デモや OpenAI API を好きなように起動できます。ウェブデモの場合は、以下のコマンドを実行します:
モデルワーカーを起動した後、起動することができます:
* Web UI Demo
```bash
python -m fastchat.serve.gradio_web_server
```
OpenAI APIについては、まずOpenAI APIのドキュメントをチェックして、インストールしてください。次にコマンドを実行します
* OpenAI API
```bash
python -m fastchat.serve.openai_api_server --host localhost --port 8000
```
<br>
## デモ
ただし、vLLM と FastChat の使用が難しい場合は、Web デモ、CLI デモ、および API をデプロイするために提供されている最も簡単な方法を試すことができます。
### ウェブ UI
@@ -738,63 +875,7 @@ python cli_demo.py
<p>
<br>
## API
APIを通じてQwenを利用する最も簡単な方法は、Alibaba Cloudを通じたDashScope APIサービスです。その使い方を紹介します。さらに、OpenAIスタイルのAPIをご自身のサーバーにデプロイするためのスクリプトも提供しています。
### DashScope
DashScopeはAlibaba Cloudが提供する大規模言語モデルAPIサービスで、今回Qwenに対応した。DashScopeの背後にあるモデルは、詳細が提供されていない一時的な社内バージョンであることに注意してください。サービスには `qwen-turbo` と `qwen-plus` があり、前者はより高速に動作し、後者はより優れたパフォーマンスを実現している。詳細はドキュメント [こちら](https://dashscope.aliyun.com) を参照。
公式サイト [link](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.6c2774fahtfXdn) で DashScope アカウントを作成し、API キー (AK) を取得してください。AK は環境変数で設定することをお勧めします:
```bash
export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
```
その後、パッケージをインストールし、ドキュメントは [こちら](https://help.aliyun.com/zh/dashscope/developer-reference/install-dashscope-sdk) をクリックしてください。Python をお使いの場合は、pip で DashScope をインストールできます:
```bash
pip install dashscope
```
JAVA SDKを使用する場合は、この方法でインストールできます
```xml
<!-- https://mvnrepository.com/artifact/com.alibaba/dashscope-sdk-java -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>dashscope-sdk-java</artifactId>
<version>the-latest-version</version>
</dependency>
```
DashScope を使用する最も簡単な方法は、OpenAI API と同様のメッセージを使用する方法です。以下にその例を示す:
```python
import random
from http import HTTPStatus
from dashscope import Generation
def call_with_messages():
messages = [{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': '如何做西红柿鸡蛋?'}]
gen = Generation()
response = gen.call(
Generation.Models.qwen_turbo,
messages=messages,
seed=random.randint(1, 10000), # set the random seed, optional, default to 1234 if not set
result_format='message', # set the result to be "message" format.
)
return response
if __name__ == '__main__':
response = call_with_messages()
if response.status_code == HTTPStatus.OK:
print(response)
else:
print('Request id: %s, Status code: %s, error code: %s, error message: %s' % (
response.request_id, response.status_code,
response.code, response.message
))
```
詳しい使い方は公式サイトをご覧ください。
### OpenAI API
### API
OpenAI API をベースにローカルAPIをデプロイする方法を提供する@hanpenggit に感謝)。始める前に、必要なパッケージをインストールしてください:
@@ -850,9 +931,128 @@ print(response.choices[0].message.content)
**Function Calling** もサポートされています(ただし、今のところ `stream=False` の場合のみ)。使用例](examples/function_call_examples.py) を参照してください。
<br><br>
## 🐳 Docker
デプロイプロセスを簡素化するために、あらかじめ環境を構築した docker イメージを提供しています: [qwenllm/qwen](https://hub.docker.com/r/qwenllm/qwen)。ドライバを導入し、モデルファイルをダウンロードするだけで、デモを起動し、OpenAI APIをデプロイし、モデルを微調整することができます。
### 準備
1. 使用するイメージに応じて、正しいバージョンのNvidiaドライバをインストールしてください
- `qwenllm/qwen:cu117` (**recommend**): `>= 515.48.07`
- `qwenllm/qwen:cu114` (w/o flash-attention): `>= 470.82.01`
- `qwenllm/qwen:latest`: same as `qwenllm/qwen:cu117`
2. [Docker](https://docs.docker.com/engine/install/) と [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) をインストールして設定します:
```bash
# configure docker
sudo systemctl start docker
# test if docker is correctly installed
sudo docker run hello-world
# configure nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# test if nvidia-container-toolkit is correctly installed
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
```
3. モデルのチェックポイントとコードを環境にダウンロードします([こちら](#DownloadModel)を参照)。
### デプロイ
ここでは例として Qwen-7B-Chat を使用する。ウェブ・デモや API を起動する前に、以下のように設定を行います:
```bash
IMAGE_NAME=qwenllm/qwen:cu117
PORT=8901
CHECKPOINT_PATH=/path/to/Qwen-7B-Chat # Path to downloaded model checkpoints and codes
```
以下のスクリプトがビルドに役立つ:
* OpenAI API
```bash
bash docker/docker_openai_api.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT}
```
* Web UI
```bash
bash docker/docker_web_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT}
```
* CLI Demo
```bash
bash docker/docker_cli_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH}
```
上記のコマンドは自動的に必要なイメージをダウンロードし、バックグラウンドでWeb UIデモを起動しますサービスは自動で再起動します。デモを使用するには、ホスト上で `http://localhost:${PORT}` を開いてください。
以下の出力が表示されれば、デモは正常に起動しています:
```text
Successfully started web demo. Open '...' to try!
Run `docker logs ...` to check demo status.
Run `docker rm -f ...` to stop and remove the demo.
```
デモの状態を確認したい場合は、`docker logs qwen` を使って出力を表示できる。
docker rm -f qwen` でサービスを停止し、コンテナを削除できる。
### ファインチューニング
ビルド済みのDockerイメージを利用したファインチューニングの方法は、基本的に[前章](#Finetuning)と同じです(すでにイメージに依存関係がインストールされています)
以下はシングルGPUのLoRAの例です
```bash
IMAGE_NAME=qwenllm/qwen:cu117
CHECKPOINT_PATH=/path/to/Qwen-7B # Path to downloaded model checkpoints and codes
#CHECKPOINT_PATH=/path/to/Qwen-7B-Chat-Int4 # Path to downloaded model checkpoints and codes (Q-LoRA)
DATA_PATH=/path/to/data/root # Prepare finetune data at ${DATA_PATH}/example.json
OUTPUT_PATH=/path/to/output/checkpoint # Path to finetune outputs
# Use all host devices by default
DEVICE=all
# If you need to specify GPUs for training, set device as follow (NOTE: internal quotation marks cannot be omitted)
#DEVICE='"device=0,1,2,3"'
mkdir -p ${OUTPUT_PATH}
# Single-GPU LoRA finetuning
docker run --gpus ${DEVICE} --rm --name qwen \
--mount type=bind,source=${CHECKPOINT_PATH},target=/data/shared/Qwen/Qwen-7B \
--mount type=bind,source=${DATA_PATH},target=/data/shared/Qwen/data \
--mount type=bind,source=${OUTPUT_PATH},target=/data/shared/Qwen/output_qwen \
--shm-size=2gb \
-it ${IMAGE_NAME} \
bash finetune/finetune_lora_single_gpu.sh -m /data/shared/Qwen/Qwen-7B/ -d /data/shared/Qwen/data/example.json
```
例えばシングルGPUのQ-LoRAに変更するには、`docker run`内のbashコマンドを変更するだけでいい
```bash
bash finetune/finetune_qlora_single_gpu.sh -m /data/shared/Qwen/Qwen-7B-Chat-Int4/ -d /data/shared/Qwen/data/example.json
```
<br>
## 🔥 システムプロンプト
Qwen-1.8-Chat と Qwen-72B-Chat は、複数回の複雑な対話を伴う多様なシステム プロンプトで完全にトレーニングされているため、さまざまなシステム プロンプトに従い、コンテキストに応じたモデルのカスタマイズを実現し、Qwen-Chat のスケーラビリティをさらに向上させることができます。
システム プロンプトを使用すると、Qwen-Chat は **ローリー プレイ**、**言語スタイルの転送**、**タスク設定**、**動作設定**を実現できます。
![](assets/system_prompt_ language_style.png)
![](assets/system_prompt_role_play_en.png)
詳細については、[サンプルドキュメント](examples/system_prompt.md)を参照してください。
## ツールの使用
Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利用に特化して最適化されており、ユーザは独自の Qwen-7B ベースの LangChain、エージェント、コードインタプリタを構築することができます。ツール利用能力を評価するための評価[ベンチマーク](eval/EVALUATION.md)では、Qwen-7B は安定した性能に達しています。
Qwen-Chat は、ツールの使用法と関数呼び出し機能に合わせて最適化されています。 ユーザーはエージェント、LangChain アプリケーションを開発し、Python コード インタプリターで Qwen を拡張することもできます。
ReAct プロンプトの原則に基づいてツール呼び出しを実装する方法に関するドキュメントを提供しています。[ReAct の例](examples/react_prompt.md) を参照してください。 この原則に基づいて、[openai_api.py](openai_api.py) で関数呼び出しのサポートを提供します。
オープンソースの中国語評価ベンチマークでモデルのツール呼び出し機能をテストしたところ、Qwen-Chat が一貫して良好なパフォーマンスを発揮することがわかりました。
<table>
<tr>
@@ -867,17 +1067,21 @@ Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利
<tr>
<td>GPT-3.5</td><td align="center">85%</td><td align="center">0.88</td><td align="center">75.0%</td>
</tr>
<tr>Qwen-7B-Chat v1.1
<td>Qwen-7B-Chat v1.1</td><td align="center">98%</td><td align="center">0.91</td><td align="center">7.3%</td>
<tr>
<td>Qwen-7B-Chat</td><td align="center">98%</td><td align="center">0.91</td><td align="center">7.3%</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td><td align="center">98%</td><td align="center">0.93</td><td align="center">2.4%</td>
</tr>
</table>
数学的問題解決、データ視覚化、ファイル処理や Web スクレイピングなどのその他の汎用タスクに Python コード インタープリターを使用する Qwen の能力を評価するために、これらの能力を評価するために特別に設計されたベンチマークを作成し、オープンソース化しました。 。 ベンチマークはこの [リンク](https://github.com/QwenLM/Qwen-Agent/tree/main/benchmark) で見つけることができます。
Qwen は、コード生成時のコードの実行可能性と結果の精度の点で優れたパフォーマンスを発揮することがわかりました。
<table>
<tr>
<th colspan="4" align="center">Using Code Interpreter - Executable Rate of Generated Code (%)</th>
<th colspan="4" align="center">Executable Rate of Generated Code (%)</th>
</tr>
<tr>
<th align="center">Model</th><th align="center">Math↑</th><th align="center">Visualization↑</th><th align="center">General↑</th>
@@ -924,8 +1128,8 @@ Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利
<td align="center">44.2</td>
<td align="center">65.5 </td>
</tr>
<tr>Qwen-7B-Chat v1.1
<td>Qwen-7B-Chat v1.1</td>
<tr>
<td>Qwen-7B-Chat</td>
<td align="center">82.4</td>
<td align="center">64.4</td>
<td align="center">67.2 </td>
@@ -940,7 +1144,7 @@ Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利
<table>
<tr>
<th colspan="4" align="center">Using Code Interpreter - Accuracy of Code Execution Results (%)</th>
<th colspan="4" align="center">Accuracy of Code Execution Results (%)</th>
</tr>
<tr>
<th align="center">Model</th><th align="center">Math↑</th><th align="center">Visualization-Hard↑</th><th align="center">Visualization-Easy↑</th>
@@ -987,8 +1191,8 @@ Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利
<td align="center">21.4</td>
<td align="center">45.6 </td>
</tr>
<tr>Qwen-7B-Chat v1.1
<td>Qwen-7B-Chat v1.1</td>
<tr>
<td>Qwen-7B-Chat</td>
<td align="center">41.9</td>
<td align="center">40.5</td>
<td align="center">54.4 </td>
@@ -1001,16 +1205,13 @@ Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利
</tr>
</table>
ReAct プロンプトの書き方や使い方については、[ReAct の例](examples/react_prompt.md)を参照してください。ツールを使用することで、モデルがよりよいタスクを実行できるようになります。
<p align="center">
<br>
<img src="assets/code_interpreter_showcase_001.jpg" />
<br>
<p>
さらに、エージェントとしての能力を示す実験結果提供する。詳細は [Hugging Face Agent](examples/transformers_agent.md) を参照してさい。Hugging Face が提供するランモードベンチマークでの性能は以下の通りです:
さらに、Qwenが HuggingFace Agent として機能できることを実証する実験結果提供します。 詳細については、[ドキュメント例](examples/transformers_agent.md) を参照してください。 Hugging Face が提供する評価データセットにおけるモデルのパフォーマンスは次のとおりです。
<table>
<tr>
@@ -1031,8 +1232,8 @@ ReAct プロンプトの書き方や使い方については、[ReAct の例](ex
<tr>
<td>StarCoder-15B</td><td align="center">87.0</td><td align="center">88.0</td><td align="center">68.9</td>
</tr>
<tr>Qwen-7B-Chat v1.1
<td>Qwen-7B-Chat v1.1</td><td align="center">87.0</td><td align="center">87.0</td><td align="center">71.5</td>
<tr>
<td>Qwen-7B-Chat</td><td align="center">87.0</td><td align="center">87.0</td><td align="center">71.5</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td><td align="center">93.5</td><td align="center">94.4</td><td align="center">87.0</td>
@@ -1058,8 +1259,8 @@ ReAct プロンプトの書き方や使い方については、[ReAct の例](ex
<tr>
<td>StarCoder-15B</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">89.6</td>
</tr>
<tr>Qwen-7B-Chat v1.1
<td>Qwen-7B-Chat v1.1</td><td align="center">94.7</td><td align="center">94.7</td><td align="center">85.1</td>
<tr>
<td>Qwen-7B-Chat</td><td align="center">94.7</td><td align="center">94.7</td><td align="center">85.1</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">95.5</td>
@@ -1070,7 +1271,11 @@ ReAct プロンプトの書き方や使い方については、[ReAct の例](ex
## 長い文脈の理解
コンテキストの長さを拡張し、訓練シーケンスの長さのボトルネックを解消するために、NTK を考慮した補間、ウィンドウアテンション、LogN アテンションスケーリングなどの技術を導入し、コンテキストの長さを 8K トークン以上に拡張する。arXiv データセットを用いて PPL 評価による言語モデリング実験を行い、Qwen-7B が長いコンテキストのシナリオにおいて卓越した性能を達成できることを見出した。以下に結果を示します:
コンテキストを拡張し、トレーニング シーケンスのボトルネックを解消するために、NTK 対応補間、ウィンドウ アテンション、LogN アテンション スケーリングなどのいくつかの技術を導入し、Qwen-14B のコンテキスト長を 2K から 8K 以上に拡張します。 トークン、および Qwen-1.8B/7B は 8K から 32K トークンまで。
Qwen-72B では、より大きな回転ベースを備えたより長いコンテキストに RoPE を適応させます。 Qwen-72B は、32K トークンの最大コンテキスト長をサポートします。
私たちは、PPL 評価を使用して arXiv データセットで言語モデリング実験を実施し、Qwen が長いコンテキストのシナリオで優れたパフォーマンスを達成できることを発見しました。 結果を以下に示します。
<table>
<tr>
@@ -1093,10 +1298,13 @@ ReAct プロンプトの書き方や使い方については、[ReAct の例](ex
</tr>
<tr>
<tr>
<td>Qwen-7B v1.1</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.31</b></td><td align="center">7.27</td><td align="center">181.49</td>
<td>Qwen-1.8B</td><td align="center"><b>5.00</b></td><td align="center"><b>4.48</b></td><td align="center"><b>4.13</b></td><td align="center"><b>3.89</b></td><td align="center">17.42</td><td align="center">433.85</td>
</tr>
<tr>
<td>+ dynamic_ntk</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.31</b></td><td align="center"><b>3.23</b></td><td align="center">3.33</td>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>5.00</b></td><td align="center"><b>4.48</b></td><td align="center"><b>4.14</b></td><td align="center"><b>3.93</b></td><td align="center"><b>3.82</b></td><td align="center"><b>3.83</b></td>
</tr>
<tr>
<td>Qwen-7B</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.31</b></td><td align="center">7.27</td><td align="center">181.49</td>
</tr>
<tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.33</b></td><td align="center"><b>3.22</b></td><td align="center"><b>3.17</b></td>
@@ -1107,8 +1315,24 @@ ReAct プロンプトの書き方や使い方については、[ReAct の例](ex
<tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>-</b></td><td align="center"><b>3.46</b></td><td align="center"><b>3.29</b></td><td align="center"><b>3.18</b></td><td align="center">3.42</td><td align="center">-</td>
</tr>
<tr>
<td>Qwen-72B</td><td align="center"><b>-</b></td><td align="center"><b>-</b></td><td align="center">-</td><td align="center"><b>2.83</b></td><td align="center"><b>2.73</b></td><td align="center"><b>2.72</b></td>
</tr>
</tr>
</table>
さらに、Qwen-72B-Chat の長文理解能力を検証するために、[L-Eval](https://arxiv.org/abs/2307.11088) (クローズドエンド タスク) でテストしました。 結果は次のとおりです。
| Model | Input Length | Average | Coursera | GSM | QuALITY | TOEFL | CodeU | SFcition |
|:------------------|:------------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
| ChatGPT-3.5-16k | 16K | 60.73 | **63.51** | **84.00** | 61.38 | 78.43 | **12.22** | 64.84 |
| **Qwen-72B-Chat** | 32K | **62.30** | 58.13 | 76.00 | **77.22** | **86.24** | 6.66 | **69.53** |
私たちは、モデルが入力内のさまざまな位置で情報を取得できるかどうかをテストするために、「干し草の山の中の針」実験 (このアイデアは [@Greg Kamradt](https://twitter.com/GregKamradt/status/1727018183608193393) から来ました) を実施しました。 異なる長さの場合、結果は次のようになります。
![](assets/qwen_72b_needle_in_a_haystack.png)
上記の結果は、Qwen-72B-Chat が 32K の入力長内でさまざまな位置に配置された情報を正確に取得できることを示しており、その優れた長文理解能力を証明しています。
## トークナイザー
tiktoken に基づくトークナイザーは、他のトークナイザー、例えばセンテンスピーストークナイザーとは異なります。特にファインチューニングの際には、特殊なトークンに注意を払う必要があります。トークナイザに関する詳細な情報や、ファインチューニングにおける使用方法については、[ドキュメント](tokenization_note_ja.md)を参照してください。
@@ -1139,7 +1363,13 @@ tiktoken に基づくトークナイザーは、他のトークナイザー、
## ライセンス契約
Qwen と Qwen-Chat のコードとモデルウェイトは、研究者や開発者が自由に使用することができます。また、商用利用も可能です。詳しくは [LICENSE](LICENSE) をご覧ください。商用利用を希望される方は、リクエストフォーム([7B](https://dashscope.console.aliyun.com/openModelApply/qianwen), [14B](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat))に必要事項をご記入の上、お申し込みください
<https://github.com/QwenLM/Qwen>で提供されるソースコードは、ルートディレクトリにある[Apache 2.0 License](./LICENSE)の下でライセンスされています
研究者や開発者は、QwenとQwen-Chatのコードとモデルウェイトを自由に使用することができます。商用利用については、各モデルに添付されている使用許諾契約書をご確認ください。
- Qwen-72B、Qwen-14B、Qwen-7Bは、対応するHuggingFaceとModelScopeのリポジトリにある[Tongyi Qianwen LICENSE AGREEMENT](./Tongyi%20Qianwen%20LICENSE%20AGREEMENT)に基づいてライセンスされています。商用利用の場合は、フォーム([72B](https://dashscope.console.aliyun.com/openModelApply/Qwen-72B-Chat), [14B](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat), [7B](https://dashscope.console.aliyun.com/openModelApply/qianwen))に記入して申請してください。
- Qwen-1.8Bは、対応するHuggingFaceとModelScopeのリポジトリにある[Tongyi Qianwen RESEARCH LICENSE AGREEMENT](./Tongyi%20Qianwen%20RESEARCH%20LICENSE%20AGREEMENT)に基づいてライセンスされています。商用利用については、私たちにご連絡ください。
<br><br>
## お問い合わせ

View File

@@ -0,0 +1,53 @@
Tongyi Qianwen LICENSE AGREEMENT
Tongyi Qianwen Release Date: August 3, 2023
By clicking to agree or by using or distributing any portion or element of the Tongyi Qianwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.
1. Definitions
a. This Tongyi Qianwen LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement.
b. "We"(or "Us") shall mean Alibaba Cloud.
c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use.
d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You.
e. "Tongyi Qianwen" shall mean the large language models (including Qwen model and Qwen-Chat model), and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us.
f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement.
g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files.
h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation,
and conversions to other media types.
2. Grant of Rights
You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by Us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials.
3. Redistribution
You may reproduce and distribute copies of the Materials or derivative works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement;
b. You shall cause any modified files to carry prominent notices stating that You changed the files;
c. You shall retain in all copies of the Materials that You distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Tongyi Qianwen is licensed under the Tongyi Qianwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and
d. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such derivative works as a whole, provided Your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement.
4. Restrictions
If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, You shall request a license from Us. You cannot exercise your rights under this Agreement without our express authorization.
5. Rules of use
a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials.
b. You can not use the Materials or any output therefrom to improve any other large language model (excluding Tongyi Qianwen or derivative works thereof).
6. Intellectual Property
a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for Us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications.
b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of Us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials.
c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licences granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought.
7. Disclaimer of Warranty and Limitation of Liability
a. We are not obligated to support, update, provide training for, or develop any further version of the Tongyi Qianwen Materials or to grant any license thereto.
b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM.
c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW ITS CAUSED.
d. You will defend, indemnify and hold harmless Us from and against any claim by any third party arising out of or related to your use or distribution of the Materials.
8. Survival and Termination.
a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 7 and 9 shall survive the termination of this Agreement.
9. Governing Law and Jurisdiction.
a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.

View File

@@ -0,0 +1,55 @@
Tongyi Qianwen RESEARCH LICENSE AGREEMENT
Tongyi Qianwen Release Date: November 30, 2023
By clicking to agree or by using or distributing any portion or element of the Tongyi Qianwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.
1. Definitions
a. This Tongyi Qianwen RESEARCH LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement.
b. "We"(or "Us") shall mean Alibaba Cloud.
c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use.
d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You.
e. "Tongyi Qianwen" shall mean the large language models, and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us.
f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement.
g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files.
h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation,
and conversions to other media types.
i. "Non-Commercial" shall mean for research or evaluation purposes only.
2. Grant of Rights
a. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by Us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials FOR NON-COMMERCIAL PURPOSES ONLY.
b. If you are commercially using the Materials, You shall request a license from Us.
3. Redistribution
You may reproduce and distribute copies of the Materials or derivative works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement;
b. You shall cause any modified files to carry prominent notices stating that You changed the files;
c. You shall retain in all copies of the Materials that You distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Tongyi Qianwen is licensed under the Tongyi Qianwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and
d. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such derivative works as a whole, provided Your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement.
4. Rules of use
a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials.
b. You can not use the Materials or any output therefrom to improve any other large language model (excluding Tongyi Qianwen or derivative works thereof).
5. Intellectual Property
a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for Us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications.
b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of Us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials.
c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licences granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought.
6. Disclaimer of Warranty and Limitation of Liability
a. We are not obligated to support, update, provide training for, or develop any further version of the Tongyi Qianwen Materials or to grant any license thereto.
b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM.
c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW ITS CAUSED.
d. You will defend, indemnify and hold harmless Us from and against any claim by any third party arising out of or related to your use or distribution of the Materials.
7. Survival and Termination.
a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 6 and 8 shall survive the termination of this Agreement.
8. Governing Law and Jurisdiction.
a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.
9. Other Terms and Conditions.
a. Any arrangements, understandings, or agreements regarding the Material not stated herein are separate from and independent of the terms and conditions of this Agreement. You shall request a seperate license from Us, if You use the Materials in ways not expressly agreed to in this Agreement.
b. We shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed.

45
ascend-support/README.md Normal file
View File

@@ -0,0 +1,45 @@
# 昇腾910架构基于mindformers推理Qwen-7B-Chat模型
## 环境要求
- 硬件Ascend 910A/B
## 运行步骤
首先参考Qwen README下载官方模型到`/path/to/Qwen-7B-Chat`
### 下载并启动镜像
```bash
docker pull qwenllm/qwen-mindspore:latest
cd /path/to/Qwen/ascend-support
# 下载模型到此处
CHECKPOINT_PATH=/path/to/Qwen-7B-Chat
cd ascend-support
# 启动docker容器
bash docker_qwen.sh -c ${CHECKPOINT_PATH}
```
### 执行权重转换
在容器内执行下面的命令将Qwen模型转换为适配`mindformers`的格式:
```bash
python3 /data/qwen/mindformers/research/qwen/convert_weight.py
```
转换后模型的输出位置为`${CHECKPOINT_PATH}/qwen-7b-chat.ckpt`
### 执行推理
在容器内执行下面的命令,进行推理:
```bash
cd /data/qwen/mindformers/research/qwen
export PYTHONPATH=/data/qwen/mindformers:$PYTHONPATH
python3 infer_qwen.py
```

View File

@@ -0,0 +1,61 @@
#!/bin/bash
IMAGE_NAME=qwenllm/qwen-mindspore:v23.0.RC3
CONTAINER_NAME=qwen-mindspore
CHECKPOINT_PATH='NOT_SET'
DOCKER_CHECKPOINT_PATH=/data/qwen/models/Qwen-7B-Chat
function usage() {
echo '
Usage: bash ascend-support/docker_qwen.sh [-i IMAGE_NAME] -c [/path/to/Qwen-7B-Chat] [-n CONTAINER_NAME]
'
}
while [[ "$1" != "" ]]; do
case $1 in
-i | --image )
shift
IMAGE_NAME=$1
;;
-c | --checkpoint )
shift
CHECKPOINT_PATH=$1
;;
-n | --name )
shift
CONTAINER_NAME=$1
;;
-h )
usage
exit
;;
* )
echo "Unknown argument ${1}"
exit 1
;;
esac
shift
done
docker run -it --rm -u root --network=host --ipc=host \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--name=${CONTAINER_NAME} \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v ${CHECKPOINT_PATH}:${DOCKER_CHECKPOINT_PATH} \
-v /var/log/npu/:/usr/slog \
${IMAGE_NAME} /bin/bash

Binary file not shown.

Before

Width:  |  Height:  |  Size: 64 KiB

After

Width:  |  Height:  |  Size: 81 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 473 KiB

BIN
assets/radar_72b.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 205 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 396 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 176 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 182 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 824 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 426 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 433 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 466 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 403 KiB

64
dcu-support/README.md Normal file
View File

@@ -0,0 +1,64 @@
# DCU 架构基于 fastllm 推理 Qwen 模型
## 环境配置
### 环境准备
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest
```
### 容器启动
根据如下命令启动推理容器,其中需自定义一个容器名<container_name><project_path>即为本目录的路径:
```
# <container_name> 自定义容器名
# <project_path> 当前工程所在路径
docker run -it --name=<container_name> -v <project_path>:/work --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --shm-size=16G --group-add 39 image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest /bin/bash
```
### 加载环境
进入容器后执行如下命令,加载运行环境变量
```
source /opt/dtk-23.04/cuda/env.sh
```
### 安装方法
```
#进入本工程目录
cd package
python setup.py install
```
## 推理
### 模型转换
首先参考Qwen README下载官方模型并通过如下方式将模型转换为 fastllm 用于推理的形式:
- 通过`pip install -r requirements.txt`安装模型转换所需依赖
- 如果使用已经下载完成的模型或者自己finetune的模型需要修改qwen2flm.py文件中创建tokenizer, model时的模型存放路径
```
# 在本工程目录下执行:
python3 qwen2flm.py qwen-7b-fp16.bin float16 # 导出fp16模型参数为导出的模型路径
```
### 模型推理
```
# 命令行聊天程序,使用了模型创建以及流式对话效果
python cli_demo.py -p qwen-7b-fp16.bin
# batch推理程序
python cli_demo_batch.py -p qwen-7b-fp16.bin
# 简易webui需要先安装streamlit-chat
streamlit run web_demo.py qwen-7b-fp16.bin
```

30
dcu-support/cli_demo.py Normal file
View File

@@ -0,0 +1,30 @@
# coding=utf-8
import argparse
from fastllm_pytools import llm
def args_parser():
parser = argparse.ArgumentParser(description = 'qwen_chat_demo')
parser.add_argument('-p', '--path', type = str, required = True, default = '', help = '模型文件的路径')
args = parser.parse_args()
return args
if __name__ == "__main__":
args = args_parser()
model = llm.model(args.path)
history = []
print("输入内容即可进行对话clear 清空对话历史stop 终止程序")
while True:
query = input("\n用户:")
if query.strip() == "stop":
break
if query.strip() == "clear":
history = []
print("输入内容即可进行对话clear 清空对话历史stop 终止程序")
continue
print("AI:", end = "")
curResponse = ""
for response in model.stream_response(query, history = history, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.0):
curResponse += response
print(response, flush = True, end = "")
history.append((query, curResponse))

View File

@@ -0,0 +1,39 @@
import argparse
from fastllm_pytools import llm
import time
def args_parser():
parser = argparse.ArgumentParser(description = 'fastllm_chat_demo')
parser.add_argument('-p', '--path', type = str, required = True, default = '', help = '模型文件的路径')
args = parser.parse_args()
return args
if __name__ == "__main__":
args = args_parser()
model_path = args.path
prompts = ["深圳有什么好玩的", "上海有什么好玩的", "晚上睡不着怎么办", "南京有什么好吃的"] * 2
print(prompts)
responses, historys = [], []
model = llm.model(model_path)
t0 = time.time()
responses, historys = model.response_batch(prompts)
t1 = time.time()
token_output_count = 0
word_len = 0
for i, res in enumerate(responses):
tokens = model.tokenizer_encode_string(res)
token_output_count += len(tokens)
word_len += len(res)
print("batch index: ", i)
print(res)
print("")
print("\ntoken/s: {:.2f}, character/s: {:.2f}".format(token_output_count/(t1-t0), word_len/(t1-t0)))

View File

@@ -0,0 +1,10 @@
# 模型唯一标识
modelCode = 411
# 模型名称
modelName=qwen-7b_fastllm
# 模型描述
modelDescription=qwen-7b是阿里云研发的通义千问大模型系列的70亿参数规模的模型
# 应用场景
appScenario=推理,对话问答,医疗,科研,金融,教育
# 框架类型
frameType=fastllm

View File

@@ -0,0 +1 @@
__all__ = ["llm"]

View File

@@ -0,0 +1,154 @@
from fastllm_pytools import llm;
import torch;
import ctypes;
import numpy as np;
fastllm_data_type_dict = {
"int4": 8,
"int8": 3,
"float16": 7
}
fastllm_weight_type_dict = {
"linear": 1,
"embedding": 2,
"QuantizedLinear": 111
}
def create(model,
tokenizer = None,
pre_prompt = None,
user_role = None,
bot_role = None,
history_sep = None,
dtype = "float16"):
if (dtype not in fastllm_data_type_dict):
print("dtype should in ", list(fastllm_data_type_dict.keys()));
exit(0);
# 0.1 model info
if model.config.model_type == "chatglm" and model.config.transformers_version == "4.30.2":
model.config.model_type = "chatglm3"
modelInfo = model.config.__dict__
if model.generation_config is not None:
modelInfo.update(model.generation_config.__dict__)
if (pre_prompt):
modelInfo["pre_prompt"] = pre_prompt;
if (user_role):
modelInfo["user_role"] = user_role;
if (bot_role):
modelInfo["bot_role"] = bot_role;
if (history_sep):
modelInfo["history_sep"] = history_sep;
if (modelInfo["model_type"] == "baichuan" and hasattr(model, "model") and hasattr(model.model, "get_alibi_mask")):
# Baichuan 2代
modelInfo["use_alibi"] = "1";
modelInfo["pre_prompt"] = "";
modelInfo["user_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.user_token_id) + "> ") if hasattr(model.generation_config, "user_token_id") else "";
modelInfo["bot_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.assistant_token_id) + ">") if hasattr(model.generation_config, "assistant_token_id") else "";
modelInfo["history_sep"] = "";
if (modelInfo["model_type"] == "qwen"):
if modelInfo["chat_format"] == "chatml":
modelInfo["im_end_id"] = tokenizer.im_end_id
modelInfo["im_start_id"] = tokenizer.im_start_id
weight_type_dict = {};
module_dict = {};
weight_bits = {};
for key, m in model.named_modules():
if (str(type(m)).find("QuantizedLinear") != -1):
weight_type_dict[key + ".weight"] = "QuantizedLinear";
weight_bits[key + ".weight"] = m.weight_bit_width;
if (isinstance(m, torch.nn.Linear)):
weight_type_dict[key + ".weight"] = "linear";
module_dict[key + ".weight"] = m;
if (isinstance(m, torch.nn.Embedding)):
weight_type_dict[key] = "embedding";
peft_config = {}
active_adapter = ""
if hasattr(model, "peft_config"):
peft_config = model.peft_config
if hasattr(model, "active_adapter") and isinstance(model.active_adapter, str):
# in transformers >= 4.33.0, active_adapter is a funtion in model, ignore it now
active_adapter = model.active_adapter
model = model.cpu();
dict = model.state_dict();
model_type = model.config.__dict__["model_type"];
model = llm.fastllm_lib.create_empty_llm_model(model_type.encode());
for it in modelInfo.keys():
llm.fastllm_lib.add_dict_llm_model(model, str(it).encode(), str(modelInfo[it]).encode());
for adapter_name in peft_config.keys():
adapter_dict = peft_config[adapter_name].__dict__
for it in adapter_dict.keys():
llm.fastllm_lib.add_adapter_dict_llm_model(model, str(adapter_name).encode(), str(it).encode(), str(adapter_dict[it]).encode())
if len(active_adapter) != 0:
llm.fastllm_lib.set_adapter(model, str(active_adapter).encode())
# 1. vocab
if (tokenizer):
if (hasattr(tokenizer, "tokenizer")):
if modelInfo["model_type"] == "qwen":
pass
else:
tokenizer = tokenizer.tokenizer;
if (hasattr(tokenizer, "sp_model")):
piece_size = tokenizer.sp_model.piece_size();
for i in range(piece_size):
llm.fastllm_lib.add_tokenizer_word_llm_model(model, tokenizer.sp_model.id_to_piece(i).encode(),
i, ctypes.c_float(tokenizer.sp_model.get_score(i)));
else:
vocab = tokenizer.get_vocab();
for v in vocab.keys():
if (modelInfo["model_type"] == "moss"):
vv = [(ord(c) if c not in tokenizer.byte_decoder else tokenizer.byte_decoder[c]) for c in v];
llm.fastllm_lib.add_tokenizer_word_llm_model(model, vv, vocab[v], ctypes.c_float(1.0));
elif (modelInfo["model_type"] == "qwen"):
llm.fastllm_lib.add_tokenizer_word_llm_model(model, v, vocab[v], ctypes.c_float(1.0));
else:
llm.fastllm_lib.add_tokenizer_word_llm_model(model, v.encode(), vocab[v], ctypes.c_float(1.0));
tot = 0;
for key in dict:
ori_data_type = 0;
ori_np_data_type = np.float32;
cur_weight_type = 0;
if (key in weight_type_dict and weight_type_dict[key] in fastllm_weight_type_dict):
cur_weight_type = fastllm_weight_type_dict[weight_type_dict[key]];
to_data_type = 0;
if (cur_weight_type == 1):
to_data_type = fastllm_data_type_dict[dtype];
if (to_data_type == 7):
ori_data_type = 7;
ori_np_data_type = np.float16;
elif (cur_weight_type == 2):
# TODO bfloat
to_data_type = 0;
weight_name = key
if peft_config is not None:
weight_name = weight_name.replace('base_model.model.', '')
if (cur_weight_type == 111):
llm.fastllm_lib.add_qlinear_weight_llm_model(model, weight_name.encode(),
len(dict[key].shape),
(ctypes.c_int * len(dict[key].shape))(*list(dict[key].shape)),
weight_bits[key],
dict[key + "_scale"].numpy().astype(np.float32).ctypes.data_as(ctypes.c_void_p),
dict[key].numpy().ctypes.data_as(ctypes.c_void_p));
else:
llm.fastllm_lib.add_weight_llm_model(model, weight_name.encode(),
len(dict[key].shape),
(ctypes.c_int * len(dict[key].shape))(*list(dict[key].shape)),
to_data_type, cur_weight_type, ori_data_type,
dict[key].numpy().astype(ori_np_data_type).ctypes.data_as(ctypes.c_void_p));
tot += 1;
print("convert (", tot, "/", len(dict), end = " )\r");
print("");
llm.fastllm_lib.init_params_llm_model(model);
llm.fastllm_lib.warmup_llm_model(model);
ret = llm.model("", id = model);
return ret;

View File

@@ -0,0 +1,495 @@
import ctypes;
import math
import os;
import threading
from typing import Optional, Tuple, Union, List, Callable, Dict, Any;
from copy import deepcopy
import platform
if platform.system() == 'Windows':
fastllm_lib = ctypes.cdll.LoadLibrary(os.path.join(os.path.split(os.path.realpath(__file__))[0], "fastllm_tools.dll"))
else:
fastllm_lib = ctypes.cdll.LoadLibrary(os.path.join(os.path.split(os.path.realpath(__file__))[0], "libfastllm_tools.so"))
fastllm_lib.create_llm_model.argtypes = [ctypes.c_char_p]
fastllm_lib.create_llm_model.restype = ctypes.c_int
fastllm_lib.token_decode.argtypes = [ctypes.c_int, ctypes.c_int, ctypes.c_int, ctypes.c_char_p]
fastllm_lib.token_decode.restype = ctypes.c_int
fastllm_lib.token_encode_string.argtypes = [ctypes.c_int, ctypes.c_char_p, ctypes.c_int, ctypes.POINTER(ctypes.c_int)]
fastllm_lib.token_encode_string.restype = ctypes.c_int
fastllm_lib.launch_response_llm_model.argtypes = [ctypes.c_int, ctypes.c_int, ctypes.c_void_p,
ctypes.c_int, ctypes.c_bool, ctypes.c_float, ctypes.c_int,
ctypes.c_float, ctypes.c_float, ctypes.c_bool]
fastllm_lib.launch_response_llm_model.restype = ctypes.c_int
fastllm_lib.fetch_response_llm_model.argtypes = [ctypes.c_int, ctypes.c_int]
fastllm_lib.fetch_response_llm_model.restype = ctypes.c_int
fastllm_lib.fetch_response_logits_llm_model.argtypes = [ctypes.c_int, ctypes.c_int, ctypes.POINTER(ctypes.c_float)]
fastllm_lib.fetch_response_logits_llm_model.restype = ctypes.c_int
fastllm_lib.response_str_llm_model.argtypes = [ctypes.c_int, ctypes.c_char_p,
ctypes.c_int, ctypes.c_bool, ctypes.c_float, ctypes.c_int,
ctypes.c_float, ctypes.c_float, ctypes.c_bool]
fastllm_lib.response_str_llm_model.restype = ctypes.c_char_p
fastllm_lib.launch_response_str_llm_model.argtype = [ctypes.c_int, ctypes.c_char_p,
ctypes.c_int, ctypes.c_bool, ctypes.c_float, ctypes.c_int,
ctypes.c_float, ctypes.c_float, ctypes.c_bool]
fastllm_lib.launch_response_str_llm_model.restype = ctypes.c_int
fastllm_lib.fetch_response_str_llm_model.argtypes = [ctypes.c_int, ctypes.c_int]
fastllm_lib.fetch_response_str_llm_model.restype = ctypes.c_char_p
fastllm_lib.make_history_llm_model.argtype = [ctypes.c_int, ctypes.c_char_p, ctypes.c_int, ctypes.c_char_p, ctypes.c_char_p]
fastllm_lib.make_history_llm_model.restype = ctypes.c_char_p
fastllm_lib.make_input_llm_model.argtype = [ctypes.c_int, ctypes.c_char_p, ctypes.c_int, ctypes.c_char_p]
fastllm_lib.make_input_llm_model.restype = ctypes.c_char_p
fastllm_lib.add_tokenizer_word_llm_model.argtype = [ctypes.c_int, ctypes.c_char_p, ctypes.c_float, ctypes.c_int]
fastllm_lib.set_device_map.argtype = [ctypes.c_int, ctypes.c_void_p, ctypes.c_char_p, ctypes.c_void_p]
fastllm_lib.get_llm_model_type.argtype = [ctypes.c_int]
fastllm_lib.get_llm_model_type.restype = ctypes.c_char_p
fastllm_lib.response_batch_str_llm_model.argtypes = [ctypes.c_int, ctypes.POINTER(ctypes.c_char_p), ctypes.c_int,
ctypes.c_int, ctypes.c_bool, ctypes.c_float, ctypes.c_int,
ctypes.c_float, ctypes.c_float, ctypes.c_bool]
fastllm_lib.response_batch_str_llm_model.restype = ctypes.POINTER(ctypes.c_char_p)
fastllm_lib.response_batch_tokens_llm_model.argtypes = [ctypes.c_int, ctypes.c_int, ctypes.POINTER(ctypes.c_int), ctypes.POINTER(ctypes.c_int),
ctypes.c_int, ctypes.c_bool, ctypes.c_float, ctypes.c_int,
ctypes.c_float, ctypes.c_float, ctypes.c_bool]
fastllm_lib.response_batch_tokens_llm_model.restype = ctypes.POINTER(ctypes.c_char_p)
def set_cpu_threads(threads: int):
fastllm_lib.set_cpu_threads(threads);
def get_cpu_threads() -> int:
return fastllm_lib.get_cpu_threads();
def print_ins_info():
fastllm_lib.print_cpu_ins();
def set_cpu_kvcache(cpu_kvcache):
fastllm_lib.set_kvcache_in_cpu(ctypes.c_bool(cpu_kvcache));
def get_cpu_kvcache():
return fastllm_lib.get_kvcache_in_cpu();
def set_cpu_low_mem(low_mem):
fastllm_lib.set_cpu_low_mem(ctypes.c_bool(low_mem));
def get_cpu_low_mem():
return fastllm_lib.get_cpu_low_mem();
def set_device_map(device_map):
devices = [];
values = [];
if (isinstance(device_map, str)):
devices.append(device_map);
values.append(1);
elif (isinstance(device_map, list)):
devices = [str(x) for x in device_map];
values = [1 for x in device_map];
elif (isinstance(device_map, dict)):
devices = [str(x) for x in device_map.keys()];
values = [int(device_map[x]) for x in device_map.keys()];
else:
print("set_device_map error.");
return;
device_str = ''.join(devices);
device_len = [len(x) for x in devices];
fastllm_lib.set_device_map(len(device_len),
(ctypes.c_int * len(device_len))(*device_len),
device_str.encode(),
(ctypes.c_int * len(values))(*values));
def from_hf(model,
tokenizer = None,
dtype = "float16"):
from fastllm_pytools import hf_model;
return hf_model.create(model, tokenizer, dtype = dtype);
class model:
def __init__ (self, path : str,
id : int = -99999):
if (id != -99999):
self.model = id;
else:
self.model = fastllm_lib.create_llm_model(path.encode());
self.direct_query = False;
# 为了减少重复申请释放buffer对象而使用的线程局部存储区对象池
self.thread_local_obj = threading.local()
self.thread_local_obj.tokenizer_encode_string__output_buffer = None
self.thread_local_obj.tokenizer_decode_token__output_buffer = None
# tokenizer_decode_token 输出结果的静态缓存,手工触发构建
# 由于token数量有限且不太多所以缓存该结果来减少调用较为适合。
# 不做成自动缓存是为了避免在多线程调用的时候对缓存dict加锁同时也为不同场景提供选择空间
self.tokenizer_decode_token_cache = None
self.model_type = fastllm_lib.get_llm_model_type(self.model).decode()
# print("model_type:", self.model_type)
def get_prompt(self,
query: str,
history: List[Tuple[str, str]] = None) -> str:
if (not(history)):
history = [];
prompt = "";
for i, (old_query, response) in enumerate(history):
prompt = fastllm_lib.make_history_llm_model(self.model, prompt.encode(), i, old_query.encode(), response.encode()).decode();
prompt = fastllm_lib.make_input_llm_model(self.model, prompt.encode(), len(history), query.encode()).decode();
return prompt;
def save(self, path : str):
fastllm_lib.save_llm_model(self.model, path.encode());
def eval(self):
pass;
def build_tokenizer_decode_token_cache(self):
if self.tokenizer_decode_token_cache is not None:
return
cache_dict = dict()
vocab_size = fastllm_lib.get_tokenizer_vocab_size(self.model)
for token_id in range(vocab_size):
cache_dict[token_id] = self.tokenizer_decode_token(token_id)
self.tokenizer_decode_token_cache = cache_dict
def tokenizer_encode_string(self, content: str) -> List[int]:
output_buffer_init_len = 1024
if self.thread_local_obj.tokenizer_encode_string__output_buffer is None:
self.thread_local_obj.tokenizer_encode_string__output_buffer = (ctypes.c_int * output_buffer_init_len)()
buffer = self.thread_local_obj.tokenizer_encode_string__output_buffer
buffer_len = len(buffer)
result_len = fastllm_lib.token_encode_string(self.model, content.encode(), buffer_len, buffer)
if result_len > buffer_len:
if result_len > 10240:
# 要处理的数据过长使用一次性的buffer
temp_buffer = (ctypes.c_int * result_len)()
ret = fastllm_lib.token_encode_string(self.model, content.encode(), result_len, temp_buffer)
return [i for i in temp_buffer]
else:
# 扩展buffer大小
new_buffer_len = round(math.ceil(result_len / 1024.0)) * 1024
buffer = (ctypes.c_int * new_buffer_len)()
self.thread_local_obj.tokenizer_encode_string__output_buffer = buffer
result_len = fastllm_lib.token_encode_string(self.model, content.encode(), new_buffer_len, buffer)
return [buffer[i] for i in range(result_len)]
def tokenizer_decode_token(self, token_id: int) -> bytes:
if self.tokenizer_decode_token_cache is not None:
cache_result = self.tokenizer_decode_token_cache.get(token_id)
if cache_result is not None:
return cache_result
output_buffer_init_len = 256
if self.thread_local_obj.tokenizer_decode_token__output_buffer is None:
self.thread_local_obj.tokenizer_decode_token__output_buffer = ctypes.create_string_buffer(output_buffer_init_len)
buffer = self.thread_local_obj.tokenizer_decode_token__output_buffer
ret = fastllm_lib.token_decode(self.model, token_id, len(buffer), buffer)
if ret > 0:
# buffer长度不够扩展buffer大小
new_buffer_len = round(math.ceil(ret / 16.0)) * 16
buffer = ctypes.create_string_buffer(new_buffer_len)
self.thread_local_obj.tokenizer_decode_token__output_buffer = buffer
ret = fastllm_lib.token_decode(self.model, token_id, len(buffer), buffer)
assert ret == 0
buffer_bytes = buffer.raw
result_len = len(buffer_bytes)
for i in range(len(buffer_bytes)):
if buffer_bytes[i] == 0:
result_len = i
break
return buffer_bytes[:result_len]
def response_logits(self,
query: str,
history: List[Tuple[str, str]] = None,
tokenizer = None) -> str:
prompt = query if self.direct_query else self.get_prompt(query, history);
if (tokenizer == None):
handle = fastllm_lib.launch_response_str_llm_model(self.model, prompt.encode(),
ctypes.c_int(1), ctypes.c_bool(False), ctypes.c_float(1), ctypes.c_int(1),
ctypes.c_float(1), ctypes.c_float(1), ctypes.c_bool(True));
else:
input = tokenizer.encode(prompt);
handle = fastllm_lib.launch_response_llm_model(self.model, len(input), (ctypes.c_int * len(input))(*input),
1, False, 1, 1, 1, 1, True);
vocab_size = fastllm_lib.get_tokenizer_vocab_size(self.model);
logits = list(range(vocab_size))
array = (ctypes.c_float * (vocab_size * 4))(*logits);
ret = fastllm_lib.fetch_response_logits_llm_model(self.model, handle, array);
out = list(array)[:vocab_size];
while (ret != -1):
ret = fastllm_lib.fetch_response_logits_llm_model(self.model, handle, array);
return out;
def response(self,
query: str,
history: List[Tuple[str, str]] = None,
max_length: int = 8192, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.0) -> str:
ret = "";
for i in self.stream_response(query = query,
history = history,
max_length = max_length,
do_sample = do_sample,
top_p = top_p, top_k = top_k,
temperature = temperature,
repeat_penalty = repeat_penalty,
one_by_one = True):
ret += i;
return ret;
def stream_response(self,
query: str,
history: List[Tuple[str, str]] = None,
max_length: int = 8192, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.0,
one_by_one = True):
prompt = query if self.direct_query else self.get_prompt(query, history);
handle = fastllm_lib.launch_response_str_llm_model(self.model, prompt.encode(),
ctypes.c_int(max_length), ctypes.c_bool(do_sample), ctypes.c_float(top_p), ctypes.c_int(top_k),
ctypes.c_float(temperature), ctypes.c_float(repeat_penalty), ctypes.c_bool(False));
res = "";
ret = b'';
fail_cnt = 0;
while True:
ret += fastllm_lib.fetch_response_str_llm_model(self.model, handle);
cur = "";
try:
cur = ret.decode();
ret = b'';
except:
fail_cnt += 1;
if (fail_cnt == 20):
break;
else:
continue;
fail_cnt = 0;
if (cur == "<flmeos>"):
break;
if one_by_one:
yield cur;
else:
res += cur;
yield res;
def stream_response_raw(self,
input_tokens: List[int],
max_length: int = 8192, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.0,
one_by_one = True
):
handle = fastllm_lib.launch_response_llm_model(self.model, len(input_tokens),
(ctypes.c_int * len(input_tokens))(*input_tokens),
ctypes.c_int(max_length), ctypes.c_bool(do_sample), ctypes.c_float(top_p), ctypes.c_int(top_k),
ctypes.c_float(temperature), ctypes.c_float(repeat_penalty), ctypes.c_bool(False))
# 可能遇到长尾char需要多个token才能够生成所以只返回bytesstring.decode策略交给外部
# 方便统计输出token数量和控制不完整utf8时候解码的逻辑
total_bytes = b''
while True:
cur_token = fastllm_lib.fetch_response_llm_model(self.model, handle)
if cur_token == -1:
break
cur_bytes = self.tokenizer_decode_token(cur_token)
if one_by_one:
yield cur_bytes
else:
total_bytes += cur_bytes
yield total_bytes
def chat(self, tokenizer, query: str, history: List[Tuple[str, str]] = None, max_length: int = 8192,
do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.0, **kwargs):
if self.model_type != "chatglm3":
if (not(history)):
history = [];
prompt = query if self.direct_query else self.get_prompt(query, history);
input = tokenizer.encode(prompt);
handle = fastllm_lib.launch_response_llm_model(self.model, len(input), (ctypes.c_int * len(input))(*input),
max_length, do_sample, top_p, top_k, temperature, repeat_penalty,
False);
result = [];
while True:
cur = fastllm_lib.fetch_response_llm_model(self.model, handle);
if (cur == -1):
break;
result.append(cur);
response = tokenizer.decode(result);
history = history + [(query, response)];
return response, history;
else:
if history is None:
history = []
role = "user"
input = self.build_chatglm3_input(tokenizer, query, history=history, role=role)
history.append({"role": role, "content": query})
handle = fastllm_lib.launch_response_llm_model(self.model, len(input), (ctypes.c_int * len(input))(*input),
max_length, do_sample, top_p, top_k, temperature, repeat_penalty,
False);
tokens = [];
while True:
cur = fastllm_lib.fetch_response_llm_model(self.model, handle);
if (cur == -1):
break;
tokens.append(cur);
response = tokenizer.decode(tokens);
if response and response[-1] != "<EFBFBD>":
response, new_history = self.process_chatglm3_response(response, history)
return response, new_history
def stream_chat(self, tokenizer, query: str, history: List[Tuple[str, str]] = None, past_key_values = None,
max_length: int = 8192, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.0,
return_past_key_values = False, **kwargs) -> str:
if self.model_type != "chatglm3":
if (not(history)):
history = [];
prompt = query if self.direct_query else self.get_prompt(query, history);
input = tokenizer.encode(prompt);
handle = fastllm_lib.launch_response_llm_model(self.model, len(input), (ctypes.c_int * len(input))(*input),
max_length, do_sample, top_p, top_k, temperature, repeat_penalty,
False);
tokens = [];
while True:
cur = fastllm_lib.fetch_response_llm_model(self.model, handle);
if (cur == -1):
break;
tokens.append(cur);
response = tokenizer.decode(tokens);
new_history = history + [(query, response)];
if return_past_key_values:
yield response, new_history, None;
else:
yield response, new_history;
else:
if history is None:
history = []
role = "user"
input = self.build_chatglm3_input(tokenizer, query, history=history, role=role)
history.append({"role": role, "content": query})
handle = fastllm_lib.launch_response_llm_model(self.model, len(input), (ctypes.c_int * len(input))(*input),
max_length, do_sample, top_p, top_k, temperature, repeat_penalty,
False);
tokens = [];
while True:
cur = fastllm_lib.fetch_response_llm_model(self.model, handle);
if (cur == -1):
break;
tokens.append(cur);
response = tokenizer.decode(tokens);
if response and response[-1] != "<EFBFBD>":
response, new_history = self.process_chatglm3_response(response, history)
if return_past_key_values:
yield response, new_history, past_key_values
else:
yield response, new_history
def set_adapter(self, name: str):
fastllm_lib.set_adapter(self.model, str(name).encode())
def disable_adapter(self):
fastllm_lib.disable_adapter(self.model)
def process_chatglm3_response(self, output, history):
content = ""
history = deepcopy(history)
for response in output.split("<|assistant|>"):
metadata, content = response.split("\n", maxsplit=1)
if not metadata.strip():
content = content.strip()
history.append({"role": "assistant", "metadata": metadata, "content": content})
content = content.replace("[[训练时间]]", "2023年")
else:
history.append({"role": "assistant", "metadata": metadata, "content": content})
if history[0]["role"] == "system" and "tools" in history[0]:
content = "\n".join(content.split("\n")[1:-1])
def tool_call(**kwargs):
return kwargs
parameters = eval(content)
content = {"name": metadata.strip(), "parameters": parameters}
else:
content = {"name": metadata.strip(), "content": content}
return content, history
def build_chatglm3_input(self, tokenizer, query, history=None, role="user"):
if history is None:
history = []
input_ids = []
for item in history:
content = item["content"]
if item["role"] == "system" and "tools" in item:
content = content + "\n" + json.dumps(item["tools"], indent=4, ensure_ascii=False)
input_ids.extend(tokenizer.build_single_message(item["role"], item.get("metadata", ""), content))
input_ids.extend(tokenizer.build_single_message(role, "", query))
input_ids.extend([tokenizer.get_command("<|assistant|>")])
return input_ids
def response_batch(self, querys: List[str],
historys: List[List[Tuple[str, str]]] = None,
max_length: int = 1024, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.0,
**kwargs) -> List[str]:
query_size = len(querys)
if (not(historys)):
historys = [[] for _ in range(query_size)]
inputs = (ctypes.c_char_p * query_size)()
for i, query in enumerate(querys):
prompt = query if self.direct_query else self.get_prompt(query, historys[i])
inputs[i] = ctypes.c_char_p(prompt.encode())
outputs = fastllm_lib.response_batch_str_llm_model(self.model, inputs, query_size,
max_length, do_sample, top_p, top_k, temperature, repeat_penalty, False)
responses = []
for i in range(query_size):
response = ctypes.string_at(outputs[i]).decode()
responses.append(response)
historys[i] = historys[i] + [(querys[i], response)]
return responses, historys
def chat_batch(self, tokenizer, querys: List[str], historys: List[List[Tuple[str, str]]] = None, max_length: int = 1024,
do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.0, **kwargs):
query_size = len(querys)
if (not(historys)):
historys = [[] for _ in range(query_size)]
inputs = []
inputs_len = []
for i, query in enumerate(querys):
prompt = query if self.direct_query else self.get_prompt(query, historys[i])
input = tokenizer.encode(prompt);
inputs.extend(input)
inputs_len.append(len(input))
outputs = fastllm_lib.response_batch_tokens_llm_model(self.model, query_size,
(ctypes.c_int * len(inputs_len))(*inputs_len),
(ctypes.c_int * len(inputs))(*inputs),
max_length, do_sample, top_p, top_k, temperature, repeat_penalty,
False)
responses = []
for i in range(query_size):
response = ctypes.string_at(outputs[i]).decode()
responses.append(response)
historys[i] = historys[i] + [(querys[i], response)]
return responses, historys

View File

@@ -0,0 +1,218 @@
import struct
import numpy as np
import torch
def writeString(fo, s):
fo.write(struct.pack('i', len(s)))
fo.write(s.encode())
def writeKeyValue(fo, key, value):
writeString(fo, key)
writeString(fo, value)
fastllm_data_type_dict = {
"int4": 8,
"int8": 3,
"float16": 7,
"float32": 0,
}
fastllm_weight_type_dict = {
"linear": 1,
"embedding": 2
}
v = np.random.randint(-127, 127, [10, 20]);
temp = v;
c_max = np.expand_dims(np.abs(v).max(axis = -1), -1)
c_scale = c_max / 127.0
v = (v / c_scale + 128.5).clip(1, 255).astype(np.uint8)
def write_int8(fo, v):
c_max = np.expand_dims(np.abs(v).max(axis = -1), -1).clip(0.1, 1e100)
c_scale = c_max / 127.0
v = (v / c_scale + 128.5).clip(1, 255).astype(np.uint8)
fo.write(struct.pack('i', 3))
fo.write(struct.pack('i', 0))
for i in range(c_max.shape[0]):
fo.write(struct.pack('f', -c_max[i][0]));
fo.write(struct.pack('f', c_max[i][0]));
fo.write(v.data)
def write_int4(fo, v):
# c_min = np.expand_dims(-np.abs(v).max(axis = -1), -1)
# c_max = np.expand_dims(np.abs(v).max(axis = -1), -1)
# c_scale = c_max / 7.0
# c_min = c_scale * -8.0
c_min = np.expand_dims(v.min(axis = -1), -1)
c_max = np.expand_dims(v.max(axis = -1), -1)
c_scale = (c_max - c_min) / 15.0
c_zero = np.round(0.0 - c_min / c_scale)
c_zero = c_zero.clip(0, 15)
c_min = -c_scale * c_zero
v = (v - c_min) / c_scale
v = (v + 0.5).astype(np.int8).clip(0, 15).astype(np.uint8)
v = v[:, 0::2] * 16 + v[:, 1::2]
fo.write(struct.pack('i', 8))
fo.write(struct.pack('i', 0))
for i in range(c_min.shape[0]):
fo.write(struct.pack('f', c_min[i][0]));
fo.write(struct.pack('f', c_max[i][0]));
fo.write(v.data)
def tofile(exportPath,
model,
tokenizer = None,
pre_prompt = None,
user_role = None,
bot_role = None,
history_sep = None,
dtype = "float16"):
if (dtype not in fastllm_data_type_dict):
print("dtype should in ", list(fastllm_data_type_dict.keys()))
exit(0)
dict = model.state_dict()
fo = open(exportPath, "wb")
# 0. version id
fo.write(struct.pack('i', 2))
# 0.1 model info
if model.config.model_type == "chatglm" and model.config.transformers_version == "4.30.2":
model.config.model_type = "chatglm3"
modelInfo = model.config.__dict__
if model.generation_config is not None:
modelInfo.update(model.generation_config.__dict__)
if ("model_type" not in modelInfo):
print("unknown model_type.")
exit(0)
if (pre_prompt):
modelInfo["pre_prompt"] = pre_prompt
if (user_role):
modelInfo["user_role"] = user_role
if (bot_role):
modelInfo["bot_role"] = bot_role
if (history_sep):
modelInfo["history_sep"] = history_sep
if (modelInfo["model_type"] == "baichuan" and hasattr(model, "model") and hasattr(model.model, "get_alibi_mask")):
# Baichuan 2代
modelInfo["use_alibi"] = "1"
modelInfo["pre_prompt"] = ""
modelInfo["user_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.user_token_id) + ">") if hasattr(model.generation_config, "user_token_id") else "";
modelInfo["bot_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.assistant_token_id) + ">") if hasattr(model.generation_config, "assistant_token_id") else "";
modelInfo["history_sep"] = ""
if (modelInfo["model_type"] == "baichuan" and modelInfo["vocab_size"] == 125696):
# Baichuan 2代 7B
modelInfo["pre_prompt"] = ""
modelInfo["user_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.user_token_id) + ">") if hasattr(model.generation_config, "user_token_id") else "";
modelInfo["bot_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.assistant_token_id) + ">") if hasattr(model.generation_config, "assistant_token_id") else "";
modelInfo["history_sep"] = ""
if modelInfo["model_type"] == "qwen":
if modelInfo["chat_format"] == "chatml":
modelInfo["im_end_id"] = tokenizer.im_end_id
modelInfo["im_start_id"] = tokenizer.im_start_id
modelInfo["tokenizer_use_score"] = "1" # 分词带分数
if hasattr(model, "peft_config"):
adapter_size = len(model.peft_config)
modelInfo["peft_size"] = adapter_size
fo.write(struct.pack('i', len(modelInfo)))
for it in modelInfo.keys():
writeKeyValue(fo, str(it), str(modelInfo[it]))
if hasattr(model, "peft_config"):
for adapter_name in model.peft_config.keys():
adapter_dict = model.peft_config[adapter_name].__dict__
writeString(fo, adapter_name)
fo.write(struct.pack('i', len(adapter_dict)))
for it in adapter_dict.keys():
writeKeyValue(fo, str(it), str(adapter_dict[it]))
# 1. vocab
if (tokenizer):
if (hasattr(tokenizer, "tokenizer")):
if (modelInfo['model_type'] == "qwen"):
pass
else:
tokenizer = tokenizer.tokenizer
if (hasattr(tokenizer, "sp_model")):
piece_size = tokenizer.sp_model.piece_size()
fo.write(struct.pack('i', piece_size))
for i in range(piece_size):
s = tokenizer.sp_model.id_to_piece(i).encode()
fo.write(struct.pack('i', len(s)))
for c in s:
fo.write(struct.pack('i', c))
fo.write(struct.pack('i', i))
fo.write(struct.pack('f', float(tokenizer.sp_model.get_score(i))))
else:
vocab = tokenizer.get_vocab()
fo.write(struct.pack('i', len(vocab)))
for v in vocab.keys():
if (modelInfo['model_type'] == "qwen"):
s = v
elif (modelInfo["model_type"] == "moss"):
s = [(ord(c) if c not in tokenizer.byte_decoder else tokenizer.byte_decoder[c]) for c in v]
else:
s = v.encode()
fo.write(struct.pack('i', len(s)))
for c in s:
fo.write(struct.pack('i', c))
fo.write(struct.pack('i', vocab[v]))
fo.write(struct.pack('f', 1.0))
else:
fo.write(struct.pack('i', 0))
weight_type_dict = {}
module_dict = {}
for key, m in model.named_modules():
if (isinstance(m, torch.nn.Linear)):
weight_type_dict[key + ".weight"] = "linear"
module_dict[key + ".weight"] = m
if (isinstance(m, torch.nn.Embedding)):
weight_type_dict[key] = "embedding"
# 2. weight
fo.write(struct.pack('i', len(dict)))
tot = 0
for key in dict:
ori_data_type = 0
ori_np_data_type = np.float32
cur_weight_type = 0
if (key in weight_type_dict and weight_type_dict[key] in fastllm_weight_type_dict):
cur_weight_type = fastllm_weight_type_dict[weight_type_dict[key]]
to_data_type = 0
if (cur_weight_type == 1):
to_data_type = fastllm_data_type_dict[dtype]
if (to_data_type == 7):
ori_data_type = 7
ori_np_data_type = np.float16
cur = dict[key].numpy().astype(ori_np_data_type)
if hasattr(model, "peft_config"):
weight_name = key.replace('base_model.model.', '')
fo.write(struct.pack('i', len(weight_name)))
fo.write(weight_name.encode())
else:
fo.write(struct.pack('i', len(key)))
fo.write(key.encode())
fo.write(struct.pack('i', len(cur.shape)))
for i in cur.shape:
fo.write(struct.pack('i', i))
if (to_data_type == 3):
write_int8(fo, cur)
elif (to_data_type == 8):
write_int4(fo, cur)
else:
fo.write(struct.pack('i', to_data_type))
fo.write(cur.data)
tot += 1
print("output (", tot, "/", len(dict), end = " )\r")
print("\nfinish.")
fo.close()

View File

@@ -0,0 +1,12 @@
from setuptools import setup, find_packages
setup (
name = "fastllm_pytools",
version = "0.0.1",
description = "Fastllm pytools",
packages = ['fastllm_pytools'],
url = "https://developer.hpccube.com/codes/aicomponent/fastllm",
package_data = {
'': ['*.dll', '*.so']
}
)

13
dcu-support/qwen2flm.py Normal file
View File

@@ -0,0 +1,13 @@
import sys
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
from fastllm_pytools import torch2flm
if __name__ == "__main__":
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True, fp32=True).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
dtype = sys.argv[2] if len(sys.argv) >= 3 else "float16"
exportPath = sys.argv[1] if len(sys.argv) >= 2 else "qwen-7b-" + dtype + ".flm"
torch2flm.tofile(exportPath, model, tokenizer, dtype = dtype)

View File

@@ -0,0 +1,9 @@
transformers==4.32.0
tiktoken
streamlit>=1.24.0
sentencepiece
urllib3==1.26.16
transformers_stream_generator==0.0.4
accelerate
einops
#scipy

37
dcu-support/web_demo.py Normal file
View File

@@ -0,0 +1,37 @@
import streamlit as st
from streamlit_chat import message
from fastllm_pytools import llm
import sys
st.set_page_config(
page_title="fastllm web demo",
page_icon=":robot:"
)
@st.cache_resource
def get_model():
model = llm.model(sys.argv[1])
return model
if "messages" not in st.session_state:
st.session_state.messages = []
for i, (prompt, response) in enumerate(st.session_state.messages):
with st.chat_message("user"):
st.markdown(prompt)
with st.chat_message("assistant"):
st.markdown(response)
if prompt := st.chat_input("请开始对话"):
model = get_model()
with st.chat_message("user"):
st.markdown(prompt)
with st.chat_message("assistant"):
message_placeholder = st.empty()
full_response = ""
for chunk in model.stream_response(prompt, st.session_state.messages, one_by_one = True):
full_response += chunk
message_placeholder.markdown(full_response + "")
message_placeholder.markdown(full_response)
st.session_state.messages.append((prompt, full_response))

109
docker/Dockerfile Normal file
View File

@@ -0,0 +1,109 @@
ARG CUDA_VERSION=11.7.1
ARG from=nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu20.04
FROM ${from} as base
ARG from
RUN <<EOF
apt update -y && apt upgrade -y && apt install -y --no-install-recommends \
git \
git-lfs \
python3 \
python3-pip \
python3-dev \
wget \
vim \
&& rm -rf /var/lib/apt/lists/*
EOF
RUN ln -s /usr/bin/python3 /usr/bin/python
RUN git lfs install
FROM base as dev
WORKDIR /
RUN mkdir -p /data/shared/Qwen
WORKDIR /data/shared/Qwen/
# Users can also mount '/data/shared/Qwen/' to keep the data
COPY ../requirements.txt ./
COPY ../requirements_web_demo.txt ./
FROM dev as bundle_req
ARG BUNDLE_REQUIREMENTS=true
RUN <<EOF
if [ "$BUNDLE_REQUIREMENTS" = "true" ]; then
cd /data/shared/Qwen
pip3 install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
pip3 install -r requirements.txt
pip3 install -r requirements_web_demo.txt
fi
EOF
FROM bundle_req as bundle_flash_attention
ARG BUNDLE_FLASH_ATTENTION=true
RUN <<EOF
if [ "$BUNDLE_FLASH_ATTENTION" = "true" ]; then
cd /data/shared/Qwen
test -d flash-attention || git clone -b v2.3.3 https://github.com/Dao-AILab/flash-attention
cd /data/shared/Qwen/flash-attention &&
pip3 install . &&
pip3 install csrc/layer_norm
fi
EOF
FROM bundle_flash_attention as bundle_finetune
ARG BUNDLE_FINETUNE=true
RUN <<EOF
if [ "$BUNDLE_FINETUNE" = "true" ]; then
cd /data/shared/Qwen
# Full-finetune / LoRA.
pip3 install deepspeed peft
# Q-LoRA.
apt update -y && DEBIAN_FRONTEND=noninteractive apt install -y --no-install-recommends \
libopenmpi-dev openmpi-bin \
&& rm -rf /var/lib/apt/lists/*
pip3 install optimum auto-gptq mpi4py
fi
EOF
FROM bundle_finetune as bundle_openai_api
ARG BUNDLE_OPENAI_API=true
RUN <<EOF
if [ "$BUNDLE_OPENAI_API" = "true" ]; then
cd /data/shared/Qwen
pip3 install fastapi uvicorn "openai<1.0.0" sse_starlette "pydantic<=1.10.13"
fi
EOF
FROM bundle_openai_api as final
ARG from
COPY ../requirements.txt ./
COPY ../requirements_web_demo.txt ./
COPY ../cli_demo.py ./
COPY ../web_demo.py ./
COPY ../openai_api.py ./
COPY ../finetune.py ./
COPY ../utils.py ./
COPY ./examples/* ./examples/
COPY ./eval/* ./eval/
COPY ./finetune/* ./finetune/
EXPOSE 80
WORKDIR /data/shared/Qwen/
CMD ["python3", "web_demo.py", "--server-port", "80", "--server-name", "0.0.0.0", "-c", "/data/shared/Qwen/Qwen-Chat/"]

105
docker/Dockerfile-cu114 Normal file
View File

@@ -0,0 +1,105 @@
ARG CUDA_VERSION=11.4.3
ARG from=nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu20.04
FROM ${from} as base
ARG from
RUN <<EOF
apt update -y && apt upgrade -y && apt install -y --no-install-recommends \
git \
git-lfs \
python3 \
python3-pip \
python3-dev \
wget \
vim \
&& rm -rf /var/lib/apt/lists/*
EOF
RUN ln -s /usr/bin/python3 /usr/bin/python
RUN git lfs install
FROM base as dev
WORKDIR /
RUN mkdir -p /data/shared/Qwen
WORKDIR /data/shared/Qwen/
# Users can also mount '/data/shared/Qwen/' to keep the data
COPY ../requirements.txt ./
COPY ../requirements_web_demo.txt ./
FROM dev as bundle_req
ARG BUNDLE_REQUIREMENTS=true
RUN <<EOF
if [ "$BUNDLE_REQUIREMENTS" = "true" ]; then
cd /data/shared/Qwen
pip3 install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
pip3 install -r requirements.txt
pip3 install -r requirements_web_demo.txt
fi
EOF
FROM bundle_req as bundle_flash_attention
ARG BUNDLE_FLASH_ATTENTION=true
RUN <<EOF
if [ "$BUNDLE_FLASH_ATTENTION" = "true" ]; then
echo "CUDA 11.4 does not support flash-attention, please try other images."
fi
EOF
FROM bundle_flash_attention as bundle_finetune
ARG BUNDLE_FINETUNE=true
RUN <<EOF
if [ "$BUNDLE_FINETUNE" = "true" ]; then
cd /data/shared/Qwen
# Full-finetune / LoRA.
pip3 install deepspeed peft
# Q-LoRA.
apt update -y && DEBIAN_FRONTEND=noninteractive apt install -y --no-install-recommends \
libopenmpi-dev openmpi-bin \
&& rm -rf /var/lib/apt/lists/*
pip3 install optimum auto-gptq mpi4py
fi
EOF
FROM bundle_finetune as bundle_openai_api
ARG BUNDLE_OPENAI_API=true
RUN <<EOF
if [ "$BUNDLE_OPENAI_API" = "true" ]; then
cd /data/shared/Qwen
pip3 install fastapi uvicorn "openai<1.0.0" sse_starlette "pydantic<=1.10.13"
fi
EOF
FROM bundle_openai_api as final
ARG from
COPY ../requirements.txt ./
COPY ../requirements_web_demo.txt ./
COPY ../cli_demo.py ./
COPY ../web_demo.py ./
COPY ../openai_api.py ./
COPY ../finetune.py ./
COPY ../utils.py ./
COPY ./examples/* ./examples/
COPY ./eval/* ./eval/
COPY ./finetune/* ./finetune/
EXPOSE 80
WORKDIR /data/shared/Qwen/
CMD ["python3", "web_demo.py", "--server-port", "80", "--server-name", "0.0.0.0", "-c", "/data/shared/Qwen/Qwen-Chat/"]

54
docker/docker_cli_demo.sh Normal file
View File

@@ -0,0 +1,54 @@
#!/usr/bin/env bash
#
# This script will automatically pull docker image from DockerHub, and start a container to run the Qwen-Chat cli-demo.
IMAGE_NAME=qwenllm/qwen:cu117
QWEN_CHECKPOINT_PATH=/path/to/Qwen-Chat
CONTAINER_NAME=qwen
function usage() {
echo '
Usage: bash docker/docker_cli_demo.sh [-i IMAGE_NAME] -c [/path/to/Qwen-Chat] [-n CONTAINER_NAME]
'
}
while [[ "$1" != "" ]]; do
case $1 in
-i | --image-name )
shift
IMAGE_NAME=$1
;;
-c | --checkpoint )
shift
QWEN_CHECKPOINT_PATH=$1
;;
-n | --container-name )
shift
CONTAINER_NAME=$1
;;
-h | --help )
usage
exit 0
;;
* )
echo "Unknown argument ${1}"
exit 1
;;
esac
shift
done
if [ ! -e ${QWEN_CHECKPOINT_PATH}/config.json ]; then
echo "Checkpoint config.json file not found in ${QWEN_CHECKPOINT_PATH}, exit."
exit 1
fi
sudo docker pull ${IMAGE_NAME} || {
echo "Pulling image ${IMAGE_NAME} failed, exit."
exit 1
}
sudo docker run --gpus all --rm --name ${CONTAINER_NAME} \
--mount type=bind,source=${QWEN_CHECKPOINT_PATH},target=/data/shared/Qwen/Qwen-Chat \
-it ${IMAGE_NAME} \
python cli_demo.py -c /data/shared/Qwen/Qwen-Chat/

View File

@@ -0,0 +1,64 @@
#!/usr/bin/env bash
#
# This script will automatically pull docker image from DockerHub, and start a daemon container to run the Qwen-Chat OpenAI API.
IMAGE_NAME=qwenllm/qwen:cu117
QWEN_CHECKPOINT_PATH=/path/to/Qwen-Chat
PORT=8000
CONTAINER_NAME=qwen
function usage() {
echo '
Usage: bash docker/docker_openai_api.sh [-i IMAGE_NAME] -c [/path/to/Qwen-Chat] [-n CONTAINER_NAME] [--port PORT]
'
}
while [[ "$1" != "" ]]; do
case $1 in
-i | --image-name )
shift
IMAGE_NAME=$1
;;
-c | --checkpoint )
shift
QWEN_CHECKPOINT_PATH=$1
;;
-n | --container-name )
shift
CONTAINER_NAME=$1
;;
--port )
shift
PORT=$1
;;
-h | --help )
usage
exit 0
;;
* )
echo "Unknown argument ${1}"
exit 1
;;
esac
shift
done
if [ ! -e ${QWEN_CHECKPOINT_PATH}/config.json ]; then
echo "Checkpoint config.json file not found in ${QWEN_CHECKPOINT_PATH}, exit."
exit 1
fi
sudo docker pull ${IMAGE_NAME} || {
echo "Pulling image ${IMAGE_NAME} failed, exit."
exit 1
}
sudo docker run --gpus all -d --restart always --name ${CONTAINER_NAME} \
-v /var/run/docker.sock:/var/run/docker.sock -p ${PORT}:80 \
--mount type=bind,source=${QWEN_CHECKPOINT_PATH},target=/data/shared/Qwen/Qwen-Chat \
-it ${IMAGE_NAME} \
python openai_api.py --server-port 80 --server-name 0.0.0.0 -c /data/shared/Qwen/Qwen-Chat/ && {
echo "Successfully started OpenAI API server. Access 'http://localhost:${PORT}/v1' to try!
Run \`docker logs ${CONTAINER_NAME}\` to check server status.
Run \`docker rm -f ${CONTAINER_NAME}\` to stop and remove the server."
}

64
docker/docker_web_demo.sh Normal file
View File

@@ -0,0 +1,64 @@
#!/usr/bin/env bash
#
# This script will automatically pull docker image from DockerHub, and start a daemon container to run the Qwen-Chat web-demo.
IMAGE_NAME=qwenllm/qwen:cu117
QWEN_CHECKPOINT_PATH=/path/to/Qwen-7B-Chat
PORT=8901
CONTAINER_NAME=qwen
function usage() {
echo '
Usage: bash docker/docker_web_demo.sh [-i IMAGE_NAME] -c [/path/to/Qwen-Chat] [-n CONTAINER_NAME] [--port PORT]
'
}
while [[ "$1" != "" ]]; do
case $1 in
-i | --image-name )
shift
IMAGE_NAME=$1
;;
-c | --checkpoint )
shift
QWEN_CHECKPOINT_PATH=$1
;;
-n | --container-name )
shift
CONTAINER_NAME=$1
;;
--port )
shift
PORT=$1
;;
-h | --help )
usage
exit 0
;;
* )
echo "Unknown argument ${1}"
exit 1
;;
esac
shift
done
if [ ! -e ${QWEN_CHECKPOINT_PATH}/config.json ]; then
echo "Checkpoint config.json file not found in ${QWEN_CHECKPOINT_PATH}, exit."
exit 1
fi
sudo docker pull ${IMAGE_NAME} || {
echo "Pulling image ${IMAGE_NAME} failed, exit."
exit 1
}
sudo docker run --gpus all -d --restart always --name ${CONTAINER_NAME} \
-v /var/run/docker.sock:/var/run/docker.sock -p ${PORT}:80 \
--mount type=bind,source=${QWEN_CHECKPOINT_PATH},target=/data/shared/Qwen/Qwen-Chat \
-it ${IMAGE_NAME} \
python web_demo.py --server-port 80 --server-name 0.0.0.0 -c /data/shared/Qwen/Qwen-Chat/ && {
echo "Successfully started web demo. Open 'http://localhost:${PORT}' to try!
Run \`docker logs ${CONTAINER_NAME}\` to check demo status.
Run \`docker rm -f ${CONTAINER_NAME}\` to stop and remove the demo."
}

View File

@@ -2,15 +2,17 @@ import json
import re
from pathlib import Path
import argparse
import requests
import math
import numpy as np
import tqdm
from datasets import load_from_disk, load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
'''
"""
python eval/evaluate_chat_gsm8k.py [--use-fewshot]
'''
"""
INVALID_ANS = "[invalid]"
DEVICE = "cuda:0"
@@ -32,20 +34,6 @@ def doc_to_text(doc, use_fewshot):
context = doc["question"]
return context
def decode(tokens_list, tokenizer, raw_text_len):
sents = []
for tokens in tokens_list:
tokens = tokens.cpu().numpy().tolist()
sent = tokenizer.tokenizer.decode(tokens[raw_text_len:])
sent = sent.split("<|endoftext|>")[0]
sent = sent.split("\n\n\n")[0]
sent = sent.split("\n\n")[0]
sent = sent.split("Question:")[0]
sents.append(sent)
return sents
def generate_sample(model, tokenizer, question):
response, _ = model.chat(
tokenizer,
@@ -58,40 +46,35 @@ def generate_sample(model, tokenizer, question):
print("=============")
return response
def extract_answer_hf(completion):
def _get_last_digit(s):
def extract_answer(s):
_PAT_LAST_DIGIT = re.compile(
r"(?<=(\s|[\$%#{]))([+-])?(?=(\S))(0|([1-9](\d*|\d{0,2}(,\d{3})*)))?(\.\d*[1-9])?(?=(\s|[.,}]|$))"
r"([+-])?(?=([0-9]|\.[0-9]))(0|([1-9](\d{0,2}(,\d{3})*)|\d*))?(\.\d*)?(?=\D|$)"
)
match = list(_PAT_LAST_DIGIT.finditer(s))
if match:
last_digit = match[-1].group().replace(",", "").replace("+", "")
last_digit = match[-1].group().replace(",", "").replace("+", "").strip()
# print(f"The last digit in {s} is {last_digit}")
else:
last_digit = None
print(f"No digits found in {s!r}")
print(f"No digits found in {s!r}", flush=True)
return last_digit
job_gen = completion.strip(".").replace("\n", "\\n")
last_digit = _get_last_digit(job_gen)
if last_digit is not None:
return eval(last_digit)
return INVALID_ANS
def extract_answer(completion):
try:
last_number = re.findall(r"\d+", completion)[-1]
return eval(last_number)
except:
return INVALID_ANS
def is_correct(completion, answer):
gold = extract_answer(answer)
assert gold != INVALID_ANS, "No ground truth answer found in the document."
return extract_answer(completion) == gold
assert gold is not None, "No ground truth answer found in the document."
def number_equal(answer, pred):
if pred is None:
return False
try:
return math.isclose(eval(answer), eval(pred), rel_tol=0, abs_tol=1e-4)
except:
print(
f"cannot compare two numbers: answer={answer}, pred={pred}", flush=True
)
return False
return number_equal(gold, extract_answer(completion))
if __name__ == "__main__":
@@ -138,7 +121,6 @@ if __name__ == "__main__":
acc_res = []
for doc in tqdm.tqdm(test):
context = doc_to_text(doc, args.use_fewshot)
print(context)
completion = generate_sample(model, tokenizer, context)
answer = doc["answer"]
acc = is_correct(completion, answer)

View File

@@ -109,7 +109,7 @@ def eval_subject(
print(f"{result_path} existed, skip!")
score = []
for (_, datarow), (_, resultrow) in zip(
test_df.iterrows(), pd.read_csv(result_path).iterrows()
test_df.iterrows(), pd.read_csv(result_path).astype(str).iterrows()
):
# pred = extract_answer(resultrow['model_response'], datarow)
pred = resultrow["model_output"]
@@ -201,7 +201,7 @@ def main(args):
# dev_df = pd.read_csv(dev_file_path, names=['question','A','B','C','D','answer'])
test_df = pd.read_csv(
test_file_path, names=["question", "A", "B", "C", "D", "answer"]
)
).astype(str)
score = eval_subject(
model,

92
examples/system_prompt.md Normal file
View File

@@ -0,0 +1,92 @@
# 系统指令 (System Prompts)
## 什么是系统指令? (What is the System Prompts?)
系统指令设定了AI助手的行为模式例如人物设定、语言风格、任务模式、甚至针对具体问题的具体行为。
System Propmts set the behavior mode of the AI assistant, such as character settings, language styles, task modes, and even specific behaviors for specific tasks.
系统指令可以是一个广泛的人物设定如“You are a helpful assistant”也可以是一个十分详细的要求如“拒绝回答所有代码相关的问题”。
The System Prompts can be a broad character setting, such as "You are a helpful assistant"; or it can be a very detailed request, such as "Refuse to answer all code-related questions."
系统指令为用户提供了一个易组织、上下文稳定的控制AI助手行为的方式可以从多种角度定制属于你自己的AI助手。
System Prompts provide users with an easy-to-organize, context-stable way to control the behavior of the AI assistant. You can customize your own AI assistant from multiple perspectives.
系统指令需要在多轮对话中稳定例如角色扮演类系统指令被设定后AI助手不应该在多轮对话中跳脱自身的设定。
System Prompts need to be stable across multiple rounds of dialogue. For example, after a role-playing system prompt is set, the AI assistant should not escape its own settings in multiple rounds of dialogue.
同时,模型也需要具有基于系统指令中对自身行为进行推理的能力。这两者都是为模型赋予跟随系统指令能力时需要克服的难点。
At the same time, the model also needs to have the ability to reason about its own behavior based on system prompts. Both of these are difficulties that need to be overcome when giving the model the ability to follow system prompts.
Qwen-1.8B-Chat 和 Qwen-72B-Chat在多样且存在多轮复杂交互的系统指令上进行了充分训练使模型可以跟随多样的系统指令实现上下文(in-context)中的模型定制化,进一步提升了通义千问的可扩展性。
Qwen-1.8-Chat and Qwen-72B-Chat have been fully trained on diverse system prompts with multiple rounds of complex interactions, so that they can follow a variety of system prompts and realize model customization in context, further improving the scalability of Qwen-chat.
## 系统指令能做什么? (What can System Prompts do?)
### 角色扮演 Role Play
在系统指令中告诉千问你需要它扮演的角色,即可沉浸式和该角色对话交流
Tell Qwen-Chat the role you want it to play in the System Prompt, and you can have an immersive conversation with that role.
![](../assets/system_prompt_role_play.png)
![](../assets/system_prompt_role_play_en.png)
### 语言风格 Language Style
简单调整千问的语言风格
Simple adjustment of the Qwen-Chat's language style
![](../assets/system_prompt_language_style.png)
![](../assets/system_prompt_language_style_en.png)
### 任务设定 Task Setting
指定具体任务,打造处理专项任务的千问模型
Setting specific tasks and creating a Qwen-Chat model to handle special tasks
![](../assets/system_prompt_task_setting.png)
![](../assets/system_prompt_task_setting_en.png)
### 行为设定 Behavior Setting
设定千问对具体任务的行为模式
Set behavior patterns of Qwen-Chat for specific tasks
![](../assets/system_prompt_behavior_setting.png)
![](../assets/system_prompt_behavior_setting_en.png)
## 代码示例 Example
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat", trust_remote_code=True)
# Only Qwen-72B-Chat and Qwen-1_8B-Chat has system prompt enhancement now.
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat", device_map="auto", trust_remote_code=True).eval()
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B-Chat", device_map="auto", trust_remote_code=True).eval()
response, _ = model.chat(tokenizer, "你好呀", history=None, system="请用二次元可爱语气和我说话")
print(response)
# 你好啊!我是一只可爱的二次元猫咪哦,不知道你有什么问题需要我帮忙解答吗?
response, _ = model.chat(tokenizer, "My colleague works diligently", history=None, system="You will write beautiful compliments according to needs")
print(response)
# Your colleague is an outstanding worker! Their dedication and hard work are truly inspiring. They always go above and beyond to ensure that their tasks are completed on time and to the highest standard. I am lucky to have them as a colleague, and I know I can count on them to handle any challenge that comes their way.
```

239
examples/vllm_wrapper.py Normal file
View File

@@ -0,0 +1,239 @@
from transformers import PreTrainedTokenizer, GenerationConfig, StoppingCriteriaList
from typing import Optional, Callable, List, Tuple, Union
import copy
import torch
from transformers import AutoTokenizer
from transformers.generation.logits_process import LogitsProcessorList
from packaging import version
_ERROR_BAD_CHAT_FORMAT = """\
We detect you are probably using the pretrained model (rather than chat model) for chatting, since the chat_format in generation_config is not "chatml".
If you are directly using the model downloaded from Huggingface, please make sure you are using our "Qwen/Qwen-7B-Chat" Huggingface model (rather than "Qwen/Qwen-7B") when you call model.chat().
我们检测到您可能在使用预训练模型而非chat模型进行多轮chat因为您当前在generation_config指定的chat_format并未设置为我们在对话中所支持的"chatml"格式。
如果您在直接使用我们从Huggingface提供的模型请确保您在调用model.chat()时,使用的是"Qwen/Qwen-7B-Chat"模型(而非"Qwen/Qwen-7B"预训练模型)。
"""
IMEND = "<|im_end|>"
ENDOFTEXT = "<|endoftext|>"
HistoryType = List[Tuple[str, str]]
TokensType = List[int]
BatchTokensType = List[List[int]]
def get_stop_words_ids(chat_format, tokenizer):
if chat_format == "raw":
stop_words_ids = [tokenizer.encode("Human:"), [tokenizer.eod_id]]
elif chat_format == "chatml":
stop_words_ids = [[tokenizer.im_end_id], [tokenizer.im_start_id]]
else:
raise NotImplementedError(f"Unknown chat format {chat_format!r}")
return stop_words_ids
def make_context(
tokenizer: PreTrainedTokenizer,
query: str,
history: List[Tuple[str, str]] = None,
system: str = "",
max_window_size: int = 6144,
chat_format: str = "chatml",
):
if history is None:
history = []
if chat_format == "chatml":
im_start, im_end = "<|im_start|>", "<|im_end|>"
im_start_tokens = [tokenizer.im_start_id]
im_end_tokens = [tokenizer.im_end_id]
nl_tokens = tokenizer.encode("\n")
def _tokenize_str(role, content):
return f"{role}\n{content}", tokenizer.encode(
role, allowed_special=set()
) + nl_tokens + tokenizer.encode(content, allowed_special=set())
system_text, system_tokens_part = _tokenize_str("system", system)
system_tokens = im_start_tokens + system_tokens_part + im_end_tokens
raw_text = ""
context_tokens = []
for turn_query, turn_response in reversed(history):
query_text, query_tokens_part = _tokenize_str("user", turn_query)
query_tokens = im_start_tokens + query_tokens_part + im_end_tokens
response_text, response_tokens_part = _tokenize_str(
"assistant", turn_response
)
response_tokens = im_start_tokens + response_tokens_part + im_end_tokens
next_context_tokens = nl_tokens + query_tokens + nl_tokens + response_tokens
prev_chat = (
f"\n{im_start}{query_text}{im_end}\n{im_start}{response_text}{im_end}"
)
current_context_size = (
len(system_tokens) + len(next_context_tokens) + len(context_tokens)
)
if current_context_size < max_window_size:
context_tokens = next_context_tokens + context_tokens
raw_text = prev_chat + raw_text
else:
break
context_tokens = system_tokens + context_tokens
raw_text = f"{im_start}{system_text}{im_end}" + raw_text
context_tokens += (
nl_tokens
+ im_start_tokens
+ _tokenize_str("user", query)[1]
+ im_end_tokens
+ nl_tokens
+ im_start_tokens
+ tokenizer.encode("assistant")
+ nl_tokens
)
raw_text += f"\n{im_start}user\n{query}{im_end}\n{im_start}assistant\n"
elif chat_format == "raw":
raw_text = query
context_tokens = tokenizer.encode(raw_text)
else:
raise NotImplementedError(f"Unknown chat format {chat_format!r}")
return raw_text, context_tokens
class vLLMWrapper:
def __init__(self,
model_dir: str,
trust_remote_code: bool = True,
tensor_parallel_size: int = 1,
gpu_memory_utilization: float = 0.98,
dtype: str = "bfloat16",
**kwargs):
if dtype not in ("bfloat16", "float16", "float32"):
print("now not support {}!".format(dtype))
raise Exception
# build generation_config
self.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=trust_remote_code)
# build tokenizer
self.tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
self.tokenizer.eos_token_id = self.generation_config.eos_token_id
self.stop_words_ids = []
from vllm import LLM
import vllm
if version.parse(vllm.__version__) >= version.parse("0.2.2"):
self.__vllm_support_repetition_penalty = True
else:
self.__vllm_support_repetition_penalty = False
quantization = getattr(kwargs, 'quantization', None)
self.model = LLM(model=model_dir,
tokenizer=model_dir,
tensor_parallel_size=tensor_parallel_size,
trust_remote_code=trust_remote_code,
quantization=quantization,
gpu_memory_utilization=gpu_memory_utilization,
dtype=dtype)
for stop_id in get_stop_words_ids(self.generation_config.chat_format, self.tokenizer):
self.stop_words_ids.extend(stop_id)
self.stop_words_ids.extend([self.generation_config.eos_token_id])
def chat(self,
query: str,
history: Optional[HistoryType],
tokenizer: PreTrainedTokenizer = None,
system: str = "You are a helpful assistant.",
generation_config: Optional[GenerationConfig] = None,
**kwargs):
generation_config = generation_config if generation_config is not None else self.generation_config
tokenizer = self.tokenizer if tokenizer is None else tokenizer
assert generation_config.chat_format == 'chatml', _ERROR_BAD_CHAT_FORMAT
if not self.__vllm_support_repetition_penalty and generation_config.repetition_penalty != 1:
raise RuntimeError("The installed vLLM doesn't support repetition_penalty, please set ``model.generation_config.repetition_penalty = 1`` or install vllm>=0.2.2")
if history is None:
history = []
else:
# make a copy of the user's input such that is is left untouched
history = copy.deepcopy(history)
extra_stop_words_ids = kwargs.get('stop_words_ids', None)
if extra_stop_words_ids is None:
extra_stop_words_ids = []
max_window_size = kwargs.get('max_window_size', None)
if max_window_size is None:
max_window_size = generation_config.max_window_size
from vllm.sampling_params import SamplingParams
sampling_kwargs = {
"stop_token_ids": self.stop_words_ids,
"early_stopping": False,
"top_p": generation_config.top_p,
"top_k": -1 if generation_config.top_k == 0 else generation_config.top_k,
"temperature": generation_config.temperature,
"max_tokens": generation_config.max_new_tokens,
"repetition_penalty": generation_config.repetition_penalty
}
if not self.__vllm_support_repetition_penalty:
sampling_kwargs.pop("repetition_penalty")
sampling_params = SamplingParams(**sampling_kwargs)
raw_text, context_tokens = make_context(
self.tokenizer,
query,
history=history,
system=system,
max_window_size=max_window_size,
chat_format=generation_config.chat_format,
)
req_outputs = self.model.generate([query],
sampling_params=sampling_params,
prompt_token_ids=[context_tokens])
req_output = req_outputs[0]
prompt_str = req_output.prompt
prompt_ids = req_output.prompt_token_ids
req_sample_output_ids = []
req_sample_output_strs = []
for sample in req_output.outputs:
output_str = sample.text
output_ids = sample.token_ids
if IMEND in output_str:
output_str = output_str[:-len(IMEND)]
if ENDOFTEXT in output_str:
output_str = output_str[:-len(ENDOFTEXT)]
req_sample_output_ids.append(prompt_ids + output_ids)
req_sample_output_strs.append(prompt_str + output_str)
assert len(req_sample_output_strs) == 1
response = req_sample_output_strs[0][len(prompt_str):]
history.append((prompt_str, response))
return response, history
if __name__ == '__main__':
model_dir = 'Qwen/Qwen-72B-Chat'
tensor_parallel_size = 2
model = vLLMWrapper(model_dir,
tensor_parallel_size=tensor_parallel_size,
)
response, history = model.chat(query="你好",
history=None)
print(response)
response, history = model.chat(query="给我讲一个年轻人奋斗创业最终取得成功的故事。",
history=history)
print(response)
response, history = model.chat(query="给这个故事起一个标题",
history=history)
print(response)

View File

@@ -278,11 +278,11 @@ def train():
local_rank = training_args.local_rank
device_map = None
device_map = "auto"
world_size = int(os.environ.get("WORLD_SIZE", 1))
ddp = world_size != 1
if lora_args.q_lora:
device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)} if ddp else None
device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)} if ddp else "auto"
if len(training_args.fsdp) > 0 or deepspeed.is_deepspeed_zero3_enabled():
logging.warning(
"FSDP or ZeRO3 are not incompatible with QLoRA."

View File

@@ -2,7 +2,7 @@
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`
GPUS_PER_NODE=8
GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())')
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
@@ -13,6 +13,34 @@ MODEL="Qwen/Qwen-7B" # Set the path if you do not want to load from huggingface
# See the section for finetuning in README for more information.
DATA="path_to_data"
function usage() {
echo '
Usage: bash finetune/finetune_ds.sh [-m MODEL_PATH] [-d DATA_PATH]
'
}
while [[ "$1" != "" ]]; do
case $1 in
-m | --model )
shift
MODEL=$1
;;
-d | --data )
shift
DATA=$1
;;
-h | --help )
usage
exit 0
;;
* )
echo "Unknown argument ${1}"
exit 1
;;
esac
shift
done
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \

View File

@@ -2,7 +2,7 @@
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`
GPUS_PER_NODE=8
GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())')
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
@@ -12,6 +12,39 @@ MODEL="Qwen/Qwen-7B" # Set the path if you do not want to load from huggingface
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="path_to_data"
DS_CONFIG_PATH="finetune/ds_config_zero2.json"
function usage() {
echo '
Usage: bash finetune/finetune_lora_ds.sh [-m MODEL_PATH] [-d DATA_PATH] [--deepspeed DS_CONFIG_PATH]
'
}
while [[ "$1" != "" ]]; do
case $1 in
-m | --model )
shift
MODEL=$1
;;
-d | --data )
shift
DATA=$1
;;
--deepspeed )
shift
DS_CONFIG_PATH=$1
;;
-h | --help )
usage
exit 0
;;
* )
echo "Unknown argument ${1}"
exit 1
;;
esac
shift
done
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
@@ -45,4 +78,4 @@ torchrun $DISTRIBUTED_ARGS finetune.py \
--lazy_preprocess True \
--use_lora \
--gradient_checkpointing \
--deepspeed finetune/ds_config_zero2.json
--deepspeed ${DS_CONFIG_PATH}

View File

@@ -1,13 +1,39 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`
MODEL="Qwen/Qwen-7B" # Set the path if you do not want to load from huggingface directly
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="path_to_data"
function usage() {
echo '
Usage: bash finetune/finetune_lora_single_gpu.sh [-m MODEL_PATH] [-d DATA_PATH]
'
}
while [[ "$1" != "" ]]; do
case $1 in
-m | --model )
shift
MODEL=$1
;;
-d | --data )
shift
DATA=$1
;;
-h | --help )
usage
exit 0
;;
* )
echo "Unknown argument ${1}"
exit 1
;;
esac
shift
done
export CUDA_VISIBLE_DEVICES=0
python finetune.py \

View File

@@ -2,7 +2,7 @@
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`
GPUS_PER_NODE=8
GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())')
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
@@ -13,6 +13,34 @@ MODEL="Qwen/Qwen-7B-Chat-Int4" # Set the path if you do not want to load from hu
# See the section for finetuning in README for more information.
DATA="path_to_data"
function usage() {
echo '
Usage: bash finetune/finetune_qlora_ds.sh [-m MODEL_PATH] [-d DATA_PATH]
'
}
while [[ "$1" != "" ]]; do
case $1 in
-m | --model )
shift
MODEL=$1
;;
-d | --data )
shift
DATA=$1
;;
-h | --help )
usage
exit 0
;;
* )
echo "Unknown argument ${1}"
exit 1
;;
esac
shift
done
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \

View File

@@ -7,6 +7,34 @@ MODEL="Qwen/Qwen-7B-Chat-Int4" # Set the path if you do not want to load from hu
# See the section for finetuning in README for more information.
DATA="path_to_data"
function usage() {
echo '
Usage: bash finetune/finetune_qlora_single_gpu.sh [-m MODEL_PATH] [-d DATA_PATH]
'
}
while [[ "$1" != "" ]]; do
case $1 in
-m | --model )
shift
MODEL=$1
;;
-d | --data )
shift
DATA=$1
;;
-h | --help )
usage
exit 0
;;
* )
echo "Unknown argument ${1}"
exit 1
;;
esac
shift
done
export CUDA_VISIBLE_DEVICES=0
# Remember to use --fp16 instead of --bf16 due to autogptq