diff --git a/.dockerignore b/.dockerignore new file mode 100644 index 0000000..a74dbfc --- /dev/null +++ b/.dockerignore @@ -0,0 +1,14 @@ +__pycache__ +*.so +build +.coverage_* +*.egg-info +*~ +.vscode/ +.idea/ +.git/ +.github/ +.DS_Store + +/private/ +/README-docker.md diff --git a/FAQ.md b/FAQ.md index 42cab97..38e17ae 100644 --- a/FAQ.md +++ b/FAQ.md @@ -81,3 +81,10 @@ However, temporarily we do not support RLHF. We will provide the code in the nea In our training, we only use `<|endoftext|>` as the separator and padding token. You can set bos_id, eos_id, and pad_id to tokenizer.eod_id. Learn more about our tokenizer from our documents about the tokenizer. + + +## Docker + +#### Download official docker image is very slow + +When downloading our official docker image, you may have a slow download speed due to some network issues. You can refer to [Alibaba Cloud Container Image Service](https://help.aliyun.com/zh/acr/user-guide/accelerate-the-pulls-of-docker-official-images) to accelerate the download of official images. diff --git a/FAQ_zh.md b/FAQ_zh.md index 1161acc..4a88c04 100644 --- a/FAQ_zh.md +++ b/FAQ_zh.md @@ -76,3 +76,9 @@ Qwen当前支持流式推理。见位于`modeling_qwen.py`的`chat_stream`函数 在训练过程中,我们仅使用<|endoftext|>这一token作为sample/document之间的分隔符及padding位置占位符,你可以将bos_id, eos_id, pad_id均指向tokenizer.eod_id。请阅读我们关于tokenizer的文档,了解如何设置这些id。 + +## Docker + +#### 下载官方Docker镜像速度很慢 + +在下载官方镜像时,您可能由于某些网络原因导致下载速度变慢。可以参考[阿里云容器镜像服务](https://help.aliyun.com/zh/acr/user-guide/accelerate-the-pulls-of-docker-official-images)加速官方镜像的下载。 \ No newline at end of file diff --git a/LICENSE b/LICENSE index 5be3338..e9ade92 100644 --- a/LICENSE +++ b/LICENSE @@ -1,53 +1,201 @@ -Tongyi Qianwen LICENSE AGREEMENT + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ -Tongyi Qianwen Release Date: August 3, 2023 + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION -By clicking to agree or by using or distributing any portion or element of the Tongyi Qianwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately. + 1. Definitions. -1. Definitions - a. This Tongyi Qianwen LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement. - b. "We"(or "Us") shall mean Alibaba Cloud. - c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use. - d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You. - e. "Tongyi Qianwen" shall mean the large language models (including Qwen model and Qwen-Chat model), and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us. - f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement. - g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files. - h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, - and conversions to other media types. + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. -2. Grant of Rights -You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by Us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials. + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. -3. Redistribution -You may reproduce and distribute copies of the Materials or derivative works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: - a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement; - b. You shall cause any modified files to carry prominent notices stating that You changed the files; - c. You shall retain in all copies of the Materials that You distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Tongyi Qianwen is licensed under the Tongyi Qianwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and - d. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such derivative works as a whole, provided Your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement. + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. -4. Restrictions -If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, You shall request a license from Us. You cannot exercise your rights under this Agreement without our express authorization. + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. -5. Rules of use - a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials. - b. You can not use the Materials or any output therefrom to improve any other large language model (excluding Tongyi Qianwen or derivative works thereof). + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. -6. Intellectual Property - a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for Us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications. - b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of Us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials. - c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licences granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought. + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. -7. Disclaimer of Warranty and Limitation of Liability + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). - a. We are not obligated to support, update, provide training for, or develop any further version of the Tongyi Qianwen Materials or to grant any license thereto. - b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM. - c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT’S CAUSED. - d. You will defend, indemnify and hold harmless Us from and against any claim by any third party arising out of or related to your use or distribution of the Materials. + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. -8. Survival and Termination. - a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. - b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 7 and 9 shall survive the termination of this Agreement. + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." -9. Governing Law and Jurisdiction. - a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. - b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement. \ No newline at end of file + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright 2023 Alibaba Cloud + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. \ No newline at end of file diff --git a/NOTICE b/NOTICE index 22c063e..8c3f123 100644 --- a/NOTICE +++ b/NOTICE @@ -49,4 +49,232 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -SOFTWARE. \ No newline at end of file +SOFTWARE. + +------------- LICENSE FOR stanford_alpaca code -------------- + + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + +------------- LICENSE FOR PanQiWei AutoGPTQ code -------------- + +MIT License + +Copyright (c) 2023 潘其威(William) + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/README.md b/README.md index db5e588..ac5a138 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@

- 中文  |  English  |  日本語 |  Français + 中文  |  English  |  日本語 |  Français |  Español



@@ -9,23 +9,32 @@

- 🤗 Hugging Face   |   🤖 ModelScope   |    📑 Paper    |   🖥️ Demo + 🤗 Hugging Face   |   🤖 ModelScope   |    📑 Paper    |   🖥️ Demo
-WeChat (微信)   |    DingTalk (钉钉)    |   Discord   +WeChat (微信)   |   Discord   |   API



| | Qwen-Chat | Qwen-Chat (Int4) | Qwen-Chat (Int8) | Qwen | |-----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:| +| 1.8B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | | 7B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | | 14B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | +| 72B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | -We opensource our **Qwen** series, now including **Qwen**, the base language models, namely **Qwen-7B** and **Qwen-14B**, as well as **Qwen-Chat**, the chat models, namely **Qwen-7B-Chat** and **Qwen-14B-Chat**. Links are on the above table. Click them and check the model cards. Also, we release the **[technical report](https://arxiv.org/abs/2309.16609)**. Please click the paper link and check it out! +We opensource our **Qwen** series, now including **Qwen**, the base language models, namely **Qwen-1.8B**, **Qwen-7B**, **Qwen-14B**, and **Qwen-72B**, as well as **Qwen-Chat**, the chat models, namely **Qwen-1.8B-Chat**, **Qwen-7B-Chat**, **Qwen-14B-Chat**, and **Qwen-72B-Chat**. Links are on the above table. Click them and check the model cards. Also, we release the **[technical report](https://arxiv.org/abs/2309.16609)**. Please click the paper link and check it out! In brief, we have strong base language models, which have been stably pretrained for up to 3 trillion tokens of multilingual data with a wide coverage of domains, languages (with a focus on Chinese and English), etc. They are able to achieve competitive performance on benchmark datasets. Additionally, we have chat models that are aligned with human preference based on SFT and RLHF (not released yet), which are able to chat, create content, extract information, summarize, translate, code, solve math problems, and so on, and are able to use tools, play as agents, or even play as code interpreters, etc. +| Model | Release Date | Max Length | System Prompt Enhancement | # of Pretrained Tokens | Minimum GPU Memory Usage of Finetuning (Q-Lora) | Minimum GPU Usage of Generating 2048 Tokens (Int4) | Tool Usage | +|:----------|:------------:|:----------:|:-------------------------:|:----------------------:|:-----------------------------------------------:|:--------------------------------------------------:|:----------:| +| Qwen-1.8B | 23.11.30 | 32K | √ | 2.2T | 5.8GB | 2.9GB | √ | +| Qwen-7B | 23.08.03 | 32K | × | 2.4T | 11.5GB | 8.2GB | √ | +| Qwen-14B | 23.09.25 | 8K | × | 3.0T | 18.7GB | 13.0GB | √ | +| Qwen-72B | 23.11.30 | 32K | √ | 3.0T | 61.4GB | 48.9GB | √ | + In this repo, you can figure out: * Quickstart with Qwen, and enjoy the simple inference. @@ -46,7 +55,7 @@ Would like to chat with us or date us coffee time? Welcome to our Discord or WeC

## News and Updates - +* 2023.11.30 🔥 We release **Qwen-72B** and **Qwen-72B-Chat**, which are trained on 3T tokens and support 32k context, along with **Qwen-1.8B**, and **Qwen-1.8B-Chat**, on ModelScope and Hugging Face. We have also strengthened the System Prompt capabilities of the Qwen-72B-Chat and Qwen-1.8B-Chat, see [example documentation](examples/system_prompt.md). Additionally, support the inference on **Ascend 910** and **Hygon DCU**. Check `ascend-support` and `dcu-support` for more details. * 2023.10.17 We release the Int8 quantized model **Qwen-7B-Chat-Int8** and **Qwen-14B-Chat-Int8**. * 2023.9.25 🔥 We release **Qwen-14B** and **Qwen-14B-Chat** on ModelScope and Hugging Face, along with [qwen.cpp](https://github.com/QwenLM/qwen.cpp) and [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent). Codes and checkpoints of **Qwen-7B** and **Qwen-7B-Chat** are also updated. **PLEASE PULL THE LATEST VERSION!** - Compared to **Qwen-7B** (original), **Qwen-7B** uses more training tokens, increasing from 2.2T tokens to 2.4T tokens, while the context length extends from 2048 to 8192. The Chinese knowledge and coding ability of **Qwen-7B** have been further improved. @@ -56,28 +65,32 @@ Would like to chat with us or date us coffee time? Welcome to our Discord or WeC
## Performance - -Qwen-14B and Qwen-7B (this is the new version trained with more tokens and the context length is extended from 2048 to 8192) outperform the baseline models of similar model sizes on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., which evaluate the models' capabilities on natural language understanding, mathematic problem solving, coding, etc. However, even Qwen-14B still significantly fall behind GPT-3.5, let alone GPT-4. See the results below. +Qwen models outperform the baseline models of similar model sizes on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., which evaluate the models’ capabilities on natural language understanding, mathematic problem solving, coding, etc. Qwen-72B achieves better performance than LLaMA2-70B on all tasks and outperforms GPT-3.5 on 7 out of 10 tasks.

- +

+ +
-| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU | -|:-------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:| -| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot | -| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 | -| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 | -| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - | -| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - | -| InternLM-7B | 51.0 | 53.4 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 | -| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 | -| Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 | -| Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 | -| Qwen-7B (original) | 56.7 | 59.6 | 51.6 | 10.4 | 24.4 | 31.2 | 40.6 | 58.8 | -| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 | -| **Qwen-14B** | **66.3** | **72.1** | **61.3** | **24.8** | **32.3** | **40.8** | **53.4** | **71.0** | +| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU | +|:------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:| +| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot | +| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 | +| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 | +| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - | +| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - | +| InternLM-7B | 51.0 | 53.4 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 | +| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 | +| Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 | +| Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 | +| Yi-34B | 76.3 | 81.8 | 67.9 | 15.9 | 26.2 | 38.2 | 66.4 | 82.6 | +| XVERSE-65B | 70.8 | 68.6 | 60.3 | - | 26.3 | - | - | - | +| **Qwen-1.8B** | 45.3 | 56.1 | 32.3 | 2.3 | 15.2 | 14.2 | 22.3 | 52.1 | +| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 | +| **Qwen-14B** | 66.3 | 72.1 | 61.3 | 24.8 | 32.3 | 40.8 | 53.4 | 71.0 | +| **Qwen-72B** | **77.4** | **83.3** | **78.9** | **35.2** | **35.4** | **52.2** | **67.7** | **83.6** | For all compared models, we report the best scores between their official reported results and [OpenCompass](https://opencompass.org.cn/leaderboard-llm). @@ -96,7 +109,9 @@ For more experimental results (detailed model performance on more benchmark data Below, we provide simple examples to show how to use Qwen-Chat with 🤖 ModelScope and 🤗 Transformers. -Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. +You can use our pre-built docker images to skip most of the environment setup steps, see Section ["Using Pre-built Docker Images"](#-using-pre-built-docker-images) for more details. + +If not using docker, please make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. ```bash pip install -r requirements.txt @@ -109,6 +124,7 @@ git clone https://github.com/Dao-AILab/flash-attention cd flash-attention && pip install . # Below are optional. Installing them might be slow. # pip install csrc/layer_norm +# If the version of flash-attn is higher than 2.1.1, the following is not needed. # pip install csrc/rotary ``` @@ -162,7 +178,7 @@ print(response) # 《奋斗创业:一个年轻人的成功之路》 ``` -Running Qwen pretrained base model is also simple. +Running Qwen, the base language model, is also simple.

Running Qwen @@ -198,7 +214,9 @@ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
+

In the event of a network issue while attempting to download model checkpoints and codes from HuggingFace, an alternative approach is to initially fetch the checkpoint from ModelScope and then load it from the local directory as outlined below: +

```python from modelscope import snapshot_download @@ -222,7 +240,7 @@ model = AutoModelForCausalLM.from_pretrained( ### 🤖 ModelScope -ModelScope is an opensource platform for Model-as-a-Service (MaaS), which provides flexible and cost-effective model service to AI developers. Similarly, you can run the models with ModelScope as shown below: +ModelScope is an open-source platform for Model-as-a-Service (MaaS), which provides flexible and cost-effective model service to AI developers. Similarly, you can run the models with ModelScope as shown below: ```python from modelscope import AutoModelForCausalLM, AutoTokenizer @@ -242,7 +260,7 @@ print(response) ``` ### Batch Inference -Qwen supports batch inference. With flash-attention enabled, using batch inference can bring a 40% speedup. The example code is shown below: +Qwen supports batch inference. With flash attention enabled, using batch inference can bring a 40% speedup. The example code is shown below: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer @@ -325,456 +343,10 @@ However, it is likely that you suffer from extremely low inference efficiency. If you suffer from lack of GPU memory and you would like to run the model on more than 1 GPU, you can directly use the default loading method, which is now supported by Transformers. The previous method based on `utils.py` is deprecated. However, though this method is simple, the efficiency of the native pipeline parallelism is low. We advise you to use vLLM with FastChat and please read the section for deployment. -

- -## Quantization - -### GPTQ - -We provide a solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release the Int4 quantized models, which achieve nearly lossless model effects but improved performance on both memory costs and inference speed. - -Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages: - -```bash -pip install auto-gptq optimum -``` - -If you meet problems installing `auto-gptq`, we advise you to check out the official [repo](https://github.com/PanQiWei/AutoGPTQ) to find a wheel. - -Then you can load the quantized model easily and run inference as same as usual: - -```python -# Model names: "Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4" -model = AutoModelForCausalLM.from_pretrained( - "Qwen/Qwen-7B-Chat-Int4", - device_map="auto", - trust_remote_code=True -).eval() -response, history = model.chat(tokenizer, "Hi", history=None) -``` - -We illustrate the model performance of both BF16, Int8 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below: - -| Quantization | MMLU | CEval (val) | GSM8K | Humaneval | -|----------------------|:----:|:-----------:|:-----:|:---------:| -| Qwen-7B-Chat (BF16) | 55.8 | 59.7 | 50.3 | 37.2 | -| Qwen-7B-Chat (Int8) | 55.4 | 59.4 | 48.3 | 34.8 | -| Qwen-7B-Chat (Int4) | 55.1 | 59.2 | 49.7 | 29.9 | -| Qwen-14B-Chat (BF16) | 64.6 | 69.8 | 60.1 | 43.9 | -| Qwen-14B-Chat (Int8) | 63.6 | 68.6 | 60.0 | 48.2 | -| Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 | - -### Quantization of KV cache - -> NOTE: Please be aware that due to the internal mechanism of Hugging Face, the support files for this functionality -> (i.e., `cache_autogptq_cuda_256.cpp` and `cache_autogptq_cuda_kernel_245.cu`) may be missing. Please manually download -> them from the Hugging Face Hub and place them into the same folder as the other module files. - -Attention KV cache can be quantized and compressed for storage, to get a higher sample throughput. The parameters of 'use_cache_quantization' and 'use_cache_kernel' are provided to control kv-cache-quantization behavior -When use_cache_quantization=True and use_cache_kernel=True, kv-cache-quantization will be enabled. -The specific use method is as follows: -```python -model = AutoModelForCausalLM.from_pretrained( - "Qwen/Qwen-7B-Chat", - device_map="auto", - trust_remote_code=True, - use_cache_quantization=True, - use_cache_kernel=True, - use_flash_attn=False -) -``` -Attention: -Currently, kv-cache-quantization and flash attn cannot be turned on at the same time. -If you enable kv cache quantization and use_flash_attn at the same time (use_flash_attn=True, use_cache_quantization=True, use_cache_kernel=True), use_flash_attn is disabled by default(use_flash_attn=false). - -We have verified that the use of the quantized int8-kvcache model does not suffer from significant performance degradation in downstream evaluation. In addition, we evaluate its performance focusing on the memory footprint. -The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. -We use BF16 models, and generate 1024 tokens (seq-length=1024) by default, and oom indicates out of memory. - -With kv-cache quantization turned on, we can run a larger batch size(bs). - -| USE KVCache | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 | -|-------------|:------:|:------:|:------:|:------:|:------:|:------:| -| no | 16.3GB | 24.1GB | 31.7GB | 48.7GB | oom | oom | -| yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB | - -With kv-cache quantization turned on, the model can save more memory when generate longer seq-length (sl, number of tokens generated) at infer. - -| USE KVCache | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 | -|-------------|:------:|:-------:|:-------:|:-------:|:-------:| -| no | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB | -| yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB | - -The model which turn on the kv-cache quantization will convert the format of layer-past from float to int8, meanwhile the quantianted layer-past will also store quantiantion parameters of current value. -Specific steps are as follows: -1、Quantize key/value -``` - qv,scale,zero_point=quantize_cache_v(v) -``` -2、Store into layer_past - -Following is the format of quantized layer_past: -``` - layer_past=((q_key,key_scale,key_zero_point), - (q_value,value_scale,value_zero_point)) -``` -Bascial format of layer_past: -``` - layer_past=(key,value) -``` -If you want to use the attention KV which is quantized, -you can use the dequantization operation to convert the int8 key/value back to the float format as following: -``` - v=dequantize_cache_torch(qv,scale,zero_point) -``` -
- - -## Inference Performance - -This section provides the statistics of speed and memory of models in different precisions. The speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py). - -### Speed - -We measured the average inference speed (tokens/s) of generating 2048 and 8192 tokens with the models in the precision of BF16, Int8, and Int4 under the condition of using flash attention v1, v2, or not using it. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model SizePrecisionFlashAttnSequence Length
20488192
7BBF16v240.9336.14
v140.7535.34 -
Disabled37.5533.56 -
Int8v237.4732.54
v137.5132.39 -
Disabled37.8432.65 -
Int4v250.0938.61
v145.9836.47 -
Disabled48.1236.70 -
14BBF16v232.8824.87
v132.7628.89 -
Disabled29.3222.91 -
Int8v229.2824.22
v128.3123.87 -
Disabled31.1224.60 -
Int4v238.7227.33
v137.8126.46 -
Disabled37.6526.00 -
- - -In detail, the setting of profiling is encoding 2048 tokens and generating 8192 new tokens. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.8. The inference speed is averaged over the encoded and generated tokens. - -Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using ``AutoModelForCausalLM.from_pretrained`` will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available. - -### GPU Memory Usage - -We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16, Int8 or Int4 quantization level, respectively. The results (GB) are shown below. - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model SizePrecisionSequence Length
20488192
7BBF1616.9922.53
Int811.2016.62 -
Int48.2113.63
14BBF1630.1538.94
Int818.8127.54 -
Int413.0121.79
-
- - -## Finetuning - -### Usage -Now we provide the official training script, `finetune.py`, for users to finetune the pretrained model for downstream applications in a simple fashion. Additionally, we provide shell scripts to launch finetuning with no worries. This script supports the training with [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/). The shell scripts that we provide use DeepSpeed (Note: this may have conflicts with the latest version of pydantic) and Peft. You can install them by: -```bash -pip install peft deepspeed -``` - -To prepare your training data, you need to put all the samples into a list and save it to a json file. Each sample is a dictionary consisting of an id and a list for conversation. Below is a simple example list with 1 sample: -```json -[ - { - "id": "identity_0", - "conversations": [ - { - "from": "user", - "value": "你好" - }, - { - "from": "assistant", - "value": "我是一个语言模型,我叫通义千问。" - } - ] - } -] -``` - -After data preparation, you can use the provided shell scripts to run finetuning. Remember to specify the path to the data file, `$DATA`. - -The finetuning scripts allow you to perform: -- Full-parameter finetuning -- LoRA -- Q-LoRA - -Full-parameter finetuning requires updating all parameters in the whole training process. To launch your training, run the following script: - -```bash -# Distributed training. We do not provide single-GPU training script as the insufficient GPU memory will break down the training. -sh finetune/finetune_ds.sh -``` - -Remember to specify the correct model name or path, the data path, as well as the output directory in the shell scripts. Another thing to notice is that we use DeepSpeed ZeRO 3 in this script. If you want to make changes, just remove the argument `--deepspeed` or make changes in the DeepSpeed configuration json file based on your requirements. Additionally, this script supports mixed-precision training, and thus you can use `--bf16 True` or `--fp16 True`. Remember to use DeepSpeed when you use fp16 due to mixed precision training. -Empirically we advise you to use bf16 to make your training consistent with our pretraining and alignment if your machine supports bf16, and thus we use it by default. - -Similarly, to run LoRA, use another script to run as shown below. Before you start, make sure that you have installed `peft`. Also, you need to specify your paths to your model, data, and output. We advise you to use absolute path for your pretrained model. This is because LoRA only saves the adapter and the absolute path in the adapter configuration json file is used for finding out the pretrained model to load. Also, this script support both bf16 and fp16. - -```bash -# Single GPU training -sh finetune/finetune_lora_single_gpu.sh -# Distributed training -sh finetune/finetune_lora_ds.sh -``` - -In comparison with full-parameter finetuning, LoRA ([paper](https://arxiv.org/abs/2106.09685)) only updates the parameters of adapter layers but keeps the original large language model layers frozen. This allows much fewer memory costs and thus fewer computation costs. - -Note that if you use LoRA to finetune the base language model, e.g., Qwen-7B, instead of chat models, e.g., Qwen-7B-Chat, the script automatically switches the embedding and output layer as trainable parameters. This is because the base language model has no knowledge of special tokens brought by ChatML format. Thus these layers should be updated for the model to understand and predict the tokens. Or in another word, if your training brings in special tokens in LoRA, you should set the layers to trainable parameters by setting `modules_to_save` inside the code. Also, if we have these parameters trainable, it is not available to use ZeRO 3, and this is why we use ZeRO 2 in the script by default. If you do not have new trainable parameters, you can switch to ZeRO 3 by changing the DeepSpeed configuration file. Additionally, we find that there is a significant gap between the memory footprint of LoRA with and without these trainable parameters. Therefore, if you have trouble with memory, we advise you to LoRA finetune the chat models. Check the profile below for more information. - -If you still suffer from insufficient memory, you can consider Q-LoRA ([paper](https://arxiv.org/abs/2305.14314)), which uses the quantized large language model and other techniques such as paged attention to allow even fewer memory costs. - -Note: to run single-GPU Q-LoRA training, you may need to install `mpi4py` through `pip` or `conda`. - -To run Q-LoRA, directly run the following script: - -```bash -# Single GPU training -sh finetune/finetune_qlora_single_gpu.sh -# Distributed training -sh finetune/finetune_qlora_ds.sh -``` - -For Q-LoRA, we advise you to load our provided quantized model, e.g., Qwen-7B-Chat-Int4. You **SHOULD NOT** use the bf16 models. Different from full-parameter finetuning and LoRA, only fp16 is supported for Q-LoRA. For single-GPU training, we have to use deepspeed for mixed-precision training due to our observation of errors caused by torch amp. Besides, for Q-LoRA, the troubles with the special tokens in LoRA still exist. However, as we only provide the Int4 models for chat models, which means the language model has learned the special tokens of ChatML format, you have no worry about the layers. Note that the layers of the Int4 model should not be trainable, and thus if you introduce special tokens in your training, Q-LoRA might not work. - -> NOTE: Please be aware that due to the internal mechanisms of Hugging Face, certain non-Python files (e.g., `*.cpp` and `*.cu`) -> may be missing from the saved checkpoint. You may need to manually copy them to the directory containing other files. - -Different from full-parameter finetuning, the training of both LoRA and Q-LoRA only saves the adapter parameters. Suppose your training starts from Qwen-7B, you can load the finetuned model for inference as shown below: - -```python -from peft import AutoPeftModelForCausalLM - -model = AutoPeftModelForCausalLM.from_pretrained( - path_to_adapter, # path to the output directory - device_map="auto", - trust_remote_code=True -).eval() -``` - -If you want to merge the adapters and save the finetuned model as a standalone model (you can only do this with LoRA, and you CANNOT merge the parameters from Q-LoRA), you can run the following codes: - -```python -from peft import AutoPeftModelForCausalLM - -model = AutoPeftModelForCausalLM.from_pretrained( - path_to_adapter, # path to the output directory - device_map="auto", - trust_remote_code=True -).eval() - -merged_model = model.merge_and_unload() -# max_shard_size and safe serialization are not necessary. -# They respectively work for sharding checkpoint and save the model to safetensors -merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True) -``` - -The `new_model_directory` directory will contain the merged model weights and module files. Please note that `*.cu` and `*.cpp` files may be missing in the saved files. If you wish to use the KV cache functionality, please manually copy them. Besides, the tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code -```python -from transformers import AutoTokenizer - -tokenizer = AutoTokenizer.from_pretrained( - path_to_adapter, # path to the output directory - trust_remote_code=True -) - -tokenizer.save_pretrained(new_model_directory) -``` - - -Note: For multi-GPU training, you need to specify the proper hyperparameters for distributed training based on your machine. Besides, we advise you to specify your maximum sequence length with the argument `--model_max_length`, based on your consideration of data, memory footprint, and training speed. - - -### Profiling of Memory and Speed -We profile the GPU memory and training speed of both LoRA (LoRA (emb) refers to training the embedding and output layer, while LoRA has no trainable embedding and output layer) and Q-LoRA in the setup of single-GPU training. In this test, we experiment on a single A100-SXM4-80G GPU, and we use CUDA 11.8 and Pytorch 2.0. Flash attention 2 is applied. We uniformly use a batch size of 1 and gradient accumulation of 8. We profile the memory (GB) and speed (s/iter) of inputs of different lengths, namely 256, 512, 1024, 2048, 4096, and 8192. We also report the statistics of full-parameter finetuning with Qwen-7B on 2 A100 GPUs. We only report the statistics of 256, 512, and 1024 tokens due to the limitation of GPU memory. The statistics are listed below: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model SizeMethodSequence Length
2565121024204840968192
7BLoRA20.1G / 1.2s/it20.4G / 1.5s/it21.5G / 2.8s/it23.8G / 5.2s/it29.7G / 10.1s/it36.6G / 21.3s/it
LoRA (emb)33.7G / 1.4s/it34.1G / 1.6s/it35.2G / 2.9s/it35.1G / 5.3s/it39.2G / 10.3s/it48.5G / 21.7s/it
Q-LoRA11.5G / 3.0s/it11.5G / 3.0s/it12.3G / 3.5s/it13.9G / 7.0s/it16.9G / 11.6s/it23.5G / 22.3s/it
Full-parameter139.2G / 4.0s/it148.0G / 4.0s/it162.0G / 4.5s/it---
14BLoRA34.6G / 1.6s/it35.1G / 2.4s/it35.3G / 4.4s/it37.4G / 8.4s/it42.5G / 17.0s/it55.2G / 36.0s/it
LoRA (emb)51.2 / 1.7s/it51.1G / 2.6s/it51.5G / 4.6s/it54.1G / 8.6s/it56.8G / 17.2s/it67.7G / 36.3s/it
Q-LoRA18.7G / 5.3s/it18.4G / 6.3s/it18.9G / 8.2s/it19.9G / 11.8s/it23.0G / 20.1s/it27.9G / 38.3s/it
-
- -## Deployment - -### vLLM -For deployment and fast inference, we suggest using vLLM with FastChat. Install the packages first: -```bash -pip install vllm -pip install "fschat[model_worker,webui]" -``` -Or you can install them from source by `git clone` and `pip install -e .`. We advise you to read their documents if you meet problems in installation. - -To run Qwen with vLLM and FastChat, you need to first launch a controller by: -```bash -python -m fastchat.serve.controller -``` - -Then you can launch the model worker, which means loading your model for inference. For single GPU inference, you can directly run: -```bash -python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code -``` -However, if you hope to run the model on multiple GPUs for faster inference or larger memory, you can use tensor parallelism supported by vLLM. Suppose you run the model on 4 GPUs, the command is shown below: -```bash -python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 -``` - -After launching your model worker, you can launch a web demo or an OpenAI API as you like. For web demo, run the following command: -```bash -python -m fastchat.serve.gradio_web_server -``` -For OpenAI API, check the documentation of our OpenAI API for installation first. Then run the command: -```bash -python -m fastchat.serve.openai_api_server --host localhost --port 8000 -``` -
- -## Demo - -### Web UI - -We provide code for users to build a web UI demo (thanks to @wysaid). Before you start, make sure you install the following packages: - -``` -pip install -r requirements_web_demo.txt -``` - -Then run the command below and click on the generated link: - -```bash -python web_demo.py -``` - -

-
- -
-

- -### CLI Demo - -We provide a CLI demo example in `cli_demo.py`, which supports streaming output for the generation. Users can interact with Qwen-7B-Chat by inputting prompts, and the model returns model outputs in the streaming mode. Run the command below: - -```bash -python cli_demo.py -``` - -

-
- -
-

-
- -## API - -The most simple way to use Qwen through APIs is DashScope API service through Alibaba Cloud. We give an introduction to the usage. Additionally, we provide a script for you to deploy an OpenAI-style API on your own servers. ### DashScope +The most simple way to use Qwen through APIs is DashScope API service through Alibaba Cloud. We give an introduction to the usage. Additionally, we provide a script for you to deploy an OpenAI-style API on your own servers. + DashScope is the large language model API service provided by Alibaba Cloud, which now supports Qwen. Note that the models behind DashScope are in-house versions temporarily without details provided. The services include `qwen-turbo` and `qwen-plus`, where the former one runs faster and the latter achieves better performance. For more information, visit the documentation [here](https://dashscope.aliyun.com). Please head to the official website [link](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.6c2774fahtfXdn) to create a DashScope account and obtain the API key (AK). We recommend setting the AK with an environment variable: @@ -825,13 +397,500 @@ if __name__ == '__main__': )) ``` For more usages, please visit the official website for more details. +

-### OpenAI API +## Quantization + +### GPTQ + +We provide a solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release the Int4 and Int8 quantized models, which achieve nearly lossless model effects but improved performance on both memory costs and inference speed. + +Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages: + +```bash +pip install auto-gptq optimum +``` + +If you meet problems installing `auto-gptq`, we advise you to check out the official [repo](https://github.com/PanQiWei/AutoGPTQ) to find a wheel. + +> Note: The pre-compiled `auto-gptq` packages strongly depend on the version of `torch` and its CUDA version. Moreover, due to recent update, +> you may also encounter unsupported version errors from `transformers`, `optimum`, or `peft`. +> We recommend using the latest versions meeting the following requirements: +> - torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1 +> - torch>=2.0,<2.1 auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0 + +Then you can load the quantized model easily and run inference as same as usual: + +```python +# Model names: "Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4" +model = AutoModelForCausalLM.from_pretrained( + "Qwen/Qwen-7B-Chat-Int4", + device_map="auto", + trust_remote_code=True +).eval() +response, history = model.chat(tokenizer, "Hi", history=None) +``` + +We illustrate the model performance of both BF16, Int8 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below: + +| Quantization | MMLU | CEval (val) | GSM8K | Humaneval | +|----------------------|:----:|:-----------:|:-----:|:---------:| +| Qwen-1.8B-Chat (BF16)| 43.3 | 55.6 | 33.7 | 26.2 | +| Qwen-1.8B-Chat (Int8)| 43.1 | 55.8 | 33.0 | 27.4 | +| Qwen-1.8B-Chat (Int4)| 42.9 | 52.8 | 31.2 | 25.0 | +| Qwen-7B-Chat (BF16) | 55.8 | 59.7 | 50.3 | 37.2 | +| Qwen-7B-Chat (Int8) | 55.4 | 59.4 | 48.3 | 34.8 | +| Qwen-7B-Chat (Int4) | 55.1 | 59.2 | 49.7 | 29.9 | +| Qwen-14B-Chat (BF16) | 64.6 | 69.8 | 60.1 | 43.9 | +| Qwen-14B-Chat (Int8) | 63.6 | 68.6 | 60.0 | 48.2 | +| Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 | +| Qwen-72B-Chat (BF16) | 74.4 | 80.1 | 76.4 | 64.6 | +| Qwen-72B-Chat (Int8) | 73.5 | 80.1 | 73.5 | 62.2 | +| Qwen-72B-Chat (Int4) | 73.4 | 80.1 | 75.3 | 61.6 | + +### Quantization of KV cache + +> NOTE: Please be aware that due to the internal mechanism of Hugging Face, the support files for this functionality +> (i.e., `cache_autogptq_cuda_256.cpp` and `cache_autogptq_cuda_kernel_245.cu`) may be missing. Please manually download +> them from the Hugging Face Hub and place them into the same folder as the other module files. + +The attention KV cache can be quantized and compressed for storage, to get a higher sample throughput. The arguments `use_cache_quantization` and `use_cache_kernel` in `config.json` are provided to enable KV cache quantization. The specific use method is as follows: +```python +model = AutoModelForCausalLM.from_pretrained( + "Qwen/Qwen-7B-Chat", + device_map="auto", + trust_remote_code=True, + use_cache_quantization=True, + use_cache_kernel=True, + use_flash_attn=False +) +``` +Attention: Currently, KV cache quantization and flash attention cannot be used at the same time. +If you enable KV cache quantization and flash attention at the same time (`use_flash_attn=True`, `use_cache_quantization=True`, `use_cache_kernel=True`), `use_flash_attn` is disabled by default (`use_flash_attn=false`). + +We have verified that the use of the quantized Int8-KV-Cache model does not suffer from significant performance degradation in downstream evaluation. In the following, we focus on profiling its memory footprint in different conditions. +The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. +We use BF16 models to generate 1024 tokens by default, and "OOM" indicates out-of-memory error. + +With KV cache quantization, the model can infer with a larger batch size (bs). + +| USE KV Cache | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 | +|--------------|:------:|:------:|:------:|:------:|:------:|:------:| +| No | 16.3GB | 24.1GB | 31.7GB | 48.7GB | OOM | OOM | +| Yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB | + +With KV cache quantization the model can save more memory when generating longer sequence (`sl`, sequence length, referring to the number of tokens generated) at the stage of inference. + +| USE KV Cache | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 | +|--------------|:------:|:-------:|:-------:|:-------:|:-------:| +| No | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB | +| Yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB | + +The model with KV cache quantization will convert the format of `layer_past` from float to int8, and meanwhile the quantized `layer-past` will also store the quantization parameters. + +Specific steps are as follows: + +1. Quantize key/value +``` + qv,scale,zero_point=quantize_cache_v(v) +``` +2. Store into layer_past + +The following is the format of quantized `layer_past`: +``` + layer_past=((q_key,key_scale,key_zero_point), + (q_value,value_scale,value_zero_point)) +``` + +The original format of `layer_past` is shown below: +``` + layer_past=(key,value) +``` + +If you want to use the attention KV which is quantized, you can use the dequantization operation to convert the Int8 key/value back to the float format as follows: +``` + v=dequantize_cache_torch(qv,scale,zero_point) +``` +
+ + +## Inference Performance + +This section provides the statistics of speed and memory of models in different precisions. The speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py). + +We measured the average inference speed (tokens/s) and GPU memory usage of generating 2048 with the models in BF16, Int8, and Int4. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model SizeQuantizationSpeed (Tokens/s)GPU Memory Usage
1.8BBF1654.094.23GB
Int855.563.48GB
Int471.072.91GB
7BBF1640.9316.99GB
Int837.4711.20GB
Int450.098.21GB
14BBF1632.2230.15GB
Int829.2818.81GB
Int438.7213.01GB
72BBF168.48144.69GB (2xA100)
Int89.0581.27GB (2xA100)
Int411.3248.86GB
72B + vLLMBF1617.602xA100
+ +The profiling runs on a single A100-SXM4-80G GPU (except 2xA100 is mentioned) with PyTorch 2.0.1, CUDA 11.8, and Flash-Attention 2. (72B + vLLM uses PyTorch 2.1.0 and Cuda 11.8.) The inference speed is averaged over the encoded and generated tokens. + +Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using ``AutoModelForCausalLM.from_pretrained`` will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available. + +We also measure the inference speed and GPU memory usage with different settings of context and generation lengths, Flash-Attention version. You can find the results in the according modelcards on Hugging Face or ModelScope. + +## Finetuning + +### Usage +Now we provide the official training script, `finetune.py`, for users to finetune the pretrained model for downstream applications in a simple fashion. Additionally, we provide shell scripts to launch finetuning with no worries. This script supports the training with [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/). The shell scripts that we provide use DeepSpeed (Note: this may have conflicts with the latest version of pydantic and you should use make sure `pydantic<2.0`) and Peft. You can install them by: +```bash +pip install peft deepspeed +``` + +To prepare your training data, you need to put all the samples into a list and save it to a json file. Each sample is a dictionary consisting of an id and a list for conversation. Below is a simple example list with 1 sample: +```json +[ + { + "id": "identity_0", + "conversations": [ + { + "from": "user", + "value": "你好" + }, + { + "from": "assistant", + "value": "我是一个语言模型,我叫通义千问。" + } + ] + } +] +``` + +After data preparation, you can use the provided shell scripts to run finetuning. Remember to specify the path to the data file, `$DATA`. + +The finetuning scripts allow you to perform: +- Full-parameter finetuning +- LoRA +- Q-LoRA + +Full-parameter finetuning requires updating all parameters in the whole training process. To launch your training, run the following script: + +```bash +# Distributed training. We do not provide single-GPU training script as the insufficient GPU memory will break down the training. +sh finetune/finetune_ds.sh +``` + +Remember to specify the correct model name or path, the data path, as well as the output directory in the shell scripts. Another thing to notice is that we use DeepSpeed ZeRO 3 in this script. If you want to make changes, just remove the argument `--deepspeed` or make changes in the DeepSpeed configuration json file based on your requirements. Additionally, this script supports mixed-precision training, and thus you can use `--bf16 True` or `--fp16 True`. Remember to use DeepSpeed when you use fp16 due to mixed precision training. Empirically we advise you to use bf16 to make your training consistent with our pretraining and alignment if your machine supports bf16, and thus we use it by default. + +Similarly, to run LoRA, use another script to run as shown below. Before you start, make sure that you have installed `peft`. Also, you need to specify your paths to your model, data, and output. We advise you to use absolute path for your pretrained model. This is because LoRA only saves the adapter and the absolute path in the adapter configuration json file is used for finding out the pretrained model to load. Also, this script support both bf16 and fp16. + +```bash +# Single GPU training +sh finetune/finetune_lora_single_gpu.sh +# Distributed training +sh finetune/finetune_lora_ds.sh +``` + +In comparison with full-parameter finetuning, LoRA ([paper](https://arxiv.org/abs/2106.09685)) only updates the parameters of adapter layers but keeps the original large language model layers frozen. This allows much fewer memory costs and thus fewer computation costs. + +Note that if you use LoRA to finetune the base language model, e.g., Qwen-7B, instead of chat models, e.g., Qwen-7B-Chat, the script automatically switches the embedding and output layer as trainable parameters. This is because the base language model has no knowledge of special tokens brought by ChatML format. Thus these layers should be updated for the model to understand and predict the tokens. Or in another word, if your training brings in special tokens in LoRA, you should set the layers to trainable parameters by setting `modules_to_save` inside the code. Also, if we have these parameters trainable, it is not available to use ZeRO 3, and this is why we use ZeRO 2 in the script by default. If you do not have new trainable parameters, you can switch to ZeRO 3 by changing the DeepSpeed configuration file. Additionally, we find that there is a significant gap between the memory footprint of LoRA with and without these trainable parameters. Therefore, if you have trouble with memory, we advise you to LoRA finetune the chat models. Check the profile below for more information. + +If you still suffer from insufficient memory, you can consider Q-LoRA ([paper](https://arxiv.org/abs/2305.14314)), which uses the quantized large language model and other techniques such as paged attention to allow even fewer memory costs. + +Note: to run single-GPU Q-LoRA training, you may need to install `mpi4py` through `pip` or `conda`. + +To run Q-LoRA, directly run the following script: + +```bash +# Single GPU training +sh finetune/finetune_qlora_single_gpu.sh +# Distributed training +sh finetune/finetune_qlora_ds.sh +``` + +For Q-LoRA, we advise you to load our provided quantized model, e.g., Qwen-7B-Chat-Int4. You **SHOULD NOT** use the bf16 models. Different from full-parameter finetuning and LoRA, only fp16 is supported for Q-LoRA. For single-GPU training, we have to use DeepSpeed for mixed-precision training due to our observation of errors caused by torch amp. Besides, for Q-LoRA, the troubles with the special tokens in LoRA still exist. However, as we only provide the Int4 models for chat models, which means the language model has learned the special tokens of ChatML format, you have no worry about the layers. Note that the layers of the Int4 model should not be trainable, and thus if you introduce special tokens in your training, Q-LoRA might not work. + +> NOTE: Please be aware that due to the internal mechanisms of Hugging Face, certain non-Python files (e.g., `*.cpp` and `*.cu`) +> may be missing from the saved checkpoint. You may need to manually copy them to the directory containing other files. + +Different from full-parameter finetuning, the training of both LoRA and Q-LoRA only saves the adapter parameters. Suppose your training starts from Qwen-7B, you can load the finetuned model for inference as shown below: + +```python +from peft import AutoPeftModelForCausalLM + +model = AutoPeftModelForCausalLM.from_pretrained( + path_to_adapter, # path to the output directory + device_map="auto", + trust_remote_code=True +).eval() +``` + +If you want to merge the adapters and save the finetuned model as a standalone model (you can only do this with LoRA, and you CANNOT merge the parameters from Q-LoRA), you can run the following codes: + +```python +from peft import AutoPeftModelForCausalLM + +model = AutoPeftModelForCausalLM.from_pretrained( + path_to_adapter, # path to the output directory + device_map="auto", + trust_remote_code=True +).eval() + +merged_model = model.merge_and_unload() +# max_shard_size and safe serialization are not necessary. +# They respectively work for sharding checkpoint and save the model to safetensors +merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True) +``` + +The `new_model_directory` directory will contain the merged model weights and module files. Please note that `*.cu` and `*.cpp` files may be missing in the saved files. If you wish to use the KV cache functionality, please manually copy them. Besides, the tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code +```python +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained( + path_to_adapter, # path to the output directory + trust_remote_code=True +) + +tokenizer.save_pretrained(new_model_directory) +``` + + +Note: For multi-GPU training, you need to specify the proper hyperparameters for distributed training based on your machine. Besides, we advise you to specify your maximum sequence length with the argument `--model_max_length`, based on your consideration of data, memory footprint, and training speed. + + +### Profiling of Memory and Speed +We profile the GPU memory and training speed of both LoRA (LoRA (emb) refers to training the embedding and output layer, while LoRA has no trainable embedding and output layer) and Q-LoRA in the setup of single-GPU training. In this test, we experiment on a single A100-SXM4-80G GPU, and we use CUDA 11.8 and Pytorch 2.0. Flash attention 2 is applied. We uniformly use a batch size of 1 and gradient accumulation of 8. We profile the memory (GB) and speed (s/iter) of inputs of different lengths, namely 256, 512, 1024, 2048, 4096, and 8192. We also report the statistics of full-parameter finetuning with Qwen-7B on 2 A100 GPUs. We only report the statistics of 256, 512, and 1024 tokens due to the limitation of GPU memory. + +For Qwen-72B, we experiment in two ways: 1) Lora fintuning + DeepSpeed ZeRO 3 on 4 A100-SXM4-80G GPUs and 2) QLora (int4) fine-tuning on a single A100-SXM4-80G GPU. Note that OOM occurs on 4 A100-SXM4-80G GPUs both with LoRA (emb) fine-tuning and LoRA fine-tuning without Deepspeed ZeRO 3 (you can pass `--deepspeed finetune/ds_config_zero3.json` to [`finetune/finetune_lora_ds.sh`](finetune/finetune_lora_ds.sh) to enable DeepSpeed ZeRO 3). + +The statistics are listed below: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model SizeMethodSequence Length
2565121024204840968192
1.8BLoRA6.7G / 1.0s/it7.4G / 1.0s/it8.4G / 1.1s/it11.0G / 1.7s/it16.2G / 3.3s/it21.8G / 6.8s/it
LoRA (emb)13.7G / 1.0s/it14.0G / 1.0s/it14.0G / 1.1s/it15.1G / 1.8s/it19.7G / 3.4s/it27.7G / 7.0s/it
Q-LoRA5.8G / 1.4s/it6.0G / 1.4s/it6.6G / 1.4s/it7.8G / 2.0s/it10.2G / 3.4s/it15.8G / 6.5s/it
Full-parameter43.5G / 2.1s/it43.5G / 2.2s/it43.5G / 2.2s/it43.5G / 2.3s/it47.1G / 2.8s/it48.3G / 5.6s/it
7BLoRA20.1G / 1.2s/it20.4G / 1.5s/it21.5G / 2.8s/it23.8G / 5.2s/it29.7G / 10.1s/it36.6G / 21.3s/it
LoRA (emb)33.7G / 1.4s/it34.1G / 1.6s/it35.2G / 2.9s/it35.1G / 5.3s/it39.2G / 10.3s/it48.5G / 21.7s/it
Q-LoRA11.5G / 3.0s/it11.5G / 3.0s/it12.3G / 3.5s/it13.9G / 7.0s/it16.9G / 11.6s/it23.5G / 22.3s/it
Full-parameter139.2G / 4.0s/it148.0G / 4.0s/it162.0G / 4.5s/it---
14BLoRA34.6G / 1.6s/it35.1G / 2.4s/it35.3G / 4.4s/it37.4G / 8.4s/it42.5G / 17.0s/it55.2G / 36.0s/it
LoRA (emb)51.2 / 1.7s/it51.1G / 2.6s/it51.5G / 4.6s/it54.1G / 8.6s/it56.8G / 17.2s/it67.7G / 36.3s/it
Q-LoRA18.7G / 5.3s/it18.4G / 6.3s/it18.9G / 8.2s/it19.9G / 11.8s/it23.0G / 20.1s/it27.9G / 38.3s/it
72BLoRA + Deepspeed Zero3215.4G / 17.6s/it217.7G / 20.5s/it222.6G / 29.4s/it228.8G / 45.7s/it249.0G / 83.4s/it289.2G / 161.5s/it
Q-LoRA61.4G / 27.4s/it61.4G / 31.5s/it62.9G / 41.4s/it64.1G / 59.5s/it68.0G / 97.7s/it75.6G / 179.8s/it
+
+ +## Deployment + +### vLLM + +For deployment and fast inference, we suggest using vLLM. + +If you use cuda 12.1 and pytorch 2.1, you can directly use the following command to install vLLM. + +```bash +pip install vllm +``` + +Otherwise, please refer to the official vLLM [Installation Instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html). + +#### vLLM + Transformer-like Wrapper + +You can download the [wrapper codes](examples/vllm_wrapper.py) and execute the following commands for multiple rounds of dialogue interaction. (Note: It currently only supports the ``model.chat()`` method.) + +```python +from vllm_wrapper import vLLMWrapper + +model = vLLMWrapper('Qwen/Qwen-7B-Chat', tensor_parallel_size=1) + +response, history = model.chat(query="你好", history=None) +print(response) +response, history = model.chat(query="给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history) +print(response) +response, history = model.chat(query="给这个故事起一个标题", history=history) +print(response) +``` + +#### vLLM + Web Demo / OpenAI-like API + +You can use FastChat to lauch a web demo or an OpenAI API server. First, install FastChat: + +```bash +pip install "fschat[model_worker,webui]" +``` + +To run Qwen with vLLM and FastChat, you need launch a controller by: +```bash +python -m fastchat.serve.controller +``` + +Then you can launch the model worker, which means loading your model for inference. For single GPU inference, you can directly run: +```bash +python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype bfloat16 +``` +However, if you hope to run the model on multiple GPUs for faster inference or larger memory, you can use tensor parallelism supported by vLLM. Suppose you run the model on 4 GPUs, the command is shown below: +```bash +python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype bfloat16 +``` + +After launching your model worker, you can launch a: + +* Web UI Demo +```bash +python -m fastchat.serve.gradio_web_server +``` + +* OpenAI API +```bash +python -m fastchat.serve.openai_api_server --host localhost --port 8000 +``` + +However, if you find it difficult to use vLLM and FastChat, you can try our provided simplest methods to deploy a web demo, CLI demo, and API. + + +### Web UI + +We provide code for users to build a web UI demo (thanks to @wysaid). Before you start, make sure you install the following packages: + +``` +pip install -r requirements_web_demo.txt +``` + +Then run the command below and click on the generated link: + +```bash +python web_demo.py +``` + +

+
+ +
+

+ +### CLI Demo + +We provide a CLI demo example in `cli_demo.py`, which supports streaming output for the generation. Users can interact with Qwen-7B-Chat by inputting prompts, and the model returns model outputs in the streaming mode. Run the command below: + +```bash +python cli_demo.py +``` + +

+
+ +
+

+
+ +### API We provide methods to deploy local API based on OpenAI API (thanks to @hanpenggit). Before you start, install the required packages: ```bash -pip install fastapi uvicorn openai "pydantic>=2.3.0" sse_starlette +pip install fastapi uvicorn openai pydantic sse_starlette ``` Then run the command to deploy your API: @@ -882,7 +941,120 @@ print(response.choices[0].message.content) **Function calling** is also supported (but only when `stream=False` for the moment). See the [example usage](examples/function_call_examples.py) here.

+## 🐳 Docker +To simplify the deployment process, we provide docker images with pre-built environments: [qwenllm/qwen](https://hub.docker.com/r/qwenllm/qwen). You only need to install the driver and download model files to launch demos, deploy OpenAI API, and finetune the model. + +### Preparation + +1. Install the correct version of Nvidia driver depending on the image to use: + - `qwenllm/qwen:cu117` (**recommend**): `>= 515.48.07` + - `qwenllm/qwen:cu114` (w/o flash-attention): `>= 470.82.01` + - `qwenllm/qwen:latest`: same as `qwenllm/qwen:cu117` + +2. Install and configure [docker](https://docs.docker.com/engine/install/) and [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html): + +```bash +# configure docker +sudo systemctl start docker +# test if docker is correctly installed +sudo docker run hello-world + +# configure nvidia-container-toolkit +sudo nvidia-ctk runtime configure --runtime=docker +sudo systemctl restart docker +# test if nvidia-container-toolkit is correctly installed +sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi +``` + +3. Download model checkpoints and codes to your environment (see [here](#DownloadModel)). + +### Deployment + +Here we use Qwen-7B-Chat as an example. Before launching a web demo or API, you can setup the configuration as shown below: + +```bash +IMAGE_NAME=qwenllm/qwen:cu117 +PORT=8901 +CHECKPOINT_PATH=/path/to/Qwen-7B-Chat # Path to downloaded model checkpoints and codes +``` +The following scripts can help you build: + +* OpenAI API +```bash +bash docker/docker_openai_api.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT} +``` + +* Web UI +```bash +bash docker/docker_web_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT} +``` + +* CLI Demo +```bash +bash docker/docker_cli_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} +``` + +The commands above will automatically download the required image and launch a Web UI demo in background (the service will auto-restart). You can open `http://localhost:${PORT}` on the host to use the demo. + +The demo is successfully launched if you see the following output: + +```text +Successfully started web demo. Open '...' to try! +Run `docker logs ...` to check demo status. +Run `docker rm -f ...` to stop and remove the demo. +``` + +If you want to check the status of the demo, you can use `docker logs qwen` to display outputs. + +You can use `docker rm -f qwen` to stop the service and remove the container. + + +### Finetuning + +The method of finetuning using the pre-built Docker image is basically the same as [the above chapter](#Finetuning) (we have already installed dependencies in the image): + +The following is an example of single-GPU LoRA: +```bash +IMAGE_NAME=qwenllm/qwen:cu117 +CHECKPOINT_PATH=/path/to/Qwen-7B # Path to downloaded model checkpoints and codes +#CHECKPOINT_PATH=/path/to/Qwen-7B-Chat-Int4 # Path to downloaded model checkpoints and codes (Q-LoRA) +DATA_PATH=/path/to/data/root # Prepare finetune data at ${DATA_PATH}/example.json +OUTPUT_PATH=/path/to/output/checkpoint # Path to finetune outputs + +# Use all host devices by default +DEVICE=all +# If you need to specify GPUs for training, set device as follow (NOTE: internal quotation marks cannot be omitted) +#DEVICE='"device=0,1,2,3"' + +mkdir -p ${OUTPUT_PATH} + +# Single-GPU LoRA finetuning +docker run --gpus ${DEVICE} --rm --name qwen \ + --mount type=bind,source=${CHECKPOINT_PATH},target=/data/shared/Qwen/Qwen-7B \ + --mount type=bind,source=${DATA_PATH},target=/data/shared/Qwen/data \ + --mount type=bind,source=${OUTPUT_PATH},target=/data/shared/Qwen/output_qwen \ + --shm-size=2gb \ + -it ${IMAGE_NAME} \ + bash finetune/finetune_lora_single_gpu.sh -m /data/shared/Qwen/Qwen-7B/ -d /data/shared/Qwen/data/example.json +``` + +To make a change to single-GPU Q-LoRA for example, you just need to modify the bash command inside `docker run`: +```bash +bash finetune/finetune_qlora_single_gpu.sh -m /data/shared/Qwen/Qwen-7B-Chat-Int4/ -d /data/shared/Qwen/data/example.json +``` +
+ +## 🔥 System Prompt +Qwen-1.8-Chat and Qwen-72B-Chat have been fully trained on diverse system prompts with multiple rounds of complex interactions, so that they can follow a variety of system prompts and realize model customization in context, further improving the scalability of Qwen-chat. + +With System Prompt, Qwen-Chat can realize **roly playing**, **language style transfer**, **task setting**, and **behavior setting**. + +![](assets/system_prompt_language_style.png) + +![](assets/system_prompt_role_play_en.png) + +For more information, please refer to the [example documentation](examples/system_prompt.md). ## Tool Usage @@ -1109,7 +1281,11 @@ In addition, we also provide experimental results demonstrating that our model i ## Long-Context Understanding -To extend the context length and break the bottleneck of training sequence length, we introduce several techniques, including NTK-aware interpolation, window attention, and LogN attention scaling, to extend the context length of Qwen-7B/14B from 2k to over 8K tokens, and Qwen-7B from 8k to 32k tokens. We conduct language modeling experiments on the arXiv dataset with the PPL evaluation and find that Qwen can reach outstanding performance in the scenario of long context. Results are demonstrated below: +To extend the context length and break the bottleneck of training sequence length, we introduce several techniques, including NTK-aware interpolation, window attention, and LogN attention scaling, to extend the context length of Qwen-14B from 2K to over 8K tokens, and Qwen-1.8B/7B from 8K to 32K tokens. + +For Qwen-72B, we adapt RoPE to longer contexts with a larger rotary base. Qwen-72B supports the max context length of 32K tokens. + +We conduct language modeling experiments on the arXiv dataset with the PPL evaluation and find that Qwen can reach outstanding performance in the scenario of long context. Results are demonstrated below: @@ -1131,6 +1307,12 @@ To extend the context length and break the bottleneck of training sequence lengt + + + + + + @@ -1143,8 +1325,24 @@ To extend the context length and break the bottleneck of training sequence lengt + + + +
+ dynamic_ntk + logn + window_attn4.233.783.583.494.32-
Qwen-1.8B5.004.484.133.8917.42433.85
+ dynamic_ntk + logn + window_attn5.004.484.143.933.823.83
Qwen-7B4.233.813.523.317.27181.49
+ dynamic_ntk + logn + window_attn-3.463.293.183.42-
Qwen-72B---2.832.732.72
+Furthermore, to verify the ability of Qwen-72B-Chat on long text understanding, we tested it on [L-Eval](https://arxiv.org/abs/2307.11088) (closed-ended tasks). The results are as follows: + +| Model | Input Length | Average | Coursera | GSM | QuALITY | TOEFL | CodeU | SFcition | +|:------------------|:------------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:| +| ChatGPT-3.5-16k | 16K | 60.73 | **63.51** | **84.00** | 61.38 | 78.43 | **12.22** | 64.84 | +| **Qwen-72B-Chat** | 32K | **62.30** | 58.13 | 76.00 | **77.22** | **86.24** | 6.66 | **69.53** | + +We conducted the "needle in a haystack" experiment (the idea came from [@Greg Kamradt](https://twitter.com/GregKamradt/status/1727018183608193393)) to test whether the model can retrieve information at different positions in the inputs of different lengths, the result is as follows: + +![](assets/qwen_72b_needle_in_a_haystack.png) + +The above results show that Qwen-72B-Chat can accurately retrieve information placed in various positions within an input length of 32k, proving its excellent long text understanding capabilities. ## Tokenizer @@ -1176,7 +1374,13 @@ If you find our work helpful, feel free to give us a cite. ## License Agreement -Researchers and developers are free to use the codes and model weights of both Qwen and Qwen-Chat. We also allow their commercial use. Check our license at [LICENSE](LICENSE) for more details. If you have requirements for commercial use, please fill out the form ([7B](https://dashscope.console.aliyun.com/openModelApply/qianwen), [14B](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat)) to apply. +The source code provided at is licensed under the [Apache 2.0 License](./LICENSE) that can be found at the root directory. + +Researchers and developers are free to use the codes and model weights of both Qwen and Qwen-Chat. For their commercial use, please check the License Agreement accompanying each model. + +- Qwen-72B, Qwen-14B, and Qwen-7B are licensed under the [Tongyi Qianwen LICENSE AGREEMENT](./Tongyi%20Qianwen%20LICENSE%20AGREEMENT) that can be found at the corresponding HuggingFace and ModelScope repository. For commercial use, please fill out the form ([72B](https://dashscope.console.aliyun.com/openModelApply/Qwen-72B-Chat), [14B](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat), and [7B](https://dashscope.console.aliyun.com/openModelApply/qianwen)) to apply. + +- Qwen-1.8B is licensed under the [Tongyi Qianwen RESEARCH LICENSE AGREEMENT](./Tongyi%20Qianwen%20RESEARCH%20LICENSE%20AGREEMENT) that can be found at the corresponding HuggingFace and ModelScope repository. For commercial use, please contact us.

## Contact Us diff --git a/README_CN.md b/README_CN.md index f41acdf..71ea0e6 100644 --- a/README_CN.md +++ b/README_CN.md @@ -1,5 +1,5 @@

- 中文  |  English  |  日本語 |  Français + 中文  |  English  |  日本語 |  Français |  Español



@@ -9,21 +9,33 @@

- 🤗 Hugging Face   |   🤖 魔搭社区   |    📑 论文    |   🖥️ Demo + 🤗 Hugging Face   |   🤖 ModelScope   |    📑 Paper    |   🖥️ Demo
-微信   |    钉钉    |   Discord   +WeChat (微信)   |   Discord   |   API   |   Web   |   APP



| | Qwen-Chat | Qwen-Chat (Int4) | Qwen-Chat (Int8) | Qwen | |-----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:| +| 1.8B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | | 7B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | | 14B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | +| 72B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | -我们开源了**Qwen**(通义千问)系列工作,当前开源模型的参数规模为70亿(7B)和140亿(14B)。本次开源包括基础模型**Qwen**,即**Qwen-7B**和**Qwen-14B**,以及对话模型**Qwen-Chat**,即**Qwen-7B-Chat**和**Qwen-14B-Chat**。模型链接在表格中,请点击了解详情。同时,我们公开了我们的技术报告,请点击上方论文链接查看。 -当前基础模型已经稳定训练了大规模高质量且多样化的数据,覆盖多语言(当前以中文和英文为主),总量高达3万亿token。在相关基准评测中,Qwen系列模型拿出非常有竞争力的表现,显著超出同规模模型并紧追一系列最强的闭源模型。此外,我们利用SFT和RLHF技术实现对齐,从基座模型训练得到对话模型。Qwen-Chat具备聊天、文字创作、摘要、信息抽取、翻译等能力,同时还具备一定的代码生成和简单数学推理的能力。在此基础上,我们针对LLM对接外部系统等方面针对性地做了优化,当前具备较强的工具调用能力,以及最近备受关注的Code Interpreter的能力和扮演Agent的能力。 + +我们开源了**Qwen**(通义千问)系列工作,当前开源模型的参数规模为18亿(1.8B)、70亿(7B)、140亿(14B)和720亿(72B)。本次开源包括基础模型**Qwen**,即**Qwen-1.8B**、**Qwen-7B**、**Qwen-14B**、**Qwen-72B**,以及对话模型**Qwen-Chat**,即**Qwen-1.8B-Chat**、**Qwen-7B-Chat**、**Qwen-14B-Chat**和**Qwen-72B-Chat**。模型链接在表格中,请点击了解详情。同时,我们公开了我们的技术报告,请点击上方论文链接查看。 +当前基础模型已经稳定训练了大规模高质量且多样化的数据,覆盖多语言(当前以中文和英文为主),总量高达3万亿token。在相关基准评测中,Qwen系列模型拿出非常有竞争力的表现,显著超出同规模模型并紧追一系列最强的闭源模型。此外,我们利用SFT和RLHF技术实现对齐,从基座模型训练得到对话模型。Qwen-Chat具备聊天、文字创作、摘要、信息抽取、翻译等能力,同时还具备一定的代码生成和简单数学推理的能力。在此基础上,我们针对LLM对接外部系统等方面针对性地做了优化,当前具备较强的工具调用能力,以及最近备受关注的Code Interpreter的能力和扮演Agent的能力。我们将各个大小模型的特点列到了下表。 + +| 模型 | 开源日期 | 最大上下文长度 | System Prompt强化 | 预训练token数 | 微调(Q-Lora)最小GPU用量 | 生成2048个token的最小显存占用 | 工具调用 | +|:----------|:--------:|:-------:|:---------------:|:---------:|:-----------------:|:-------------------:|:----:| +| Qwen-1.8B | 23.11.30 | 32K | √ | 2.2T | 5.8GB | 2.9GB | √ | +| Qwen-7B | 23.08.03 | 32K | × | 2.4T | 11.5GB | 8.2GB | √ | +| Qwen-14B | 23.09.25 | 8K | × | 3.0T | 18.7GB | 13.0GB | √ | +| Qwen-72B | 23.11.30 | 32K | √ | 3.0T | 61.4GB | 48.9GB | √ | + + 在这个项目中,你可以了解到以下内容 * 快速上手Qwen-Chat教程,玩转大模型推理 @@ -45,8 +57,9 @@ ## 新闻 +* 2023.11.30 🔥 我们推出 **Qwen-72B** 和 **Qwen-72B-Chat**,它们在 3T tokens上进行训练,并支持 32k 上下文。同时也发布了 **Qwen-1.8B** 和 **Qwen-1.8B-Chat**。我们还增强了 Qwen-72B-Chat 和 Qwen-1.8B-Chat 的系统指令(System Prompt)功能,请参阅[示例文档](examples/system_prompt.md)。此外,我们还对**昇腾910**以及**海光DCU**实现了推理的支持,详情请查看`ascend-support`及`dcu-support`文件夹。 * 2023年10月17日 我们推出了Int8量化模型**Qwen-7B-Chat-Int8**和**Qwen-14B-Chat-Int8**。 -* 2023年9月25日 🔥 在魔搭社区(ModelScope)和Hugging Face推出**Qwen-14B**和**Qwen-14B-Chat**模型,并开源 [qwen.cpp](https://github.com/QwenLM/qwen.cpp) 和 [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent)。**Qwen-7B**和**Qwen-7B-Chat**的代码和模型也同步得到更新。**请使用最新的代码和模型!** +* 2023年9月25日 在魔搭社区(ModelScope)和Hugging Face推出**Qwen-14B**和**Qwen-14B-Chat**模型,并开源 [qwen.cpp](https://github.com/QwenLM/qwen.cpp) 和 [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent)。**Qwen-7B**和**Qwen-7B-Chat**的代码和模型也同步得到更新。**请使用最新的代码和模型!** - 相比原版Qwen-7B,新版用了更多训练数据(从2.2T增加到2.4T tokens),序列长度从2048扩展至8192。整体中文能力以及代码能力均有所提升。 * 2023年9月12日 支持Qwen-7B和Qwen-7B-Chat的微调,其中包括全参数微调、LoRA以及Q-LoRA。 * 2023年8月21日 发布Qwen-7B-Chat的Int4量化模型,Qwen-7B-Chat-Int4。该模型显存占用低,推理速度相比半精度模型显著提升,在基准评测上效果损失较小。 @@ -55,27 +68,30 @@ ## 评测表现 -Qwen-14B及Qwen-7B (最新版本使用更大量的token进行预训练)相比同规模模型均实现了效果的显著提升。我们评测的数据集包括MMLU、C-Eval、 GSM8K、 MATH、HumanEval、MBPP、BBH等数据集,考察的能力包括自然语言理解、知识、数学计算和推理、代码生成、逻辑推理等。当然,即便Qwen-14B相比GPT-3.5和GPT-4仍有差距。 +Qwen系列模型相比同规模模型均实现了效果的显著提升。我们评测的数据集包括MMLU、C-Eval、 GSM8K、 MATH、HumanEval、MBPP、BBH等数据集,考察的能力包括自然语言理解、知识、数学计算和推理、代码生成、逻辑推理等。Qwen-72B在所有任务上均超越了LLaMA2-70B的性能,同时在10项任务中的7项任务中超越GPT-3.5.

- +


-| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU | -|:-----------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:| -| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot | -| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 | -| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 | -| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - | -| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - | -| InternLM-7B | 51.0 | 53.4 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 | -| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 | -| Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 | -| Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 | -| **Qwen-7B (original)** | 56.7 | 59.6 | 51.6 | 10.4 | 24.4 | 31.2 | 40.6 | 58.8 | -| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 | -| **Qwen-14B** | **66.3** | **72.1** | **61.3** | **24.8** | **32.3** | **40.8** | **53.4** | **71.0** | +| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU | +|:-------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:| +| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot | +| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 | +| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 | +| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - | +| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - | +| InternLM-7B | 51.0 | 53.4 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 | +| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 | +| Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 | +| Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 | +| Yi-34B | 76.3 | 81.8 | 67.9 | 15.9 | 26.2 | 38.2 | 66.4 | 82.6 | +| XVERSE-65B | 70.8 | 68.6 | 60.3 | - | 26.3 | - | - | - | +| **Qwen-1.8B** | 45.3 | 56.1 | 32.3 | 2.3 | 15.2 | 14.2 | 22.3 | 52.1 | +| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 | +| **Qwen-14B** | 66.3 | 72.1 | 61.3 | 24.8 | 32.3 | 40.8 | 53.4 | 71.0 | +| **Qwen-72B** | **77.4** | **83.3** | **78.9** | **35.2** | **35.4** | **52.2** | **67.7** | **83.6** | 对于以上所有对比模型,我们列出了其官方汇报结果与[OpenCompass](https://opencompass.org.cn/leaderboard-llm)结果之间的最佳分数。 @@ -87,6 +103,7 @@ Qwen-14B及Qwen-7B (最新版本使用更大量的token进行预训练)相比同 * python 3.8及以上版本 * pytorch 1.12及以上版本,推荐2.0及以上版本 +* transformers 4.32及以上版本 * 建议使用CUDA 11.4及以上(GPU用户、flash-attention用户等需考虑此选项)
@@ -94,7 +111,9 @@ Qwen-14B及Qwen-7B (最新版本使用更大量的token进行预训练)相比同 我们提供简单的示例来说明如何利用🤖 ModelScope和🤗 Transformers快速使用Qwen-7B和Qwen-7B-Chat。 -在开始前,请确保你已经配置好环境并安装好相关的代码包。最重要的是,确保你满足上述要求,然后安装相关的依赖库。 +你可以使用我们预构建好的Docker镜像,省去大部分配置环境的操作,详情见[“使用预构建的docker镜像”](#-使用预构建的docker镜像)一节。 + +如不使用Docker,请确保你已经配置好环境并安装好相关的代码包。最重要的是,确保你满足上述要求,然后安装相关的依赖库。 ```bash pip install -r requirements.txt @@ -107,6 +126,7 @@ git clone https://github.com/Dao-AILab/flash-attention cd flash-attention && pip install . # 下方安装可选,安装可能比较缓慢。 # pip install csrc/layer_norm +# 如果flash-attn版本高于2.1.1,下方无需安装。 # pip install csrc/rotary ``` @@ -189,7 +209,9 @@ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)) +

若在使用上述代码时由于各种原因无法从 HuggingFace 拉取模型和代码,可以先从 ModelScope 下载模型及代码至本地,再从本地加载模型: +

```python from modelscope import snapshot_download @@ -316,6 +338,60 @@ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cp 如果你遇到显存不足的问题而希望使用多张GPU进行推理,可以使用上述的默认的使用方法读取模型。此前提供的脚本`utils.py`已停止维护。 尽管这个方法很简单,但它的效率相对较低。我们建议使用vLLM和FastChat并请阅读部署章节。 + +### 阿里云灵积(DashScope)API服务 +最简单的使用Qwen模型API服务的方法就是通过DashScope(阿里云灵积API模型服务)。我们提供了简单介绍说明使用方法。同时,我们还提供了自己部署OpenAI格式的API的方法。 + +DashScope是阿里云提供的大语言模型的API服务,目前支持Qwen。但请注意,目前提供服务的Qwen模型为内部模型,暂无更多具体细节对外透露。模型服务包括`qwen-turbo`、`qwen-plus`和`qwen-max`,`qwen-turbo`速度更快,`qwen-plus`效果更优,`qwen-max`是最新发布的千亿级通义千问2.0模型。详情请查看[文档](https://dashscope.aliyun.com)。 + +请首先前往[官网](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.6c2774fahtfXdn)开通DashScope,获得API Key(AK)。建议通过环境变量设置AK: +```bash +export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY" +``` +随后安装相关代码包,点击[此处](https://help.aliyun.com/zh/dashscope/developer-reference/install-dashscope-sdk)查看安装文档。如使用python,则直接通过pip安装: +```bash +pip install dashscope +``` +如安装JAVA SDK,则通过如下命令安装: +```xml + + + com.alibaba + dashscope-sdk-java + the-latest-version + +``` +最简单的使用方法就是通过messages调用,用法类似OpenAI API。示例如下: +```python +import random +from http import HTTPStatus +from dashscope import Generation + + +def call_with_messages(): + messages = [{'role': 'system', 'content': 'You are a helpful assistant.'}, + {'role': 'user', 'content': '如何做西红柿鸡蛋?'}] + gen = Generation() + response = gen.call( + Generation.Models.qwen_turbo, + messages=messages, + seed=random.randint(1, 10000), # set the random seed, optional, default to 1234 if not set + result_format='message', # set the result to be "message" format. + ) + return response + + +if __name__ == '__main__': + response = call_with_messages() + if response.status_code == HTTPStatus.OK: + print(response) + else: + print('Request id: %s, Status code: %s, error code: %s, error message: %s' % ( + response.request_id, response.status_code, + response.code, response.message + )) +``` +更多用法请查看官方文档了解详情。

@@ -323,7 +399,7 @@ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cp ### GPTQ -**请注意:我们更新量化方案为基于 [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) 的量化,提供Int4量化模型。该方案在模型评测效果几乎无损,且存储需求更低,推理速度更优。** +我们提供了基于[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)的量化方案,并开源了Int4和Int8量化模型。量化模型的效果损失很小,但能显著降低显存占用并提升推理速度。 以下我们提供示例说明如何使用Int4量化模型。在开始使用前,请先保证满足要求(如torch 2.0及以上,transformers版本为4.32.0及以上,等等),并安装所需安装包: @@ -333,6 +409,12 @@ pip install auto-gptq optimum 如安装`auto-gptq`遇到问题,我们建议您到官方[repo](https://github.com/PanQiWei/AutoGPTQ)搜索合适的wheel。 +> 注意:预编译的`auto-gptq`版本对`torch`版本及其CUDA版本要求严格。同时,由于 +> 其近期更新,你可能会遇到`transformers`、`optimum`或`peft`抛出的版本错误。 +> 我们建议使用符合以下要求的最新版本: +> - torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1 +> - torch>=2.0,<2.1 auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0 + 随后即可使用和上述一致的用法调用量化模型: ```python @@ -349,12 +431,18 @@ response, history = model.chat(tokenizer, "Hi", history=None) | Quantization | MMLU | CEval (val) | GSM8K | Humaneval | |----------------------|:----:|:-----------:|:-----:|:---------:| +| Qwen-1.8B-Chat (BF16)| 43.3 | 55.6 | 33.7 | 26.2 | +| Qwen-1.8B-Chat (Int8)| 43.1 | 55.8 | 33.0 | 27.4 | +| Qwen-1.8B-Chat (Int4)| 42.9 | 52.8 | 31.2 | 25.0 | | Qwen-7B-Chat (BF16) | 55.8 | 59.7 | 50.3 | 37.2 | | Qwen-7B-Chat (Int8) | 55.4 | 59.4 | 48.3 | 34.8 | | Qwen-7B-Chat (Int4) | 55.1 | 59.2 | 49.7 | 29.9 | | Qwen-14B-Chat (BF16) | 64.6 | 69.8 | 60.1 | 43.9 | -| Qwen-14B-Chat (Int8) | 63.6 | 68.6 | 60.0 | 48.2 | +| Qwen-14B-Chat (Int8) | 63.6 | 68.6 | 60.0 | 48.2 | | Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 | +| Qwen-72B-Chat (BF16) | 74.4 | 80.1 | 76.4 | 64.6 | +| Qwen-72B-Chat (Int8) | 73.5 | 80.1 | 73.5 | 62.2 | +| Qwen-72B-Chat (Int4) | 73.4 | 80.1 | 75.3 | 61.6 |
@@ -362,9 +450,9 @@ response, history = model.chat(tokenizer, "Hi", history=None) > 注意:由于Hugging Face的内部实现,本功能的支持文件`cache_autogptq_cuda_356.cpp`与`cache_autogptq_cuda_kernel_245.cu`可能没被下载。如需开启使用,请手动从相关位置下载,并放置到相应文件中。 -在模型infer时,可以将中间结果key以及value的值量化后压缩存储,这样便可以在相同的卡上存储更多的key以及value,增加样本吞吐。 +在模型推理时,我们可以将中间结果key以及value的值量化后压缩存储,这样便可以在相同的卡上存储更多的key以及value,增加样本吞吐。 -提供use_cache_quantization以及use_cache_kernel两个参数对模型控制,当use_cache_quantization以及use_cache_kernel均开启时,将启动kv-cache量化的功能。具体使用如下: +我们在`config.json`里提供了`use_cache_quantization`和`use_cache_kernel`两个参数来控制是否启用KV cache量化,具体使用方法如下: ```python model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen-7B-Chat", @@ -375,43 +463,46 @@ model = AutoModelForCausalLM.from_pretrained( use_flash_attn=False ) ``` -注意:当前该功能目前不支持与flash attn同时开启,如果你开了kv cache量化的同时又开了flash attn(use_flash_attn=True, use_cache_quantization=True, use_cache_kernel=True),会默认将use_flash_attn关闭。 +注意:当前该功能不支持与flash attention同时开启,如果你开了KV cache量化的同时又开了flash attention(`use_flash_attn=True`, `use_cache_quantization=True`, `use_cache_kernel=True`),程序默认将关闭`use_flash_attn`。 -效果方面,我们验证过Int8 kv-cache的使用对模型整体的精度指标基本无损。我们做了针对显存占用的性能测试。评测运行于单张A100-SXM4-80G GPU,模型默认使用BF16格式,默认生成的seq-length=1024(生成1024个token),其中oom表示out of memory。 +效果方面,我们验证过Int8 KV Cache的使用对模型整体的精度指标基本无损。我们做了针对显存占用的性能测试。评测运行于单张A100-SXM4-80G GPU,模型默认使用BF16格式,默认生成1024个token,其中OOM表示内存不足。 -开启了kv-cache量化之后,模型在infer的时候可以开启更大的batch size(bs) +开启了KV cache量化之后,模型在推理的时候可以开启更大的batch size (bs)。 -| USE KVCache | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 | -|-------------|:------:|:------:|:------:|:------:|:------:|:------:| -| no | 16.3GB | 24.1GB | 31.7GB | 48.7GB | oom | oom | -| yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB | +| USE KV Cache | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 | +|--------------|:------:|:------:|:------:|:------:|:------:|:------:| +| No | 16.3GB | 24.1GB | 31.7GB | 48.7GB | oom | oom | +| Yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB | -开启了kv-cache量化之后,模型在infer时预测更长的seq-length(sl,生成的token数)结果时,可以节约更多的显存。 +开启了KV cache量化之后,模型在推理时可在生成更长的序列(sl,生成的token数)时,节约更多的显存。 -| USE KVCache | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 | -|-------------|:------:|:-------:|:-------:|:-------:|:-------:| -| no | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB | -| yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB | +| USE KV Cache | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 | +|--------------|:------:|:-------:|:-------:|:-------:|:-------:| +| no | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB | +| yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB | -模型开启kv cache量化后再模型infer的时候,会将原始存进layer_past的float格式的key/value变成int8格式的qkey/qvalue和相对应的量化参数。 +开启KV cache量化后,模型在推理时会将原始存进`layer-past`的float格式的key/value转换成int8格式,同时存储量化部分的参数。 + 具体操作如下: -1、将key/value进行量化操作 + +1. 将key/value进行量化操作 ``` qv,scale,zero_point=quantize_cache_v(v) ``` -2、存入layer_past中: -量化格式的layer_past: +2. 存入`layer_past`中: + +量化格式的`layer-past`: ``` layer_past=((q_key,key_scale,key_zero_point), (q_value,value_scale,value_zero_point)) ``` -原始格式的layer_past: +原始格式的`layer-past`: ``` layer_past=(key,value) ``` -如果需要将layer_past中存好的key,value直接取出使用,可以使用反量化操作将int8格式的key/value转回float格式: +如果需要将`layer-past`中存好的key,value直接取出使用,可以使用反量化操作将Int8格式的key/value转回float格式: ``` v=dequantize_cache_torch(qv,scale,zero_point) ``` @@ -420,118 +511,100 @@ model = AutoModelForCausalLM.from_pretrained( ### 推理性能 这一部分将介绍模型推理的速度和显存占用的相关数据。下文的性能测算使用 [此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py) 完成。 -### 推理速度 - -我们测算了BF16、Int8和Int4模型在使用flash attention v2、v1或不使用时生成2048和8192个token的平均推理速度(tokens/s)。结果如下所示: +我们测算了BF16、Int8和Int4模型在生成2048个token时的平均推理速度(tokens/s)和显存使用。结果如下所示: - + + + + - - - + + + + - + + + - + + - + + + - + + + - + + - + + + - + + + - + + - + + + - + + + - + + - - - - - - - - - - - - - - - + + +
Model SizePrecisionFlashAttnSequence LengthModel SizeQuantizationSpeed (Tokens/s)GPU Memory Usage
20488192
1.8BBF1654.094.23GB
7BBF16v240.9336.14Int855.563.48GB
v140.7535.34 + Int471.072.91GB
Disabled37.5533.56 + 7BBF1640.9316.99GB
Int8v237.4732.54Int837.4711.20GB
v137.5132.39 + Int450.098.21GB
Disabled37.8432.65 + 14BBF1632.2230.15GB
Int4v250.0938.61Int829.2818.81GB
v145.9836.47 + Int438.7213.01GB
Disabled48.1236.70 + 72BBF168.48144.69GB (2xA100)
14BBF16v232.8824.87Int89.0581.27GB (2xA100)
v132.7628.89 + Int411.3248.86GB
Disabled29.3222.91 -
Int8v229.2824.22
v128.3123.87 -
Disabled31.1224.60 -
Int4v238.7227.33
v137.8126.46 -
Disabled37.6526.00 + 72B + vLLMBF1617.602xA100
-评测运行于单张A100-SXM4-80G GPU,使用PyTorch 2.0.1和CUDA 11.4。推理速度是编码2048个token和生成8192个token的速度均值。 +评测运行于单张A100-SXM4-80G GPU(除非提到使用2xA100),使用PyTorch 2.0.1、CUDA 11.8和Flash-Attention2。(72B + vLLM 使用 PyTorch 2.1.0和Cuda 11.8.)推理速度是生成2048个token的速度均值。 注意:以上Int4/Int8模型生成速度使用autogptq库给出,当前``AutoModelForCausalLM.from_pretrained``载入的模型生成速度会慢大约20%。我们已经将该问题汇报给HuggingFace团队,若有解决方案将即时更新。 -### 显存使用 - -我们还测算了BF16、Int8和Int4模型编码2048个token及生成8192个token的峰值显存占用情况。结果(GB)如下所示: - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model SizePrecisionSequence Length
20488192
7BBF1616.9922.53
Int811.2016.62 -
Int48.2113.63
14BBF1630.1538.94
Int818.8127.54 -
Int413.0121.79
- -
+我们还测量了不同上下文长度、生成长度、Flash-Attention版本的推理速度和 GPU 内存使用情况。可以在 Hugging Face 或 ModelScope 上的相应的模型介绍页面找到结果。 ## 微调 ### 使用方法 -我们提供了`finetune.py`这个脚本供用户实现在自己的数据上进行微调的功能,以接入下游任务。此外,我们还提供了shell脚本减少用户的工作量。这个脚本支持 [DeepSpeed](https://github.com/microsoft/DeepSpeed) 和 [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/) 。我们提供的shell脚本使用了DeepSpeed,因此建议您确保已经安装DeepSpeed。 +我们提供了`finetune.py`这个脚本供用户实现在自己的数据上进行微调的功能,以接入下游任务。此外,我们还提供了shell脚本减少用户的工作量。这个脚本支持 [DeepSpeed](https://github.com/microsoft/DeepSpeed) 和 [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/) 。我们提供的shell脚本使用了DeepSpeed,因此建议您确保已经安装DeepSpeed和Peft(注意:DeepSpeed可能不兼容最新的pydantic版本,请确保`pydantic<2.0`)。你可以使用如下命令安装: +```bash +pip install peft deepspeed +``` 首先,你需要准备你的训练数据。你需要将所有样本放到一个列表中并存入json文件中。每个样本对应一个字典,包含id和conversation,其中后者为一个列表。示例如下所示: ```json @@ -641,7 +714,12 @@ tokenizer.save_pretrained(new_model_directory) 注意:分布式训练需要根据你的需求和机器指定正确的分布式训练超参数。此外,你需要根据你的数据、显存情况和训练速度预期,使用`--model_max_length`设定你的数据长度。 ### 显存占用及训练速度 -下面记录7B和14B模型在单GPU使用LoRA(LoRA (emb)指的是embedding和输出层参与训练,而LoRA则不优化这部分参数)和QLoRA时处理不同长度输入的显存占用和训练速度的情况。本次评测运行于单张A100-SXM4-80G GPU,使用CUDA 11.8和Pytorch 2.0,并使用了flash attention 2。我们统一使用batch size为1,gradient accumulation为8的训练配置,记录输入长度分别为256、512、1024、2048、4096和8192的显存占用(GB)和训练速度(s/iter)。我们还使用2张A100测了Qwen-7B的全参数微调。受限于显存大小,我们仅测试了256、512和1024token的性能。具体数值如下所示: +下面记录7B和14B模型在单GPU使用LoRA(LoRA (emb)指的是embedding和输出层参与训练,而LoRA则不优化这部分参数)和QLoRA时处理不同长度输入的显存占用和训练速度的情况。本次评测运行于单张A100-SXM4-80G GPU,使用CUDA 11.8和Pytorch 2.0,并使用了flash attention 2。我们统一使用batch size为1,gradient accumulation为8的训练配置,记录输入长度分别为256、512、1024、2048、4096和8192的显存占用(GB)和训练速度(s/iter)。我们还使用2张A100测了Qwen-7B的全参数微调。受限于显存大小,我们仅测试了256、512和1024token的性能。 + +对于 Qwen-72B,我们测试了两种方案:1)使用4个 A100-SXM4-80G GPUs,通过 Lora + DeepSpeed ZeRO 3 微调和2)使用单张A100-SXM4-80G GPU,通过 QLora (int4) 微调。请注意,使用 LoRA (emb) 微调和不带 DeepSpeed ZeRO 3 的 LoRA 微调在4个A100-SXM4-80G GPUs 上都会出现OOM(你可以通过将`--deepspeed finetune/ds_config_zero3.json`参数传给[`finetune/finetune_lora_ds.sh`](finetune/finetune_lora_ds.sh)来打开 DeepSpeed ZeRO 3 配置)。 + +具体数值如下所示: + @@ -652,6 +730,18 @@ tokenizer.save_pretrained(new_model_directory) + + + + + + + + + + + + @@ -673,6 +763,12 @@ tokenizer.save_pretrained(new_model_directory) + + + + + +
1.8BLoRA6.7G / 1.0s/it7.4G / 1.0s/it8.4G / 1.1s/it11.0G / 1.7s/it16.2G / 3.3s/it21.8G / 6.8s/it
LoRA (emb)13.7G / 1.0s/it14.0G / 1.0s/it14.0G / 1.1s/it15.1G / 1.8s/it19.7G / 3.4s/it27.7G / 7.0s/it
Q-LoRA5.8G / 1.4s/it6.0G / 1.4s/it6.6G / 1.4s/it7.8G / 2.0s/it10.2G / 3.4s/it15.8G / 6.5s/it
Full-parameter43.5G / 2.1s/it43.5G / 2.2s/it43.5G / 2.2s/it43.5G / 2.3s/it47.1G / 2.8s/it48.3G / 5.6s/it
7BLoRA20.1G / 1.2s/it20.4G / 1.5s/it21.5G / 2.8s/it23.8G / 5.2s/it29.7G / 10.1s/it36.6G / 21.3s/it
Q-LoRA18.7G / 5.3s/it18.4G / 6.3s/it18.9G / 8.2s/it19.9G / 11.8s/it23.0G / 20.1s/it27.9G / 38.3s/it
72BLoRA + Deepspeed Zero3215.4G / 17.6s/it217.7G / 20.5s/it222.6G / 29.4s/it228.8G / 45.7s/it249.0G / 83.4s/it289.2G / 161.5s/it
Q-LoRA61.4G / 27.4s/it61.4G / 31.5s/it62.9G / 41.4s/it64.1G / 59.5s/it68.0G / 97.7s/it75.6G / 179.8s/it

@@ -680,12 +776,40 @@ tokenizer.save_pretrained(new_model_directory) ## 部署 ### vLLM -如希望部署及加速推理,我们建议你使用vLLM和FastChat。首先安装相应的代码库: +如希望部署及加速推理,我们建议你使用vLLM。 + +如果你使用cuda12.1和pytorch2.1,可以直接使用以下命令安装vLLM。 + ```bash pip install vllm +``` + +否则请参考vLLM官方的[安装说明](https://docs.vllm.ai/en/latest/getting_started/installation.html)。 + +#### vLLM + 类Transformer接口 + +请下载[接口封装代码](examples/vllm_wrapper.py)到当前文件夹,并执行以下命令进行多轮对话交互。(注意:该方法当前只支持``model.chat()``接口。) + +```python +from vllm_wrapper import vLLMWrapper + +model = vLLMWrapper('Qwen/Qwen-7B-Chat', tensor_parallel_size=1) + +response, history = model.chat(query="你好", history=None) +print(response) +response, history = model.chat(query="给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history) +print(response) +response, history = model.chat(query="给这个故事起一个标题", history=history) +print(response) +``` + +#### vLLM + 网页Demo / 类OpenAI API + +你可以使用FastChat去搭建一个网页Demo或类OpenAI API服务器。首先,请安装FastChat: + +```bash pip install "fschat[model_worker,webui]" ``` -你也可以通过`git clone`和`pip install -e .`的方式通过源码安装。如果遇到安装问题,请阅读它们的官方文档。 使用vLLM和FastChat运行Qwen之前,首先启动一个controller: ```bash @@ -694,24 +818,30 @@ python -m fastchat.serve.controller 然后启动model worker读取模型。如使用单卡推理,运行如下命令: ```bash -python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code +python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype bfloat16 ``` 然而,如果你希望使用多GPU加速推理或者增大显存,你可以使用vLLM支持的模型并行机制。假设你需要在4张GPU上运行你的模型,命令如下所示: ```bash -python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 +python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype bfloat16 ``` -启动model worker后,你可以启动一个web demo或者OpenAI API。启动web demo的命令如下: +启动model worker后,你可以启动一个: + +* Web UI Demo ```bash python -m fastchat.serve.gradio_web_server ``` + +* OpenAI API + 使用OpenAI API前,请阅读我们的API章节配置好环境,然后运行如下命令: ```bash python -m fastchat.serve.openai_api_server --host localhost --port 8000 ``` + +然而,如果你觉得使用vLLM和FastChat比较困难,你也可以尝试以下我们提供的最简单的方式部署Web Demo、CLI Demo和OpenAI API。
-## Demo ### Web UI @@ -748,68 +878,12 @@ python cli_demo.py


-## API - -最简单的使用Qwen模型API服务的方法就是通过DashScope(阿里云灵积模型服务)。我们提供了简单介绍说明使用方法。同时,我们还提供了自己部署OpenAI格式的API的方法。 - -### DashScope -DashScope是阿里云提供的大语言模型的API服务,目前支持Qwen。但请注意,目前提供服务的Qwen模型为内部模型,暂无更多具体细节对外透露。模型服务包括`qwen-turbo`和`qwen-plus`。前者速度更快,后者效果更优。详情请查看[文档](https://dashscope.aliyun.com)。 - -请首先前往[官网](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.6c2774fahtfXdn)开通DashScope,获得API Key(AK)。建议通过环境变量设置AK: -```bash -export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY" -``` -随后安装相关代码包,点击[此处](https://help.aliyun.com/zh/dashscope/developer-reference/install-dashscope-sdk)查看安装文档。如使用python,则直接通过pip安装: -```bash -pip install dashscope -``` -如安装JAVA SDK,则通过如下命令安装: -```xml - - - com.alibaba - dashscope-sdk-java - the-latest-version - -``` -最简单的使用方法就是通过messages调用,用法类似OpenAI API。示例如下: -```python -import random -from http import HTTPStatus -from dashscope import Generation - - -def call_with_messages(): - messages = [{'role': 'system', 'content': 'You are a helpful assistant.'}, - {'role': 'user', 'content': '如何做西红柿鸡蛋?'}] - gen = Generation() - response = gen.call( - Generation.Models.qwen_turbo, - messages=messages, - seed=random.randint(1, 10000), # set the random seed, optional, default to 1234 if not set - result_format='message', # set the result to be "message" format. - ) - return response - - -if __name__ == '__main__': - response = call_with_messages() - if response.status_code == HTTPStatus.OK: - print(response) - else: - print('Request id: %s, Status code: %s, error code: %s, error message: %s' % ( - response.request_id, response.status_code, - response.code, response.message - )) -``` -更多用法请查看官方文档了解详情。 - -### OpenAI API +### API 我们提供了OpenAI API格式的本地API部署方法(感谢@hanpenggit)。在开始之前先安装必要的代码库: ```bash -pip install fastapi uvicorn openai "pydantic>=2.3.0" sse_starlette +pip install fastapi uvicorn openai pydantic sse_starlette ``` 随后即可运行以下命令部署你的本地API: @@ -860,6 +934,86 @@ print(response.choices[0].message.content) 该接口也支持函数调用(**Function Calling**),但暂时仅限 `stream=False` 时能生效。用法见[函数调用示例](examples/function_call_examples.py)。

+## 🐳 使用预构建的Docker镜像 + +为简化部署流程,我们提供了预配置好相应环境的Docker镜像:[qwenllm/qwen](https://hub.docker.com/r/qwenllm/qwen),只需安装驱动、下载模型文件即可启动Demo、部署OpenAI API以及进行微调。 + +### 准备操作 + +1. 根据需要使用的镜像版本,安装相应版本的Nvidia驱动: + - `qwenllm/qwen:cu117`(**推荐**):`>= 515.48.07` + - `qwenllm/qwen:cu114`(不支持flash-attention):`>= 470.82.01` + - `qwenllm/qwen:latest`:与`qwenllm/qwen:cu117`相同 + +2. 安装并配置[docker](https://docs.docker.com/engine/install/)和[nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html): + +```bash +# 配置docker +sudo systemctl start docker +# 测试docker是否安装正确 +sudo docker run hello-world + +# 配置nvidia-container-toolkit +sudo nvidia-ctk runtime configure --runtime=docker +sudo systemctl restart docker +# 测试nvidia-container-toolkit是否安装正确 +sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi +``` + +3. 下载模型及代码至本地(参考[此处说明](#DownloadModel)) + +### 部署 + +下面我们以Qwen-7B-Chat为例。在启动Web Demo或者部署API前,请先参照下方代码完成配置工作: + +```bash +IMAGE_NAME=qwenllm/qwen:cu117 +PORT=8901 +CHECKPOINT_PATH=/path/to/Qwen-7B-Chat # 下载到本地的模型及代码路径 +``` + +如下脚本可以帮你部署: + +* OpenAI API +```bash +bash docker/docker_openai_api.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT} +``` + +* Web UI +```bash +bash docker/docker_web_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT} +``` + +* 交互式Demo +```bash +bash docker/docker_cli_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} +``` + +这些命令将自动下载所需镜像以及后台启动Web UI Demo。你可以打开`http://localhost:${PORT}` 来使用该Demo。 + +如果输出如下内容,则说明Demo启动成功: + +```text +Successfully started web demo. Open '...' to try! +Run `docker logs ...` to check demo status. +Run `docker rm -f ...` to stop and remove the demo. +``` + +如果你想查看Demo的状态,你可以使用这个命令来展示输出结果:`docker logs qwen`。 + +你可以使用这个命令`docker rm -f qwen`来停止服务并删除容器。 + +## 🔥 系统指令 (System Prompt) +Qwen-1.8-Chat 和 Qwen-72B-Chat 通义千问在多样且存在多轮复杂交互的系统指令上进行了充分训练,使模型可以跟随多样的系统指令,实现上下文(in-context)中的模型定制化,进一步提升了通义千问的可扩展性。 + +通过系统指令,Qwen-Chat能够实现**角色扮演**,**语言风格迁移**,**任务设定**,和**行为设定**等能力。 + +![](assets/system_prompt_language_style.png) + +![](assets/system_prompt_role_play_en.png) + +更多关于系统指令的介绍信息可以参考[示例文档](examples/system_prompt.md). + ## 工具调用 @@ -1084,7 +1238,11 @@ Qwen-Chat针对工具使用、函数调用能力进行了优化。用户可以 ## 长文本理解 -我们引入了NTK插值、窗口注意力、LogN注意力缩放等技术来提升模型的上下文长度并突破训练序列长度的限制。通过arXiv数据集上的语言模型实验,我们的原生长度为2K的Qwen-7B/14B在8K的序列长度下依然表现不错,而原生长度扩展到8K的Qwen-7B能够在32K长序列的设置下取得不错的表现。 +我们引入了NTK插值、窗口注意力、LogN注意力缩放等技术来提升模型的上下文长度并突破训练序列长度的限制,原生长度为2K的Qwen-14B可以扩展到8K的序列长度,而原生长度8K的Qwen-1.8B/7B能够在32K长序列的设置下取得不错的表现。 + +对于Qwen-72B,我们基于RoPE采用更大的旋转Base来适应更长的上下文。Qwen-72B支持32K的上下文长度。 + +通过arXiv数据集上的语言模型实验,发现 Qwen 在长上下文场景下可以达到出色的性能。结果如下: @@ -1100,12 +1258,11 @@ Qwen-Chat针对工具使用、函数调用能力进行了优化。用户可以 - + - + - @@ -1121,11 +1278,28 @@ Qwen-Chat针对工具使用、函数调用能力进行了优化。用户可以 + + +
+ dynamic_ntk4.233.783.593.665.71-
+ dynamic_ntk + logn4.233.783.583.564.62-Qwen-1.8B5.004.484.133.8917.42433.85
+ dynamic_ntk + logn + window_attn4.233.783.583.494.32-+ dynamic_ntk + logn + window_attn5.004.484.143.933.823.83
Qwen-7B4.233.813.523.317.27181.49
+ dynamic_ntk + logn + window_attn-3.463.293.183.42-
Qwen-72B---2.832.732.72
-## Tokenization +进一步,我们为了验证Qwen-72B-Chat在长文本任务上的能力,在[L-Eval](https://arxiv.org/abs/2307.11088)客观题上进行了测试,评分结果如下: -> 注:作为术语的“tokenization”在中文中尚无共识的概念对应,本文档采用英文表达以利说明。 +| Model | Input Length | Average | Coursera | GSM | QuALITY | TOEFL | CodeU | SFcition | +|:------------------|:------------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:| +| ChatGPT-3.5-16k | 16K | 60.73 | **63.51** | **84.00** | 61.38 | 78.43 | **12.22** | 64.84 | +| **Qwen-72B-Chat** | 32K | **62.30** | 58.13 | 76.00 | **77.22** | **86.24** | 6.66 | **69.53** | + + +我们进一步进行了“大海捞针”实验(想法来自于[@Greg Kamradt](https://twitter.com/GregKamradt/status/1727018183608193393)),测试模型在不同长度的输入下,是否能检索到文章不同位置的信息,结果如下: + +![](assets/qwen_72b_needle_in_a_haystack.png) + +以上结果说明,Qwen-72B-Chat可以能准确检索到32K以内的输入长度中放在各种位置的信息,证明了其具有优秀的长文本处理能力。 + +## Tokenizer + +> 注:作为术语的“tokenizer”在中文中尚无共识的概念对应,本文档采用英文表达以利说明。 基于tiktoken的tokenizer有别于其他分词器,比如sentencepiece tokenizer。尤其在微调阶段,需要特别注意特殊token的使用。关于tokenizer的更多信息,以及微调时涉及的相关使用,请参阅[文档](tokenization_note_zh.md)。

@@ -1155,7 +1329,14 @@ Qwen-Chat针对工具使用、函数调用能力进行了优化。用户可以 ## 使用协议 -研究人员与开发者可使用Qwen和Qwen-Chat或进行二次开发。我们同样允许商业使用,具体细节请查看[LICENSE](LICENSE)。如需商用,请填写问卷([7B](https://dashscope.console.aliyun.com/openModelApply/qianwen), [14B](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat))申请。 +中的源代码采用[Apache 2.0协议](./LICENSE)授权,您可在该仓库根目录找到协议全文。 + +研究人员与开发者可使用Qwen和Qwen-Chat或进行二次开发。对于商业使用,请查看模型各自的LICENSE。 + +- Qwen-72B、Qwen-14B和Qwen-7B采用[Tongyi Qianwen LICENSE AGREEMENT](./Tongyi%20Qianwen%20LICENSE%20AGREEMENT)授权,您可在相应模型的HuggingFace或ModelScope仓库找到协议原文。如需商用,您只需遵循使用协议进行商用即可,我们欢迎您填写问卷([72B](https://dashscope.console.aliyun.com/openModelApply/Qwen-72B-Chat)、[14B](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat)、[7B](https://dashscope.console.aliyun.com/openModelApply/qianwen))。 + +- Qwen-1.8B采用[Tongyi Qianwen RESEARCH LICENSE AGREEMENT](./Tongyi%20Qianwen%20RESEARCH%20LICENSE%20AGREEMENT)授权,您可在相应模型的HuggingFace或ModelScope仓库找到协议原文。如需商用,请联系我们。 +

## 联系我们 diff --git a/README_ES.md b/README_ES.md new file mode 100644 index 0000000..ec65de6 --- /dev/null +++ b/README_ES.md @@ -0,0 +1,1350 @@ +

+ 中文  |  English  |  日本語 |  Français |  Español +

+

+ +

+ +

+
+ +

+ 🤗 Hugging Face   |   🤖 ModelScope   |    📑 Paper    |   🖥️ Demo +
+WeChat (微信)   |   Discord   |   API +

+

+ +| | Qwen-Chat | Qwen-Chat (Int4) | Qwen-Chat (Int8) | Qwen | +|-----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:| +| 1.8B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | +| 7B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | +| 14B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | +| 72B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | + + + +Abrimos nuestra serie **Qwen**, que ahora incluye **Qwen**, los modelos de lenguaje, es decir **Qwen-7B** y **Qwen-14B**, así como **Qwen-Chat**, los modelos de chat, es decir **Qwen-7B-Chat** y **Qwen-14B-Chat**. Los enlaces se encuentran en la tabla anterior. Haz clic en ellos y comprueba las fichas de los modelos. Además, publicamos el **[informe técnico](https://arxiv.org/abs/2309.16609)**. Haz clic en el enlace y compruébalo. + +En resumen, disponemos de modelos lingüísticos sólidos, que han sido preentrenados de forma estable para hasta 3 billones de tokens de datos multilingües con una amplia cobertura de dominios, idiomas (con especial atención al chino y al inglés), etc. Son capaces de lograr un rendimiento competitivo en conjuntos de datos de referencia. Además, disponemos de modelos de chat alineados con las preferencias humanas basados en SFT y RLHF (aún no publicados), que son capaces de chatear, crear contenidos, extraer información, resumir, traducir, codificar, resolver problemas matemáticos, etc., y son capaces de utilizar herramientas, jugar como agentes o incluso jugar como intérpretes de código, etc. + +| Modelo | Fecha de Publicación | Longitud Máx. | Mejora del Sistema de Avisos | # de Fichas Preentrenadas | Uso Mínimo de Memoria GPU de Finetuning (Q-Lora) | Uso Mínimo de la GPU para Generar 2048 Tokens (Int4) | Uso de Herramientas | +|:----------|:--------------------:|:-------------:|:----------------------------:|:-------------------------:|:------------------------------------------------:|:----------------------------------------------------:|:-------------------:| +| Qwen-1.8B | 23.11.30 | 32K | √ | 2.2T | 5.8GB | 2.9GB | √ | +| Qwen-7B | 23.08.03 | 32K | × | 2.4T | 11.5GB | 8.2GB | √ | +| Qwen-14B | 23.09.25 | 8K | × | 3.0T | 18.7GB | 13.0GB | √ | +| Qwen-72B | 23.11.30 | 32K | √ | 3.0T | 61.4GB | 48.9GB | √ | + +En este repo, usted puede averiguar: + +* Inicio rápido con Qwen, y disfrute de la simple inferencia. +* Detalles sobre los modelos de cuantificación, incluyendo GPTQ y cuantización de caché KV. +* Estadísticas de rendimiento de la inferencia, incluyendo velocidad y memoria. +* Tutoriales sobre ajuste fino, incluyendo ajuste de parámetros completos, LoRA y Q-LoRA. +* Instrucciones de despliegue, con el ejemplo de vLLM y FastChat. +* Instrucciones para construir demos, incluyendo WebUI, CLI demo, etc. +* Introducción al servicio API de DashScope, así como instrucciones para crear una API de estilo OpenAI para tu modelo. +* Información sobre Qwen para el uso de herramientas, agente e intérprete de código. +* Estadísticas de la evaluación de la comprensión del contexto largo +* Acuerdo de licencia +* ... + +Además, si tienes problemas, consulta primero [FAQ](FAQ.md) para obtener ayuda. ¿Sigues teniendo problemas? No dudes en plantearnos tus problemas (mejor en inglés para que te entienda más gente). Si quieres ayudarnos, ¡envíanos pull requests sin dudarlo! ¡Siempre nos entusiasman los PR! + +¿Quieres charlar con nosotros o quedar para tomar un café? ¡Bienvenido a nuestro Discord o WeChat! +

+ +## Noticias y Actualizaciones + +* 2023.11.30 🔥 Lanzamos **Qwen-72B** y **Qwen-72B-Chat**, que están entrenados en tokens 3T y soportan 32k contextos, junto con **Qwen-1.8B**, y **Qwen-1.8B-Chat**, en ModelScope y Hugging Face. También hemos reforzado las capacidades de System Prompt de Qwen-72B-Chat y Qwen-1.8B-Chat, ver [documentación de ejemplo](examples/system_prompt.md). Adicionalmente, soporta la inferencia en **Ascend 910** y **Hygon DCU**. Consulta `ascend-support` y `dcu-support` para más detalles. +* 2023.10.17 Publicamos el modelo cuantizado Int8 **Qwen-7B-Chat-Int8** y **Qwen-14B-Chat-Int8**. +* 2023.9.25 Publicamos **Qwen-14B** y **Qwen-14B-Chat** en ModelScope y Hugging Face, junto con [qwen.cpp](https://github.com/QwenLM/qwen.cpp) y [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent). También se actualizan los códigos y pesos de **Qwen-7B** y **Qwen-7B-Chat**. **POR FAVOR, DESCARGA LA ÚLTIMA VERSIÓN!** + - En comparación con **Qwen-7B** (original), **Qwen-7B** utiliza más tokens de entrenamiento, pasando de 2,2T tokens a 2,4T tokens, mientras que la longitud del contexto se amplía de 2048 a 8192. El conocimiento del chino y la capacidad de codificación de **Qwen-7B** se han mejorado aún más. +* 2023.9.12 Ahora es posible el ajuste fino de los modelos Qwen-7B, incluido el ajuste fino de parámetros completos, LoRA y Q-LoRA. +* 2023.8.21 Publicamos el modelo cuantizado Int4 para Qwen-7B-Chat, **Qwen-7B-Chat-Int4**, que requiere bajos costes de memoria pero consigue mejorar la velocidad de inferencia. Además, no se produce una degradación significativa del rendimiento en la evaluación comparativa. +* 2023.8.3 Publicamos **Qwen-7B** y **Qwen-7B-Chat** en ModelScope y Hugging Face. También proporcionamos una nota técnica para más detalles sobre el modelo, incluidos los detalles de entrenamiento y el rendimiento del modelo. +
+ +## Rendimiento + +Los modelos Qwen superan a los modelos de referencia de tamaños de modelo similares en una serie de conjuntos de datos de referencia, como MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., que evalúan las capacidades de los modelos en comprensión del lenguaje natural, resolución de problemas matemáticos, codificación, etc. Qwen-72B obtiene mejores resultados que LLaMA2-70B en todas las tareas y supera a GPT-3.5 en 7 de cada 10 tareas. + +

+ +

+
+ +| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU | +|:------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:| +| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot | +| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 | +| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 | +| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - | +| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - | +| InternLM-7B | 51.0 | 53.4 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 | +| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 | +| Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 | +| Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 | +| Yi-34B | 76.3 | 81.8 | 67.9 | 15.9 | 26.2 | 38.2 | 66.4 | 82.6 | +| XVERSE-65B | 70.8 | 68.6 | 60.3 | - | 26.3 | - | - | - | +| **Qwen-1.8B** | 45.3 | 56.1 | 32.3 | 2.3 | 15.2 | 14.2 | 22.3 | 52.1 | +| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 | +| **Qwen-14B** | 66.3 | 72.1 | 61.3 | 24.8 | 32.3 | 40.8 | 53.4 | 71.0 | +| **Qwen-72B** | **77.4** | **83.3** | **78.9** | **35.2** | **35.4** | **52.2** | **67.7** | **83.6** | + +Para todos los modelos comparados, presentamos las mejores puntuaciones entre sus resultados oficiales y [OpenCompass](https://opencompass.org.cn/leaderboard-llm). + +Para más resultados experimentales (rendimiento detallado del modelo en más conjuntos de datos de referencia) y detalles, consulte nuestro informe técnico haciendo clic [aquí](https://qianwen-res.oss-cn-beijing.aliyuncs.com/QWEN_TECHNICAL_REPORT.pdf). +

+ +## Requisitos + +* python 3.8 y superior +* pytorch 1.12 y superior, se recomienda 2.0 y superior +* transformers 4.32 y superiores +* Se recomienda CUDA 11.4 y superior (esto es para usuarios de GPU, usuarios de flash-attention, etc.) +
+ +## Inicio rápido + +A continuación, proporcionamos ejemplos sencillos para mostrar cómo utilizar Qwen-Chat con 🤖 ModelScope y 🤗 Transformers. + +Puedes usar nuestras imágenes docker pre-construidas para saltarte la mayoría de los pasos de configuración del entorno, mira la Sección ["Usando Imágenes Docker Pre-construidas"](#-using-pre-built-docker-images) para más detalles. + +Si no utiliza Docker, asegúrese de haber configurado el entorno e instalado los paquetes necesarios. Asegúrese de que cumple los requisitos anteriores y, a continuación, instale las bibliotecas dependientes. + +```bash +pip install -r requirements.txt +``` + +Si tu dispositivo soporta fp16 o bf16, te recomendamos instalar [flash-attention](https://github.com/Dao-AILab/flash-attention) (**ahora soportamos flash attention 2.**) para una mayor eficiencia y un menor uso de memoria. (**flash-attention es opcional y el proyecto puede ejecutarse normalmente sin instalarlo**) + +```bash +git clone https://github.com/Dao-AILab/flash-attention +cd flash-attention && pip install . +# Below are optional. Installing them might be slow. +# pip install csrc/layer_norm +# pip install csrc/rotary +``` + +Ahora puedes empezar con ModelScope o Transformers. + +### 🤗 Transformers + +Para utilizar Qwen-Chat para la inferencia, todo lo que tienes que hacer es introducir unas pocas líneas de código como se demuestra a continuación. Recuerda introducir los nombres o rutas correctos de los modelos, como "Qwen/Qwen-7B-Chat" y "Qwen/Qwen-14B-Chat". Sin embargo, **por favor, asegúrese de que está utilizando el código más reciente.** + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer +from transformers.generation import GenerationConfig + +# Model names: "Qwen/Qwen-7B-Chat", "Qwen/Qwen-14B-Chat" +tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) + +# use bf16 +# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval() +# use fp16 +# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval() +# use cpu only +# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval() +# use auto mode, automatically select precision based on the device. +model = AutoModelForCausalLM.from_pretrained( + "Qwen/Qwen-7B-Chat", + device_map="auto", + trust_remote_code=True +).eval() + +# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this. +# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) + +# 1st dialogue turn +response, history = model.chat(tokenizer, "你好", history=None) +print(response) +# 你好!很高兴为你提供帮助。 + +# 2nd dialogue turn +response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history) +print(response) +# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。 +# 故事的主人公叫李明,他来自一个普通的家庭,父母都是普通的工人。从小,李明就立下了一个目标:要成为一名成功的企业家。 +# 为了实现这个目标,李明勤奋学习,考上了大学。在大学期间,他积极参加各种创业比赛,获得了不少奖项。他还利用课余时间去实习,积累了宝贵的经验。 +# 毕业后,李明决定开始自己的创业之路。他开始寻找投资机会,但多次都被拒绝了。然而,他并没有放弃。他继续努力,不断改进自己的创业计划,并寻找新的投资机会。 +# 最终,李明成功地获得了一笔投资,开始了自己的创业之路。他成立了一家科技公司,专注于开发新型软件。在他的领导下,公司迅速发展起来,成为了一家成功的科技企业。 +# 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险,不断学习和改进自己。他的成功也证明了,只要努力奋斗,任何人都有可能取得成功。 + +# 3rd dialogue turn +response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history) +print(response) +# 《奋斗创业:一个年轻人的成功之路》 +``` + +Ejecutar Qwen, el modelo lingüístico base, también es sencillo. + +

+ Ejecutar Qwen + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer +from transformers.generation import GenerationConfig + +# Model names: "Qwen/Qwen-7B", "Qwen/Qwen-14B" +tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True) +# use bf16 +# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval() +# use fp16 +# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval() +# use cpu only +# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval() +# use auto mode, automatically select precision based on the device. +model = AutoModelForCausalLM.from_pretrained( + "Qwen/Qwen-7B", + device_map="auto", + trust_remote_code=True +).eval() + +# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this. +# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True) + +inputs = tokenizer('蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是', return_tensors='pt') +inputs = inputs.to(model.device) +pred = model.generate(**inputs) +print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)) +# 蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是亚的斯亚贝巴(Addis Ababa)... +``` + +
+ +En caso de que se produzca un problema de red al intentar descargar puntos de control y códigos de modelos desde Hugging Face, un método alternativo consiste en obtener inicialmente el punto de control desde ModelScope y luego cargarlo desde el directorio local como se indica a continuación: + +```python +from modelscope import snapshot_download +from transformers import AutoModelForCausalLM, AutoTokenizer + +# Downloading model checkpoint to a local dir model_dir +# model_dir = snapshot_download('qwen/Qwen-7B', revision='v1.1.4') +# model_dir = snapshot_download('qwen/Qwen-7B-Chat', revision='v1.1.4') +# model_dir = snapshot_download('qwen/Qwen-14B', revision='v1.0.4') +model_dir = snapshot_download('qwen/Qwen-14B-Chat', revision='v1.0.4') + +# Loading local checkpoints +# trust_remote_code is still set as True since we still load codes from local dir instead of transformers +tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) +model = AutoModelForCausalLM.from_pretrained( + model_dir, + device_map="auto", + trust_remote_code=True +).eval() +``` + +### 🤖 ModelScope + +ModelScope es una plataforma de código abierto para Model-as-a-Service (MaaS), que proporciona un servicio de modelos flexible y rentable a los desarrolladores de IA. Del mismo modo, puede ejecutar los modelos con ModelScope como se muestra a continuación: + +```python +from modelscope import AutoModelForCausalLM, AutoTokenizer +from modelscope import GenerationConfig + +# Model names: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat" +tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) +model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', device_map="auto", trust_remote_code=True, fp16=True).eval() +model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 + +response, history = model.chat(tokenizer, "你好", history=None) +print(response) +response, history = model.chat(tokenizer, "浙江的省会在哪里?", history=history) +print(response) +response, history = model.chat(tokenizer, "它有什么好玩的景点", history=history) +print(response) +``` + +### Inferencia por lotes +Qwen admite la inferencia por lotes. Con la atención flash activada, el uso de la inferencia por lotes puede suponer un aumento de velocidad del 40%. El código de ejemplo se muestra a continuación: + +```python +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer +from transformers import GenerationConfig +from qwen_generation_utils import make_context, decode_tokens, get_stop_words_ids + +tokenizer = AutoTokenizer.from_pretrained( + './', + pad_token='<|extra_0|>', + eos_token='<|endoftext|>', + padding_side='left', + trust_remote_code=True +) +model = AutoModelForCausalLM.from_pretrained( + './', + pad_token_id=tokenizer.pad_token_id, + device_map="auto", + trust_remote_code=True +).eval() +model.generation_config = GenerationConfig.from_pretrained('./', pad_token_id=tokenizer.pad_token_id) + +all_raw_text = ["我想听你说爱我。", "今天我想吃点啥,甜甜的,推荐下", "我马上迟到了,怎么做才能不迟到"] +batch_raw_text = [] +for q in all_raw_text: + raw_text, _ = make_context( + tokenizer, + q, + system="You are a helpful assistant.", + max_window_size=model.generation_config.max_window_size, + chat_format=model.generation_config.chat_format, + ) + batch_raw_text.append(raw_text) + +batch_input_ids = tokenizer(batch_raw_text, padding='longest') +batch_input_ids = torch.LongTensor(batch_input_ids['input_ids']).to(model.device) +batch_out_ids = model.generate( + batch_input_ids, + return_dict_in_generate=False, + generation_config=model.generation_config +) +padding_lens = [batch_input_ids[i].eq(tokenizer.pad_token_id).sum().item() for i in range(batch_input_ids.size(0))] + +batch_response = [ + decode_tokens( + batch_out_ids[i][padding_lens[i]:], + tokenizer, + raw_text_len=len(batch_raw_text[i]), + context_length=(batch_input_ids[i].size(0)-padding_lens[i]), + chat_format="chatml", + verbose=False, + errors='replace' + ) for i in range(len(all_raw_text)) +] +print(batch_response) + +response, _ = model.chat(tokenizer, "我想听你说爱我。", history=None) +print(response) + +response, _ = model.chat(tokenizer, "今天我想吃点啥,甜甜的,推荐下", history=None) +print(response) + +response, _ = model.chat(tokenizer, "我马上迟到了,怎么做才能不迟到", history=None) +print(response) +``` + +### CPU + +Para desplegar nuestros modelos en la CPU, le recomendamos encarecidamente que utilice [qwen.cpp](https://github.com/QwenLM/qwen.cpp), que es una implementación C++ pura de Qwen y tiktoken. Comprueba el repositorio para más detalles. + +Además, también es sencillo ejecutar directamente el modelo en la CPU, lo que requiere que especifiques el dispositivo: + +```python +model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval() +``` + +Pero es probable que sufra una eficacia de inferencia extremadamente baja. + +### Múltiples GPU + +Si sufres de falta de memoria en la GPU y quieres ejecutar el modelo en más de 1 GPU, puedes utilizar directamente el método de carga por defecto, que ahora es soportado por Transformers. El método anterior basado en `utils.py` está obsoleto. + +Sin embargo, aunque este método es sencillo, la eficiencia del paralelismo del pipeline nativo es baja. Le aconsejamos que utilice vLLM con FastChat y por favor lea la sección para el despliegue. + +### DashScope + +La forma más sencilla de utilizar Qwen a través de APIs es el servicio DashScope API a través de Alibaba Cloud. Damos una introducción al uso. Además, proporcionamos un script para que despliegues una API estilo OpenAI en tus propios servidores. + +DashScope es el gran servicio de API de modelos lingüísticos proporcionado por Alibaba Cloud, que ahora es compatible con Qwen. Tenga en cuenta que los modelos detrás de DashScope son versiones internas temporalmente sin detalles proporcionados. Los servicios incluyen `qwen-turbo` y `qwen-plus`, donde el primero se ejecuta más rápido y el segundo consigue un mejor rendimiento. Para más información, visita la documentación [aquí](https://dashscope.aliyun.com). + +Dirígete al sitio web oficial [enlace](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.6c2774fahtfXdn) para crear una cuenta DashScope y obtener la clave API (AK). Recomendamos configurar la AK con una variable de entorno: +```bash +export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY" +``` +A continuación, instala los paquetes y haz clic [aquí](https://help.aliyun.com/zh/dashscope/developer-reference/install-dashscope-sdk) para consultar la documentación. Si utilizas Python, puedes instalar DashScope con pip: +```bash +pip install dashscope +``` +Si utiliza JAVA SDK, puede instalarlo de esta forma: +```xml + + + com.alibaba + dashscope-sdk-java + the-latest-version + +``` +La forma más sencilla de utilizar DashScope es el uso con mensajes, que es similar a la API OpenAI. El ejemplo se muestra a continuación: +```python +import random +from http import HTTPStatus +from dashscope import Generation + + +def call_with_messages(): + messages = [{'role': 'system', 'content': 'You are a helpful assistant.'}, + {'role': 'user', 'content': '如何做西红柿鸡蛋?'}] + gen = Generation() + response = gen.call( + Generation.Models.qwen_turbo, + messages=messages, + seed=random.randint(1, 10000), # set the random seed, optional, default to 1234 if not set + result_format='message', # set the result to be "message" format. + ) + return response + + +if __name__ == '__main__': + response = call_with_messages() + if response.status_code == HTTPStatus.OK: + print(response) + else: + print('Request id: %s, Status code: %s, error code: %s, error message: %s' % ( + response.request_id, response.status_code, + response.code, response.message + )) +``` +Para más usos, visite el sitio web oficial. +

+ +## Cuantización + +### GPTQ + +Proporcionamos una solución basada en [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), y liberamos los modelos cuantificados Int4 e Int8, que consiguen efectos de modelo casi sin pérdidas pero un rendimiento mejorado tanto en costes de memoria como en velocidad de inferencia. + +Aquí demostramos cómo utilizar los modelos cuantizados que proporcionamos para la inferencia. Antes de empezar, asegúrese de que cumple los requisitos de auto-gptq (por ejemplo, torch 2.0 y superior, transformers 4.32.0 y superior, etc.) e instale los paquetes necesarios: + +```bash +pip install auto-gptq optimum +``` + +Si tiene problemas para instalar `auto-gptq`, le aconsejamos que consulte el [repo] oficial (https://github.com/PanQiWei/AutoGPTQ) para encontrar una rueda. + +> Nota: Los paquetes `auto-gptq` precompilados dependen en gran medida de la versión de `torch` y de su versión CUDA. Además, debido a la reciente actualización +> también puede encontrar errores de versión no soportada de `transformers`, `optimum`, o `peft`. +> Recomendamos utilizar las últimas versiones que cumplan los siguientes requisitos: +> - torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1 +> - antorcha>=2.0,<2.1 auto-gptq<0.5.0 transformadores<4.35.0 óptimo<1.14.0 peft>=0.5.0,<0.6.0 + +A continuación, puede cargar el modelo cuantizado fácilmente y ejecutar la inferencia como de costumbre: + +```python +# Model names: "Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4" +model = AutoModelForCausalLM.from_pretrained( + "Qwen/Qwen-7B-Chat-Int4", + device_map="auto", + trust_remote_code=True +).eval() +response, history = model.chat(tokenizer, "Hi", history=None) +``` + +Ilustramos el rendimiento de los modelos BF16, Int8 e Int4 en la prueba de referencia, y observamos que el modelo cuantizado no sufre una degradación significativa del rendimiento. Los resultados se muestran a continuación: + +| Quantization | MMLU | CEval (val) | GSM8K | Humaneval | +|----------------------|:----:|:-----------:|:-----:|:---------:| +| Qwen-1.8B-Chat (BF16)| 43.3 | 55.6 | 33.7 | 26.2 | +| Qwen-1.8B-Chat (Int8)| 43.1 | 55.8 | 33.0 | 27.4 | +| Qwen-1.8B-Chat (Int4)| 42.9 | 52.8 | 31.2 | 25.0 | +| Qwen-7B-Chat (BF16) | 55.8 | 59.7 | 50.3 | 37.2 | +| Qwen-7B-Chat (Int8) | 55.4 | 59.4 | 48.3 | 34.8 | +| Qwen-7B-Chat (Int4) | 55.1 | 59.2 | 49.7 | 29.9 | +| Qwen-14B-Chat (BF16) | 64.6 | 69.8 | 60.1 | 43.9 | +| Qwen-14B-Chat (Int8) | 63.6 | 68.6 | 60.0 | 48.2 | +| Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 | +| Qwen-72B-Chat (BF16) | 74.4 | 80.1 | 76.4 | 64.6 | +| Qwen-72B-Chat (Int8) | 73.5 | 80.1 | 73.5 | 62.2 | +| Qwen-72B-Chat (Int4) | 73.4 | 80.1 | 75.3 | 61.6 | + +### Cuantización de la caché KV + +> NOTA: Por favor, ten en cuenta que debido al mecanismo interno de Hugging Face, los archivos de soporte para esta funcionalidad +> (es decir, `cache_autogptq_cuda_256.cpp` y `cache_autogptq_cuda_kernel_245.cu`). +> Por favor, descárguelos manualmente del Hugging Face Hub y colóquelos en la misma carpeta que los demás archivos del módulo. + +La caché KV de atención puede cuantificarse y comprimirse para su almacenamiento, con el fin de obtener un mayor rendimiento de la muestra. Los argumentos `use_cache_quantization` y `use_cache_kernel` en `config.json` se proporcionan para habilitar la cuantización de la caché KV. +El método de uso específico es el siguiente: + +```python +model = AutoModelForCausalLM.from_pretrained( + "Qwen/Qwen-7B-Chat", + device_map="auto", + trust_remote_code=True, + use_cache_quantization=True, + use_cache_kernel=True, + use_flash_attn=False +) +``` +Atención: Actualmente, la cuantización de caché KV y flash attention no se pueden utilizar al mismo tiempo. +Si habilita la cuantización de caché KV y flash attention al mismo tiempo (`use_flash_attn=True`, `use_cache_quantization=True`, `use_cache_kernel=True`), `use_flash_attn` está deshabilitado por defecto (`use_flash_attn=false`). + +Hemos comprobado que el uso del modelo int8-kvcache cuantizado no sufre una degradación significativa del rendimiento en la evaluación posterior. A continuación, nos centraremos en el análisis de su huella de memoria en diferentes condiciones. +El perfil se ejecuta en una única GPU A100-SXM4-80G con PyTorch 2.0.1 y CUDA 11.4. +Utilizamos modelos BF16 para generar 1024 tokens por defecto, y "OOM" indica error de memoria insuficiente. + +Con la cuantización de la caché KV, el modelo puede inferir con un tamaño de lote (bs) mayor. + +| Utilizar la caché KV | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 | +|----------------------|:------:|:------:|:------:|:------:|:------:|:------:| +| No | 16.3GB | 24.1GB | 31.7GB | 48.7GB | OOM | OOM | +| Yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB | + +Con la cuantización kv-cache activada, el modelo puede ahorrar más memoria cuando genera seq-length más largos (sl, número de tokens generados) en infer. + +| Utilizar la caché KV | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 | +|----------------------|:------:|:-------:|:-------:|:-------:|:-------:| +| No | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB | +| Yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB | + +El modelo con cuantificación de caché KV convertirá el formato de `layer_past` de float a int8, y mientras tanto el `layer-past` cuantificado también almacenará los parámetros de cuantificación. + +Los pasos específicos son los siguientes + +1. Cuantificar clave/valor +``` + qv,scale,zero_point=quantize_cache_v(v) +``` +2. Almacenar en layer_past + +A continuación se muestra el formato de `layer_past` cuantificado: +``` + layer_past=((q_key,key_scale,key_zero_point), + (q_value,value_scale,value_zero_point)) +``` +A continuación se muestra el formato original de `layer_past`: +``` + layer_past=(key,value) +``` +Si desea utilizar la atención KV que se cuantiza, +puede utilizar la operación de decuantización para convertir la clave/valor int8 de nuevo al formato float de la siguiente manera: +``` + v=dequantize_cache_torch(qv,scale,zero_point) +``` +
+ + +## Rendimiento de Inferencia + +Esta sección proporciona las estadísticas de velocidad y memoria de los modelos en diferentes precisiones. Los perfiles de velocidad y memoria se realizan utilizando [este script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py). + +Medimos la velocidad media de inferencia (tokens/s) y el uso de memoria de la GPU al generar 2048 con los modelos en BF16, Int8 e Int4. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model SizeQuantizationSpeed (Tokens/s)GPU Memory Usage
1.8BBF1654.094.23GB
Int855.563.48GB
Int471.072.91GB
7BBF1640.9316.99GB
Int837.4711.20GB
Int450.098.21GB
14BBF1632.2230.15GB
Int829.2818.81GB
Int438.7213.01GB
72BBF168.48144.69GB (2xA100)
Int89.0581.27GB (2xA100)
Int411.3248.86GB
72B + vLLMBF1617.602xA100
+ +El perfil se ejecuta en una única GPU A100-SXM4-80G (salvo que se mencione 2xA100) con PyTorch 2.0.1, CUDA 11.8 y Flash-Attention 2. (72B + vLLM utiliza PyTorch 2.1.0 y Cuda 11.8.) La velocidad de inferencia se promedia sobre los tokens codificados y generados. + +Nota: La velocidad de generación de los modelos Int4/Int8 mencionados anteriormente es proporcionada por la librería autogptq. La velocidad actual del modelo cargado utilizando ``AutoModelForCausalLM.from_pretrained`` será aproximadamente un 20% más lenta. Hemos informado de este problema al equipo de HuggingFace y lo actualizaremos rápidamente si se encuentra una solución. + +También medimos la velocidad de inferencia y el uso de memoria de la GPU con diferentes configuraciones de contexto y longitudes de generación, versión Flash-Attention. Puedes encontrar los resultados en las modelcards correspondientes en Hugging Face o ModelScope. + + + +## Finetuning + +### Utilización +Ahora proporcionamos el script de entrenamiento oficial, `finetune.py`, para que los usuarios puedan ajustar el modelo preentrenado para aplicaciones posteriores de forma sencilla. Además, proporcionamos scripts de shell para lanzar el ajuste fino sin preocupaciones. Este script soporta el entrenamiento con [DeepSpeed](https://github.com/microsoft/DeepSpeed) y [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/). Los shell scripts que proporcionamos utilizan DeepSpeed (Nota: esto puede tener conflictos con la última versión de pydantic y debe utilizar make sure `pydantic<2.0`) y Peft. Puede instalarlos de la siguiente manera: +```bash +pip install peft deepspeed +``` + +Para preparar tus datos de entrenamiento, necesitas poner todas las muestras en una lista y guardarla en un archivo json. Cada muestra es un diccionario que consiste en un id y una lista para la conversación. A continuación se muestra una lista de ejemplo simple con 1 muestra: +```json +[ + { + "id": "identity_0", + "conversations": [ + { + "from": "user", + "value": "你好" + }, + { + "from": "assistant", + "value": "我是一个语言模型,我叫通义千问。" + } + ] + } +] +``` + +Una vez preparados los datos, puede utilizar los scripts de shell suministrados para ejecutar el ajuste fino. Recuerde especificar la ruta al archivo de datos, `$DATA`. + +Los guiones de finetuning permiten realizar: +- Finetuning de todos los parámetros +- LoRA +- Q-LoRA + +Full-parameter finetuning requires updating all parameters in the whole training process. To launch your training, run the following script: + +```bash +# Entrenamiento distribuido. No proporcionamos un script de entrenamiento para una sola GPU, ya que la insuficiente memoria de la GPU interrumpiría el entrenamiento. +sh finetune/finetune_ds.sh +``` + +Recuerde especificar el nombre correcto del modelo o ruta, la ruta de datos, así como el directorio de salida en los scripts de shell. Otra cosa a notar es que usamos DeepSpeed ZeRO 3 en este script. Si desea realizar cambios, basta con eliminar el argumento `--deepspeed` o realizar cambios en el archivo json de configuración de DeepSpeed en función de sus necesidades. Además, este script soporta entrenamiento de precisión mixta, por lo que puedes usar `--bf16 True` o `--fp16 True`. Recuerde utilizar DeepSpeed cuando utilice fp16 debido al entrenamiento de precisión mixta. +Empíricamente le aconsejamos que utilice bf16 para que su entrenamiento sea coherente con nuestro preentrenamiento y alineación si su máquina soporta bf16, y por lo tanto lo utilizamos por defecto. + +Para ejecutar LoRA, utilice otro script para ejecutar como se muestra a continuación. Antes de empezar, asegúrese de que ha instalado `peft`. Además, es necesario especificar las rutas a su modelo, los datos y la salida. Le aconsejamos que utilice la ruta absoluta para su modelo pre-entrenado. Esto se debe a que LoRA sólo guarda el adaptador y la ruta absoluta en el archivo json de configuración del adaptador se utiliza para encontrar el modelo preentrenado para cargar. Además, este script soporta tanto bf16 como fp16. + +```bash +# Single GPU training +sh finetune/finetune_lora_single_gpu.sh +# Distributed training +sh finetune/finetune_lora_ds.sh +``` + +En comparación con el ajuste fino de parámetros completos, LoRA ([artículo](https://arxiv.org/abs/2106.09685)) sólo actualiza los parámetros de las capas adaptadoras, pero mantiene congeladas las grandes capas originales del modelo de lenguaje. Esto permite muchos menos costes de memoria y, por tanto, de computación. + +Tenga en cuenta que si utiliza LoRA para ajustar el modelo de lenguaje base, por ejemplo, Qwen-7B, en lugar de los modelos de chat, por ejemplo, Qwen-7B-Chat, el script cambia automáticamente la incrustación y la capa de salida como parámetros entrenables. Esto se debe a que el modelo de lenguaje base no tiene conocimiento de los tokens especiales que aporta el formato ChatML. Por lo tanto, estas capas deben actualizarse para que el modelo comprenda y prediga los tokens. O en otras palabras, si tu entrenamiento trae tokens especiales en LoRA, deberías poner las capas como parámetros entrenables poniendo `modules_to_save` dentro del código. Además, si tenemos estos parámetros entrenables, no está disponible para usar ZeRO 3, y es por esto que usamos ZeRO 2 en el script por defecto. Si no tenemos nuevos parámetros entrenables, podemos cambiar a ZeRO 3 cambiando el fichero de configuración de DeepSpeed. Además, encontramos que hay una brecha significativa entre la huella de memoria de LoRA con y sin estos parámetros entrenables. Por lo tanto, si usted tiene problemas con la memoria, le aconsejamos LoRA finetune los modelos de chat. Compruebe el perfil de abajo para obtener más información. + +Si sigue sufriendo de memoria insuficiente, puede considerar Q-LoRA ([artículo](https://arxiv.org/abs/2305.14314)), que utiliza el modelo de lenguaje cuantizado de gran tamaño y otras técnicas como la atención paginada para permitir incluso menos costes de memoria. + +Nota: para ejecutar el entrenamiento Q-LoRA con una sola GPU, puede que necesites instalar `mpi4py` a través de `pip` o `conda`. + +Para ejecutar Q-LoRA, ejecute directamente el siguiente script: + +```bash +# Entrenamiento con una sola GPU +sh finetune/finetune_qlora_single_gpu.sh +# Entrenamiento distribuida +sh finetune/finetune_qlora_ds.sh +``` + +Para Q-LoRA, le aconsejamos que cargue nuestro modelo cuantizado proporcionado, por ejemplo, Qwen-7B-Chat-Int4. **NO DEBE** utilizar los modelos bf16. A diferencia del finetuning de parámetros completos y LoRA, sólo fp16 es compatible con Q-LoRA. Para el entrenamiento con una sola GPU, tenemos que utilizar DeepSpeed para el entrenamiento de precisión mixta debido a nuestra observación de errores causados por el amplificador de antorcha. Además, para Q-LoRA, los problemas con los tokens especiales en LoRA siguen existiendo. Sin embargo, como sólo proporcionamos los modelos Int4 para los modelos de chat, lo que significa que el modelo lingüístico ha aprendido los tokens especiales del formato ChatML, no hay que preocuparse por las capas. Ten en cuenta que las capas del modelo Int4 no deben ser entrenables, por lo que si introduces tokens especiales en tu entrenamiento, Q-LoRA podría no funcionar. + +> NOTA: Tenga en cuenta que debido a los mecanismos internos de Hugging Face, ciertos archivos que no son de Python (por ejemplo, `*.cpp` y `*.cu`) pueden faltar en el punto de control guardado. +> pueden faltar en el punto de control guardado. Es posible que tenga que copiarlos manualmente en el directorio que contiene otros archivos. + +A diferencia del finetuning de parámetros completo, el entrenamiento de LoRA y Q-LoRA sólo guarda los parámetros del adaptador. Supongamos que su entrenamiento comienza desde Qwen-7B, puede cargar el modelo ajustado para la inferencia como se muestra a continuación: + +```python +from peft import AutoPeftModelForCausalLM + +model = AutoPeftModelForCausalLM.from_pretrained( + path_to_adapter, # path to the output directory + device_map="auto", + trust_remote_code=True +).eval() +``` + +Si quieres fusionar los adaptadores y guardar el modelo ajustado como un modelo independiente (sólo puedes hacer esto con LoRA, y NO puedes fusionar los parámetros desde Q-LoRA), puedes ejecutar los siguientes códigos: + +```python +from peft import AutoPeftModelForCausalLM + +model = AutoPeftModelForCausalLM.from_pretrained( + path_to_adapter, # path to the output directory + device_map="auto", + trust_remote_code=True +).eval() + +merged_model = model.merge_and_unload() +# max_shard_size and safe serialization are not necessary. +# They respectively work for sharding checkpoint and save the model to safetensors +merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True) +``` + +Nota: Para el entrenamiento multi-GPU, es necesario especificar los hiperparámetros adecuados para el entrenamiento distribuido basado en su máquina. Además, le aconsejamos que especifique la longitud máxima de la secuencia con el argumento `--model_max_length`, en función de los datos, el espacio de memoria y la velocidad de entrenamiento. + + +### Perfiles de Memoria y Velocidad +Perfilamos la memoria de la GPU y la velocidad de entrenamiento tanto de LoRA (LoRA (emb) se refiere al entrenamiento de la capa de incrustación y salida, mientras que LoRA no tiene capa de incrustación y salida entrenables) como de Q-LoRA en la configuración de entrenamiento en una sola GPU. En esta prueba, experimentamos con una única GPU A100-SXM4-80G, y utilizamos CUDA 11.8 y Pytorch 2.0. Se aplica Flash attention 2. Utilizamos uniformemente un tamaño de lote de 1 y una acumulación de gradiente de 8. Perfilamos la memoria (GB) y la velocidad (s/iter) de entradas de distintas longitudes, a saber, 256, 512, 1024, 2048, 4096 y 8192. También presentamos las estadísticas del ajuste fino de todos los parámetros con Qwen-7B en 2 GPU A100. Sólo se presentan las estadísticas de 256, 512 y 1024 tokens debido a la limitación de memoria de la GPU. + +Para Qwen-72B, experimentamos de dos formas: 1) Ajuste fino de Lora + DeepSpeed ZeRO 3 en 4 GPUs A100-SXM4-80G y 2) Ajuste fino de QLora (int4) en una sola GPU A100-SXM4-80G. Ten en cuenta que la OOM se produce en 4 GPUs A100-SXM4-80G tanto con ajuste fino LoRA (emb) como con ajuste fino LoRA sin Deepspeed ZeRO 3 (puedes pasar `--deepspeed finetune/ds_config_zero3.json` a [`finetune/finetune_lora_ds.sh`](finetune/finetune_lora_ds.sh) para activar DeepSpeed ZeRO 3). + +Las estadísticas se enumeran a continuación: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model SizeMethodSequence Length
2565121024204840968192
1.8BLoRA6.7G / 1.0s/it7.4G / 1.0s/it8.4G / 1.1s/it11.0G / 1.7s/it16.2G / 3.3s/it21.8G / 6.8s/it
LoRA (emb)13.7G / 1.0s/it14.0G / 1.0s/it14.0G / 1.1s/it15.1G / 1.8s/it19.7G / 3.4s/it27.7G / 7.0s/it
Q-LoRA5.8G / 1.4s/it6.0G / 1.4s/it6.6G / 1.4s/it7.8G / 2.0s/it10.2G / 3.4s/it15.8G / 6.5s/it
Full-parameter43.5G / 2.1s/it43.5G / 2.2s/it43.5G / 2.2s/it43.5G / 2.3s/it47.1G / 2.8s/it48.3G / 5.6s/it
7BLoRA20.1G / 1.2s/it20.4G / 1.5s/it21.5G / 2.8s/it23.8G / 5.2s/it29.7G / 10.1s/it36.6G / 21.3s/it
LoRA (emb)33.7G / 1.4s/it34.1G / 1.6s/it35.2G / 2.9s/it35.1G / 5.3s/it39.2G / 10.3s/it48.5G / 21.7s/it
Q-LoRA11.5G / 3.0s/it11.5G / 3.0s/it12.3G / 3.5s/it13.9G / 7.0s/it16.9G / 11.6s/it23.5G / 22.3s/it
Full-parameter139.2G / 4.0s/it148.0G / 4.0s/it162.0G / 4.5s/it---
14BLoRA34.6G / 1.6s/it35.1G / 2.4s/it35.3G / 4.4s/it37.4G / 8.4s/it42.5G / 17.0s/it55.2G / 36.0s/it
LoRA (emb)51.2 / 1.7s/it51.1G / 2.6s/it51.5G / 4.6s/it54.1G / 8.6s/it56.8G / 17.2s/it67.7G / 36.3s/it
Q-LoRA18.7G / 5.3s/it18.4G / 6.3s/it18.9G / 8.2s/it19.9G / 11.8s/it23.0G / 20.1s/it27.9G / 38.3s/it
72BLoRA + Deepspeed Zero3215.4G / 17.6s/it217.7G / 20.5s/it222.6G / 29.4s/it228.8G / 45.7s/it249.0G / 83.4s/it289.2G / 161.5s/it
Q-LoRA61.4G / 27.4s/it61.4G / 31.5s/it62.9G / 41.4s/it64.1G / 59.5s/it68.0G / 97.7s/it75.6G / 179.8s/it
+
+ +## Despliegue + +### vLLM +Para el despliegue y la inferencia rápida, sugerimos utilizar vLLM con FastChat. Instale primero los paquetes: +```bash +pip install vllm fastchat +``` +O puede instalarlos desde el código fuente mediante `git clone` y `pip install -e .`. Le aconsejamos que lea sus documentos si encuentra problemas en la instalación. + +Para ejecutar Qwen con vLLM y FastChat, primero necesitas lanzar un controlador por: +```bash +python -m fastchat.serve.controller +``` + +A continuación, puede iniciar el model worker, lo que significa cargar su modelo para la inferencia. Para la inferencia de una sola GPU, puede ejecutar directamente: + +```bash +python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code +``` +Sin embargo, si desea ejecutar el modelo en varias GPU para acelerar la inferencia o disponer de más memoria, puede utilizar el paralelismo tensorial soportado por vLLM. Supongamos que ejecutas el modelo en 4 GPUs, el comando se muestra a continuación: +```bash +python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 +``` + +Después de lanzar tu model worker, puedes lanzar: + +* Web UI Demo +```bash +python -m fastchat.serve.gradio_web_server +``` + +* API OpenAI +```bash +python -m fastchat.serve.openai_api_server --host localhost --port 8000 +``` + +### Interfaz Web + +Proporcionamos código para que los usuarios construyan una web UI demo (gracias a @wysaid). Antes de empezar, asegúrate de instalar los siguientes paquetes: +``` +pip install -r requirements_web_demo.txt +``` + +A continuación, ejecute el siguiente comando y haga clic en el enlace generado: + +```bash +python web_demo.py +``` + +

+
+ +
+

+ +Sin embargo, si le resulta difícil utilizar vLLM y FastChat, puede probar los métodos más sencillos que le proporcionamos para desplegar una demo web, una demo CLI y una API. + +### Demo CLI + +Proporcionamos un ejemplo de demostración CLI en `cli_demo.py`, que soporta la salida de streaming para la generación. Los usuarios pueden interactuar con Qwen-7B-Chat introduciendo mensajes, y el modelo devuelve los resultados del modelo en modo streaming. Ejecute el siguiente comando: + +```bash +python cli_demo.py +``` + +

+
+ +
+

+
+ +### API + +Proporcionamos métodos para desplegar la API local basada en la API de OpenAI (gracias a @hanpenggit). Antes de empezar, instala los paquetes necesarios: + +```bash +pip install fastapi uvicorn openai "pydantic>=2.3.0" sse_starlette +``` + +A continuación, ejecute el comando para desplegar su API: + +```bash +python openai_api.py +``` + +Puede cambiar sus argumentos, por ejemplo, `-c` para el nombre o la ruta del punto de control, `--cpu-only` para el despliegue en CPU, etc. Si tienes problemas al iniciar el despliegue de tu API, probablemente puedas solucionarlos actualizando los paquetes a la última versión. + +Utilizar la API también es sencillo. Vea el siguiente ejemplo: + +```python +import openai +openai.api_base = "http://localhost:8000/v1" +openai.api_key = "none" + +# create a request activating streaming response +for chunk in openai.ChatCompletion.create( + model="Qwen", + messages=[ + {"role": "user", "content": "你好"} + ], + stream=True + # Specifying stop words in streaming output format is not yet supported and is under development. +): + if hasattr(chunk.choices[0].delta, "content"): + print(chunk.choices[0].delta.content, end="", flush=True) + +# create a request not activating streaming response +response = openai.ChatCompletion.create( + model="Qwen", + messages=[ + {"role": "user", "content": "你好"} + ], + stream=False, + stop=[] # You can add custom stop words here, e.g., stop=["Observation:"] for ReAct prompting. +) +print(response.choices[0].message.content) +``` + +

+
+ +
+

+ +**Function calling** también está soportada (pero sólo cuando `stream=False` por el momento). Ver el [ejemplo de uso](examples/function_call_examples.py) aquí. +

+ +## 🐳 Docker + +Para simplificar el proceso de despliegue, proporcionamos imágenes Docker con entornos preconstruidos: [qwenllm/qwen](https://hub.docker.com/r/qwenllm/qwen). Solo tienes que instalar el controlador y descargar los archivos del modelo para lanzar demos, desplegar la API de OpenAI y ajustar el modelo. + +### Preparación + +1. Instale la versión correcta del controlador Nvidia en función de la imagen que vaya a utilizar: + - `qwenllm/qwen:cu117` (**recomendado**): `>= 515.48.07` + - `qwenllm/qwen:cu114` (w/o flash-attention): `>= 470.82.01` + - `qwenllm/qwen:latest`: igual que `qwenllm/qwen:cu117` + +2. Instale y configure [docker](https://docs.docker.com/engine/install/) y [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html): + +```bash +# configure docker +sudo systemctl start docker +# test if docker is correctly installed +sudo docker run hello-world + +# configure nvidia-container-toolkit +sudo nvidia-ctk runtime configure --runtime=docker +sudo systemctl restart docker +# test if nvidia-container-toolkit is correctly installed +sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi +``` + +3. Descargue los checkpoints y los códigos del modelo a su entorno (véase [aquí](#DownloadModel)). + +### Despliegue + +Aquí usamos Qwen-7B-Chat como ejemplo. Antes de lanzar una demo web o API, puede establecer la configuración como se muestra a continuación: + +```bash +IMAGE_NAME=qwenllm/qwen:cu117 +PORT=8901 +CHECKPOINT_PATH=/path/to/Qwen-7B-Chat # Path to downloaded model checkpoints and codes +``` +Los siguientes scripts pueden ayudarte a construir: + +* API OpenAI +```bash +bash docker/docker_openai_api.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT} +``` + +* Interfaz Web +```bash +bash docker/docker_web_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT} +``` + +* Demo CLI +```bash +bash docker/docker_cli_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} +``` + +Los comandos anteriores descargarán automáticamente la imagen requerida y lanzarán una demo Web UI en segundo plano (el servicio se reiniciará automáticamente). Puede abrir `http://localhost:${PORT}` en el host para utilizar la demo. + +La demostración se ha iniciado correctamente si ve la siguiente salida: + +```text +Successfully started web demo. Open '...' to try! +Run `docker logs ...` to check demo status. +Run `docker rm -f ...` to stop and remove the demo. +``` + +Si quieres comprobar el estado de la demo, puedes usar `docker logs qwen` para mostrar los resultados. + +Puede utilizar `docker rm -f qwen` para detener el servicio y eliminar el contenedor. + + +### Finetuning + +El método de finetuning utilizando la imagen Docker pre-construida es básicamente el mismo que [el capítulo anterior](#Finetuning) (ya hemos instalado dependencias en la imagen): + +A continuación se muestra un ejemplo de LoRA de GPU única: +```bash +IMAGE_NAME=qwenllm/qwen:cu117 +CHECKPOINT_PATH=/path/to/Qwen-7B # Path to downloaded model checkpoints and codes +#CHECKPOINT_PATH=/path/to/Qwen-7B-Chat-Int4 # Path to downloaded model checkpoints and codes (Q-LoRA) +DATA_PATH=/path/to/data/root # Prepare finetune data at ${DATA_PATH}/example.json +OUTPUT_PATH=/path/to/output/checkpoint # Path to finetune outputs + +# Use all host devices by default +DEVICE=all +# If you need to specify GPUs for training, set device as follow (NOTE: internal quotation marks cannot be omitted) +#DEVICE='"device=0,1,2,3"' + +mkdir -p ${OUTPUT_PATH} + +# Single-GPU LoRA finetuning +docker run --gpus ${DEVICE} --rm --name qwen \ + --mount type=bind,source=${CHECKPOINT_PATH},target=/data/shared/Qwen/Qwen-7B \ + --mount type=bind,source=${DATA_PATH},target=/data/shared/Qwen/data \ + --mount type=bind,source=${OUTPUT_PATH},target=/data/shared/Qwen/output_qwen \ + --shm-size=2gb \ + -it ${IMAGE_NAME} \ + bash finetune/finetune_lora_single_gpu.sh -m /data/shared/Qwen/Qwen-7B/ -d /data/shared/Qwen/data/example.json +``` + +Para realizar un cambio a Q-LoRA de una sola GPU, por ejemplo, basta con modificar el comando bash dentro de `docker run`: +```bash +bash finetune/finetune_qlora_single_gpu.sh -m /data/shared/Qwen/Qwen-7B-Chat-Int4/ -d /data/shared/Qwen/data/example.json +``` +
+ +## 🔥 Indicaciones del sistema +Qwen-1.8-Chat y Qwen-72B-Chat han sido completamente entrenados en diversas indicaciones del sistema con múltiples rondas de interacciones complejas, para que puedan seguir una variedad de indicaciones del sistema y realizar la personalización del modelo en contexto, mejorando aún más la escalabilidad de Qwen-chat. + +Gracias a las instrucciones del sistema, Qwen-Chat puede realizar **juegos de rol**, **transferencia de estilos de lenguaje**, **configuración de tareas** y **configuración de comportamientos**. + +![](assets/system_prompt_language_style.png) + +![](assets/system_prompt_role_play_en.png) + +Para más información, consulta la [documentación de ejemplo](examples/system_prompt.md). + + +## Uso de Herramientas + +Qwen-Chat ha sido optimizado para el uso de herramientas y capacidades de llamada a funciones. Los usuarios pueden desarrollar agentes, aplicaciones LangChain e incluso aumentar Qwen con un intérprete de código Python. + +Proporcionamos documentación sobre cómo implementar llamadas a herramientas basadas en el principio de ReAct Prompting, por favor consulte [the ReAct example](examples/react_prompt.md). Basándonos en este principio, proporcionamos soporte para llamadas a funciones en [openai_api.py](openai_api.py). + +Hemos probado las capacidades de llamada de la herramienta del modelo en nuestro punto de referencia de evaluación chino de código abierto y hemos descubierto que Qwen-Chat obtiene siempre buenos resultados: + + + + + + + + + + + + + + + + + + + + +
Chinese Tool-Use Benchmark
ModelTool Selection (Acc.↑)Tool Input (Rouge-L↑)False Positive Error↓
GPT-495%0.9015.0%
GPT-3.585%0.8875.0%
Qwen-7B-Chat98%0.917.3%
Qwen-14B-Chat98%0.932.4%
+ +Para evaluar la capacidad de Qwen para utilizar el intérprete de código Python en tareas como la resolución de problemas matemáticos, la visualización de datos y otras tareas de propósito general como el manejo de archivos y el web scraping, hemos creado y puesto a disposición del público un benchmark específicamente diseñado para evaluar estas capacidades. Puede encontrar el punto de referencia en este [enlace](https://github.com/QwenLM/Qwen-Agent/tree/main/benchmark). + +Hemos observado que Qwen funciona bien en términos de ejecutabilidad del código y precisión de los resultados al generar código: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Executable Rate of Generated Code (%)
ModelMath↑Visualization↑General↑
GPT-491.985.982.8
GPT-3.589.265.074.1
LLaMA2-7B-Chat41.933.124.1
LLaMA2-13B-Chat50.040.548.3
CodeLLaMA-7B-Instruct85.154.070.7
CodeLLaMA-13B-Instruct93.255.874.1
InternLM-7B-Chat-v1.178.444.262.1
InternLM-20B-Chat70.344.265.5
Qwen-7B-Chat82.464.467.2
Qwen-14B-Chat89.284.165.5
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Accuracy of Code Execution Results (%)
ModelMath↑Visualization-Hard↑Visualization-Easy↑
GPT-482.866.760.8
GPT-3.547.333.355.7
LLaMA2-7B-Chat3.914.339.2
LLaMA2-13B-Chat8.38.340.5
CodeLLaMA-7B-Instruct14.326.260.8
CodeLLaMA-13B-Instruct28.227.462.0
InternLM-7B-Chat-v1.128.54.840.5
InternLM-20B-Chat34.621.445.6
Qwen-7B-Chat41.940.554.4
Qwen-14B-Chat58.453.659.5
+ +

+
+ +
+

+ +Además, también proporcionamos resultados experimentales que demuestran que nuestro modelo es capaz de actuar como un Agente HuggingFace. Para más información, consulte la [documentación del ejemplo](examples/transformers_agent.md). El rendimiento del modelo en el conjunto de datos de evaluación proporcionado por Hugging Face es el siguiente: + + + + + + + + + + + + + + + + + + + + + + + + + + +
HuggingFace Agent Benchmark- Run Mode
ModelTool Selection↑Tool Used↑Code↑
GPT-410010097.4
GPT-3.595.496.387.0
StarCoder-Base-15B86.187.068.9
StarCoder-15B87.088.068.9
Qwen-7B-Chat87.087.071.5
Qwen-14B-Chat93.594.487.0
+ + + + + + + + + + + + + + + + + + + + + + + + + + +
HuggingFace Agent Benchmark - Chat Mode
ModelTool Selection↑Tool Used↑Code↑
GPT-497.997.998.5
GPT-3.597.396.889.6
StarCoder-Base-15B97.997.991.1
StarCoder-15B97.997.989.6
Qwen-7B-Chat94.794.785.1
Qwen-14B-Chat97.997.995.5
+ +
+ +## Comprensión del Contexto Largo + +Para ampliar la longitud del contexto y romper el cuello de botella de la longitud de la secuencia de entrenamiento, introducimos varias técnicas, como la interpolación NTK, la atención de ventana y el escalado de atención LogN, para ampliar la longitud del contexto de Qwen-14B de 2K a más de 8K tokens, y Qwen-1.8B/7B de 8K a 32K tokens. + +Para Qwen-72B, adaptamos RoPE a contextos más largos con una base rotatoria mayor. Qwen-72B admite una longitud máxima de contexto de 32K tokens. + +Realizamos experimentos de modelado lingüístico en el conjunto de datos arXiv con la evaluación PPL y descubrimos que Qwen puede alcanzar un rendimiento sobresaliente en el escenario de contextos largos. Los resultados se muestran a continuación: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSequence Length
10242048409681921638432768
Qwen-7B (original)4.233.7839.35469.812645.09-
+ dynamic_ntk4.233.783.593.665.71-
+ dynamic_ntk + logn4.233.783.583.564.62-
+ dynamic_ntk + logn + window_attn4.233.783.583.494.32-
Qwen-1.8B5.004.484.133.8917.42433.85
+ dynamic_ntk + logn + window_attn5.004.484.143.933.823.83
Qwen-7B4.233.813.523.317.27181.49
+ dynamic_ntk + logn + window_attn4.233.813.523.333.223.17
Qwen-14B-3.4622.79334.653168.35-
+ dynamic_ntk + logn + window_attn-3.463.293.183.42-
Qwen-72B---2.832.732.72
+ +Furthermore, to verify the ability of Qwen-72B-Chat on long text understanding, we tested it on [L-Eval](https://arxiv.org/abs/2307.11088) (closed-ended tasks). The results are as follows: + +| Model | Input Length | Average | Coursera | GSM | QuALITY | TOEFL | CodeU | SFcition | +|:------------------|:------------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:| +| ChatGPT-3.5-16k | 16K | 60.73 | **63.51** | **84.00** | 61.38 | 78.43 | **12.22** | 64.84 | +| **Qwen-72B-Chat** | 32K | **62.30** | 58.13 | 76.00 | **77.22** | **86.24** | 6.66 | **69.53** | + +Hemos realizado el experimento de la "aguja en el pajar" (la idea procede de [@Greg Kamradt](https://twitter.com/GregKamradt/status/1727018183608193393)) para comprobar si el modelo puede recuperar información en distintas posiciones de las entradas de distintas longitudes, el resultado es el siguiente: + +![](assets/qwen_72b_needle_in_a_haystack.png) + +Los resultados anteriores muestran que Qwen-72B-Chat puede recuperar con precisión información situada en varias posiciones dentro de una longitud de entrada de 32K, lo que demuestra su excelente capacidad de comprensión de textos largos. + + +## Tokenizador + +Nuestro tokenizador basado en tiktoken es diferente de otros tokenizadores, por ejemplo, el tokenizador sentencepiece. Es necesario prestar atención a los tokens especiales, especialmente en el finetuning. Para obtener información más detallada sobre el tokenizador y su uso en el ajuste fino, consulte la [documentación](tokenization_note.md). +

+ +## Reproducción + +Para que pueda reproducir el rendimiento del modelo en conjuntos de datos de referencia, le proporcionamos secuencias de comandos para que reproduzca los resultados. Consulte [eval/EVALUATION.md](eval/EVALUATION.md) para obtener más información. Tenga en cuenta que la reproducción puede dar lugar a ligeras diferencias con respecto a nuestros resultados. +

+ +## FAQ + +Si tiene problemas, consulte primero [FAQ](FAQ.md) y las incidencias para buscar una solución antes de lanzar una nueva incidencia. +

+ +## Cita +Si nuestro trabajo le resulta útil, no dude en citarnos. + +``` +@article{qwen, + title={Qwen Technical Report}, + author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu}, + journal={arXiv preprint arXiv:2309.16609}, + year={2023} +} +``` +
+ +## Acuerdo de Licencia + +El código fuente proporcionado en está licenciado bajo la [Licencia Apache 2.0](./LICENSE) que puede encontrarse en el directorio raíz. + +Los investigadores y desarrolladores son libres de utilizar los códigos y los pesos de los modelos tanto de Qwen como de Qwen-Chat. Para su uso comercial, consulte el Acuerdo de Licencia que acompaña a cada modelo. + +- Qwen-72B, Qwen-14B, y Qwen-7B están licenciados bajo el [Tongyi Qianwen LICENSE AGREEMENT](./Tongyi%20Qianwen%20LICENSE%20AGREEMENT) que se puede encontrar en el repositorio correspondiente de HuggingFace y ModelScope. Para uso comercial, rellene el formulario ([72B](https://dashscope.console.aliyun.com/openModelApply/Qwen-72B-Chat), [14B](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat), y [7B](https://dashscope.console.aliyun.com/openModelApply/qianwen)) para solicitarlo. + +- Qwen-1.8B está licenciado bajo el [Tongyi Qianwen RESEARCH LICENSE AGREEMENT](./Tongyi%20Qianwen%20RESEARCH%20LICENSE%20AGREEMENT) que puede encontrarse en el repositorio correspondiente de HuggingFace y ModelScope. Para uso comercial, póngase en contacto con nosotros. +

+ +## Contacte con Nosotros + +Si estás interesado en dejar un mensaje a nuestro equipo de investigación o de producto, únete a nuestros grupos de Discord o WeChat. También puedes enviar un correo electrónico a qianwen_opensource@alibabacloud.com. + diff --git a/README_FR.md b/README_FR.md index e5d0d14..69d723c 100644 --- a/README_FR.md +++ b/README_FR.md @@ -1,5 +1,5 @@

- 中文  |  English  |  日本語  |  Français + 中文  |  English  |  日本語  |  Français |  Español



@@ -9,16 +9,18 @@

- 🤗 Hugging Face   |   🤖 ModelScope   |    📑 Paper    |   🖥️ Demo + 🤗 Hugging Face   |   🤖 ModelScope   |    📑 Paper    |   🖥️ Demo
-WeChat (微信)   |    DingTalk (钉钉)    |   Discord   +WeChat (微信)   |   Discord   |   API



| | Qwen-Chat | Qwen-Chat (Int4) | Qwen-Chat (Int8) | Qwen | |-----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:| +| 1.8B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | | 7B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | | 14B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | +| 72B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | @@ -26,6 +28,14 @@ Nous ouvrons notre série **Qwen**, qui comprend désormais **Qwen**, les modèl En bref, nous disposons de modèles linguistiques solides, qui ont été pré-entraîné de manière stable pour 3 000 milliards de tokens de données multilingues avec une large couverture de domaines, de langues (en particulier le chinois et l'anglais), etc. Ils sont capables d'atteindre des performances compétitives sur des ensembles de données de référence. En outre, nous disposons de modèles de chat alignés sur les préférences humaines basées sur SFT et RLHF (pas encore publiés), qui sont capables de chatter, de créer du contenu, d'extraire des informations, de résumer, de traduire, de coder, de résoudre des problèmes mathématiques, etc. et d'utiliser des outils, de jouer le rôle d'agents ou même code interpreter, etc. +| Modèle | Date de sortie | Longueur maximale | Amélioration de l'invite du système | # de tokens pré-formés | Utilisation minimale de la mémoire du GPU pour Finetuning (Q-Lora) | Utilisation minimale du GPU pour générer 2048 jetons (Int4) | Utilisation des outils | +|:----------|:--------------:|:-----------------:|:-----------------------------------:|:----------------------:|:------------------------------------------------------------------:|:-----------------------------------------------------------:|:----------------------:| +| Qwen-1.8B | 23.11.30 | 32K | √ | 2.2T | 5.8GB | 2.9GB | √ | +| Qwen-7B | 23.08.03 | 32K | × | 2.4T | 11.5GB | 8.2GB | √ | +| Qwen-14B | 23.09.25 | 8K | × | 3.0T | 18.7GB | 13.0GB | √ | +| Qwen-72B | 23.11.30 | 32K | √ | 3.0T | 61.4GB | 48.9GB | √ | + + Dans la repo, vous pouvez trouver: * Comment utiliser Qwen, et profiter de l'inférence simple. @@ -47,6 +57,7 @@ Vous voulez discuter avec nous ou prendre un café avec nous ? Bienvenue sur not ## Nouvelles et mises à jour +* 2023.11.30 🔥 Nous publions **Qwen-72B** et **Qwen-72B-Chat**, qui sont entraînés sur des tokens 3T et prennent en charge 32k contextes, ainsi que **Qwen-1.8B** et **Qwen-1.8B-Chat**, sur ModelScope et Hugging Face. Nous avons également renforcé les capacités de l'invite système du Qwen-72B-Chat et du Qwen-1.8B-Chat, voir la [documentation d'exemple](examples/system_prompt.md). De plus, nous supportons l'inférence sur **Ascend 910** et **Hygon DCU**. Consultez `ascend-support` et `dcu-support` pour plus de détails. * 2023.10.17 Nous publions le modèle quantifié Int8 **Qwen-7B-Chat-Int8** et **Qwen-14B-Chat-Int8**. * 2023.9.25 🔥 Nous publions **Qwen-14B** et **Qwen-14B-Chat** sur ModelScope et Hugging Face, ainsi que [qwen.cpp](https://github.com/QwenLM/qwen.cpp) et [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent). Les codes et les poids de **Qwen-7B** et **Qwen-7B-Chat** ont également été mis à jour. **S'IL VOUS PLAÎT, TIREZ LA DERNIÈRE VERSION!** - Par rapport à **Qwen-7B** (original), **Qwen-7B** utilise davantage de jetons d'entraînement, passant de 2,2 à 2,4T de jetons, tandis que la longueur du contexte passe de 2048 à 8192. La connaissance du chinois et la capacité de codage de **Qwen-7B** ont été encore améliorées. @@ -57,27 +68,30 @@ Vous voulez discuter avec nous ou prendre un café avec nous ? Bienvenue sur not ## Performance -Qwen-14B et Qwen-7B (il s'agit de la nouvelle version entraînée avec davantage de tokens et la longueur du contexte est passée de 2048 à 8192) surpassent les modèles de référence de tailles similaires sur une série d'ensembles de données de référence, par exemple MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., qui évaluent les capacités des modèles en matière de compréhension du langage naturel, de résolution de problèmes mathématiques, de codage, etc. Cependant, même Qwen-14B reste nettement inférieur à GPT-3.5, sans parler de GPT-4. Voir les résultats ci-dessous. +Les modèles Qwen surpassent les modèles de base de taille similaire sur une série de données de référence, par exemple MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., qui évaluent les capacités des modèles sur la compréhension du langage naturel, la résolution de problèmes mathématiques, le codage, etc. Qwen-72B obtient de meilleures performances que LLaMA2-70B dans toutes les tâches et surpasse GPT-3.5 dans 7 tâches sur 10.

- +


-| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU | -|:-------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:| -| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot | -| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 | -| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 | -| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - | -| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - | -| InternLM-7B | 51.0 | 53.4 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 | -| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 | -| Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 | -| Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 | -| Qwen-7B (original) | 56.7 | 59.6 | 51.6 | 10.4 | 24.4 | 31.2 | 40.6 | 58.8 | -| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 | -| **Qwen-14B** | **66.3** | **72.1** | **61.3** | **24.8** | **32.3** | **40.8** | **53.4** | **71.0** | +| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU | +|:------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:| +| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot | +| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 | +| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 | +| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - | +| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - | +| InternLM-7B | 51.0 | 53.4 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 | +| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 | +| Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 | +| Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 | +| Yi-34B | 76.3 | 81.8 | 67.9 | 15.9 | 26.2 | 38.2 | 66.4 | 82.6 | +| XVERSE-65B | 70.8 | 68.6 | 60.3 | - | 26.3 | - | - | - | +| **Qwen-1.8B** | 45.3 | 56.1 | 32.3 | 2.3 | 15.2 | 14.2 | 22.3 | 52.1 | +| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 | +| **Qwen-14B** | 66.3 | 72.1 | 61.3 | 24.8 | 32.3 | 40.8 | 53.4 | 71.0 | +| **Qwen-72B** | **77.4** | **83.3** | **78.9** | **35.2** | **35.4** | **52.2** | **67.7** | **83.6** | Pour tous les modèles comparés, nous indiquons les meilleurs scores entre leurs résultats officiels et [OpenCompass] (https://opencompass.org.cn/leaderboard-llm). @@ -96,7 +110,9 @@ Pour plus de résultats expérimentaux (performances détaillées des modèles s Ci-dessous, nous fournissons des exemples simples pour montrer comment utiliser Qwen-Chat avec 🤖 ModelScope et 🤗 Transformers. -Avant d'exécuter le code, assurez-vous d'avoir configuré l'environnement et installé les paquets requis. Assurez-vous que vous répondez aux exigences ci-dessus, puis installez les bibliothèques dépendantes. +Vous pouvez utiliser nos images docker pré-construites pour sauter la plupart des étapes de configuration de l'environnement, voir la section ["Utiliser des images docker pré-construites"](#-using-pre-built-docker-images) pour plus de détails. + +Si vous n'utilisez pas Docker, assurez-vous d'avoir configuré l'environnement et installé les paquets requis. Assurez-vous de répondre aux exigences ci-dessus, puis installez les bibliothèques dépendantes. ```bash pip install -r requirements.txt @@ -325,437 +341,12 @@ Cependant, il est probable que vous souffriez d'une efficacité d'inférence ext Si vous souffrez d'un manque de mémoire GPU et que vous souhaitez exécuter le modèle sur plus d'un GPU, vous pouvez utiliser directement la méthode de chargement par défaut, qui est maintenant supportée par Transformers. La méthode précédente basée sur `utils.py` est obsolète. Cependant, bien que cette méthode soit simple, l'efficacité du parallélisme natif du pipeline est faible. Nous vous conseillons d'utiliser vLLM avec FastChat et de lire la section relative au déploiement. -

-## Quantization -### GPTQ - -Nous proposons une solution basée sur [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), et publions les modèles quantifiés Int4, qui permettent d'obtenir des effets de modèle presque sans perte mais des performances améliorées en termes de coûts de mémoire et de vitesse d'inférence. - -Nous démontrons ici comment utiliser les modèles quantifiés que nous fournissons pour l'inférence. Avant de commencer, assurez-vous que vous répondez aux exigences d'auto-gptq (par exemple, torch 2.0 et plus, transformers 4.32.0 et plus, etc.) et installez les paquets requis: - -```bash -pip install auto-gptq optimum -``` - -Si vous rencontrez des problèmes pour installer `auto-gptq`, nous vous conseillons de consulter le [repo](https://github.com/PanQiWei/AutoGPTQ) officiel pour trouver une roue. - -Vous pouvez ensuite charger facilement le modèle quantifié et lancer l'inférence comme d'habitude: - -```python -# Model names: "Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4" -model = AutoModelForCausalLM.from_pretrained( - "Qwen/Qwen-7B-Chat-Int4", - device_map="auto", - trust_remote_code=True -).eval() -response, history = model.chat(tokenizer, "Hi", history=None) -``` - -Nous illustrons les performances des modèles BF16, Int8 et Int4 sur le benchmark, et nous constatons que le modèle quantifié ne souffre pas d'une dégradation significative des performances. Les résultats sont présentés ci-dessous: - -| Quantization | MMLU | CEval (val) | GSM8K | Humaneval | -|----------------------|:----:|:-----------:|:-----:|:---------:| -| Qwen-7B-Chat (BF16) | 55.8 | 59.7 | 50.3 | 37.2 | -| Qwen-7B-Chat (Int8) | 55.4 | 59.4 | 48.3 | 34.8 | -| Qwen-7B-Chat (Int4) | 55.1 | 59.2 | 49.7 | 29.9 | -| Qwen-14B-Chat (BF16) | 64.6 | 69.8 | 60.1 | 43.9 | -| Qwen-14B-Chat (Int8) | 63.6 | 68.6 | 60.0 | 48.2 | -| Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 | - -### Quantization du cache KV - -Attention Le cache KV peut être quantifié et compressé pour le stockage, afin d'obtenir un débit d'échantillonnage plus élevé. Les paramètres `use_cache_quantization` et `use_cache_kernel` sont fournis pour contrôler le comportement de quantification du cache KV -Lorsque `use_cache_quantization=True` et `use_cache_kernel=True`, la quantization de kv-cache est activée. -La méthode d'utilisation spécifique est la suivante: - -```python -model = AutoModelForCausalLM.from_pretrained( - "Qwen/Qwen-7B-Chat", - device_map="auto", - trust_remote_code=True, - use_cache_quantization=True, - use_cache_kernel=True, - use_flash_attn=False -) -``` -Attention : Actuellement, la quantization du cache kv et le flash attn ne peuvent pas être activés en même temps. -Si vous activez la quantification du cache kv et use_flash_attn en même temps (`use_flash_attn=True`, `use_cache_quantization=True`, `use_cache_kernel=True`), use_flash_attn est désactivé par défaut (`use_flash_attn=false`). - -Nous avons vérifié que l'utilisation du modèle int8-kvcache quantifié ne souffre pas d'une dégradation significative des performances dans l'évaluation en aval. En outre, nous évaluons ses performances en nous concentrant sur l'empreinte mémoire. -Le profilage s'exécute sur un seul GPU A100-SXM4-80G avec PyTorch 2.0.1 et CUDA 11.4. -Nous utilisons des modèles BF16, et générons 1024 tokens (seq-length=1024) par défaut, et oom indique qu'il n'y a plus de mémoire. - -Lorsque la quantization de kv-cache est activée, nous pouvons utiliser une taille de lot (bs) plus importante. - -| USE KVCache | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 | -|-------------|:------:|:------:|:------:|:------:|:------:|:------:| -| no | 16.3GB | 24.1GB | 31.7GB | 48.7GB | oom | oom | -| yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB | - -Lorsque la quantification de kv-cache est activée, le modèle peut économiser plus de mémoire lorsqu'il génère des séquences plus longues (sl, nombre de jetons générés) lors de l'inférence. - -| USE KVCache | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 | -|-------------|:------:|:-------:|:-------:|:-------:|:-------:| -| no | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB | -| yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB | - -Le modèle qui active la quantification du kv-cache convertit le format du layer-past de float à int8, tandis que le layer-past quantifié stocke également les paramètres de quantification de la valeur actuelle. -Les étapes spécifiques sont les suivantes : - -1. Quantifier clé/valeur -``` - qv,scale,zero_point=quantize_cache_v(v) -``` -2. Stocker dans layer_past - -Following is the format of quantized layer_past: -``` - layer_past=((q_key,key_scale,key_zero_point), - (q_value,value_scale,value_zero_point)) -``` -Format de base de layer_past: -``` - layer_past=(key,value) -``` -Si vous souhaitez utiliser l'attention KV qui est quantifiée, vous pouvez utiliser l'opération de déquantification pour convertir la clé/valeur int8 en format float comme suit -vous pouvez utiliser l'opération de déquantification pour reconvertir la clé/valeur int8 au format float comme suit : -``` - v=dequantize_cache_torch(qv,scale,zero_point) -``` -
- - -## Performance de l'inférence - -Cette section fournit les statistiques de vitesse et de mémoire des modèles dans différentes précisions. Le profilage de la vitesse et de la mémoire est effectué à l'aide de [ce script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py). - -### Vitesse - -Nous avons mesuré la vitesse moyenne d'inférence (jetons/s) pour la génération de 2048 et 8192 jetons avec les modèles dans la précision de BF16, Int8, et Int4 sous la condition d'utiliser l'attention flash v1, v2, ou de ne pas l'utiliser. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model SizePrecisionFlashAttnSequence Length
20488192
7BBF16v240.9336.14
v140.7535.34 -
Disabled37.5533.56 -
Int8v237.4732.54
v137.5132.39 -
Disabled37.8432.65 -
Int4v250.0938.61
v145.9836.47 -
Disabled48.1236.70 -
14BBF16v232.8824.87
v132.7628.89 -
Disabled29.3222.91 -
Int8v229.2824.22
v128.3123.87 -
Disabled31.1224.60 -
Int4v238.7227.33
v137.8126.46 -
Disabled37.6526.00 -
- - -En détail, le profilage consiste à encoder 2048 jetons et à générer 8192 nouveaux jetons. Le profilage s'exécute sur un seul GPU A100-SXM4-80G avec PyTorch 2.0.1 et CUDA 11.8. La vitesse d'inférence est calculée en moyenne sur les jetons encodés et générés. - -Note : La vitesse de génération des modèles Int4/Int8 mentionnés ci-dessus est fournie par la bibliothèque autogptq. La vitesse actuelle du modèle chargé à l'aide de "AutoModelForCausalLM.from_pretrained" sera environ 20% plus lente. Nous avons signalé ce problème à l'équipe HuggingFace et nous le mettrons à jour rapidement si une solution est disponible. - -### Utilisation de la mémoire du GPU - -Nous avons également établi le profil de l'utilisation maximale de la mémoire du GPU pour l'encodage de 2048 jetons en tant que contexte (et la génération d'un seul jeton) et la génération de 8192 jetons (avec un seul jeton en tant que contexte) sous BF16, Int8 ou Int4 niveau de quantization, respectivement. Les résultats (GB) sont présentés ci-dessous. - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model SizePrecisionSequence Length
20488192
7BBF1616.9922.53
Int811.2016.62 -
Int48.2113.63
14BBF1630.1538.94
Int818.8127.54 -
Int413.0121.79
- - -
- - -## Finetuning - -### Utilisation -Nous fournissons maintenant le script d'entraînement officiel, `finetune.py`, pour que les utilisateurs puissent ajuster le modèle pré-entraîné pour les applications en aval de manière simple. De plus, nous fournissons des scripts shell pour lancer le finetune sans soucis. Ce script prend en charge l'entraînement avec [DeepSpeed](https://github.com/microsoft/DeepSpeed) et [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/). Les scripts que nous fournissons utilisent DeepSpeed (Note : il peut y avoir des conflits avec la dernière version de pydantic) et Peft. Vous pouvez les installer en procédant comme suit: -```bash -pip install peft deepspeed -``` - -Pour préparer vos données d'entraînement, vous devez rassembler tous les échantillons dans une liste et l'enregistrer dans un fichier json. Chaque échantillon est un dictionnaire composé d'un identifiant et d'une liste de conversation. Voici un exemple simple de liste avec 1 échantillon: -```json -[ - { - "id": "identity_0", - "conversations": [ - { - "from": "user", - "value": "你好" - }, - { - "from": "assistant", - "value": "我是一个语言模型,我叫通义千问。" - } - ] - } -] -``` - -Après la préparation des données, vous pouvez utiliser les scripts shell fournis pour lancer le finetuning. N'oubliez pas de spécifier le chemin d'accès au fichier de données, `$DATA`. - -Les scripts de finetuning vous permettent d'effectuer les opérations suivantes -- Finetuning de tous les paramètres -- LoRA -- Q-LoRA - -Le finetuning de tous les paramètres nécessite la mise à jour de tous les paramètres au cours de l'ensemble du processus de formation. Pour lancer votre formation, exécutez le script suivant: - -```bash -# Distributed training. We do not provide single-GPU training script as the insufficient GPU memory will break down the training. -sh finetune/finetune_ds.sh -``` - -N'oubliez pas de spécifier le nom ou le chemin d'accès au modèle, le chemin d'accès aux données, ainsi que le répertoire de sortie dans les scripts shell. Une autre chose à noter est que nous utilisons DeepSpeed ZeRO 3 dans ce script. Si vous voulez faire des changements, il suffit de supprimer l'argument `--deepspeed` ou de faire des changements dans le fichier json de configuration de DeepSpeed en fonction de vos besoins. De plus, ce script supporte l'entraînement en précision mixte, et donc vous pouvez utiliser `--bf16 True` ou `--fp16 True`. N'oubliez pas d'utiliser DeepSpeed lorsque vous utilisez fp16 en raison de l'entraînement de précision mixte. Empiriquement, nous vous conseillons d'utiliser bf16 pour rendre votre apprentissage cohérent avec notre pré-entraînement et notre alignement si votre machine supporte bf16, et nous l'utilisons donc par défaut. - -Pour exécuter LoRA, utilisez un autre script à exécuter comme indiqué ci-dessous. Avant de commencer, assurez-vous que vous avez installé `peft`. Vous devez spécifier les chemins d'accès à votre modèle, à vos données et à vos résultats. Nous vous conseillons d'utiliser des chemins absolus pour votre modèle pré-entraîné. En effet, LoRA ne sauvegarde que l'adaptateur et le chemin absolu dans le fichier json de configuration de l'adaptateur est utilisé pour trouver le modèle pré-entraîné à charger. De plus, ce script supporte à la fois bf16 et fp16. - -```bash -# Single GPU training -sh finetune/finetune_lora_single_gpu.sh -# Distributed training -sh finetune/finetune_lora_ds.sh -``` - -Par rapport au finetuning de tous les paramètres, LoRA ([paper](https://arxiv.org/abs/2106.09685)) ne met à jour que les paramètres des couches d'adaptateurs, tout en gelant les couches originales du grand modèle de langage. Cela permet de réduire considérablement les coûts de mémoire et donc les coûts de calcul. - -Notez que si vous utilisez LoRA pour affiner le modèle de langue, par exemple Qwen-7B, au lieu des modèles de chat, par exemple Qwen-7B-Chat, le script change automatiquement les embedding et la couche de sortie en tant que paramètres entraînables. En effet, le modèle de langue n'a aucune connaissance des jetons spéciaux apportés par le format ChatML. Ces couches doivent donc être mises à jour pour que le modèle comprenne et prédise les jetons. En d'autres termes, si votre entraînement apporte des tokens spéciaux dans LoRA, vous devez définir les couches comme des paramètres entraînables en définissant `modules_to_save` à l'intérieur du code. De plus, si ces paramètres sont entraînables, il n'est pas possible d'utiliser ZeRO 3, et c'est pourquoi nous utilisons ZeRO 2 par défaut dans le script. Si vous n'avez pas de nouveaux paramètres entraînables, vous pouvez passer à ZeRO 3 en modifiant le fichier de configuration de DeepSpeed. En outre, nous constatons qu'il existe un écart important entre l'empreinte mémoire de LoRA avec et sans ces paramètres d'entraînement. Par conséquent, si vous avez des problèmes de mémoire, nous vous conseillons d'affiner les modèles de chat de LoRA. Consultez le profil ci-dessous pour plus d'informations. - -Si vous souffrez toujours d'un manque de mémoire, vous pouvez envisager Q-LoRA ([paper](https://arxiv.org/abs/2305.14314)), qui utilise le modèle de langage quantifié et d'autres techniques telles que l'attention paginée pour réduire encore les coûts de mémoire. - -Note : pour exécuter l'entraînement Q-LoRA sur un seul GPU, vous pouvez avoir besoin d'installer `mpi4py` via `pip` ou `conda`. - -Pour lancer Q-LoRA, exécutez directement le script suivant: - -```bash -# Single GPU training -sh finetune/finetune_qlora_single_gpu.sh -# Distributed training -sh finetune/finetune_qlora_ds.sh -``` - -Pour Q-LoRA, nous vous conseillons de charger le modèle quantifié que nous fournissons, par exemple Qwen-7B-Chat-Int4. Vous **NE DEVRIEZ PAS** utiliser les modèles bf16. Contrairement au finetuning de tous les paramètres et à la LoRA, seul le modèle fp16 est pris en charge pour la Q-LoRA. Pour l'entraînement sur un seul GPU, nous devons utiliser DeepSpeed pour l'entraînement en précision mixte en raison de notre observation des erreurs causées par torch amp. En outre, pour Q-LoRA, les problèmes avec les jetons spéciaux dans LoRA existent toujours. Cependant, comme nous ne fournissons que les modèles Int4 pour les modèles de chat, ce qui signifie que le modèle de langage a appris les tokens spéciaux du format ChatML, vous n'avez pas à vous soucier des couches. Notez que les couches du modèle Int4 ne doivent pas être entraînables, et donc si vous introduisez des tokens spéciaux dans votre entraînement, Q-LoRA risque de ne pas fonctionner. - -Contrairement au finetuning des paramètres complets, l'entraînement de LoRA et de Q-LoRA n'enregistre que les paramètres de l'adaptateur. Supposons que votre entraînement commence à partir de Qwen-7B, vous pouvez charger le modèle finalisé pour l'inférence comme indiqué ci-dessous: - -```python -from peft import AutoPeftModelForCausalLM - -model = AutoPeftModelForCausalLM.from_pretrained( - path_to_adapter, # path to the output directory - device_map="auto", - trust_remote_code=True -).eval() -``` - -Si vous souhaitez fusionner les adaptateurs et enregistrer le modèle affiné en tant que modèle autonome (vous ne pouvez le faire qu'avec LoRA, et vous **NE POUVEZ PAS** fusionner les paramètres de Q-LoRA), vous pouvez exécuter les codes suivants: - -```python -from peft import AutoPeftModelForCausalLM - -model = AutoPeftModelForCausalLM.from_pretrained( - path_to_adapter, # path to the output directory - device_map="auto", - trust_remote_code=True -).eval() - -merged_model = model.merge_and_unload() -# max_shard_size and safe serialization are not necessary. -# They respectively work for sharding checkpoint and save the model to safetensors -merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True) -``` - -Note : Pour l'entraînement multi-GPU, vous devez spécifier les hyperparamètres appropriés pour l'entraînement distribué en fonction de votre machine. De plus, nous vous conseillons de spécifier votre longueur maximale de séquence avec l'argument `--model_max_length`, en fonction de votre considération des données, de l'empreinte mémoire, et de la vitesse d'apprentissage. - -### Profilage de la mémoire et de la vitesse -Nous profilons la mémoire du GPU et la vitesse d'apprentissage de LoRA (LoRA (emb) se réfère à l'apprentissage de l'embedding et la couche de sortie, tandis que LoRA n'a pas de couche d'intégration et de sortie pouvant être entraînée) et de Q-LoRA dans la configuration de l'apprentissage sur un seul GPU. Dans ce test, nous expérimentons sur un seul GPU A100-SXM4-80G, et nous utilisons CUDA 11.8 et Pytorch 2.0. Flash attention 2 est appliqué. Nous utilisons uniformément une taille de lot de 1 et une accumulation de gradient de 8. Nous profilons la mémoire (GB) et la vitesse (s/iter) des entrées de différentes longueurs, à savoir 256, 512, 1024, 2048, 4096, et 8192. Nous présentons également les statistiques du finetuning de tous les paramètres avec Qwen-7B sur 2 GPU A100. Nous ne présentons que les statistiques de 256, 512 et 1024 jetons en raison de la limitation de la mémoire du GPU. Les statistiques sont listées ci-dessous : - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model SizeMethodSequence Length
2565121024204840968192
7BLoRA20.1G / 1.2s/it20.4G / 1.5s/it21.5G / 2.8s/it23.8G / 5.2s/it29.7G / 10.1s/it36.6G / 21.3s/it
LoRA (emb)33.7G / 1.4s/it34.1G / 1.6s/it35.2G / 2.9s/it35.1G / 5.3s/it39.2G / 10.3s/it48.5G / 21.7s/it
Q-LoRA11.5G / 3.0s/it11.5G / 3.0s/it12.3G / 3.5s/it13.9G / 7.0s/it16.9G / 11.6s/it23.5G / 22.3s/it
Full-parameter139.2G / 4.0s/it148.0G / 4.0s/it162.0G / 4.5s/it---
14BLoRA34.6G / 1.6s/it35.1G / 2.4s/it35.3G / 4.4s/it37.4G / 8.4s/it42.5G / 17.0s/it55.2G / 36.0s/it
LoRA (emb)51.2 / 1.7s/it51.1G / 2.6s/it51.5G / 4.6s/it54.1G / 8.6s/it56.8G / 17.2s/it67.7G / 36.3s/it
Q-LoRA18.7G / 5.3s/it18.4G / 6.3s/it18.9G / 8.2s/it19.9G / 11.8s/it23.0G / 20.1s/it27.9G / 38.3s/it
-
- -## Déploiement - -### vLLM -Pour le déploiement et l'inférence rapide, nous suggérons d'utiliser vLLM avec FastChat. Installez d'abord les paquets: -```bash -pip install vllm -pip install "fschat[model_worker,webui]" -``` -Ou vous pouvez les installer à partir des sources par `git clone` et `pip install -e .`. Nous vous conseillons de lire leurs documents si vous rencontrez des problèmes lors de l'installation. - -Pour faire fonctionner Qwen avec vLLM et FastChat, vous devez d'abord lancer un contrôleur par: -```bash -python -m fastchat.serve.controller -``` - -Ensuite, vous pouvez lancer le travailleur de modèle, ce qui signifie charger votre modèle pour l'inférence. Pour l'inférence sur un seul GPU, vous pouvez directement lancer: -```bash -python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code -``` -Cependant, si vous souhaitez exécuter le modèle sur plusieurs GPU pour une inférence plus rapide ou une mémoire plus importante, vous pouvez utiliser le parallélisme tensoriel pris en charge par vLLM. Supposons que vous exécutiez le modèle sur 4 GPU, la commande est présentée ci-dessous: -```bash -python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 -``` - -Après avoir lancé votre model worker, vous pouvez lancer une démo web ou une API OpenAI comme vous le souhaitez. Pour la démo web, exécutez la commande suivante: -```bash -python -m fastchat.serve.gradio_web_server -``` -Pour l'API OpenAI, consultez d'abord la documentation de notre API OpenAI pour l'installation. Exécutez ensuite la commande: -```bash -python -m fastchat.serve.openai_api_server --host localhost --port 8000 -``` -
- -## Démo - -### Interface Web - -Nous fournissons du code pour que les utilisateurs puissent construire une démo d'interface web (merci à @wysaid). Avant de commencer, assurez-vous d'installer les paquets suivants: - -``` -pip install -r requirements_web_demo.txt -``` - -Exécutez ensuite la commande ci-dessous et cliquez sur le lien généré: - -```bash -python web_demo.py -``` - -

-
- -
-

- -### Démo CLI - -Nous fournissons un exemple de démonstration CLI dans `cli_demo.py`, qui prend en charge la sortie en continu pour la génération. Les utilisateurs peuvent interagir avec Qwen-7B-Chat en saisissant des invites, et le modèle renvoie les sorties du modèle en mode streaming. Exécutez la commande ci-dessous: - -```bash -python cli_demo.py -``` - -

-
- -
-

-
- -## API +### DashScope Le moyen le plus simple d'utiliser Qwen via les API est le service API DashScope via Alibaba Cloud. Nous présentons une introduction à l'utilisation. De plus, nous fournissons un script pour vous permettre de déployer une API de type OpenAI sur vos propres serveurs. -### DashScope DashScope est le service API de grands modèles linguistiques fourni par Alibaba Cloud, qui prend désormais en charge Qwen. Notez que les modèles derrière DashScope sont des versions internes temporairement sans détails fournis. Les services comprennent `qwen-turbo` et `qwen-plus`, le premier fonctionnant plus rapidement et le second atteignant de meilleures performances. Pour plus d'informations, consultez la documentation [ici] (https://dashscope.aliyun.com). Veuillez vous rendre sur le site officiel [lien](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.6c2774fahtfXdn) pour créer un compte DashScope et obtenir la clé API (AK). Nous recommandons de définir l'AK à l'aide d'une variable d'environnement: @@ -806,8 +397,456 @@ if __name__ == '__main__': )) ``` Pour d'autres utilisations, veuillez consulter le site web officiel pour plus de détails. +

-### API OpenAI +## Quantization + +### GPTQ + +Nous proposons une solution basée sur [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), et publions les modèles quantifiés Int4 et Int8, qui permettent d'obtenir des effets de modèle presque sans perte mais des performances améliorées en termes de coûts de mémoire et de vitesse d'inférence. + +Nous démontrons ici comment utiliser les modèles quantifiés que nous fournissons pour l'inférence. Avant de commencer, assurez-vous que vous répondez aux exigences d'auto-gptq (par exemple, torch 2.0 et plus, transformers 4.32.0 et plus, etc.) et installez les paquets requis: + +```bash +pip install auto-gptq optimum +``` + +Si vous rencontrez des problèmes pour installer `auto-gptq`, nous vous conseillons de consulter le [repo](https://github.com/PanQiWei/AutoGPTQ) officiel pour trouver une roue. + +> Note : Les paquets `auto-gptq` précompilés dépendent fortement de la version de `torch` et de sa version CUDA. De plus, en raison d'une récente mise à jour, +> vous pouvez aussi rencontrer des erreurs de version non supportée avec `transformers`, `optimum`, ou `peft`. +> Nous recommandons d'utiliser les dernières versions répondant aux exigences suivantes : +> - torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1 +> - torch>=2.0,<2.1 auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0 + +Vous pouvez ensuite charger facilement le modèle quantifié et lancer l'inférence comme d'habitude: + +```python +# Model names: "Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4" +model = AutoModelForCausalLM.from_pretrained( + "Qwen/Qwen-7B-Chat-Int4", + device_map="auto", + trust_remote_code=True +).eval() +response, history = model.chat(tokenizer, "Hi", history=None) +``` + +Nous illustrons les performances des modèles BF16, Int8 et Int4 sur le benchmark, et nous constatons que le modèle quantifié ne souffre pas d'une dégradation significative des performances. Les résultats sont présentés ci-dessous: + +| Quantization | MMLU | CEval (val) | GSM8K | Humaneval | +|----------------------|:----:|:-----------:|:-----:|:---------:| +| Qwen-1.8B-Chat (BF16)| 43.3 | 55.6 | 33.7 | 26.2 | +| Qwen-1.8B-Chat (Int8)| 43.1 | 55.8 | 33.0 | 27.4 | +| Qwen-1.8B-Chat (Int4)| 42.9 | 52.8 | 31.2 | 25.0 | +| Qwen-7B-Chat (BF16) | 55.8 | 59.7 | 50.3 | 37.2 | +| Qwen-7B-Chat (Int8) | 55.4 | 59.4 | 48.3 | 34.8 | +| Qwen-7B-Chat (Int4) | 55.1 | 59.2 | 49.7 | 29.9 | +| Qwen-14B-Chat (BF16) | 64.6 | 69.8 | 60.1 | 43.9 | +| Qwen-14B-Chat (Int8) | 63.6 | 68.6 | 60.0 | 48.2 | +| Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 | +| Qwen-72B-Chat (BF16) | 74.4 | 80.1 | 76.4 | 64.6 | +| Qwen-72B-Chat (Int8) | 73.5 | 80.1 | 73.5 | 62.2 | +| Qwen-72B-Chat (Int4) | 73.4 | 80.1 | 75.3 | 61.6 | + +### Quantization du cache KV + +> NOTE : Veuillez noter qu'en raison du mécanisme interne de Hugging Face, les fichiers de support pour cette fonctionnalité +> (i.e., `cache_autogptq_cuda_256.cpp` et `cache_autogptq_cuda_kernel_245.cu`) peuvent être manquants. +> Veuillez les télécharger manuellement manuellement depuis le Hugging Face Hub et placez-les dans le même dossier que les autres fichiers du module. + +Le cache KV de l'attention peut être quantifié et compressé pour le stockage, afin d'obtenir un débit d'échantillonnage plus élevé. Les arguments `use_cache_quantization` et `use_cache_kernel` dans `config.json` sont fournis pour activer la quantification du cache KV. +La méthode d'utilisation spécifique est la suivante: + +```python +model = AutoModelForCausalLM.from_pretrained( + "Qwen/Qwen-7B-Chat", + device_map="auto", + trust_remote_code=True, + use_cache_quantization=True, + use_cache_kernel=True, + use_flash_attn=False +) +``` +Attention : Actuellement, la quantification du cache KV et flash attention ne peuvent pas être utilisées en même temps. +Si vous activez la quantification du cache KV et flash attention en même temps (`use_flash_attn=True`, `use_cache_quantization=True`, `use_cache_kernel=True`), `use_flash_attn` est désactivé par défaut (`use_flash_attn=false`). + +Nous avons vérifié que l'utilisation du modèle int8-kvcache quantifié ne souffre pas d'une dégradation significative des performances dans l'évaluation en aval. Dans ce qui suit, nous nous concentrons sur le profilage de son empreinte mémoire dans différentes conditions. +Le profilage s'exécute sur un seul GPU A100-SXM4-80G avec PyTorch 2.0.1 et CUDA 11.4. +Nous utilisons des modèles BF16 pour générer 1024 jetons par défaut, et "OOM" indique une erreur de mémoire insuffisante. + +Avec la quantification du cache KV, le modèle peut inférer avec une taille de lot (bs) plus grande. + +| Utilisation du cache KV | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 | +|--------------|:------:|:------:|:------:|:------:|:------:|:------:| +| Non | 16.3GB | 24.1GB | 31.7GB | 48.7GB | OOM | OOM | +| Oui | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB | + +Avec la quantification du cache KV, le modèle peut économiser plus de mémoire lorsqu'il génère des séquences plus longues (`sl`, se référant au nombre de jetons générés) à l'étape de l'inférence. + +| Utilisation du cache KV | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 | +|-------------------------|:------:|:-------:|:-------:|:-------:|:-------:| +| Non | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB | +| Oui | 15.0GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB | + +Le modèle avec quantification du cache KV convertira le format de `layer_past` de float à int8, et pendant ce temps le `layer-past` quantifié stockera également les paramètres de quantification. + +Les étapes spécifiques sont les suivantes: + +1. Quantifier clé/valeur +``` + qv,scale,zero_point=quantize_cache_v(v) +``` +2. Stocker dans `layer_past` + +Voici le format de `layer_past` quantifié: +``` + layer_past=((q_key,key_scale,key_zero_point), + (q_value,value_scale,value_zero_point)) +``` + +Le format original de `layer_past` est illustré ci-dessous: +``` + layer_past=(key,value) +``` + +Si vous souhaitez utiliser l'attention KV qui est quantifiée, vous pouvez utiliser l'opération de déquantification pour reconvertir la clé/valeur int8 au format float comme suit +vous pouvez utiliser l'opération de déquantification pour reconvertir la clé/valeur int8 au format float comme suit: +``` + v=dequantize_cache_torch(qv,scale,zero_point) +``` +
+ + +## Performance de l'inférence + +Cette section fournit les statistiques de vitesse et de mémoire des modèles dans différentes précisions. Le profilage de la vitesse et de la mémoire est effectué à l'aide de [ce script] (https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py). + +Nous avons mesuré la vitesse moyenne d'inférence (tokens/s) et l'utilisation de la mémoire GPU pour générer 2048 avec les modèles en BF16, Int8 et Int4. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model SizeQuantizationSpeed (Tokens/s)GPU Memory Usage
1.8BBF1654.094.23GB
Int855.563.48GB
Int471.072.91GB
7BBF1640.9316.99GB
Int837.4711.20GB
Int450.098.21GB
14BBF1632.2230.15GB
Int829.2818.81GB
Int438.7213.01GB
72BBF168.48144.69GB (2xA100)
Int89.0581.27GB (2xA100)
Int411.3248.86GB
72B + vLLMBF1617.602xA100
+ +Le profilage s'exécute sur un seul GPU A100-SXM4-80G (sauf si 2xA100 est mentionné) avec PyTorch 2.0.1, CUDA 11.8, et Flash-Attention 2. (72B + vLLM utilise PyTorch 2.1.0 et Cuda 11.8.) La vitesse d'inférence est calculée en moyenne sur les tokens encodés et générés. + +Note : La vitesse de génération des modèles Int4/Int8 mentionnés ci-dessus est fournie par la bibliothèque autogptq. La vitesse actuelle du modèle chargé en utilisant ``AutoModelForCausalLM.from_pretrained`` sera environ 20% plus lente. Nous avons signalé ce problème à l'équipe HuggingFace et nous le mettrons à jour rapidement si une solution est disponible. + +Nous mesurons également la vitesse d'inférence et l'utilisation de la mémoire du GPU avec différents paramètres de contexte et de longueur de génération, version Flash-Attention. Vous pouvez trouver les résultats dans les cartes modèles correspondantes sur Hugging Face ou ModelScope. + + +## Finetuning + +### Utilisation +Nous fournissons maintenant le script d'entraînement officiel, `finetune.py`, pour que les utilisateurs puissent ajuster le modèle pré-entraîné pour les applications en aval de manière simple. De plus, nous fournissons des scripts shell pour lancer le finetune sans soucis. Ce script prend en charge l'entraînement avec [DeepSpeed](https://github.com/microsoft/DeepSpeed) et [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/). Les scripts que nous fournissons utilisent DeepSpeed (Note : il peut y avoir des conflits avec la dernière version de pydantic et vous devriez utiliser make sure `pydantic<2.0`) et Peft. Vous pouvez les installer en procédant comme suit : +```bash +pip install peft deepspeed +``` + +Pour préparer vos données d'entraînement, vous devez rassembler tous les échantillons dans une liste et l'enregistrer dans un fichier json. Chaque échantillon est un dictionnaire composé d'un identifiant et d'une liste de conversation. Voici un exemple simple de liste avec 1 échantillon : +```json +[ + { + "id": "identity_0", + "conversations": [ + { + "from": "user", + "value": "你好" + }, + { + "from": "assistant", + "value": "我是一个语言模型,我叫通义千问。" + } + ] + } +] +``` + +Après la préparation des données, vous pouvez utiliser les scripts shell fournis pour lancer le finetuning. N'oubliez pas de spécifier le chemin d'accès au fichier de données, `$DATA`. + +Les scripts de finetuning vous permettent d'effectuer les opérations suivantes +- Finetuning de tous les paramètres +- LoRA +- Q-LoRA + +Le finetuning de tous les paramètres nécessite la mise à jour de tous les paramètres au cours de l'ensemble du processus de formation. Pour lancer votre formation, exécutez le script suivant: + +```bash +# Distributed training. We do not provide single-GPU training script as the insufficient GPU memory will break down the training. +sh finetune/finetune_ds.sh +``` + +N'oubliez pas de spécifier le nom ou le chemin d'accès au modèle, le chemin d'accès aux données, ainsi que le répertoire de sortie dans les scripts shell. Une autre chose à noter est que nous utilisons DeepSpeed ZeRO 3 dans ce script. Si vous voulez faire des changements, il suffit de supprimer l'argument `--deepspeed` ou de faire des changements dans le fichier json de configuration de DeepSpeed en fonction de vos besoins. De plus, ce script supporte l'entraînement en précision mixte, et donc vous pouvez utiliser `--bf16 True` ou `--fp16 True`. N'oubliez pas d'utiliser DeepSpeed lorsque vous utilisez fp16 en raison de l'entraînement de précision mixte. Empiriquement, nous vous conseillons d'utiliser bf16 pour rendre votre apprentissage cohérent avec notre pré-entraînement et notre alignement si votre machine supporte bf16, et nous l'utilisons donc par défaut. + +Pour exécuter LoRA, utilisez un autre script à exécuter comme indiqué ci-dessous. Avant de commencer, assurez-vous que vous avez installé `peft`. Vous devez spécifier les chemins d'accès à votre modèle, à vos données et à vos résultats. Nous vous conseillons d'utiliser des chemins absolus pour votre modèle pré-entraîné. En effet, LoRA ne sauvegarde que l'adaptateur et le chemin absolu dans le fichier json de configuration de l'adaptateur est utilisé pour trouver le modèle pré-entraîné à charger. De plus, ce script supporte à la fois bf16 et fp16. + +```bash +# Single GPU training +sh finetune/finetune_lora_single_gpu.sh +# Distributed training +sh finetune/finetune_lora_ds.sh +``` + +Par rapport au finetuning de tous les paramètres, LoRA ([paper](https://arxiv.org/abs/2106.09685)) ne met à jour que les paramètres des couches d'adaptateurs, tout en gelant les couches originales du grand modèle de langage. Cela permet de réduire considérablement les coûts de mémoire et donc les coûts de calcul. + +Notez que si vous utilisez LoRA pour affiner le modèle linguistique de base, par exemple Qwen-7B, au lieu des modèles de chat, par exemple Qwen-7B-Chat, le script change automatiquement l'intégration et la couche de sortie en tant que paramètres entraînables. En effet, le modèle linguistique de base n'a aucune connaissance des jetons spéciaux apportés par le format ChatML. Ces couches doivent donc être mises à jour pour que le modèle comprenne et prédise les jetons. En d'autres termes, si votre formation apporte des tokens spéciaux dans LoRA, vous devez définir les couches comme des paramètres entraînables en définissant `modules_to_save` à l'intérieur du code. De plus, si ces paramètres sont entraînables, il n'est pas possible d'utiliser ZeRO 3, et c'est pourquoi nous utilisons ZeRO 2 par défaut dans le script. Si vous n'avez pas de nouveaux paramètres entraînables, vous pouvez passer à ZeRO 3 en modifiant le fichier de configuration de DeepSpeed. En outre, nous constatons qu'il existe un écart important entre l'empreinte mémoire de LoRA avec et sans ces paramètres d'entraînement. Par conséquent, si vous avez des problèmes de mémoire, nous vous conseillons d'affiner les modèles de chat de LoRA. Consultez le profil ci-dessous pour plus d'informations. + +Si vous souffrez toujours d'un manque de mémoire, vous pouvez envisager Q-LoRA ([paper](https://arxiv.org/abs/2305.14314)), qui utilise le modèle de langage quantifié et d'autres techniques telles que l'attention paginée pour réduire encore les coûts de mémoire. + +Note : pour exécuter l'entraînement Q-LoRA sur un seul GPU, vous pouvez avoir besoin d'installer `mpi4py` via `pip` ou `conda`. + +Pour lancer Q-LoRA, exécutez directement le script suivant : + +```bash +# Single GPU training +sh finetune/finetune_qlora_single_gpu.sh +# Distributed training +sh finetune/finetune_qlora_ds.sh +``` + +Pour Q-LoRA, nous vous conseillons de charger le modèle quantifié que nous fournissons, par exemple Qwen-7B-Chat-Int4. Vous **NE DEVRIEZ PAS** utiliser les modèles bf16. Contrairement au finetuning de tous les paramètres et à la LoRA, seul le modèle fp16 est pris en charge pour la Q-LoRA. Pour l'entraînement sur un seul GPU, nous devons utiliser DeepSpeed pour l'entraînement en précision mixte en raison de notre observation des erreurs causées par torch amp. En outre, pour Q-LoRA, les problèmes avec les jetons spéciaux dans LoRA existent toujours. Cependant, comme nous ne fournissons que les modèles Int4 pour les modèles de chat, ce qui signifie que le modèle de langage a appris les tokens spéciaux du format ChatML, vous n'avez pas à vous soucier des couches. Notez que les couches du modèle Int4 ne doivent pas être entraînables, et donc si vous introduisez des tokens spéciaux dans votre entraînement, Q-LoRA risque de ne pas fonctionner. + +> NOTE : Veuillez noter qu'en raison des mécanismes internes de Hugging Face, certains fichiers non-Python (par exemple, `*.cpp` et `*.cu`) +> peuvent être absents du point de contrôle sauvegardé. Vous devrez peut-être les copier manuellement dans le répertoire contenant les autres fichiers. + +Contrairement au finetuning des paramètres complets, l'entraînement de LoRA et de Q-LoRA n'enregistre que les paramètres de l'adaptateur. Supposons que votre entraînement commence à partir de Qwen-7B, vous pouvez charger le modèle finalisé pour l'inférence comme indiqué ci-dessous: + +```python +from peft import AutoPeftModelForCausalLM + +model = AutoPeftModelForCausalLM.from_pretrained( + path_to_adapter, # path to the output directory + device_map="auto", + trust_remote_code=True +).eval() +``` + +Si vous souhaitez fusionner les adaptateurs et enregistrer le modèle affiné en tant que modèle autonome (vous ne pouvez le faire qu'avec LoRA, et vous **NE POUVEZ PAS** fusionner les paramètres de Q-LoRA), vous pouvez exécuter les codes suivants : + +```python +from peft import AutoPeftModelForCausalLM + +model = AutoPeftModelForCausalLM.from_pretrained( + path_to_adapter, # path to the output directory + device_map="auto", + trust_remote_code=True +).eval() + +merged_model = model.merge_and_unload() +# max_shard_size and safe serialization are not necessary. +# They respectively work for sharding checkpoint and save the model to safetensors +merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True) +``` + +Note : Pour l'entraînement multi-GPU, vous devez spécifier les hyperparamètres appropriés pour l'entraînement distribué en fonction de votre machine. De plus, nous vous conseillons de spécifier votre longueur maximale de séquence avec l'argument `--model_max_length`, en fonction de votre considération des données, de l'empreinte mémoire, et de la vitesse d'apprentissage. + + +### Profilage de la mémoire et de la vitesse +Nous profilons la mémoire du GPU et la vitesse d'apprentissage de LoRA (LoRA (emb) se réfère à l'apprentissage de la couche d'intégration et de sortie, tandis que LoRA n'a pas de couche d'intégration et de sortie pouvant être entraînée) et de Q-LoRA dans la configuration de l'apprentissage sur un seul GPU. Dans ce test, nous expérimentons sur un seul GPU A100-SXM4-80G, et nous utilisons CUDA 11.8 et Pytorch 2.0. Flash attention 2 est appliqué. Nous utilisons uniformément une taille de lot de 1 et une accumulation de gradient de 8. Nous profilons la mémoire (GB) et la vitesse (s/iter) des entrées de différentes longueurs, à savoir 256, 512, 1024, 2048, 4096, et 8192. Nous présentons également les statistiques du réglage fin de tous les paramètres avec Qwen-7B sur 2 GPU A100. Nous ne présentons que les statistiques de 256, 512 et 1024 jetons en raison de la limitation de la mémoire du GPU. + +Pour Qwen-72B, nous expérimentons de deux manières : 1) Lora fintuning + DeepSpeed ZeRO 3 sur 4 GPU A100-SXM4-80G et 2) QLora (int4) fintuning sur un seul GPU A100-SXM4-80G. Notez que l'OOM se produit sur 4 GPUs A100-SXM4-80G à la fois avec le réglage fin LoRA (emb) et le réglage fin LoRA sans Deepspeed ZeRO 3 (vous pouvez passer `--deepspeed finetune/ds_config_zero3.json` à [`finetune/finetune_lora_ds.sh`](finetune/finetune_lora_ds.sh) afin d'activer DeepSpeed ZeRO 3). + +Les statistiques sont listées ci-dessous : + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model SizeMethodSequence Length
2565121024204840968192
1.8BLoRA6.7G / 1.0s/it7.4G / 1.0s/it8.4G / 1.1s/it11.0G / 1.7s/it16.2G / 3.3s/it21.8G / 6.8s/it
LoRA (emb)13.7G / 1.0s/it14.0G / 1.0s/it14.0G / 1.1s/it15.1G / 1.8s/it19.7G / 3.4s/it27.7G / 7.0s/it
Q-LoRA5.8G / 1.4s/it6.0G / 1.4s/it6.6G / 1.4s/it7.8G / 2.0s/it10.2G / 3.4s/it15.8G / 6.5s/it
Full-parameter43.5G / 2.1s/it43.5G / 2.2s/it43.5G / 2.2s/it43.5G / 2.3s/it47.1G / 2.8s/it48.3G / 5.6s/it
7BLoRA20.1G / 1.2s/it20.4G / 1.5s/it21.5G / 2.8s/it23.8G / 5.2s/it29.7G / 10.1s/it36.6G / 21.3s/it
LoRA (emb)33.7G / 1.4s/it34.1G / 1.6s/it35.2G / 2.9s/it35.1G / 5.3s/it39.2G / 10.3s/it48.5G / 21.7s/it
Q-LoRA11.5G / 3.0s/it11.5G / 3.0s/it12.3G / 3.5s/it13.9G / 7.0s/it16.9G / 11.6s/it23.5G / 22.3s/it
Full-parameter139.2G / 4.0s/it148.0G / 4.0s/it162.0G / 4.5s/it---
14BLoRA34.6G / 1.6s/it35.1G / 2.4s/it35.3G / 4.4s/it37.4G / 8.4s/it42.5G / 17.0s/it55.2G / 36.0s/it
LoRA (emb)51.2 / 1.7s/it51.1G / 2.6s/it51.5G / 4.6s/it54.1G / 8.6s/it56.8G / 17.2s/it67.7G / 36.3s/it
Q-LoRA18.7G / 5.3s/it18.4G / 6.3s/it18.9G / 8.2s/it19.9G / 11.8s/it23.0G / 20.1s/it27.9G / 38.3s/it
72BLoRA + Deepspeed Zero3215.4G / 17.6s/it217.7G / 20.5s/it222.6G / 29.4s/it228.8G / 45.7s/it249.0G / 83.4s/it289.2G / 161.5s/it
Q-LoRA61.4G / 27.4s/it61.4G / 31.5s/it62.9G / 41.4s/it64.1G / 59.5s/it68.0G / 97.7s/it75.6G / 179.8s/it
+
+ +## Déploiement + +### vLLM +Pour le déploiement et l'inférence rapide, nous suggérons d'utiliser vLLM avec FastChat. Installez d'abord les paquets: +```bash +pip install vllm +pip install "fschat[model_worker,webui]" +``` +Ou vous pouvez les installer à partir des sources par `git clone` et `pip install -e .`. Nous vous conseillons de lire leurs documents si vous rencontrez des problèmes lors de l'installation. + +Pour faire fonctionner Qwen avec vLLM et FastChat, vous devez d'abord lancer un contrôleur par: +```bash +python -m fastchat.serve.controller +``` + +Ensuite, vous pouvez lancer le travailleur de modèle, ce qui signifie charger votre modèle pour l'inférence. Pour l'inférence sur un seul GPU, vous pouvez directement lancer: +```bash +python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code +``` +Cependant, si vous souhaitez exécuter le modèle sur plusieurs GPU pour une inférence plus rapide ou une mémoire plus importante, vous pouvez utiliser le parallélisme tensoriel pris en charge par vLLM. Supposons que vous exécutiez le modèle sur 4 GPU, la commande est présentée ci-dessous: +```bash +python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 +``` + +Après avoir lancé votre model worker, vous pouvez lancer : + +* Démonstration de l'interface web +```bash +python -m fastchat.serve.gradio_web_server +``` + +* API OpenAI +```bash +python -m fastchat.serve.openai_api_server --host localhost --port 8000 +``` + +Cependant, si vous avez des difficultés à utiliser vLLM et FastChat, vous pouvez essayer nos méthodes les plus simples pour déployer une démo web, une démo CLI et une API. + +### Interface Web + +Nous fournissons du code pour que les utilisateurs puissent construire une démo d'interface web (merci à @wysaid). Avant de commencer, assurez-vous d'installer les paquets suivants: + +``` +pip install -r requirements_web_demo.txt +``` + +Exécutez ensuite la commande ci-dessous et cliquez sur le lien généré: + +```bash +python web_demo.py +``` + +

+
+ +
+

+ +### Démo CLI + +Nous fournissons un exemple de démonstration CLI dans `cli_demo.py`, qui prend en charge la sortie en continu pour la génération. Les utilisateurs peuvent interagir avec Qwen-7B-Chat en saisissant des invites, et le modèle renvoie les sorties du modèle en mode streaming. Exécutez la commande ci-dessous: + +```bash +python cli_demo.py +``` + +

+
+ +
+

+
+ +### API Nous fournissons des méthodes pour déployer une API locale basée sur l'API OpenAI (merci à @hanpenggit). Avant de commencer, installez les paquets nécessaires: @@ -864,6 +903,122 @@ print(response.choices[0].message.content)

+## 🐳 Docker + +Pour simplifier le processus de déploiement, nous fournissons des images docker avec des environnements préconstruits : [qwenllm/qwen] (https://hub.docker.com/r/qwenllm/qwen). Il vous suffit d'installer le pilote et de télécharger les fichiers de modèle pour lancer les démonstrations, déployer l'API OpenAI et affiner le modèle. + +### Préparation + +1. Installez la version correcte du pilote Nvidia en fonction de l'image à utiliser : + - `qwenllm/qwen:cu117` (**recommandé**): `>= 515.48.07` + - `qwenllm/qwen:cu114` (w/o flash-attention): `>= 470.82.01` + - `qwenllm/qwen:latest`: même que `qwenllm/qwen:cu117` + +2. Installer et configurer [docker](https://docs.docker.com/engine/install/) et [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) : + +```bash +# configure docker +sudo systemctl start docker +# test if docker is correctly installed +sudo docker run hello-world + +# configure nvidia-container-toolkit +sudo nvidia-ctk runtime configure --runtime=docker +sudo systemctl restart docker +# test if nvidia-container-toolkit is correctly installed +sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi +``` + +3. Téléchargez les checkpoints et les codes du modèle dans votre environnement (voir [ici](#DownloadModel)). + +### Déploiement + +Nous utilisons ici Qwen-7B-Chat comme exemple. Avant de lancer une démo web ou une API, vous pouvez établir la configuration comme indiqué ci-dessous : + +```bash +IMAGE_NAME=qwenllm/qwen:cu117 +PORT=8901 +CHECKPOINT_PATH=/path/to/Qwen-7B-Chat # Path to downloaded model checkpoints and codes +``` +Les scripts suivants peuvent vous aider à construire : + +* API OpenAI +```bash +bash docker/docker_openai_api.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT} +``` + +* Interface Web +```bash +bash docker/docker_web_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT} +``` + +* Démo CLI +```bash +bash docker/docker_cli_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} +``` + +Les commandes ci-dessus téléchargeront automatiquement l'image requise et lanceront une démo d'interface Web en arrière-plan (le service redémarrera automatiquement). Vous pouvez ouvrir `http://localhost:${PORT}` sur l'hôte pour utiliser la démo. + +La démo est lancée avec succès si vous obtenez le résultat suivant : + +```text +Successfully started web demo. Open '...' to try! +Run `docker logs ...` to check demo status. +Run `docker rm -f ...` to stop and remove the demo. +``` + +Si vous voulez vérifier le statut de la démo, vous pouvez utiliser `docker logs qwen` pour afficher les résultats. + +Vous pouvez utiliser `docker rm -f qwen` pour arrêter le service et supprimer le conteneur. + + +### Finetuning + +La méthode de finetuning utilisant l'image Docker préconstruite est fondamentalement la même que [le chapitre ci-dessus](#Finetuning) (nous avons déjà installé les dépendances dans l'image) : + +Voici un exemple de LoRA à une seule GPU : +```bash +IMAGE_NAME=qwenllm/qwen:cu117 +CHECKPOINT_PATH=/path/to/Qwen-7B # Path to downloaded model checkpoints and codes +#CHECKPOINT_PATH=/path/to/Qwen-7B-Chat-Int4 # Path to downloaded model checkpoints and codes (Q-LoRA) +DATA_PATH=/path/to/data/root # Prepare finetune data at ${DATA_PATH}/example.json +OUTPUT_PATH=/path/to/output/checkpoint # Path to finetune outputs + +# Use all host devices by default +DEVICE=all +# If you need to specify GPUs for training, set device as follow (NOTE: internal quotation marks cannot be omitted) +#DEVICE='"device=0,1,2,3"' + +mkdir -p ${OUTPUT_PATH} + +# Single-GPU LoRA finetuning +docker run --gpus ${DEVICE} --rm --name qwen \ + --mount type=bind,source=${CHECKPOINT_PATH},target=/data/shared/Qwen/Qwen-7B \ + --mount type=bind,source=${DATA_PATH},target=/data/shared/Qwen/data \ + --mount type=bind,source=${OUTPUT_PATH},target=/data/shared/Qwen/output_qwen \ + --shm-size=2gb \ + -it ${IMAGE_NAME} \ + bash finetune/finetune_lora_single_gpu.sh -m /data/shared/Qwen/Qwen-7B/ -d /data/shared/Qwen/data/example.json +``` + +Pour faire un changement vers Q-LoRA à GPU unique par exemple, il suffit de modifier la commande bash à l'intérieur de `docker run` : +```bash +bash finetune/finetune_qlora_single_gpu.sh -m /data/shared/Qwen/Qwen-7B-Chat-Int4/ -d /data/shared/Qwen/data/example.json +``` +
+ +## 🔥 Invite du système +Qwen-1.8-Chat et Qwen-72B-Chat ont été entièrement formés à diverses invites de système avec plusieurs séries d'interactions complexes, de sorte qu'ils peuvent suivre une variété d'invites de système et réaliser la personnalisation du modèle dans le contexte, améliorant ainsi l'évolutivité de Qwen-chat. + +Grâce aux messages-guides du système, Qwen-Chat peut **jouer avec enthousiasme**, **transférer le style de langage**, **fixer des tâches** et **fixer des comportements**. + +![](assets/system_prompt_language_style.png) + +![](assets/system_prompt_role_play_en.png) + +Pour plus d'informations, veuillez vous référer à la [documentation d'exemple](examples/system_prompt.md). + + ## Utilisation des outils Qwen-Chat a été optimisé pour l'utilisation d'outils et les capacités d'appel de fonctions. Les utilisateurs peuvent développer des agents, des applications LangChain, et même augmenter Qwen avec un Code Interpreter. @@ -1087,9 +1242,13 @@ En outre, nous fournissons également des résultats expérimentaux démontrant
-## Compréhension du contexte long +## Compréhension du Contexte Long -Pour étendre la longueur du contexte et briser le goulot d'étranglement de la longueur de la séquence d'entraînement, nous introduisons plusieurs techniques, y compris l'interpolation consciente de NTK, l'attention de fenêtre, et l'échelle d'attention LogN, pour étendre la longueur du contexte de Qwen-7B/14B de 2k à plus de 8k tokens, et Qwen-7B de 8k à 32k tokens. Nous menons des expériences de modélisation du langage sur l'ensemble de données arXiv avec l'évaluation PPL et nous constatons que Qwen peut atteindre des performances exceptionnelles dans le scénario d'un contexte long. Les résultats sont présentés ci-dessous : +Pour augmenter la longueur du contexte et éliminer le goulot d'étranglement que constitue la longueur de la séquence d'entraînement, nous introduisons plusieurs techniques, notamment l'interpolation tenant compte des NTK, l'attention par fenêtre et la mise à l'échelle de l'attention LogN, afin d'augmenter la longueur du contexte de Qwen-14B de 2K à plus de 8K tokens, et de Qwen-1.8B/7B de 8K à 32K tokens. + +Pour Qwen-72B, nous adaptons RoPE à des contextes plus longs avec une base rotative plus importante. Qwen-72B prend en charge la longueur de contexte maximale de 32K tokens. + +Nous menons des expériences de modélisation du langage sur l'ensemble de données arXiv avec l'évaluation PPL et nous constatons que Qwen peut atteindre des performances exceptionnelles dans le scénario d'un contexte long. Les résultats sont présentés ci-dessous : @@ -1111,6 +1270,12 @@ Pour étendre la longueur du contexte et briser le goulot d'étranglement de la + + + + + + @@ -1123,8 +1288,25 @@ Pour étendre la longueur du contexte et briser le goulot d'étranglement de la + + + +
+ dynamic_ntk + logn + window_attn4.233.783.583.494.32-
Qwen-1.8B5.004.484.133.8917.42433.85
+ dynamic_ntk + logn + window_attn5.004.484.143.933.823.83
Qwen-7B4.233.813.523.317.27181.49
+ dynamic_ntk + logn + window_attn-3.463.293.183.42-
Qwen-72B---2.832.732.72
+En outre, pour vérifier la capacité de Qwen-72B-Chat à comprendre des textes longs, nous l'avons testé sur [L-Eval] (https://arxiv.org/abs/2307.11088) (tâches fermées). Les résultats sont les suivants : + +| Model | Input Length | Average | Coursera | GSM | QuALITY | TOEFL | CodeU | SFcition | +|:------------------|:------------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:| +| ChatGPT-3.5-16k | 16K | 60.73 | **63.51** | **84.00** | 61.38 | 78.43 | **12.22** | 64.84 | +| **Qwen-72B-Chat** | 32K | **62.30** | 58.13 | 76.00 | **77.22** | **86.24** | 6.66 | **69.53** | + +Nous avons réalisé l'expérience de "l'aiguille dans une botte de foin" (l'idée vient de [@Greg Kamradt](https://twitter.com/GregKamradt/status/1727018183608193393)) pour tester si le modèle peut récupérer des informations à différentes positions dans les entrées de différentes longueurs, le résultat est le suivant : + +![](assets/qwen_72b_needle_in_a_haystack.png) + +Les résultats ci-dessus montrent que Qwen-72B-Chat peut récupérer avec précision des informations placées dans différentes positions dans une longueur d'entrée de 32K, ce qui prouve ses excellentes capacités de compréhension de textes longs. + ## Tokenizer @@ -1156,7 +1338,13 @@ Si vous trouvez notre travail utile, n'hésitez pas à nous citer. ## Accord de Licence -Les chercheurs et les développeurs sont libres d'utiliser les codes et les poids des modèles de Qwen et de Qwen-Chat. Nous autorisons également leur utilisation commerciale. Consultez notre licence à [LICENSE](LICENSE) pour plus de détails. Si vous avez des exigences en matière d'utilisation commerciale, veuillez remplir le formulaire ([7B](https://dashscope.console.aliyun.com/openModelApply/qianwen), [14B](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat)) pour en faire la demande. +Le code source fourni à l'adresse est soumis à la licence [Apache 2.0 License](./LICENSE) qui se trouve dans le répertoire racine. + +Les chercheurs et les développeurs sont libres d'utiliser les codes et les poids des modèles de Qwen et de Qwen-Chat. Pour leur utilisation commerciale, veuillez consulter l'accord de licence accompagnant chaque modèle. + +- Qwen-72B, Qwen-14B et Qwen-7B sont sous licence [Tongyi Qianwen LICENSE AGREEMENT](./Tongyi%20Qianwen%20LICENSE%20AGREEMENT) que l'on peut trouver dans les dépôts HuggingFace et ModelScope correspondants. Pour une utilisation commerciale, veuillez remplir le formulaire ([72B](https://dashscope.console.aliyun.com/openModelApply/Qwen-72B-Chat), [14B](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat), et [7B](https://dashscope.console.aliyun.com/openModelApply/qianwen)) pour en faire la demande. + +- Qwen-1.8B est sous licence [Tongyi Qianwen RESEARCH LICENSE AGREEMENT](./Tongyi%20Qianwen%20RESEARCH%20LICENSE%20AGREEMENT) qui peut être trouvé dans les dépôts HuggingFace et ModelScope correspondants. Pour une utilisation commerciale, veuillez nous contacter.

## Contactez-nous diff --git a/README_JA.md b/README_JA.md index 6955cf3..f12ea60 100644 --- a/README_JA.md +++ b/README_JA.md @@ -1,5 +1,5 @@

- 中文  |  English  |  日本語 |  Français + 中文  |  English  |  日本語 |  Français |  Español



@@ -9,28 +9,34 @@

- 🤗 Hugging Face   |   🤖 ModelScope   |    📑 Paper    |   🖥️ Demo + 🤗 Hugging Face   |   🤖 ModelScope   |    📑 Paper    |   🖥️ Demo
-WeChat   |    DingTalk    |   Discord   +WeChat (微信)   |   Discord   |   API



-

- 日本語ドキュメントメンテナー: Ikko Eltociear Ashimine & Junyang Lin -

-
- | | Qwen-Chat | Qwen-Chat (Int4) | Qwen-Chat (Int8) | Qwen | |-----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:| +| 1.8B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | | 7B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | | 14B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | +| 72B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | -Qwen-7B**と**Qwen-14B**の**Qwen**シリーズと、**Qwen-7B-Chat**と**Qwen-14B-Chat**の**Qwen-Chat**シリーズをオープンソース化しました。上の表にリンクがあります。クリックしてモデルカードをご確認ください。また、テクニカルレポートも公開しました。論文リンクをクリックしてご覧ください! +**Qwen-1.8B**、**Qwen-7B**、**Qwen-14B**、**Qwen-72B**の基本言語モデルである**Qwen**と、**Qwen-1.8B-Chat**、**Qwen-7B-Chat**、**Qwen-14B-Chat**、**Qwen-72B-Chat**のチャットモデルである**Qwen-Chat**をオープンソース化します。上の表にリンクがあります。リンクをクリックして、モデルカードをご確認ください。また、**[テクニカルレポート](https://arxiv.org/abs/2309.16609)**も公開しています。論文リンクをクリックしてご覧ください! 簡単に説明すると、私たちは、ドメインや言語(中国語と英語を中心に)などを幅広くカバーする最大3兆トークンの多言語データに対して安定的に事前学習された強力なベース言語モデルを持っています。これらのモデルは、ベンチマークデータセットにおいて競争力のあるパフォーマンスを達成することができます。さらに、SFTとRLHFに基づく人間の嗜好に沿ったチャットモデル(まだリリースされていません)があり、チャット、コンテンツ作成、情報抽出、要約、翻訳、コーディング、数学の問題を解くなどが可能で、ツールを使ったり、エージェントとして遊んだり、コードインタプリタとして遊んだりすることもできます。 + +| モデル | 発行日 | コンテキストの最大長 | システムプロンプトの強化 | 预训练されたトークンの数 | Finetuning(Q-Lora)の最小GPUメモリ使用量 | 2048トークン生成時の最小GPUメモリ使用量(Int4) | ツールの使用能力 | +|:----------|:--------:|:----------:|:------------:|:------------:|:------------------------------:|:-----------------------------:|:--------:| +| Qwen-1.8B | 23.11.30 | 32K | √ | 2.2T | 5.8GB | 2.9GB | √ | +| Qwen-7B | 23.08.03 | 32K | × | 2.4T | 11.5GB | 8.2GB | √ | +| Qwen-14B | 23.09.25 | 8K | × | 3.0T | 18.7GB | 13.0GB | √ | +| Qwen-72B | 23.11.30 | 32K | √ | 3.0T | 61.4GB | 48.9GB | √ | + + このレポでは、それを把握することができる: * Qwenのクイックスタート。 @@ -51,6 +57,7 @@ Qwen-7B**と**Qwen-14B**の**Qwen**シリーズと、**Qwen-7B-Chat**と**Qwen-1 ## ニュースとアップデート +* 2023.11.30 🔥 3T トークンで学習し、32k コンテキストをサポートする **Qwen-72B** と **Qwen-72B-Chat** を、 **Qwen-1.8B** と **Qwen-1.8B-Chat** とともに、ModelScope と Hugging Face 上でリリースしました。また、Qwen-72B-ChatとQwen-1.8B-Chatのシステム・プロンプト機能を強化しました。[サンプル・ドキュメント](examples/system_prompt.md)を参照してください。さらに、**Ascend 910** と **Hygon DCU** での推論をサポートしました。詳細は `ascend-support` と `dcu-support` を参照してください。 * 2023.10.17 Int8量子化モデル**Qwen-7B-Chat-Int8**と**Qwen-14B-Chat-Int8**をリリースしました。 * 2023.9.25 🔥 Qwen-14BとQwen-14B-ChatをModelScopeとHugging Faceでリリースしました。[qwen.cpp](https://github.com/QwenLM/qwen.cpp) と [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) もリリースされました。同時に、Qwen-7B と Qwen-7B-Chat も更新しました。Qwen-7B(オリジナル)と比較して、Qwen-7Bはより多くの学習トークンを使用し、2.2Tトークンから2.4Tトークンに増加し、コンテキスト長は2048から8192に拡張された。Qwen-7Bの中国語知識とコーディング能力はさらに向上しています。最新のコードとチェックポイントをお使いください! * 2023.9.12 Qwen-7Bモデルにおいて、フルパラメーター・ファインチューニング、LoRA、Q-LoRAを含むファインチューニングをサポートしました。 @@ -60,27 +67,31 @@ Qwen-7B**と**Qwen-14B**の**Qwen**シリーズと、**Qwen-7B-Chat**と**Qwen-1 ## 性能 -Qwen-14BとQwen-7B(これは、より多くのトークンで学習され、コンテキストの長さが2048から8192に拡張された新バージョン)は、自然言語理解、数学的問題解決、コーディングなどに関するモデルの能力を評価する一連のベンチマークデータセット、例えばMMLU、C-Eval、GSM8K、MATH、HumanEval、MBPP、BBHなどにおいて、同様のモデルサイズのベースラインモデルを上回る。しかし、Qwen-14BでもGPT-4はおろかGPT-3.5にも大きく遅れをとっています。以下の結果をご覧ください。 +Qwenモデルは、MMLU、C-Eval、GSM8K、MATH、HumanEval、MBPP、BBHなど、自然言語理解、数学的問題解決、コーディングなどに関するモデルの能力を評価する一連のベンチマークデータセットにおいて、同様のモデルサイズを持つベースラインモデルを上回る性能を発揮する。Qwen-72Bは全てのタスクでLLaMA2-70Bを上回り、10タスク中7タスクでGPT-3.5を上回った。 +

- +


-| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU | -|:-------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:| -| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot | -| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 | -| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 | -| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - | -| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - | -| InternLM-7B | 51.0 | 53.4 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 | -| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 | -| Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 | -| Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 | -| Qwen-7B (original) | 56.7 | 59.6 | 51.6 | 10.4 | 24.4 | 31.2 | 40.6 | 58.8 | -| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 | -| **Qwen-14B** | **66.3** | **72.1** | **61.3** | **24.8** | **32.3** | **40.8** | **53.4** | **71.0** | +| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU | +|:------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:| +| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot | +| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 | +| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 | +| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - | +| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - | +| InternLM-7B | 51.0 | 53.4 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 | +| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 | +| Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 | +| Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 | +| Yi-34B | 76.3 | 81.8 | 67.9 | 15.9 | 26.2 | 38.2 | 66.4 | 82.6 | +| XVERSE-65B | 70.8 | 68.6 | 60.3 | - | 26.3 | - | - | - | +| **Qwen-1.8B** | 45.3 | 56.1 | 32.3 | 2.3 | 15.2 | 14.2 | 22.3 | 52.1 | +| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 | +| **Qwen-14B** | 66.3 | 72.1 | 61.3 | 24.8 | 32.3 | 40.8 | 53.4 | 71.0 | +| **Qwen-72B** | **77.4** | **83.3** | **78.9** | **35.2** | **35.4** | **52.2** | **67.7** | **83.6** | 比較されたすべてのモデルについて、公式に報告された結果と[OpenCompass](https://opencompass.org.cn/leaderboard-llm) の間の最高スコアを報告します。 @@ -92,6 +103,7 @@ Qwen-14BとQwen-7B(これは、より多くのトークンで学習され、 * python 3.8 以上 * pytorch 1.12 以上、2.0 以上を推奨 +* transformers 4.32 以上 * CUDA 11.4 以上を推奨(GPU ユーザー、フラッシュアテンションユーザー向けなど)
@@ -99,7 +111,9 @@ Qwen-14BとQwen-7B(これは、より多くのトークンで学習され、 以下では、Qwen-Chat と 🤖 ModelScope と 🤗 Transformers の簡単な使用例を示します。 -コードを実行する前に、環境のセットアップと必要なパッケージのインストールが済んでいることを確認してください。上記の要件を満たしていることを確認してから、依存するライブラリをインストールしてください。 +詳しくはセクション["ビルド済みDockerイメージの使用"](#-using-pre-built-docker-images)を参照してください。 + +Dockerを使用しない場合は、環境のセットアップと必要なパッケージのインストールが済んでいることを確認してください。上記の要件を満たしていることを確認してから、依存するライブラリをインストールしてください。 ```bash pip install -r requirements.txt @@ -112,6 +126,7 @@ git clone https://github.com/Dao-AILab/flash-attention cd flash-attention && pip install . # 以下はオプションです。インストールに時間がかかる場合があります。 # pip install csrc/layer_norm +# flash-attn のバージョンが 2.1.1 以降の場合、以下は必要ありません。 # pip install csrc/rotary ``` @@ -119,7 +134,7 @@ cd flash-attention && pip install . ### 🤗 Transformers -Qwen-Chat を推論に使用するには、以下のように数行のコードを入力するだけです。**最新のコードを使用していることを確認してください。** +Qwen-Chat を推論に使用するには、以下のように数行のコードを入力するだけです。Qwen/Qwen-7B-Chat "や "Qwen/Qwen-14B-Chat "のように、正しいモデル名やパスを渡すことを忘れないでください。**最新のコードを使用していることを確認してください。** ```python from transformers import AutoModelForCausalLM, AutoTokenizer @@ -137,8 +152,8 @@ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code # オートモードを使用すると、デバイスに応じて自動的に精度が選択されます。 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval() -# 生成のためのハイパーパラメータを指定 -model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) +# 生成のためのハイパーパラメータを指定。ただし、4.32.0 以上のトTransformerを使用している場合は、これを行う必要はありません。 +# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 第一回対話ターン response, history = model.chat(tokenizer, "你好", history=None) @@ -181,8 +196,8 @@ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True # オートモードを使用すると、デバイスに応じて自動的に精度が選択されます。 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True).eval() -# 生成のためのハイパーパラメータを指定 -model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True) +# 生成のためのハイパーパラメータを指定。ただし、4.32.0 以上のトTransformerを使用している場合は、これを行う必要はありません。 +# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) inputs = tokenizer('蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是', return_tensors='pt') inputs = inputs.to(model.device) @@ -193,7 +208,9 @@ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)) +

HuggingFaceからモデルのチェックポイントとコードをダウンロードする際にネットワークの問題が発生した場合、ModelScopeからチェックポイントをダウンロードする方法はこちらでございます。 +

```python from modelscope import snapshot_download @@ -321,13 +338,69 @@ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cp GPUメモリ不足に悩まされ、1つ以上のGPUでモデルを実行したい場合、Transformersでサポートされるようになったデフォルトのロード方法を直接使うことができます。以前の `utils.py` に基づく方法は非推奨です。 しかし、この方法は簡単ですが、ネイティブ・パイプライン並列の効率は低いです。FastChatでvLLMを使用することをお勧めします。 + +### DashScope + +APIを通じてQwenを利用する最も簡単な方法は、Alibaba Cloudを通じたDashScope APIサービスです。その使い方を紹介します。さらに、OpenAIスタイルのAPIをご自身のサーバーにデプロイするためのスクリプトも提供しています。 + +DashScopeはAlibaba Cloudが提供する大規模言語モデルAPIサービスで、今回Qwenに対応した。DashScopeの背後にあるモデルは、詳細が提供されていない一時的な社内バージョンであることに注意してください。サービスには `qwen-turbo` と `qwen-plus` があり、前者はより高速に動作し、後者はより優れたパフォーマンスを実現している。詳細はドキュメント [こちら](https://dashscope.aliyun.com) を参照。 + +公式サイト [link](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.6c2774fahtfXdn) で DashScope アカウントを作成し、API キー (AK) を取得してください。AK は環境変数で設定することをお勧めします: +```bash +export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY" +``` +その後、パッケージをインストールし、ドキュメントは [こちら](https://help.aliyun.com/zh/dashscope/developer-reference/install-dashscope-sdk) をクリックしてください。Python をお使いの場合は、pip で DashScope をインストールできます: +```bash +pip install dashscope +``` +JAVA SDKを使用する場合は、この方法でインストールできます: +```xml + + + com.alibaba + dashscope-sdk-java + the-latest-version + +``` +DashScope を使用する最も簡単な方法は、OpenAI API と同様のメッセージを使用する方法です。以下にその例を示す: +```python +import random +from http import HTTPStatus +from dashscope import Generation + + +def call_with_messages(): + messages = [{'role': 'system', 'content': 'You are a helpful assistant.'}, + {'role': 'user', 'content': '如何做西红柿鸡蛋?'}] + gen = Generation() + response = gen.call( + Generation.Models.qwen_turbo, + messages=messages, + seed=random.randint(1, 10000), # set the random seed, optional, default to 1234 if not set + result_format='message', # set the result to be "message" format. + ) + return response + + +if __name__ == '__main__': + response = call_with_messages() + if response.status_code == HTTPStatus.OK: + print(response) + else: + print('Request id: %s, Status code: %s, error code: %s, error message: %s' % ( + response.request_id, response.status_code, + response.code, response.message + )) +``` +詳しい使い方は公式サイトをご覧ください。

+ ## 量子化 ### GPTQ -**注: [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) に基づく新しい解決策を提供し、Qwen-Chat 用の Int4 量子化モデル[ここをクリック](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4)をリリースしました。このモデルは、従来の解決策と比較して、ほぼ無損失のモデル効果を達成しつつ、メモリコストと推論速度の両方で性能が向上しています。** +我々は、[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)に基づいた解決策を提供し、Int4とInt8の量子化モデルをリリースすることで、ほぼ無損失なモデル効果を達成しつつ、メモリコストと推論速度の両方で性能を向上させた。 ここでは、量子化されたモデルを推論に使用する方法を説明する。始める前に、auto-gptqの要件を満たしていることを確認し(例:torch 2.0以上、transformers 4.32.0以上など)、必要なパッケージをインストールしてください: @@ -337,6 +410,12 @@ pip install auto-gptq optimum auto-gptq`のインストールに問題がある場合は、公式の[repo](https://github.com/PanQiWei/AutoGPTQ)をチェックして、ホイールを見つけることをお勧めする。 +> 注意:コンパイル済みの `auto-gptq` パッケージは `torch` のバージョンと CUDA バージョンに強く依存しています。さらに、最近のアップデートにより +> さらに、最近のアップデートにより、`transformers`、`optimum`、`peft` でサポートされていないバージョンのエラーが発生する可能性があります。 +> 以下の要件を満たす最新バージョンの使用をお勧めします: +> - torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1 > - torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1 +> - torch>=2.0, <2.1 auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0 + そうすれば、量子化されたモデルを簡単にロードすることができ、いつもと同じように推論を実行することができる: ```python @@ -352,18 +431,28 @@ response, history = model.chat(tokenizer, "Hi", history=None) | Quantization | MMLU | CEval (val) | GSM8K | Humaneval | |----------------------|:----:|:-----------:|:-----:|:---------:| +| Qwen-1.8B-Chat (BF16)| 43.3 | 55.6 | 33.7 | 26.2 | +| Qwen-1.8B-Chat (Int8)| 43.1 | 55.8 | 33.0 | 27.4 | +| Qwen-1.8B-Chat (Int4)| 42.9 | 52.8 | 31.2 | 25.0 | | Qwen-7B-Chat (BF16) | 55.8 | 59.7 | 50.3 | 37.2 | | Qwen-7B-Chat (Int8) | 55.4 | 59.4 | 48.3 | 34.8 | | Qwen-7B-Chat (Int4) | 55.1 | 59.2 | 49.7 | 29.9 | | Qwen-14B-Chat (BF16) | 64.6 | 69.8 | 60.1 | 43.9 | -| Qwen-14B-Chat (Int8) | 63.6 | 68.6 | 60.0 | 48.2 | +| Qwen-14B-Chat (Int8) | 63.6 | 68.6 | 60.0 | 48.2 | | Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 | +| Qwen-72B-Chat (BF16) | 74.4 | 80.1 | 76.4 | 64.6 | +| Qwen-72B-Chat (Int8) | 73.5 | 80.1 | 73.5 | 62.2 | +| Qwen-72B-Chat (Int4) | 73.4 | 80.1 | 75.3 | 61.6 | ### KVキャッシュ量子化 -モデルの推論の時に、中間結果のKeyとValueを量子化して圧縮保存することができます。これにより、同じGPUでより多くのKeyとValueを保存することができ、サンプルのスピードを増やすことができます。 +> 注意: Hugging Faceの内部メカニズムにより、この機能のサポートファイル +> (すなわち、`cache_autogptq_cuda_256.cpp`と`cache_autogptq_cuda_kernel_245.cu`)が欠落している可能性があります。以下を手動でダウンロードしてください。 +> Hugging Face Hubから手動でダウンロードし、他のモジュールファイルと同じフォルダに入れてください。 + +アテンション KV キャッシュを量子化して圧縮して保存すると、サンプルのスループットが向上する。この機能を有効にするには、`config.json` に `use_cache_quantization` と `use_cache_kernel` という引数を指定する。 +具体的な使用方法は以下の通りである: -use_cache_quantizationとuse_cache_kernelという2つのパラメータを提供します。use_cache_quantizationとuse_cache_kernelを両方ONにした場合、KVキャッシュ量子化の機能が有効になります。具体的な使い方は: ```python model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen-7B-Chat", @@ -374,43 +463,50 @@ model = AutoModelForCausalLM.from_pretrained( use_flash_attn=False ) ``` -現在、この機能はflash attnと同時に使用することはできません。use_flash_attnをTrueにしてKVキャッシュ量子化とflash attnを同時に有効にした場合、use_flash_attnはデフォルトで無効になります。 +注意: 現在、KVキャッシュの量子化とフラッシュ・アテンションを同時に使用することはできない。 +KV キャッシュの量子化とフラッシュ・アテンションを同時に有効にした場合(`use_flash_attn=True`, `use_cache_quantization=True`, `use_cache_kernel=True`)、デフォルトでは `use_flash_attn` は無効になる(`use_flash_attn=false`)。 -Int8 KVキャッシュ量子化の使用によるモデルの性能の影響はほとんどありませんでした。性能評価は単一のA100-SXM4-80G GPUで実行され、モデルはデフォルトでBF16形式を使用し、生成される文章の長さは1024です。oomはメモリ不足を示します。 +量子化されたint8-kvcacheモデルを使用しても、下流の評価で大幅な性能低下がないことを確認しました。以下では、さまざまな条件下でのメモリフットプリントのプロファイリングに焦点を当てます。 +プロファイリングは、PyTorch 2.0.1とCUDA 11.4を搭載したシングルA100-SXM4-80G GPUで実行しました。 +デフォルトで1024トークンを生成するためにBF16モデルを使用し、"OOM "はメモリ不足エラーを示します。 -KVキャッシュ量子化を有効にすると、推論の時により大きなバッチサイズ(bs)を使用できるようになります。 +KVキャッシュの量子化により、モデルはより大きなバッチサイズ(bs)で推論することができる。 -| USE KVCache | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 | -|-------------|:------:|:------:|:------:|:------:|:------:|:------:| -| no | 16.3GB | 24.1GB | 31.7GB | 48.7GB | oom | oom | -| yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB | +| USE KV Cache | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 | +|--------------|:------:|:------:|:------:|:------:|:------:|:------:| +| No | 16.3GB | 24.1GB | 31.7GB | 48.7GB | OOM | OOM | +| Yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB | -KVキャッシュ量子化を有効にすると、推論の時により長い文章が生成できる。 +KVキャッシュ量子化により、推論段階でより長いシーケンス(`sl`, シーケンス長、生成されるトークン数を指す)を生成する際、モデルはより多くのメモリを節約することができる。 -| USE KVCache | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 | -|-------------|:------:|:-------:|:-------:|:-------:|:-------:| -| no | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB | -| yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB | +| USE KV Cache | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 | +|--------------|:------:|:-------:|:-------:|:-------:|:-------:| +| No | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB | +| Yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB | +KVキャッシュ量子化モデルでは、layer-pastのフォーマットをfloatからint8に変換し、量子化された `layer-past` には量子化パラメータも格納される。 -モデルがKVキャッシュ量子化を有効にした場合、モデルの推論の時には、元のfloat形式のkey/valueをint8形式のqkey/qvalueと対応する量子化パラメータに変換します。 -具体的な手順は以下の通りです: -1、key/valueの量子化を行います。 +具体的な手順は以下の通り: + +1. key/valueの量子化を行います。 ``` qv,scale,zero_point=quantize_cache_v(v) ``` -2、layer_pastに保存します。 -量子化されたのlayer_pastは: + +2. `layer_past`に保存します。 + +量子化されたの`layer-past`は: ``` layer_past=((q_key,key_scale,key_zero_point), (q_value,value_scale,value_zero_point)) ``` -元のlayer_past: +`layer_past`の元のフォーマットは以下の通りである: ``` layer_past=(key,value) ``` -layer_pastのkey、valueを使用する必要がある場合は、int8形式のkey/valueをfloat形式に戻すために、逆量子化操作を使用することができます。 +量子化されたアテンションKVを使用したい場合、 +Int8のkey/valueをfloatフォーマットに戻すには、以下のように逆量子化操作を使用します: ``` v=dequantize_cache_torch(qv,scale,zero_point) ``` @@ -420,118 +516,97 @@ layer_pastのkey、valueを使用する必要がある場合は、int8形式のk このセクションでは、さまざまな精度のモデルのスピードとメモリの統計情報を提供する。スピードとメモリーのプロファイリングは[このスクリプト](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)を使用しています。 -### 推論スピード - -BF16、Int8、Int4の精度のモデルを用いて、2048個と8192個のトークンを生成する平均推論速度(tokens/s)を、フラッシュアテンションv1、v2を使用した場合と使用しなかった場合の条件で測定した。 +BF16、Int8、および Int4 のモデルを使用して 2048 を生成する際の平均推論速度 (トークン/秒) と GPU メモリ使用量を測定しました。 - + + + + - - - + + + + - + + + - + + - + + + - + + + - + + - + + + - + + + - + + - + + + - + + + - + + - - - - - - - - - - - - - - - + + +
Model SizePrecisionFlashAttnSequence LengthModel SizeQuantizationSpeed (Tokens/s)GPU Memory Usage
20488192
1.8BBF1654.094.23GB
7BBF16v240.9336.14Int855.563.48GB
v140.7535.34 + Int471.072.91GB
Disabled37.5533.56 + 7BBF1640.9316.99GB
Int8v237.4732.54Int837.4711.20GB
v137.5132.39 + Int450.098.21GB
Disabled37.8432.65 + 14BBF1632.2230.15GB
Int4v250.0938.61Int829.2818.81GB
v145.9836.47 + Int438.7213.01GB
Disabled48.1236.70 + 72BBF168.48144.69GB (2xA100)
14BBF16v232.8824.87Int89.0581.27GB (2xA100)
v132.7628.89 + Int411.3248.86GB
Disabled29.3222.91 -
Int8v229.2824.22
v128.3123.87 -
Disabled31.1224.60 -
Int4v238.7227.33
v137.8126.46 -
Disabled37.6526.00 + 72B + vLLMBF1617.602xA100
-詳細には、プロファイリングの設定は、2048個のトークンをエンコードし、8192個の新しいトークンを生成することである。プロファイリングは、PyTorch 2.0.1とCUDA 11.4を搭載したシングルA100-SXM4-80G GPUで実行される。推論速度はエンコードされたトークンと生成されたトークンの平均である。 +プロファイリングは、PyTorch 2.0.1、CUDA 11.8、および Flash-Attendant 2 を備えた単一の A100-SXM4-80G GPU (2xA100 について言及されている場合を除く) で実行されます。(72B + vLLM は PyTorch 2.1.0 および Cuda 11.8 を使用します。) 推論速度 は、エンコードされ生成されたトークンの平均である。 注意:上記のInt4/Int8モデルの推論速度は、autogptqを使用しています。現在、``AutoModelForCausalLM.from_pretrained``で読み込まれるモデルの推論速度は約20%遅くなります。この問題はHuggingFaceチームに報告済みであり、解決策があれば即座に更新されます。 -### GPU メモリ使用量 - -また、BF16、Int8、Int4量子化レベルのそれぞれにおいて、2048個のトークンをコンテキストとしてエンコードした場合(および単一のトークンを生成した場合)と、8192個のトークンを生成した場合(単一のトークンをコンテキストとして生成した場合)のGPUメモリ使用量のピーク値をプロファイリングしました。結果(GB)を以下に示します。 - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model SizePrecisionSequence Length
20488192
7BBF1616.9922.53
Int811.2016.62 -
Int48.2113.63
14BBF1630.1538.94
Int818.8127.54 -
Int413.0121.79
- -
+また、コンテキストと生成の長さ、Flash Attention バージョンのさまざまな設定で推論速度と GPU メモリ使用量も測定します。 結果は、Hugging Face または ModelScope の対応するモデルカードで確認できます。 ## ファインチューニング ### 使用方法 -現在、公式のトレーニングスクリプト `finetune.py` を提供しています。さらに、finetune.pyのシェルスクリプトを提供し、finetune.pyを実行することで、finetune.pyを起動することができる。さらに、安心してファインチューニングを開始するためのシェルスクリプトも提供しています。このスクリプトは、[DeepSpeed](https://github.com/microsoft/DeepSpeed) および [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/) を使用したトレーニングをサポートします。弊社が提供するシェル・スクリプトは DeepSpeed と Peft を使用するため、事前に DeepSpeed と Peft をインストールすることをお勧めします: +現在、公式のトレーニングスクリプト `finetune.py` を提供しています。さらに、finetune.pyのシェルスクリプトを提供し、finetune.pyを実行することで、finetune.pyを起動することができる。さらに、安心してファインチューニングを開始するためのシェルスクリプトも提供しています。このスクリプトは、[DeepSpeed](https://github.com/microsoft/DeepSpeed) (注意:これはpydanticの最新バージョンとコンフリクトする可能性があるので、`pydantic<2.0`にする必要があります) および [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/) を使用したトレーニングをサポートします。弊社が提供するシェル・スクリプトは DeepSpeed と Peft を使用するため、事前に DeepSpeed と Peft をインストールすることをお勧めします: ```bash pip install -r requirements_finetune.txt ``` @@ -598,7 +673,7 @@ sh finetune/finetune_qlora_single_gpu.sh sh finetune/finetune_qlora_ds.sh ``` -Q-LoRAについては、弊社が提供する量子化モデル、例えばQwen-7B-Chat-Int4をロードすることをお勧めします。BF16モデルは使用し**ない**でください!フルパラメータ・ファインチューニングやLoRAとは異なり、Q-LoRAではfp16のみがサポートされる。シングルGPUのトレーニングでは、トーチアンプによるエラーが観測されたため、混合精度のトレーニングにはディープスピードを使用する必要がある。また、Q-LoRAの場合、LoRAの特殊トークンの問題が残っています。しかし、Q-LoRAではチャットモデルとしてInt4モデルのみを提供しており、言語モデルはChatML形式の特殊トークンを学習しているため、レイヤーの心配はありません。なお、Int4モデルのレイヤーは学習できないはずなので、学習で特殊なトークンを導入すると、Q-LoRAが動作しなくなる可能性があります。 +Q-LoRAについては、弊社が提供する量子化モデル、例えばQwen-7B-Chat-Int4をロードすることをお勧めします。BF16モデルは使用し**ない**でください!フルパラメータ・ファインチューニングやLoRAとは異なり、Q-LoRAではfp16のみがサポートされる。シングルGPUのトレーニングでは、トーチアンプによるエラーが観測されたため、混合精度のトレーニングにはDeepSpeedを使用する必要がある。また、Q-LoRAの場合、LoRAの特殊トークンの問題が残っています。しかし、Q-LoRAではチャットモデルとしてInt4モデルのみを提供しており、言語モデルはChatML形式の特殊トークンを学習しているため、レイヤーの心配はありません。なお、Int4モデルのレイヤーは学習できないはずなので、学習で特殊なトークンを導入すると、Q-LoRAが動作しなくなる可能性があります。 LoRAとQ-LoRAの学習は、フルパラメータによるファインチューニングとは異なり、アダプターパラメータのみを保存する。仮にQwen-7Bから学習を開始したとすると、以下のようにファインチューニングされたモデルを読み込んで推論を行うことができる: @@ -629,10 +704,27 @@ merged_model = model.merge_and_unload() merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True) ``` +`new_model_directory` ディレクトリには、マージされたモデルの重みとモジュール ファイルが含まれます。 保存されたファイルに `*.cu` および `*.cpp` ファイルが存在しない可能性があることに注意してください。 KVキャッシュ機能を使用したい場合は、手動でコピーしてください。 また、このステップではトークナイザー ファイルは新しいディレクトリに保存されません。 トークナイザー ファイルをコピーするか、次のコードを使用できます。 +```python +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained( + path_to_adapter, # path to the output directory + trust_remote_code=True +) + +tokenizer.save_pretrained(new_model_directory) +``` + 注意:マルチGPUトレーニングの場合、分散トレーニング用の適切なハイパーパラメータをマシンに応じて指定する必要があります。また、データ、メモリフットプリント、トレーニング速度を考慮して、引数 `--model_max_length` で最大シーケンス長を指定することをお勧めします。 ### メモリと速度のプロファイリング -シングルGPUトレーニングのセットアップにおいて、LoRA (LoRA(emb)はembeddingと出力層を学習させるが、LoRAはembeddingと出力層を学習させない) とQ-LoRAのGPUメモリとトレーニング速度をプロファイリングする。このテストでは、シングルA100-SXM4-80G GPUで実験し、CUDA 11.8とPytorch 2.0を使用します。Flash attention 2を使用します。256、512、1024、2048、4096、8192という異なる長さの入力のメモリ(GB)と速度(s/iter)をプロファイリングします。また、2台のA100 GPUを用いたQwen-7Bによるフルパラメータ・ファインチューニングの統計量も報告する。GPUメモリの制限のため、256、512、1024トークンの統計のみを報告する。統計量を以下に示す: +シングルGPUトレーニングのセットアップにおいて、LoRA (LoRA(emb)はembeddingと出力層を学習させるが、LoRAはembeddingと出力層を学習させない) とQ-LoRAのGPUメモリとトレーニング速度をプロファイリングする。このテストでは、シングルA100-SXM4-80G GPUで実験し、CUDA 11.8とPytorch 2.0を使用します。Flash attention 2を使用します。256、512、1024、2048、4096、8192という異なる長さの入力のメモリ(GB)と速度(s/iter)をプロファイリングします。また、2台のA100 GPUを用いたQwen-7Bによるフルパラメータ・ファインチューニングの統計量も報告する。GPUメモリの制限のため、256、512、1024トークンの統計のみを報告する。 + + +Qwen-72B については、2 つの方法で実験します。1) 4 つの A100-SXM4-80G GPU での Lora 微調整 + DeepSpeed ZeRO 3、および 2) 1 つの A100-SXM4-80G GPU での QLora (int4) 微調整。 OOM は、LoRA (emb) 微調整と Deepspeed ZeRO 3 を使用しない LoRA 微調整の両方で 4 つの A100-SXM4-80G GPU で発生することに注意してください (`--deepspeedfinetune/ds_config_zero3.json` を [`finetune/finetune_lora_ds に渡すことができます) .sh`](finetune/finetune_lora_ds.sh) を使用して DeepSpeed ZeRO 3 を有効にします)。 + +統計量を以下に示す: @@ -642,6 +734,18 @@ merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_ + + + + + + + + + + + + @@ -664,44 +768,77 @@ merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_ + + + + + +
2565121024204840968192
1.8BLoRA6.7G / 1.0s/it7.4G / 1.0s/it8.4G / 1.1s/it11.0G / 1.7s/it16.2G / 3.3s/it21.8G / 6.8s/it
LoRA (emb)13.7G / 1.0s/it14.0G / 1.0s/it14.0G / 1.1s/it15.1G / 1.8s/it19.7G / 3.4s/it27.7G / 7.0s/it
Q-LoRA5.8G / 1.4s/it6.0G / 1.4s/it6.6G / 1.4s/it7.8G / 2.0s/it10.2G / 3.4s/it15.8G / 6.5s/it
Full-parameter43.5G / 2.1s/it43.5G / 2.2s/it43.5G / 2.2s/it43.5G / 2.3s/it47.1G / 2.8s/it48.3G / 5.6s/it
7BLoRA20.1G / 1.2s/it20.4G / 1.5s/it21.5G / 2.8s/it23.8G / 5.2s/it29.7G / 10.1s/it36.6G / 21.3s/it
Q-LoRA18.7G / 5.3s/it18.4G / 6.3s/it18.9G / 8.2s/it19.9G / 11.8s/it23.0G / 20.1s/it27.9G / 38.3s/it
72BLoRA + Deepspeed Zero3215.4G / 17.6s/it217.7G / 20.5s/it222.6G / 29.4s/it228.8G / 45.7s/it249.0G / 83.4s/it289.2G / 161.5s/it
Q-LoRA61.4G / 27.4s/it61.4G / 31.5s/it62.9G / 41.4s/it64.1G / 59.5s/it68.0G / 97.7s/it75.6G / 179.8s/it

## デプロイ ### vLLM -デプロイメントと高速推論のためには、FastChatとvLLMを使用することをお勧めします。まずパッケージをインストールしてください: +デプロイメントと高速推論のためには、vLLMを使用することをお勧めします。 + +cuda 12.1 および pytorch 2.1 を使用している場合は、次のコマンドを直接使用して vLLM をインストールできます。 ```bash pip install vllm +``` + +それ以外の場合は、公式 vLLM [インストール手順](https://docs.vllm.ai/en/latest/getting_started/installation.html) を参照してください。 + +#### vLLM + Transformer Wrapper + +[ラッパー コード](examples/vllm_wrapper.py) をダウンロードし、複数ラウンドの対話対話のために次のコマンドを実行できます。 (注: 現在は ``model.chat()`` メソッドのみをサポートしています。) + +```python +from vllm_wrapper import vLLMWrapper + +model = vLLMWrapper('Qwen/Qwen-7B-Chat', tensor_parallel_size=1) + +response, history = model.chat(query="你好", history=None) +print(response) +response, history = model.chat(query="给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history) +print(response) +response, history = model.chat(query="给这个故事起一个标题", history=history) +print(response) +``` +#### vLLM + Web デモ / OpenAI API +FastChat を使用して、Web デモまたは OpenAI API サーバーを起動できます。 まず、FastChat をインストールします。 +``` pip install "fschat[model_worker,webui]" ``` -または、`git clone` と `pip install -e .` を使ってソースからインストールすることもできます。インストールに問題がある場合は、それぞれのドキュメントを読むことを勧める。 -QwenをvLLMとFastChatで実行するには、まず以下の方法でコントローラを起動する必要があります: +vLLM および FastChat で Qwen を実行するには、次の方法でコントローラーを起動する必要があります。 ```bash python -m fastchat.serve.controller ``` それからmodel workerを起動し、推論のためにモデルをロードします。シングルGPU推論の場合は、直接実行できます: ```bash -python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code +python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype bfloat16 ``` しかし、より高速な推論や大容量メモリーのために複数のGPUでモデルを実行したい場合は、vLLMがサポートするテンソル並列を使用することができます。モデルを4GPUで実行するとすると、コマンドは以下のようになります: ```bash -python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 +python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype bfloat16 ``` -Model workerを起動したら、Web デモや OpenAI API を好きなように起動できます。ウェブデモの場合は、以下のコマンドを実行します: +モデルワーカーを起動した後、起動することができます: + +* Web UI Demo ```bash python -m fastchat.serve.gradio_web_server ``` -OpenAI APIについては、まずOpenAI APIのドキュメントをチェックして、インストールしてください。次にコマンドを実行します: + +* OpenAI API ```bash python -m fastchat.serve.openai_api_server --host localhost --port 8000 ``` -
-## デモ +ただし、vLLM と FastChat の使用が難しい場合は、Web デモ、CLI デモ、および API をデプロイするために提供されている最も簡単な方法を試すことができます。 + ### ウェブ UI @@ -738,63 +875,7 @@ python cli_demo.py


-## API - -APIを通じてQwenを利用する最も簡単な方法は、Alibaba Cloudを通じたDashScope APIサービスです。その使い方を紹介します。さらに、OpenAIスタイルのAPIをご自身のサーバーにデプロイするためのスクリプトも提供しています。 - -### DashScope -DashScopeはAlibaba Cloudが提供する大規模言語モデルAPIサービスで、今回Qwenに対応した。DashScopeの背後にあるモデルは、詳細が提供されていない一時的な社内バージョンであることに注意してください。サービスには `qwen-turbo` と `qwen-plus` があり、前者はより高速に動作し、後者はより優れたパフォーマンスを実現している。詳細はドキュメント [こちら](https://dashscope.aliyun.com) を参照。 - -公式サイト [link](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.6c2774fahtfXdn) で DashScope アカウントを作成し、API キー (AK) を取得してください。AK は環境変数で設定することをお勧めします: -```bash -export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY" -``` -その後、パッケージをインストールし、ドキュメントは [こちら](https://help.aliyun.com/zh/dashscope/developer-reference/install-dashscope-sdk) をクリックしてください。Python をお使いの場合は、pip で DashScope をインストールできます: -```bash -pip install dashscope -``` -JAVA SDKを使用する場合は、この方法でインストールできます: -```xml - - - com.alibaba - dashscope-sdk-java - the-latest-version - -``` -DashScope を使用する最も簡単な方法は、OpenAI API と同様のメッセージを使用する方法です。以下にその例を示す: -```python -import random -from http import HTTPStatus -from dashscope import Generation - - -def call_with_messages(): - messages = [{'role': 'system', 'content': 'You are a helpful assistant.'}, - {'role': 'user', 'content': '如何做西红柿鸡蛋?'}] - gen = Generation() - response = gen.call( - Generation.Models.qwen_turbo, - messages=messages, - seed=random.randint(1, 10000), # set the random seed, optional, default to 1234 if not set - result_format='message', # set the result to be "message" format. - ) - return response - - -if __name__ == '__main__': - response = call_with_messages() - if response.status_code == HTTPStatus.OK: - print(response) - else: - print('Request id: %s, Status code: %s, error code: %s, error message: %s' % ( - response.request_id, response.status_code, - response.code, response.message - )) -``` -詳しい使い方は公式サイトをご覧ください。 - -### OpenAI API +### API OpenAI API をベースにローカルAPIをデプロイする方法を提供する(@hanpenggit に感謝)。始める前に、必要なパッケージをインストールしてください: @@ -850,9 +931,128 @@ print(response.choices[0].message.content) **Function Calling** もサポートされています(ただし、今のところ `stream=False` の場合のみ)。使用例](examples/function_call_examples.py) を参照してください。

+## 🐳 Docker + +デプロイプロセスを簡素化するために、あらかじめ環境を構築した docker イメージを提供しています: [qwenllm/qwen](https://hub.docker.com/r/qwenllm/qwen)。ドライバを導入し、モデルファイルをダウンロードするだけで、デモを起動し、OpenAI APIをデプロイし、モデルを微調整することができます。 + +### 準備 + +1. 使用するイメージに応じて、正しいバージョンのNvidiaドライバをインストールしてください: + - `qwenllm/qwen:cu117` (**recommend**): `>= 515.48.07` + - `qwenllm/qwen:cu114` (w/o flash-attention): `>= 470.82.01` + - `qwenllm/qwen:latest`: same as `qwenllm/qwen:cu117` + +2. [Docker](https://docs.docker.com/engine/install/) と [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) をインストールして設定します: + +```bash +# configure docker +sudo systemctl start docker +# test if docker is correctly installed +sudo docker run hello-world + +# configure nvidia-container-toolkit +sudo nvidia-ctk runtime configure --runtime=docker +sudo systemctl restart docker +# test if nvidia-container-toolkit is correctly installed +sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi +``` + +3. モデルのチェックポイントとコードを環境にダウンロードします([こちら](#DownloadModel)を参照)。 + +### デプロイ + +ここでは例として Qwen-7B-Chat を使用する。ウェブ・デモや API を起動する前に、以下のように設定を行います: + +```bash +IMAGE_NAME=qwenllm/qwen:cu117 +PORT=8901 +CHECKPOINT_PATH=/path/to/Qwen-7B-Chat # Path to downloaded model checkpoints and codes +``` +以下のスクリプトがビルドに役立つ: + +* OpenAI API +```bash +bash docker/docker_openai_api.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT} +``` + +* Web UI +```bash +bash docker/docker_web_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT} +``` + +* CLI Demo +```bash +bash docker/docker_cli_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} +``` + +上記のコマンドは自動的に必要なイメージをダウンロードし、バックグラウンドでWeb UIデモを起動します(サービスは自動で再起動します)。デモを使用するには、ホスト上で `http://localhost:${PORT}` を開いてください。 + +以下の出力が表示されれば、デモは正常に起動しています: + +```text +Successfully started web demo. Open '...' to try! +Run `docker logs ...` to check demo status. +Run `docker rm -f ...` to stop and remove the demo. +``` + +デモの状態を確認したい場合は、`docker logs qwen` を使って出力を表示できる。 + +docker rm -f qwen` でサービスを停止し、コンテナを削除できる。 + + +### ファインチューニング + +ビルド済みのDockerイメージを利用したファインチューニングの方法は、基本的に[前章](#Finetuning)と同じです(すでにイメージに依存関係がインストールされています): + +以下はシングルGPUのLoRAの例です: +```bash +IMAGE_NAME=qwenllm/qwen:cu117 +CHECKPOINT_PATH=/path/to/Qwen-7B # Path to downloaded model checkpoints and codes +#CHECKPOINT_PATH=/path/to/Qwen-7B-Chat-Int4 # Path to downloaded model checkpoints and codes (Q-LoRA) +DATA_PATH=/path/to/data/root # Prepare finetune data at ${DATA_PATH}/example.json +OUTPUT_PATH=/path/to/output/checkpoint # Path to finetune outputs + +# Use all host devices by default +DEVICE=all +# If you need to specify GPUs for training, set device as follow (NOTE: internal quotation marks cannot be omitted) +#DEVICE='"device=0,1,2,3"' + +mkdir -p ${OUTPUT_PATH} + +# Single-GPU LoRA finetuning +docker run --gpus ${DEVICE} --rm --name qwen \ + --mount type=bind,source=${CHECKPOINT_PATH},target=/data/shared/Qwen/Qwen-7B \ + --mount type=bind,source=${DATA_PATH},target=/data/shared/Qwen/data \ + --mount type=bind,source=${OUTPUT_PATH},target=/data/shared/Qwen/output_qwen \ + --shm-size=2gb \ + -it ${IMAGE_NAME} \ + bash finetune/finetune_lora_single_gpu.sh -m /data/shared/Qwen/Qwen-7B/ -d /data/shared/Qwen/data/example.json +``` + +例えばシングルGPUのQ-LoRAに変更するには、`docker run`内のbashコマンドを変更するだけでいい: +```bash +bash finetune/finetune_qlora_single_gpu.sh -m /data/shared/Qwen/Qwen-7B-Chat-Int4/ -d /data/shared/Qwen/data/example.json +``` +
+ +## 🔥 システムプロンプト +Qwen-1.8-Chat と Qwen-72B-Chat は、複数回の複雑な対話を伴う多様なシステム プロンプトで完全にトレーニングされているため、さまざまなシステム プロンプトに従い、コンテキストに応じたモデルのカスタマイズを実現し、Qwen-Chat のスケーラビリティをさらに向上させることができます。 + +システム プロンプトを使用すると、Qwen-Chat は **ローリー プレイ**、**言語スタイルの転送**、**タスク設定**、**動作設定**を実現できます。 + +![](assets/system_prompt_ language_style.png) + +![](assets/system_prompt_role_play_en.png) + +詳細については、[サンプルドキュメント](examples/system_prompt.md)を参照してください。 + ## ツールの使用 -Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利用に特化して最適化されており、ユーザは独自の Qwen-7B ベースの LangChain、エージェント、コードインタプリタを構築することができます。ツール利用能力を評価するための評価[ベンチマーク](eval/EVALUATION.md)では、Qwen-7B は安定した性能に達しています。 +Qwen-Chat は、ツールの使用法と関数呼び出し機能に合わせて最適化されています。 ユーザーはエージェント、LangChain アプリケーションを開発し、Python コード インタープリターで Qwen を拡張することもできます。 + +ReAct プロンプトの原則に基づいてツール呼び出しを実装する方法に関するドキュメントを提供しています。[ReAct の例](examples/react_prompt.md) を参照してください。 この原則に基づいて、[openai_api.py](openai_api.py) で関数呼び出しのサポートを提供します。 + +オープンソースの中国語評価ベンチマークでモデルのツール呼び出し機能をテストしたところ、Qwen-Chat が一貫して良好なパフォーマンスを発揮することがわかりました。 @@ -867,17 +1067,21 @@ Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利 - Qwen-7B-Chat v1.1 - + +
GPT-3.585%0.8875.0%
Qwen-7B-Chat v1.198%0.917.3%
Qwen-7B-Chat98%0.917.3%
Qwen-14B-Chat98%0.932.4%
+数学的問題解決、データ視覚化、ファイル処理や Web スクレイピングなどのその他の汎用タスクに Python コード インタープリターを使用する Qwen の能力を評価するために、これらの能力を評価するために特別に設計されたベンチマークを作成し、オープンソース化しました。 。 ベンチマークはこの [リンク](https://github.com/QwenLM/Qwen-Agent/tree/main/benchmark) で見つけることができます。 + +Qwen は、コード生成時のコードの実行可能性と結果の精度の点で優れたパフォーマンスを発揮することがわかりました。 + - + @@ -924,8 +1128,8 @@ Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利 - Qwen-7B-Chat v1.1 - + + @@ -940,7 +1144,7 @@ Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利
Using Code Interpreter - Executable Rate of Generated Code (%)Executable Rate of Generated Code (%)
ModelMath↑Visualization↑General↑44.2 65.5
Qwen-7B-Chat v1.1
Qwen-7B-Chat 82.4 64.4 67.2
- + @@ -987,8 +1191,8 @@ Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利 - Qwen-7B-Chat v1.1 - + + @@ -1001,16 +1205,13 @@ Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利
Using Code Interpreter - Accuracy of Code Execution Results (%)Accuracy of Code Execution Results (%)
ModelMath↑Visualization-Hard↑Visualization-Easy↑21.4 45.6
Qwen-7B-Chat v1.1
Qwen-7B-Chat 41.9 40.5 54.4
- -ReAct プロンプトの書き方や使い方については、[ReAct の例](examples/react_prompt.md)を参照してください。ツールを使用することで、モデルがよりよいタスクを実行できるようになります。 -



-さらに、エージェントとしての能力を示す実験結果を提供する。詳細は [Hugging Face Agent](examples/transformers_agent.md) を参照して下さい。Hugging Face が提供するランモードベンチマークでの性能は以下の通りです: +さらに、Qwenが HuggingFace Agent として機能できることを実証する実験結果も提供します。 詳細については、[ドキュメント例](examples/transformers_agent.md) を参照してください。 Hugging Face が提供する評価データセットにおけるモデルのパフォーマンスは次のとおりです。 @@ -1031,8 +1232,8 @@ ReAct プロンプトの書き方や使い方については、[ReAct の例](ex - Qwen-7B-Chat v1.1 - + + @@ -1058,8 +1259,8 @@ ReAct プロンプトの書き方や使い方については、[ReAct の例](ex - Qwen-7B-Chat v1.1 - + + @@ -1070,7 +1271,11 @@ ReAct プロンプトの書き方や使い方については、[ReAct の例](ex ## 長い文脈の理解 -コンテキストの長さを拡張し、訓練シーケンスの長さのボトルネックを解消するために、NTK を考慮した補間、ウィンドウアテンション、LogN アテンションスケーリングなどの技術を導入し、コンテキストの長さを 8K トークン以上に拡張する。arXiv データセットを用いて PPL 評価による言語モデリング実験を行い、Qwen-7B が長いコンテキストのシナリオにおいて卓越した性能を達成できることを見出した。以下に結果を示します: +コンテキスト長を拡張し、トレーニング シーケンス長のボトルネックを解消するために、NTK 対応補間、ウィンドウ アテンション、LogN アテンション スケーリングなどのいくつかの技術を導入し、Qwen-14B のコンテキスト長を 2K から 8K 以上に拡張します。 トークン、および Qwen-1.8B/7B は 8K から 32K トークンまで。 + +Qwen-72B では、より大きな回転ベースを備えたより長いコンテキストに RoPE を適応させます。 Qwen-72B は、32K トークンの最大コンテキスト長をサポートします。 + +私たちは、PPL 評価を使用して arXiv データセットで言語モデリング実験を実施し、Qwen が長いコンテキストのシナリオで優れたパフォーマンスを達成できることを発見しました。 結果を以下に示します。
StarCoder-15B87.088.068.9
Qwen-7B-Chat v1.187.087.071.5
Qwen-7B-Chat87.087.071.5
Qwen-14B-Chat93.594.487.0
StarCoder-15B97.997.989.6
Qwen-7B-Chat v1.194.794.785.1
Qwen-7B-Chat94.794.785.1
Qwen-14B-Chat97.997.995.5
@@ -1093,10 +1298,13 @@ ReAct プロンプトの書き方や使い方については、[ReAct の例](ex - + - + + + + @@ -1107,8 +1315,24 @@ ReAct プロンプトの書き方や使い方については、[ReAct の例](ex + + + +
Qwen-7B v1.14.233.813.523.317.27181.49Qwen-1.8B5.004.484.133.8917.42433.85
+ dynamic_ntk4.233.813.523.313.233.33+ dynamic_ntk + logn + window_attn5.004.484.143.933.823.83
Qwen-7B4.233.813.523.317.27181.49
+ dynamic_ntk + logn + window_attn4.233.813.523.333.223.17
+ dynamic_ntk + logn + window_attn-3.463.293.183.42-
Qwen-72B---2.832.732.72
+さらに、Qwen-72B-Chat の長文理解能力を検証するために、[L-Eval](https://arxiv.org/abs/2307.11088) (クローズドエンド タスク) でテストしました。 結果は次のとおりです。 + +| Model | Input Length | Average | Coursera | GSM | QuALITY | TOEFL | CodeU | SFcition | +|:------------------|:------------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:| +| ChatGPT-3.5-16k | 16K | 60.73 | **63.51** | **84.00** | 61.38 | 78.43 | **12.22** | 64.84 | +| **Qwen-72B-Chat** | 32K | **62.30** | 58.13 | 76.00 | **77.22** | **86.24** | 6.66 | **69.53** | + +私たちは、モデルが入力内のさまざまな位置で情報を取得できるかどうかをテストするために、「干し草の山の中の針」実験 (このアイデアは [@Greg Kamradt](https://twitter.com/GregKamradt/status/1727018183608193393) から来ました) を実施しました。 異なる長さの場合、結果は次のようになります。 +![](assets/qwen_72b_needle_in_a_haystack.png) + +上記の結果は、Qwen-72B-Chat が 32K の入力長内でさまざまな位置に配置された情報を正確に取得できることを示しており、その優れた長文理解能力を証明しています。 + ## トークナイザー tiktoken に基づくトークナイザーは、他のトークナイザー、例えばセンテンスピーストークナイザーとは異なります。特にファインチューニングの際には、特殊なトークンに注意を払う必要があります。トークナイザに関する詳細な情報や、ファインチューニングにおける使用方法については、[ドキュメント](tokenization_note_ja.md)を参照してください。 @@ -1139,7 +1363,13 @@ tiktoken に基づくトークナイザーは、他のトークナイザー、 ## ライセンス契約 -Qwen と Qwen-Chat のコードとモデルウェイトは、研究者や開発者が自由に使用することができます。また、商用利用も可能です。詳しくは [LICENSE](LICENSE) をご覧ください。商用利用を希望される方は、リクエストフォーム([7B](https://dashscope.console.aliyun.com/openModelApply/qianwen), [14B](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat))に必要事項をご記入の上、お申し込みください。 +で提供されるソースコードは、ルートディレクトリにある[Apache 2.0 License](./LICENSE)の下でライセンスされています。 + +研究者や開発者は、QwenとQwen-Chatのコードとモデルウェイトを自由に使用することができます。商用利用については、各モデルに添付されている使用許諾契約書をご確認ください。 + +- Qwen-72B、Qwen-14B、Qwen-7Bは、対応するHuggingFaceとModelScopeのリポジトリにある[Tongyi Qianwen LICENSE AGREEMENT](./Tongyi%20Qianwen%20LICENSE%20AGREEMENT)に基づいてライセンスされています。商用利用の場合は、フォーム([72B](https://dashscope.console.aliyun.com/openModelApply/Qwen-72B-Chat), [14B](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat), [7B](https://dashscope.console.aliyun.com/openModelApply/qianwen))に記入して申請してください。 + +- Qwen-1.8Bは、対応するHuggingFaceとModelScopeのリポジトリにある[Tongyi Qianwen RESEARCH LICENSE AGREEMENT](./Tongyi%20Qianwen%20RESEARCH%20LICENSE%20AGREEMENT)に基づいてライセンスされています。商用利用については、私たちにご連絡ください。

## お問い合わせ diff --git a/Tongyi Qianwen LICENSE AGREEMENT b/Tongyi Qianwen LICENSE AGREEMENT new file mode 100644 index 0000000..5be3338 --- /dev/null +++ b/Tongyi Qianwen LICENSE AGREEMENT @@ -0,0 +1,53 @@ +Tongyi Qianwen LICENSE AGREEMENT + +Tongyi Qianwen Release Date: August 3, 2023 + +By clicking to agree or by using or distributing any portion or element of the Tongyi Qianwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately. + +1. Definitions + a. This Tongyi Qianwen LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement. + b. "We"(or "Us") shall mean Alibaba Cloud. + c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use. + d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You. + e. "Tongyi Qianwen" shall mean the large language models (including Qwen model and Qwen-Chat model), and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us. + f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement. + g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files. + h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, + and conversions to other media types. + +2. Grant of Rights +You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by Us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials. + +3. Redistribution +You may reproduce and distribute copies of the Materials or derivative works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: + a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement; + b. You shall cause any modified files to carry prominent notices stating that You changed the files; + c. You shall retain in all copies of the Materials that You distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Tongyi Qianwen is licensed under the Tongyi Qianwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and + d. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such derivative works as a whole, provided Your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement. + +4. Restrictions +If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, You shall request a license from Us. You cannot exercise your rights under this Agreement without our express authorization. + +5. Rules of use + a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials. + b. You can not use the Materials or any output therefrom to improve any other large language model (excluding Tongyi Qianwen or derivative works thereof). + +6. Intellectual Property + a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for Us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications. + b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of Us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials. + c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licences granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought. + +7. Disclaimer of Warranty and Limitation of Liability + + a. We are not obligated to support, update, provide training for, or develop any further version of the Tongyi Qianwen Materials or to grant any license thereto. + b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM. + c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT’S CAUSED. + d. You will defend, indemnify and hold harmless Us from and against any claim by any third party arising out of or related to your use or distribution of the Materials. + +8. Survival and Termination. + a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. + b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 7 and 9 shall survive the termination of this Agreement. + +9. Governing Law and Jurisdiction. + a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. + b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement. \ No newline at end of file diff --git a/Tongyi Qianwen RESEARCH LICENSE AGREEMENT b/Tongyi Qianwen RESEARCH LICENSE AGREEMENT new file mode 100644 index 0000000..dc3f801 --- /dev/null +++ b/Tongyi Qianwen RESEARCH LICENSE AGREEMENT @@ -0,0 +1,55 @@ +Tongyi Qianwen RESEARCH LICENSE AGREEMENT + +Tongyi Qianwen Release Date: November 30, 2023 + +By clicking to agree or by using or distributing any portion or element of the Tongyi Qianwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately. + +1. Definitions + a. This Tongyi Qianwen RESEARCH LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement. + b. "We"(or "Us") shall mean Alibaba Cloud. + c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use. + d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You. + e. "Tongyi Qianwen" shall mean the large language models, and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us. + f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement. + g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files. + h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, + and conversions to other media types. + i. "Non-Commercial" shall mean for research or evaluation purposes only. + +2. Grant of Rights + a. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by Us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials FOR NON-COMMERCIAL PURPOSES ONLY. + b. If you are commercially using the Materials, You shall request a license from Us. + +3. Redistribution +You may reproduce and distribute copies of the Materials or derivative works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: + a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement; + b. You shall cause any modified files to carry prominent notices stating that You changed the files; + c. You shall retain in all copies of the Materials that You distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Tongyi Qianwen is licensed under the Tongyi Qianwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and + d. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such derivative works as a whole, provided Your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement. + +4. Rules of use + a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials. + b. You can not use the Materials or any output therefrom to improve any other large language model (excluding Tongyi Qianwen or derivative works thereof). + +5. Intellectual Property + a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for Us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications. + b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of Us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials. + c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licences granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought. + +6. Disclaimer of Warranty and Limitation of Liability + a. We are not obligated to support, update, provide training for, or develop any further version of the Tongyi Qianwen Materials or to grant any license thereto. + b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM. + c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT’S CAUSED. + d. You will defend, indemnify and hold harmless Us from and against any claim by any third party arising out of or related to your use or distribution of the Materials. + +7. Survival and Termination. + a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. + b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 6 and 8 shall survive the termination of this Agreement. + +8. Governing Law and Jurisdiction. + a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. + b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement. + +9. Other Terms and Conditions. + a. Any arrangements, understandings, or agreements regarding the Material not stated herein are separate from and independent of the terms and conditions of this Agreement. You shall request a seperate license from Us, if You use the Materials in ways not expressly agreed to in this Agreement. + b. We shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed. diff --git a/ascend-support/README.md b/ascend-support/README.md new file mode 100644 index 0000000..cd67ecd --- /dev/null +++ b/ascend-support/README.md @@ -0,0 +1,45 @@ +# 昇腾910架构基于mindformers推理Qwen-7B-Chat模型 + +## 环境要求 + +- 硬件:Ascend 910A/B + +## 运行步骤 + +首先参考Qwen README下载官方模型到`/path/to/Qwen-7B-Chat`。 + +### 下载并启动镜像 + +```bash +docker pull qwenllm/qwen-mindspore:latest + +cd /path/to/Qwen/ascend-support + +# 下载模型到此处 +CHECKPOINT_PATH=/path/to/Qwen-7B-Chat + +cd ascend-support + +# 启动docker容器 +bash docker_qwen.sh -c ${CHECKPOINT_PATH} +``` + +### 执行权重转换 + +在容器内执行下面的命令,将Qwen模型转换为适配`mindformers`的格式: + +```bash +python3 /data/qwen/mindformers/research/qwen/convert_weight.py +``` + +转换后模型的输出位置为`${CHECKPOINT_PATH}/qwen-7b-chat.ckpt`。 + +### 执行推理 + +在容器内执行下面的命令,进行推理: + +```bash +cd /data/qwen/mindformers/research/qwen +export PYTHONPATH=/data/qwen/mindformers:$PYTHONPATH +python3 infer_qwen.py +``` diff --git a/ascend-support/docker_qwen.sh b/ascend-support/docker_qwen.sh new file mode 100644 index 0000000..b615f7b --- /dev/null +++ b/ascend-support/docker_qwen.sh @@ -0,0 +1,61 @@ +#!/bin/bash + +IMAGE_NAME=qwenllm/qwen-mindspore:v23.0.RC3 +CONTAINER_NAME=qwen-mindspore +CHECKPOINT_PATH='NOT_SET' + +DOCKER_CHECKPOINT_PATH=/data/qwen/models/Qwen-7B-Chat + +function usage() { + echo ' +Usage: bash ascend-support/docker_qwen.sh [-i IMAGE_NAME] -c [/path/to/Qwen-7B-Chat] [-n CONTAINER_NAME] +' +} + +while [[ "$1" != "" ]]; do + case $1 in + -i | --image ) + shift + IMAGE_NAME=$1 + ;; + -c | --checkpoint ) + shift + CHECKPOINT_PATH=$1 + ;; + -n | --name ) + shift + CONTAINER_NAME=$1 + ;; + -h ) + usage + exit + ;; + * ) + echo "Unknown argument ${1}" + exit 1 + ;; + esac + shift +done + +docker run -it --rm -u root --network=host --ipc=host \ + --device=/dev/davinci0 \ + --device=/dev/davinci1 \ + --device=/dev/davinci2 \ + --device=/dev/davinci3 \ + --device=/dev/davinci4 \ + --device=/dev/davinci5 \ + --device=/dev/davinci6 \ + --device=/dev/davinci7 \ + --name=${CONTAINER_NAME} \ + --device=/dev/davinci_manager \ + --device=/dev/devmm_svm \ + --device=/dev/hisi_hdc \ + -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ + -v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \ + -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ + -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \ + -v /etc/ascend_install.info:/etc/ascend_install.info \ + -v ${CHECKPOINT_PATH}:${DOCKER_CHECKPOINT_PATH} \ + -v /var/log/npu/:/usr/slog \ + ${IMAGE_NAME} /bin/bash diff --git a/assets/logo.jpg b/assets/logo.jpg index f12d561..a4b24c6 100644 Binary files a/assets/logo.jpg and b/assets/logo.jpg differ diff --git a/assets/qwen_72b_needle_in_a_haystack.png b/assets/qwen_72b_needle_in_a_haystack.png new file mode 100644 index 0000000..8bc6390 Binary files /dev/null and b/assets/qwen_72b_needle_in_a_haystack.png differ diff --git a/assets/radar_72b.jpg b/assets/radar_72b.jpg new file mode 100644 index 0000000..743e68f Binary files /dev/null and b/assets/radar_72b.jpg differ diff --git a/assets/system_prompt_behavior_setting.png b/assets/system_prompt_behavior_setting.png new file mode 100644 index 0000000..1900fdd Binary files /dev/null and b/assets/system_prompt_behavior_setting.png differ diff --git a/assets/system_prompt_behavior_setting_en.png b/assets/system_prompt_behavior_setting_en.png new file mode 100644 index 0000000..2e26df1 Binary files /dev/null and b/assets/system_prompt_behavior_setting_en.png differ diff --git a/assets/system_prompt_language_style.png b/assets/system_prompt_language_style.png new file mode 100644 index 0000000..265defb Binary files /dev/null and b/assets/system_prompt_language_style.png differ diff --git a/assets/system_prompt_language_style_en.png b/assets/system_prompt_language_style_en.png new file mode 100644 index 0000000..7224a9c Binary files /dev/null and b/assets/system_prompt_language_style_en.png differ diff --git a/assets/system_prompt_role_play.png b/assets/system_prompt_role_play.png new file mode 100644 index 0000000..447429c Binary files /dev/null and b/assets/system_prompt_role_play.png differ diff --git a/assets/system_prompt_role_play_en.png b/assets/system_prompt_role_play_en.png new file mode 100644 index 0000000..3451883 Binary files /dev/null and b/assets/system_prompt_role_play_en.png differ diff --git a/assets/system_prompt_task_setting.png b/assets/system_prompt_task_setting.png new file mode 100644 index 0000000..9d69e3a Binary files /dev/null and b/assets/system_prompt_task_setting.png differ diff --git a/assets/system_prompt_task_setting_en.png b/assets/system_prompt_task_setting_en.png new file mode 100644 index 0000000..57f2fd3 Binary files /dev/null and b/assets/system_prompt_task_setting_en.png differ diff --git a/dcu-support/README.md b/dcu-support/README.md new file mode 100644 index 0000000..dc508fd --- /dev/null +++ b/dcu-support/README.md @@ -0,0 +1,64 @@ +# DCU 架构基于 fastllm 推理 Qwen 模型 + + +## 环境配置 + +### 环境准备 + +``` +docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest +``` + +### 容器启动 + +根据如下命令启动推理容器,其中需自定义一个容器名即为本目录的路径: +``` +# 自定义容器名 +# 当前工程所在路径 +docker run -it --name= -v :/work --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --shm-size=16G --group-add 39 image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest /bin/bash +``` + +### 加载环境 + +进入容器后执行如下命令,加载运行环境变量 + +``` +source /opt/dtk-23.04/cuda/env.sh +``` + +### 安装方法 + +``` +#进入本工程目录 +cd package +python setup.py install +``` + +## 推理 + +### 模型转换 + +首先参考Qwen README下载官方模型,并通过如下方式将模型转换为 fastllm 用于推理的形式: + +- 通过`pip install -r requirements.txt`安装模型转换所需依赖 + +- 如果使用已经下载完成的模型或者自己finetune的模型需要修改qwen2flm.py文件中创建tokenizer, model时的模型存放路径 + +``` +# 在本工程目录下执行: +python3 qwen2flm.py qwen-7b-fp16.bin float16 # 导出fp16模型,参数为导出的模型路径 +``` + + +### 模型推理 + +``` +# 命令行聊天程序,使用了模型创建以及流式对话效果 +python cli_demo.py -p qwen-7b-fp16.bin + +# batch推理程序 +python cli_demo_batch.py -p qwen-7b-fp16.bin + +# 简易webui,需要先安装streamlit-chat +streamlit run web_demo.py qwen-7b-fp16.bin +``` diff --git a/dcu-support/cli_demo.py b/dcu-support/cli_demo.py new file mode 100644 index 0000000..33e7eb4 --- /dev/null +++ b/dcu-support/cli_demo.py @@ -0,0 +1,30 @@ +# coding=utf-8 +import argparse +from fastllm_pytools import llm + +def args_parser(): + parser = argparse.ArgumentParser(description = 'qwen_chat_demo') + parser.add_argument('-p', '--path', type = str, required = True, default = '', help = '模型文件的路径') + args = parser.parse_args() + return args + +if __name__ == "__main__": + args = args_parser() + model = llm.model(args.path) + + history = [] + print("输入内容即可进行对话,clear 清空对话历史,stop 终止程序") + while True: + query = input("\n用户:") + if query.strip() == "stop": + break + if query.strip() == "clear": + history = [] + print("输入内容即可进行对话,clear 清空对话历史,stop 终止程序") + continue + print("AI:", end = "") + curResponse = "" + for response in model.stream_response(query, history = history, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.0): + curResponse += response + print(response, flush = True, end = "") + history.append((query, curResponse)) \ No newline at end of file diff --git a/dcu-support/cli_demo_batch.py b/dcu-support/cli_demo_batch.py new file mode 100644 index 0000000..af163b5 --- /dev/null +++ b/dcu-support/cli_demo_batch.py @@ -0,0 +1,39 @@ +import argparse +from fastllm_pytools import llm +import time + +def args_parser(): + parser = argparse.ArgumentParser(description = 'fastllm_chat_demo') + parser.add_argument('-p', '--path', type = str, required = True, default = '', help = '模型文件的路径') + args = parser.parse_args() + return args + +if __name__ == "__main__": + args = args_parser() + + model_path = args.path + + prompts = ["深圳有什么好玩的", "上海有什么好玩的", "晚上睡不着怎么办", "南京有什么好吃的"] * 2 + print(prompts) + + responses, historys = [], [] + + model = llm.model(model_path) + + t0 = time.time() + responses, historys = model.response_batch(prompts) + t1 = time.time() + + token_output_count = 0 + word_len = 0 + for i, res in enumerate(responses): + tokens = model.tokenizer_encode_string(res) + token_output_count += len(tokens) + word_len += len(res) + + print("batch index: ", i) + print(res) + print("") + + print("\ntoken/s: {:.2f}, character/s: {:.2f}".format(token_output_count/(t1-t0), word_len/(t1-t0))) + diff --git a/dcu-support/model.properties b/dcu-support/model.properties new file mode 100644 index 0000000..fb34a5c --- /dev/null +++ b/dcu-support/model.properties @@ -0,0 +1,10 @@ +# 模型唯一标识 +modelCode = 411 +# 模型名称 +modelName=qwen-7b_fastllm +# 模型描述 +modelDescription=qwen-7b是阿里云研发的通义千问大模型系列的70亿参数规模的模型 +# 应用场景 +appScenario=推理,对话问答,医疗,科研,金融,教育 +# 框架类型 +frameType=fastllm diff --git a/dcu-support/package/fastllm_pytools/__init__.py b/dcu-support/package/fastllm_pytools/__init__.py new file mode 100644 index 0000000..012c8ae --- /dev/null +++ b/dcu-support/package/fastllm_pytools/__init__.py @@ -0,0 +1 @@ +__all__ = ["llm"] \ No newline at end of file diff --git a/dcu-support/package/fastllm_pytools/hf_model.py b/dcu-support/package/fastllm_pytools/hf_model.py new file mode 100644 index 0000000..3a952ec --- /dev/null +++ b/dcu-support/package/fastllm_pytools/hf_model.py @@ -0,0 +1,154 @@ +from fastllm_pytools import llm; +import torch; +import ctypes; +import numpy as np; + +fastllm_data_type_dict = { + "int4": 8, + "int8": 3, + "float16": 7 +} +fastllm_weight_type_dict = { + "linear": 1, + "embedding": 2, + "QuantizedLinear": 111 +} + +def create(model, + tokenizer = None, + pre_prompt = None, + user_role = None, + bot_role = None, + history_sep = None, + dtype = "float16"): + if (dtype not in fastllm_data_type_dict): + print("dtype should in ", list(fastllm_data_type_dict.keys())); + exit(0); + + # 0.1 model info + if model.config.model_type == "chatglm" and model.config.transformers_version == "4.30.2": + model.config.model_type = "chatglm3" + modelInfo = model.config.__dict__ + if model.generation_config is not None: + modelInfo.update(model.generation_config.__dict__) + if (pre_prompt): + modelInfo["pre_prompt"] = pre_prompt; + if (user_role): + modelInfo["user_role"] = user_role; + if (bot_role): + modelInfo["bot_role"] = bot_role; + if (history_sep): + modelInfo["history_sep"] = history_sep; + if (modelInfo["model_type"] == "baichuan" and hasattr(model, "model") and hasattr(model.model, "get_alibi_mask")): + # Baichuan 2代 + modelInfo["use_alibi"] = "1"; + modelInfo["pre_prompt"] = ""; + modelInfo["user_role"] = (" ") if hasattr(model.generation_config, "user_token_id") else ""; + modelInfo["bot_role"] = ("") if hasattr(model.generation_config, "assistant_token_id") else ""; + modelInfo["history_sep"] = ""; + if (modelInfo["model_type"] == "qwen"): + if modelInfo["chat_format"] == "chatml": + modelInfo["im_end_id"] = tokenizer.im_end_id + modelInfo["im_start_id"] = tokenizer.im_start_id + + + weight_type_dict = {}; + module_dict = {}; + weight_bits = {}; + for key, m in model.named_modules(): + if (str(type(m)).find("QuantizedLinear") != -1): + weight_type_dict[key + ".weight"] = "QuantizedLinear"; + weight_bits[key + ".weight"] = m.weight_bit_width; + if (isinstance(m, torch.nn.Linear)): + weight_type_dict[key + ".weight"] = "linear"; + module_dict[key + ".weight"] = m; + if (isinstance(m, torch.nn.Embedding)): + weight_type_dict[key] = "embedding"; + + peft_config = {} + active_adapter = "" + if hasattr(model, "peft_config"): + peft_config = model.peft_config + if hasattr(model, "active_adapter") and isinstance(model.active_adapter, str): + # in transformers >= 4.33.0, active_adapter is a funtion in model, ignore it now + active_adapter = model.active_adapter + + model = model.cpu(); + dict = model.state_dict(); + model_type = model.config.__dict__["model_type"]; + model = llm.fastllm_lib.create_empty_llm_model(model_type.encode()); + for it in modelInfo.keys(): + llm.fastllm_lib.add_dict_llm_model(model, str(it).encode(), str(modelInfo[it]).encode()); + + for adapter_name in peft_config.keys(): + adapter_dict = peft_config[adapter_name].__dict__ + for it in adapter_dict.keys(): + llm.fastllm_lib.add_adapter_dict_llm_model(model, str(adapter_name).encode(), str(it).encode(), str(adapter_dict[it]).encode()) + if len(active_adapter) != 0: + llm.fastllm_lib.set_adapter(model, str(active_adapter).encode()) + + # 1. vocab + if (tokenizer): + if (hasattr(tokenizer, "tokenizer")): + if modelInfo["model_type"] == "qwen": + pass + else: + tokenizer = tokenizer.tokenizer; + if (hasattr(tokenizer, "sp_model")): + piece_size = tokenizer.sp_model.piece_size(); + for i in range(piece_size): + llm.fastllm_lib.add_tokenizer_word_llm_model(model, tokenizer.sp_model.id_to_piece(i).encode(), + i, ctypes.c_float(tokenizer.sp_model.get_score(i))); + else: + vocab = tokenizer.get_vocab(); + for v in vocab.keys(): + if (modelInfo["model_type"] == "moss"): + vv = [(ord(c) if c not in tokenizer.byte_decoder else tokenizer.byte_decoder[c]) for c in v]; + llm.fastllm_lib.add_tokenizer_word_llm_model(model, vv, vocab[v], ctypes.c_float(1.0)); + elif (modelInfo["model_type"] == "qwen"): + llm.fastllm_lib.add_tokenizer_word_llm_model(model, v, vocab[v], ctypes.c_float(1.0)); + else: + llm.fastllm_lib.add_tokenizer_word_llm_model(model, v.encode(), vocab[v], ctypes.c_float(1.0)); + tot = 0; + for key in dict: + ori_data_type = 0; + ori_np_data_type = np.float32; + cur_weight_type = 0; + if (key in weight_type_dict and weight_type_dict[key] in fastllm_weight_type_dict): + cur_weight_type = fastllm_weight_type_dict[weight_type_dict[key]]; + to_data_type = 0; + + if (cur_weight_type == 1): + to_data_type = fastllm_data_type_dict[dtype]; + if (to_data_type == 7): + ori_data_type = 7; + ori_np_data_type = np.float16; + elif (cur_weight_type == 2): + # TODO bfloat + to_data_type = 0; + + weight_name = key + if peft_config is not None: + weight_name = weight_name.replace('base_model.model.', '') + if (cur_weight_type == 111): + llm.fastllm_lib.add_qlinear_weight_llm_model(model, weight_name.encode(), + len(dict[key].shape), + (ctypes.c_int * len(dict[key].shape))(*list(dict[key].shape)), + weight_bits[key], + dict[key + "_scale"].numpy().astype(np.float32).ctypes.data_as(ctypes.c_void_p), + dict[key].numpy().ctypes.data_as(ctypes.c_void_p)); + else: + llm.fastllm_lib.add_weight_llm_model(model, weight_name.encode(), + len(dict[key].shape), + (ctypes.c_int * len(dict[key].shape))(*list(dict[key].shape)), + to_data_type, cur_weight_type, ori_data_type, + dict[key].numpy().astype(ori_np_data_type).ctypes.data_as(ctypes.c_void_p)); + tot += 1; + print("convert (", tot, "/", len(dict), end = " )\r"); + + print(""); + llm.fastllm_lib.init_params_llm_model(model); + llm.fastllm_lib.warmup_llm_model(model); + ret = llm.model("", id = model); + return ret; + diff --git a/dcu-support/package/fastllm_pytools/llm.py b/dcu-support/package/fastllm_pytools/llm.py new file mode 100644 index 0000000..b0e186c --- /dev/null +++ b/dcu-support/package/fastllm_pytools/llm.py @@ -0,0 +1,495 @@ +import ctypes; +import math +import os; +import threading +from typing import Optional, Tuple, Union, List, Callable, Dict, Any; +from copy import deepcopy + +import platform +if platform.system() == 'Windows': + fastllm_lib = ctypes.cdll.LoadLibrary(os.path.join(os.path.split(os.path.realpath(__file__))[0], "fastllm_tools.dll")) +else: + fastllm_lib = ctypes.cdll.LoadLibrary(os.path.join(os.path.split(os.path.realpath(__file__))[0], "libfastllm_tools.so")) + +fastllm_lib.create_llm_model.argtypes = [ctypes.c_char_p] +fastllm_lib.create_llm_model.restype = ctypes.c_int + +fastllm_lib.token_decode.argtypes = [ctypes.c_int, ctypes.c_int, ctypes.c_int, ctypes.c_char_p] +fastllm_lib.token_decode.restype = ctypes.c_int + +fastllm_lib.token_encode_string.argtypes = [ctypes.c_int, ctypes.c_char_p, ctypes.c_int, ctypes.POINTER(ctypes.c_int)] +fastllm_lib.token_encode_string.restype = ctypes.c_int + +fastllm_lib.launch_response_llm_model.argtypes = [ctypes.c_int, ctypes.c_int, ctypes.c_void_p, + ctypes.c_int, ctypes.c_bool, ctypes.c_float, ctypes.c_int, + ctypes.c_float, ctypes.c_float, ctypes.c_bool] +fastllm_lib.launch_response_llm_model.restype = ctypes.c_int + +fastllm_lib.fetch_response_llm_model.argtypes = [ctypes.c_int, ctypes.c_int] +fastllm_lib.fetch_response_llm_model.restype = ctypes.c_int + +fastllm_lib.fetch_response_logits_llm_model.argtypes = [ctypes.c_int, ctypes.c_int, ctypes.POINTER(ctypes.c_float)] +fastllm_lib.fetch_response_logits_llm_model.restype = ctypes.c_int + +fastllm_lib.response_str_llm_model.argtypes = [ctypes.c_int, ctypes.c_char_p, + ctypes.c_int, ctypes.c_bool, ctypes.c_float, ctypes.c_int, + ctypes.c_float, ctypes.c_float, ctypes.c_bool] +fastllm_lib.response_str_llm_model.restype = ctypes.c_char_p + +fastllm_lib.launch_response_str_llm_model.argtype = [ctypes.c_int, ctypes.c_char_p, + ctypes.c_int, ctypes.c_bool, ctypes.c_float, ctypes.c_int, + ctypes.c_float, ctypes.c_float, ctypes.c_bool] +fastllm_lib.launch_response_str_llm_model.restype = ctypes.c_int + +fastllm_lib.fetch_response_str_llm_model.argtypes = [ctypes.c_int, ctypes.c_int] +fastllm_lib.fetch_response_str_llm_model.restype = ctypes.c_char_p + +fastllm_lib.make_history_llm_model.argtype = [ctypes.c_int, ctypes.c_char_p, ctypes.c_int, ctypes.c_char_p, ctypes.c_char_p] +fastllm_lib.make_history_llm_model.restype = ctypes.c_char_p + +fastllm_lib.make_input_llm_model.argtype = [ctypes.c_int, ctypes.c_char_p, ctypes.c_int, ctypes.c_char_p] +fastllm_lib.make_input_llm_model.restype = ctypes.c_char_p + +fastllm_lib.add_tokenizer_word_llm_model.argtype = [ctypes.c_int, ctypes.c_char_p, ctypes.c_float, ctypes.c_int] + +fastllm_lib.set_device_map.argtype = [ctypes.c_int, ctypes.c_void_p, ctypes.c_char_p, ctypes.c_void_p] + +fastllm_lib.get_llm_model_type.argtype = [ctypes.c_int] +fastllm_lib.get_llm_model_type.restype = ctypes.c_char_p + +fastllm_lib.response_batch_str_llm_model.argtypes = [ctypes.c_int, ctypes.POINTER(ctypes.c_char_p), ctypes.c_int, + ctypes.c_int, ctypes.c_bool, ctypes.c_float, ctypes.c_int, + ctypes.c_float, ctypes.c_float, ctypes.c_bool] +fastllm_lib.response_batch_str_llm_model.restype = ctypes.POINTER(ctypes.c_char_p) + +fastllm_lib.response_batch_tokens_llm_model.argtypes = [ctypes.c_int, ctypes.c_int, ctypes.POINTER(ctypes.c_int), ctypes.POINTER(ctypes.c_int), + ctypes.c_int, ctypes.c_bool, ctypes.c_float, ctypes.c_int, + ctypes.c_float, ctypes.c_float, ctypes.c_bool] +fastllm_lib.response_batch_tokens_llm_model.restype = ctypes.POINTER(ctypes.c_char_p) + +def set_cpu_threads(threads: int): + fastllm_lib.set_cpu_threads(threads); + +def get_cpu_threads() -> int: + return fastllm_lib.get_cpu_threads(); + +def print_ins_info(): + fastllm_lib.print_cpu_ins(); + +def set_cpu_kvcache(cpu_kvcache): + fastllm_lib.set_kvcache_in_cpu(ctypes.c_bool(cpu_kvcache)); + +def get_cpu_kvcache(): + return fastllm_lib.get_kvcache_in_cpu(); + +def set_cpu_low_mem(low_mem): + fastllm_lib.set_cpu_low_mem(ctypes.c_bool(low_mem)); + +def get_cpu_low_mem(): + return fastllm_lib.get_cpu_low_mem(); + +def set_device_map(device_map): + devices = []; + values = []; + if (isinstance(device_map, str)): + devices.append(device_map); + values.append(1); + elif (isinstance(device_map, list)): + devices = [str(x) for x in device_map]; + values = [1 for x in device_map]; + elif (isinstance(device_map, dict)): + devices = [str(x) for x in device_map.keys()]; + values = [int(device_map[x]) for x in device_map.keys()]; + else: + print("set_device_map error."); + return; + device_str = ''.join(devices); + device_len = [len(x) for x in devices]; + fastllm_lib.set_device_map(len(device_len), + (ctypes.c_int * len(device_len))(*device_len), + device_str.encode(), + (ctypes.c_int * len(values))(*values)); +def from_hf(model, + tokenizer = None, + dtype = "float16"): + from fastllm_pytools import hf_model; + return hf_model.create(model, tokenizer, dtype = dtype); + +class model: + def __init__ (self, path : str, + id : int = -99999): + if (id != -99999): + self.model = id; + else: + self.model = fastllm_lib.create_llm_model(path.encode()); + self.direct_query = False; + + # 为了减少重复申请释放buffer对象而使用的线程局部存储区对象池 + self.thread_local_obj = threading.local() + self.thread_local_obj.tokenizer_encode_string__output_buffer = None + self.thread_local_obj.tokenizer_decode_token__output_buffer = None + + # tokenizer_decode_token 输出结果的静态缓存,手工触发构建 + # 由于token数量有限且不太多,所以缓存该结果来减少调用较为适合。 + # 不做成自动缓存是为了避免在多线程调用的时候对缓存dict加锁,同时也为不同场景提供选择空间 + self.tokenizer_decode_token_cache = None + + self.model_type = fastllm_lib.get_llm_model_type(self.model).decode() + # print("model_type:", self.model_type) + + def get_prompt(self, + query: str, + history: List[Tuple[str, str]] = None) -> str: + if (not(history)): + history = []; + prompt = ""; + for i, (old_query, response) in enumerate(history): + prompt = fastllm_lib.make_history_llm_model(self.model, prompt.encode(), i, old_query.encode(), response.encode()).decode(); + prompt = fastllm_lib.make_input_llm_model(self.model, prompt.encode(), len(history), query.encode()).decode(); + return prompt; + + def save(self, path : str): + fastllm_lib.save_llm_model(self.model, path.encode()); + + def eval(self): + pass; + + def build_tokenizer_decode_token_cache(self): + if self.tokenizer_decode_token_cache is not None: + return + + cache_dict = dict() + vocab_size = fastllm_lib.get_tokenizer_vocab_size(self.model) + for token_id in range(vocab_size): + cache_dict[token_id] = self.tokenizer_decode_token(token_id) + + self.tokenizer_decode_token_cache = cache_dict + + def tokenizer_encode_string(self, content: str) -> List[int]: + output_buffer_init_len = 1024 + if self.thread_local_obj.tokenizer_encode_string__output_buffer is None: + self.thread_local_obj.tokenizer_encode_string__output_buffer = (ctypes.c_int * output_buffer_init_len)() + + buffer = self.thread_local_obj.tokenizer_encode_string__output_buffer + buffer_len = len(buffer) + result_len = fastllm_lib.token_encode_string(self.model, content.encode(), buffer_len, buffer) + if result_len > buffer_len: + if result_len > 10240: + # 要处理的数据过长,使用一次性的buffer + temp_buffer = (ctypes.c_int * result_len)() + ret = fastllm_lib.token_encode_string(self.model, content.encode(), result_len, temp_buffer) + return [i for i in temp_buffer] + else: + # 扩展buffer大小 + new_buffer_len = round(math.ceil(result_len / 1024.0)) * 1024 + buffer = (ctypes.c_int * new_buffer_len)() + self.thread_local_obj.tokenizer_encode_string__output_buffer = buffer + result_len = fastllm_lib.token_encode_string(self.model, content.encode(), new_buffer_len, buffer) + + return [buffer[i] for i in range(result_len)] + + def tokenizer_decode_token(self, token_id: int) -> bytes: + if self.tokenizer_decode_token_cache is not None: + cache_result = self.tokenizer_decode_token_cache.get(token_id) + if cache_result is not None: + return cache_result + + output_buffer_init_len = 256 + if self.thread_local_obj.tokenizer_decode_token__output_buffer is None: + self.thread_local_obj.tokenizer_decode_token__output_buffer = ctypes.create_string_buffer(output_buffer_init_len) + + buffer = self.thread_local_obj.tokenizer_decode_token__output_buffer + ret = fastllm_lib.token_decode(self.model, token_id, len(buffer), buffer) + if ret > 0: + # buffer长度不够,扩展buffer大小 + new_buffer_len = round(math.ceil(ret / 16.0)) * 16 + buffer = ctypes.create_string_buffer(new_buffer_len) + self.thread_local_obj.tokenizer_decode_token__output_buffer = buffer + ret = fastllm_lib.token_decode(self.model, token_id, len(buffer), buffer) + assert ret == 0 + + buffer_bytes = buffer.raw + result_len = len(buffer_bytes) + for i in range(len(buffer_bytes)): + if buffer_bytes[i] == 0: + result_len = i + break + return buffer_bytes[:result_len] + + def response_logits(self, + query: str, + history: List[Tuple[str, str]] = None, + tokenizer = None) -> str: + prompt = query if self.direct_query else self.get_prompt(query, history); + if (tokenizer == None): + handle = fastllm_lib.launch_response_str_llm_model(self.model, prompt.encode(), + ctypes.c_int(1), ctypes.c_bool(False), ctypes.c_float(1), ctypes.c_int(1), + ctypes.c_float(1), ctypes.c_float(1), ctypes.c_bool(True)); + else: + input = tokenizer.encode(prompt); + handle = fastllm_lib.launch_response_llm_model(self.model, len(input), (ctypes.c_int * len(input))(*input), + 1, False, 1, 1, 1, 1, True); + vocab_size = fastllm_lib.get_tokenizer_vocab_size(self.model); + logits = list(range(vocab_size)) + array = (ctypes.c_float * (vocab_size * 4))(*logits); + ret = fastllm_lib.fetch_response_logits_llm_model(self.model, handle, array); + out = list(array)[:vocab_size]; + while (ret != -1): + ret = fastllm_lib.fetch_response_logits_llm_model(self.model, handle, array); + return out; + + def response(self, + query: str, + history: List[Tuple[str, str]] = None, + max_length: int = 8192, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.0) -> str: + ret = ""; + for i in self.stream_response(query = query, + history = history, + max_length = max_length, + do_sample = do_sample, + top_p = top_p, top_k = top_k, + temperature = temperature, + repeat_penalty = repeat_penalty, + one_by_one = True): + ret += i; + return ret; + + def stream_response(self, + query: str, + history: List[Tuple[str, str]] = None, + max_length: int = 8192, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.0, + one_by_one = True): + prompt = query if self.direct_query else self.get_prompt(query, history); + handle = fastllm_lib.launch_response_str_llm_model(self.model, prompt.encode(), + ctypes.c_int(max_length), ctypes.c_bool(do_sample), ctypes.c_float(top_p), ctypes.c_int(top_k), + ctypes.c_float(temperature), ctypes.c_float(repeat_penalty), ctypes.c_bool(False)); + res = ""; + ret = b''; + fail_cnt = 0; + while True: + ret += fastllm_lib.fetch_response_str_llm_model(self.model, handle); + cur = ""; + try: + cur = ret.decode(); + ret = b''; + except: + fail_cnt += 1; + if (fail_cnt == 20): + break; + else: + continue; + fail_cnt = 0; + if (cur == ""): + break; + if one_by_one: + yield cur; + else: + res += cur; + yield res; + + def stream_response_raw(self, + input_tokens: List[int], + max_length: int = 8192, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.0, + one_by_one = True + ): + handle = fastllm_lib.launch_response_llm_model(self.model, len(input_tokens), + (ctypes.c_int * len(input_tokens))(*input_tokens), + ctypes.c_int(max_length), ctypes.c_bool(do_sample), ctypes.c_float(top_p), ctypes.c_int(top_k), + ctypes.c_float(temperature), ctypes.c_float(repeat_penalty), ctypes.c_bool(False)) + + # 可能遇到长尾char需要多个token才能够生成,所以只返回bytes,string.decode策略交给外部 + # 方便统计输出token数量,和控制不完整utf8时候解码的逻辑 + + total_bytes = b'' + while True: + cur_token = fastllm_lib.fetch_response_llm_model(self.model, handle) + if cur_token == -1: + break + + cur_bytes = self.tokenizer_decode_token(cur_token) + + if one_by_one: + yield cur_bytes + else: + total_bytes += cur_bytes + yield total_bytes + + def chat(self, tokenizer, query: str, history: List[Tuple[str, str]] = None, max_length: int = 8192, + do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.0, **kwargs): + if self.model_type != "chatglm3": + if (not(history)): + history = []; + prompt = query if self.direct_query else self.get_prompt(query, history); + input = tokenizer.encode(prompt); + handle = fastllm_lib.launch_response_llm_model(self.model, len(input), (ctypes.c_int * len(input))(*input), + max_length, do_sample, top_p, top_k, temperature, repeat_penalty, + False); + + result = []; + while True: + cur = fastllm_lib.fetch_response_llm_model(self.model, handle); + if (cur == -1): + break; + result.append(cur); + response = tokenizer.decode(result); + history = history + [(query, response)]; + return response, history; + else: + if history is None: + history = [] + role = "user" + input = self.build_chatglm3_input(tokenizer, query, history=history, role=role) + history.append({"role": role, "content": query}) + + handle = fastllm_lib.launch_response_llm_model(self.model, len(input), (ctypes.c_int * len(input))(*input), + max_length, do_sample, top_p, top_k, temperature, repeat_penalty, + False); + tokens = []; + while True: + cur = fastllm_lib.fetch_response_llm_model(self.model, handle); + if (cur == -1): + break; + tokens.append(cur); + response = tokenizer.decode(tokens); + if response and response[-1] != "�": + response, new_history = self.process_chatglm3_response(response, history) + return response, new_history + + def stream_chat(self, tokenizer, query: str, history: List[Tuple[str, str]] = None, past_key_values = None, + max_length: int = 8192, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.0, + return_past_key_values = False, **kwargs) -> str: + if self.model_type != "chatglm3": + if (not(history)): + history = []; + prompt = query if self.direct_query else self.get_prompt(query, history); + input = tokenizer.encode(prompt); + handle = fastllm_lib.launch_response_llm_model(self.model, len(input), (ctypes.c_int * len(input))(*input), + max_length, do_sample, top_p, top_k, temperature, repeat_penalty, + False); + tokens = []; + while True: + cur = fastllm_lib.fetch_response_llm_model(self.model, handle); + if (cur == -1): + break; + tokens.append(cur); + response = tokenizer.decode(tokens); + new_history = history + [(query, response)]; + if return_past_key_values: + yield response, new_history, None; + else: + yield response, new_history; + else: + if history is None: + history = [] + role = "user" + input = self.build_chatglm3_input(tokenizer, query, history=history, role=role) + history.append({"role": role, "content": query}) + + handle = fastllm_lib.launch_response_llm_model(self.model, len(input), (ctypes.c_int * len(input))(*input), + max_length, do_sample, top_p, top_k, temperature, repeat_penalty, + False); + tokens = []; + while True: + cur = fastllm_lib.fetch_response_llm_model(self.model, handle); + if (cur == -1): + break; + tokens.append(cur); + response = tokenizer.decode(tokens); + if response and response[-1] != "�": + response, new_history = self.process_chatglm3_response(response, history) + if return_past_key_values: + yield response, new_history, past_key_values + else: + yield response, new_history + + + def set_adapter(self, name: str): + fastllm_lib.set_adapter(self.model, str(name).encode()) + + def disable_adapter(self): + fastllm_lib.disable_adapter(self.model) + + def process_chatglm3_response(self, output, history): + content = "" + history = deepcopy(history) + for response in output.split("<|assistant|>"): + metadata, content = response.split("\n", maxsplit=1) + if not metadata.strip(): + content = content.strip() + history.append({"role": "assistant", "metadata": metadata, "content": content}) + content = content.replace("[[训练时间]]", "2023年") + else: + history.append({"role": "assistant", "metadata": metadata, "content": content}) + if history[0]["role"] == "system" and "tools" in history[0]: + content = "\n".join(content.split("\n")[1:-1]) + def tool_call(**kwargs): + return kwargs + parameters = eval(content) + content = {"name": metadata.strip(), "parameters": parameters} + else: + content = {"name": metadata.strip(), "content": content} + return content, history + + def build_chatglm3_input(self, tokenizer, query, history=None, role="user"): + if history is None: + history = [] + input_ids = [] + for item in history: + content = item["content"] + if item["role"] == "system" and "tools" in item: + content = content + "\n" + json.dumps(item["tools"], indent=4, ensure_ascii=False) + input_ids.extend(tokenizer.build_single_message(item["role"], item.get("metadata", ""), content)) + input_ids.extend(tokenizer.build_single_message(role, "", query)) + input_ids.extend([tokenizer.get_command("<|assistant|>")]) + return input_ids + + def response_batch(self, querys: List[str], + historys: List[List[Tuple[str, str]]] = None, + max_length: int = 1024, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.0, + **kwargs) -> List[str]: + query_size = len(querys) + if (not(historys)): + historys = [[] for _ in range(query_size)] + inputs = (ctypes.c_char_p * query_size)() + for i, query in enumerate(querys): + prompt = query if self.direct_query else self.get_prompt(query, historys[i]) + inputs[i] = ctypes.c_char_p(prompt.encode()) + + outputs = fastllm_lib.response_batch_str_llm_model(self.model, inputs, query_size, + max_length, do_sample, top_p, top_k, temperature, repeat_penalty, False) + + responses = [] + for i in range(query_size): + response = ctypes.string_at(outputs[i]).decode() + responses.append(response) + historys[i] = historys[i] + [(querys[i], response)] + return responses, historys + + def chat_batch(self, tokenizer, querys: List[str], historys: List[List[Tuple[str, str]]] = None, max_length: int = 1024, + do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.0, **kwargs): + query_size = len(querys) + if (not(historys)): + historys = [[] for _ in range(query_size)] + + inputs = [] + inputs_len = [] + for i, query in enumerate(querys): + prompt = query if self.direct_query else self.get_prompt(query, historys[i]) + input = tokenizer.encode(prompt); + inputs.extend(input) + inputs_len.append(len(input)) + + outputs = fastllm_lib.response_batch_tokens_llm_model(self.model, query_size, + (ctypes.c_int * len(inputs_len))(*inputs_len), + (ctypes.c_int * len(inputs))(*inputs), + max_length, do_sample, top_p, top_k, temperature, repeat_penalty, + False) + + responses = [] + for i in range(query_size): + response = ctypes.string_at(outputs[i]).decode() + responses.append(response) + historys[i] = historys[i] + [(querys[i], response)] + return responses, historys + + diff --git a/dcu-support/package/fastllm_pytools/torch2flm.py b/dcu-support/package/fastllm_pytools/torch2flm.py new file mode 100644 index 0000000..9cd218b --- /dev/null +++ b/dcu-support/package/fastllm_pytools/torch2flm.py @@ -0,0 +1,218 @@ +import struct +import numpy as np +import torch + +def writeString(fo, s): + fo.write(struct.pack('i', len(s))) + fo.write(s.encode()) + +def writeKeyValue(fo, key, value): + writeString(fo, key) + writeString(fo, value) + +fastllm_data_type_dict = { + "int4": 8, + "int8": 3, + "float16": 7, + "float32": 0, +} +fastllm_weight_type_dict = { + "linear": 1, + "embedding": 2 +} + +v = np.random.randint(-127, 127, [10, 20]); +temp = v; +c_max = np.expand_dims(np.abs(v).max(axis = -1), -1) +c_scale = c_max / 127.0 +v = (v / c_scale + 128.5).clip(1, 255).astype(np.uint8) + +def write_int8(fo, v): + c_max = np.expand_dims(np.abs(v).max(axis = -1), -1).clip(0.1, 1e100) + c_scale = c_max / 127.0 + v = (v / c_scale + 128.5).clip(1, 255).astype(np.uint8) + fo.write(struct.pack('i', 3)) + fo.write(struct.pack('i', 0)) + for i in range(c_max.shape[0]): + fo.write(struct.pack('f', -c_max[i][0])); + fo.write(struct.pack('f', c_max[i][0])); + fo.write(v.data) + +def write_int4(fo, v): + # c_min = np.expand_dims(-np.abs(v).max(axis = -1), -1) + # c_max = np.expand_dims(np.abs(v).max(axis = -1), -1) + # c_scale = c_max / 7.0 + # c_min = c_scale * -8.0 + + c_min = np.expand_dims(v.min(axis = -1), -1) + c_max = np.expand_dims(v.max(axis = -1), -1) + c_scale = (c_max - c_min) / 15.0 + c_zero = np.round(0.0 - c_min / c_scale) + c_zero = c_zero.clip(0, 15) + c_min = -c_scale * c_zero + + v = (v - c_min) / c_scale + v = (v + 0.5).astype(np.int8).clip(0, 15).astype(np.uint8) + v = v[:, 0::2] * 16 + v[:, 1::2] + fo.write(struct.pack('i', 8)) + fo.write(struct.pack('i', 0)) + for i in range(c_min.shape[0]): + fo.write(struct.pack('f', c_min[i][0])); + fo.write(struct.pack('f', c_max[i][0])); + fo.write(v.data) + +def tofile(exportPath, + model, + tokenizer = None, + pre_prompt = None, + user_role = None, + bot_role = None, + history_sep = None, + dtype = "float16"): + if (dtype not in fastllm_data_type_dict): + print("dtype should in ", list(fastllm_data_type_dict.keys())) + exit(0) + + dict = model.state_dict() + fo = open(exportPath, "wb") + + # 0. version id + fo.write(struct.pack('i', 2)) + + # 0.1 model info + if model.config.model_type == "chatglm" and model.config.transformers_version == "4.30.2": + model.config.model_type = "chatglm3" + modelInfo = model.config.__dict__ + if model.generation_config is not None: + modelInfo.update(model.generation_config.__dict__) + if ("model_type" not in modelInfo): + print("unknown model_type.") + exit(0) + + if (pre_prompt): + modelInfo["pre_prompt"] = pre_prompt + if (user_role): + modelInfo["user_role"] = user_role + if (bot_role): + modelInfo["bot_role"] = bot_role + if (history_sep): + modelInfo["history_sep"] = history_sep + if (modelInfo["model_type"] == "baichuan" and hasattr(model, "model") and hasattr(model.model, "get_alibi_mask")): + # Baichuan 2代 + modelInfo["use_alibi"] = "1" + modelInfo["pre_prompt"] = "" + modelInfo["user_role"] = ("") if hasattr(model.generation_config, "user_token_id") else ""; + modelInfo["bot_role"] = ("") if hasattr(model.generation_config, "assistant_token_id") else ""; + modelInfo["history_sep"] = "" + if (modelInfo["model_type"] == "baichuan" and modelInfo["vocab_size"] == 125696): + # Baichuan 2代 7B + modelInfo["pre_prompt"] = "" + modelInfo["user_role"] = ("") if hasattr(model.generation_config, "user_token_id") else ""; + modelInfo["bot_role"] = ("") if hasattr(model.generation_config, "assistant_token_id") else ""; + modelInfo["history_sep"] = "" + if modelInfo["model_type"] == "qwen": + if modelInfo["chat_format"] == "chatml": + modelInfo["im_end_id"] = tokenizer.im_end_id + modelInfo["im_start_id"] = tokenizer.im_start_id + + modelInfo["tokenizer_use_score"] = "1" # 分词带分数 + + if hasattr(model, "peft_config"): + adapter_size = len(model.peft_config) + modelInfo["peft_size"] = adapter_size + + fo.write(struct.pack('i', len(modelInfo))) + for it in modelInfo.keys(): + writeKeyValue(fo, str(it), str(modelInfo[it])) + + if hasattr(model, "peft_config"): + for adapter_name in model.peft_config.keys(): + adapter_dict = model.peft_config[adapter_name].__dict__ + writeString(fo, adapter_name) + fo.write(struct.pack('i', len(adapter_dict))) + for it in adapter_dict.keys(): + writeKeyValue(fo, str(it), str(adapter_dict[it])) + + # 1. vocab + if (tokenizer): + if (hasattr(tokenizer, "tokenizer")): + if (modelInfo['model_type'] == "qwen"): + pass + else: + tokenizer = tokenizer.tokenizer + if (hasattr(tokenizer, "sp_model")): + piece_size = tokenizer.sp_model.piece_size() + fo.write(struct.pack('i', piece_size)) + for i in range(piece_size): + s = tokenizer.sp_model.id_to_piece(i).encode() + fo.write(struct.pack('i', len(s))) + for c in s: + fo.write(struct.pack('i', c)) + fo.write(struct.pack('i', i)) + fo.write(struct.pack('f', float(tokenizer.sp_model.get_score(i)))) + else: + vocab = tokenizer.get_vocab() + fo.write(struct.pack('i', len(vocab))) + for v in vocab.keys(): + if (modelInfo['model_type'] == "qwen"): + s = v + elif (modelInfo["model_type"] == "moss"): + s = [(ord(c) if c not in tokenizer.byte_decoder else tokenizer.byte_decoder[c]) for c in v] + else: + s = v.encode() + fo.write(struct.pack('i', len(s))) + for c in s: + fo.write(struct.pack('i', c)) + fo.write(struct.pack('i', vocab[v])) + fo.write(struct.pack('f', 1.0)) + else: + fo.write(struct.pack('i', 0)) + + weight_type_dict = {} + module_dict = {} + for key, m in model.named_modules(): + if (isinstance(m, torch.nn.Linear)): + weight_type_dict[key + ".weight"] = "linear" + module_dict[key + ".weight"] = m + if (isinstance(m, torch.nn.Embedding)): + weight_type_dict[key] = "embedding" + + # 2. weight + fo.write(struct.pack('i', len(dict))) + tot = 0 + for key in dict: + ori_data_type = 0 + ori_np_data_type = np.float32 + cur_weight_type = 0 + if (key in weight_type_dict and weight_type_dict[key] in fastllm_weight_type_dict): + cur_weight_type = fastllm_weight_type_dict[weight_type_dict[key]] + to_data_type = 0 + if (cur_weight_type == 1): + to_data_type = fastllm_data_type_dict[dtype] + if (to_data_type == 7): + ori_data_type = 7 + ori_np_data_type = np.float16 + + cur = dict[key].numpy().astype(ori_np_data_type) + + if hasattr(model, "peft_config"): + weight_name = key.replace('base_model.model.', '') + fo.write(struct.pack('i', len(weight_name))) + fo.write(weight_name.encode()) + else: + fo.write(struct.pack('i', len(key))) + fo.write(key.encode()) + fo.write(struct.pack('i', len(cur.shape))) + for i in cur.shape: + fo.write(struct.pack('i', i)) + if (to_data_type == 3): + write_int8(fo, cur) + elif (to_data_type == 8): + write_int4(fo, cur) + else: + fo.write(struct.pack('i', to_data_type)) + fo.write(cur.data) + tot += 1 + print("output (", tot, "/", len(dict), end = " )\r") + print("\nfinish.") + fo.close() \ No newline at end of file diff --git a/dcu-support/package/setup.py b/dcu-support/package/setup.py new file mode 100644 index 0000000..84ee672 --- /dev/null +++ b/dcu-support/package/setup.py @@ -0,0 +1,12 @@ +from setuptools import setup, find_packages + +setup ( + name = "fastllm_pytools", + version = "0.0.1", + description = "Fastllm pytools", + packages = ['fastllm_pytools'], + url = "https://developer.hpccube.com/codes/aicomponent/fastllm", + package_data = { + '': ['*.dll', '*.so'] + } +) diff --git a/dcu-support/qwen2flm.py b/dcu-support/qwen2flm.py new file mode 100644 index 0000000..1dde95d --- /dev/null +++ b/dcu-support/qwen2flm.py @@ -0,0 +1,13 @@ +import sys +from transformers import AutoModelForCausalLM, AutoTokenizer +from transformers.generation import GenerationConfig +from fastllm_pytools import torch2flm + +if __name__ == "__main__": + tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) + model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True, fp32=True).eval() + model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 + + dtype = sys.argv[2] if len(sys.argv) >= 3 else "float16" + exportPath = sys.argv[1] if len(sys.argv) >= 2 else "qwen-7b-" + dtype + ".flm" + torch2flm.tofile(exportPath, model, tokenizer, dtype = dtype) \ No newline at end of file diff --git a/dcu-support/requirements.txt b/dcu-support/requirements.txt new file mode 100644 index 0000000..a5eb0c5 --- /dev/null +++ b/dcu-support/requirements.txt @@ -0,0 +1,9 @@ +transformers==4.32.0 +tiktoken +streamlit>=1.24.0 +sentencepiece +urllib3==1.26.16 +transformers_stream_generator==0.0.4 +accelerate +einops +#scipy diff --git a/dcu-support/web_demo.py b/dcu-support/web_demo.py new file mode 100644 index 0000000..27f93b2 --- /dev/null +++ b/dcu-support/web_demo.py @@ -0,0 +1,37 @@ +import streamlit as st +from streamlit_chat import message +from fastllm_pytools import llm +import sys + +st.set_page_config( + page_title="fastllm web demo", + page_icon=":robot:" +) + +@st.cache_resource +def get_model(): + model = llm.model(sys.argv[1]) + return model + +if "messages" not in st.session_state: + st.session_state.messages = [] + +for i, (prompt, response) in enumerate(st.session_state.messages): + with st.chat_message("user"): + st.markdown(prompt) + with st.chat_message("assistant"): + st.markdown(response) + +if prompt := st.chat_input("请开始对话"): + model = get_model() + with st.chat_message("user"): + st.markdown(prompt) + + with st.chat_message("assistant"): + message_placeholder = st.empty() + full_response = "" + for chunk in model.stream_response(prompt, st.session_state.messages, one_by_one = True): + full_response += chunk + message_placeholder.markdown(full_response + "▌") + message_placeholder.markdown(full_response) + st.session_state.messages.append((prompt, full_response)) diff --git a/docker/Dockerfile b/docker/Dockerfile new file mode 100644 index 0000000..a914ef0 --- /dev/null +++ b/docker/Dockerfile @@ -0,0 +1,109 @@ +ARG CUDA_VERSION=11.7.1 +ARG from=nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu20.04 + +FROM ${from} as base + +ARG from + +RUN <")[0] - sent = sent.split("\n\n\n")[0] - sent = sent.split("\n\n")[0] - sent = sent.split("Question:")[0] - sents.append(sent) - return sents - - def generate_sample(model, tokenizer, question): response, _ = model.chat( tokenizer, @@ -58,40 +46,35 @@ def generate_sample(model, tokenizer, question): print("=============") return response - -def extract_answer_hf(completion): - def _get_last_digit(s): - _PAT_LAST_DIGIT = re.compile( - r"(?<=(\s|[\$%#{]))([+-])?(?=(\S))(0|([1-9](\d*|\d{0,2}(,\d{3})*)))?(\.\d*[1-9])?(?=(\s|[.,}]|$))" - ) - match = list(_PAT_LAST_DIGIT.finditer(s)) - if match: - last_digit = match[-1].group().replace(",", "").replace("+", "") - # print(f"The last digit in {s} is {last_digit}") - else: - last_digit = None - print(f"No digits found in {s!r}") - return last_digit - - job_gen = completion.strip(".").replace("\n", "\\n") - last_digit = _get_last_digit(job_gen) - if last_digit is not None: - return eval(last_digit) - return INVALID_ANS - - -def extract_answer(completion): - try: - last_number = re.findall(r"\d+", completion)[-1] - return eval(last_number) - except: - return INVALID_ANS - +def extract_answer(s): + _PAT_LAST_DIGIT = re.compile( + r"([+-])?(?=([0-9]|\.[0-9]))(0|([1-9](\d{0,2}(,\d{3})*)|\d*))?(\.\d*)?(?=\D|$)" + ) + match = list(_PAT_LAST_DIGIT.finditer(s)) + if match: + last_digit = match[-1].group().replace(",", "").replace("+", "").strip() + # print(f"The last digit in {s} is {last_digit}") + else: + last_digit = None + print(f"No digits found in {s!r}", flush=True) + return last_digit def is_correct(completion, answer): gold = extract_answer(answer) - assert gold != INVALID_ANS, "No ground truth answer found in the document." - return extract_answer(completion) == gold + assert gold is not None, "No ground truth answer found in the document." + + def number_equal(answer, pred): + if pred is None: + return False + try: + return math.isclose(eval(answer), eval(pred), rel_tol=0, abs_tol=1e-4) + except: + print( + f"cannot compare two numbers: answer={answer}, pred={pred}", flush=True + ) + return False + + return number_equal(gold, extract_answer(completion)) if __name__ == "__main__": @@ -138,7 +121,6 @@ if __name__ == "__main__": acc_res = [] for doc in tqdm.tqdm(test): context = doc_to_text(doc, args.use_fewshot) - print(context) completion = generate_sample(model, tokenizer, context) answer = doc["answer"] acc = is_correct(completion, answer) diff --git a/eval/evaluate_chat_mmlu.py b/eval/evaluate_chat_mmlu.py index 36d0524..bd275a2 100644 --- a/eval/evaluate_chat_mmlu.py +++ b/eval/evaluate_chat_mmlu.py @@ -109,7 +109,7 @@ def eval_subject( print(f"{result_path} existed, skip!") score = [] for (_, datarow), (_, resultrow) in zip( - test_df.iterrows(), pd.read_csv(result_path).iterrows() + test_df.iterrows(), pd.read_csv(result_path).astype(str).iterrows() ): # pred = extract_answer(resultrow['model_response'], datarow) pred = resultrow["model_output"] @@ -201,7 +201,7 @@ def main(args): # dev_df = pd.read_csv(dev_file_path, names=['question','A','B','C','D','answer']) test_df = pd.read_csv( test_file_path, names=["question", "A", "B", "C", "D", "answer"] - ) + ).astype(str) score = eval_subject( model, diff --git a/examples/system_prompt.md b/examples/system_prompt.md new file mode 100644 index 0000000..139728b --- /dev/null +++ b/examples/system_prompt.md @@ -0,0 +1,92 @@ +# 系统指令 (System Prompts) + +## 什么是系统指令? (What is the System Prompts?) + +系统指令设定了AI助手的行为模式,例如人物设定、语言风格、任务模式、甚至针对具体问题的具体行为。 + +System Propmts set the behavior mode of the AI assistant, such as character settings, language styles, task modes, and even specific behaviors for specific tasks. + +系统指令可以是一个广泛的人物设定,如“You are a helpful assistant”;也可以是一个十分详细的要求,如“拒绝回答所有代码相关的问题”。 + +The System Prompts can be a broad character setting, such as "You are a helpful assistant"; or it can be a very detailed request, such as "Refuse to answer all code-related questions." + +系统指令为用户提供了一个易组织、上下文稳定的控制AI助手行为的方式,可以从多种角度定制属于你自己的AI助手。 + +System Prompts provide users with an easy-to-organize, context-stable way to control the behavior of the AI assistant. You can customize your own AI assistant from multiple perspectives. + +系统指令需要在多轮对话中稳定,例如角色扮演类系统指令被设定后AI助手不应该在多轮对话中跳脱自身的设定。 + +System Prompts need to be stable across multiple rounds of dialogue. For example, after a role-playing system prompt is set, the AI assistant should not escape its own settings in multiple rounds of dialogue. + +同时,模型也需要具有基于系统指令中对自身行为进行推理的能力。这两者都是为模型赋予跟随系统指令能力时需要克服的难点。 + +At the same time, the model also needs to have the ability to reason about its own behavior based on system prompts. Both of these are difficulties that need to be overcome when giving the model the ability to follow system prompts. + +Qwen-1.8B-Chat 和 Qwen-72B-Chat在多样且存在多轮复杂交互的系统指令上进行了充分训练,使模型可以跟随多样的系统指令,实现上下文(in-context)中的模型定制化,进一步提升了通义千问的可扩展性。 + +Qwen-1.8-Chat and Qwen-72B-Chat have been fully trained on diverse system prompts with multiple rounds of complex interactions, so that they can follow a variety of system prompts and realize model customization in context, further improving the scalability of Qwen-chat. + +## 系统指令能做什么? (What can System Prompts do?) + +### 角色扮演 Role Play + +在系统指令中告诉千问你需要它扮演的角色,即可沉浸式和该角色对话交流 + +Tell Qwen-Chat the role you want it to play in the System Prompt, and you can have an immersive conversation with that role. + + +![](../assets/system_prompt_role_play.png) + +![](../assets/system_prompt_role_play_en.png) + +### 语言风格 Language Style + + +简单调整千问的语言风格 + +Simple adjustment of the Qwen-Chat's language style + +![](../assets/system_prompt_language_style.png) + +![](../assets/system_prompt_language_style_en.png) + +### 任务设定 Task Setting + +指定具体任务,打造处理专项任务的千问模型 + +Setting specific tasks and creating a Qwen-Chat model to handle special tasks + +![](../assets/system_prompt_task_setting.png) + +![](../assets/system_prompt_task_setting_en.png) + +### 行为设定 Behavior Setting + +设定千问对具体任务的行为模式 + +Set behavior patterns of Qwen-Chat for specific tasks + +![](../assets/system_prompt_behavior_setting.png) + +![](../assets/system_prompt_behavior_setting_en.png) + +## 代码示例 Example + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer +from transformers.generation import GenerationConfig + +tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat", trust_remote_code=True) + +# Only Qwen-72B-Chat and Qwen-1_8B-Chat has system prompt enhancement now. +model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat", device_map="auto", trust_remote_code=True).eval() +# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B-Chat", device_map="auto", trust_remote_code=True).eval() + +response, _ = model.chat(tokenizer, "你好呀", history=None, system="请用二次元可爱语气和我说话") +print(response) +# 你好啊!我是一只可爱的二次元猫咪哦,不知道你有什么问题需要我帮忙解答吗? + +response, _ = model.chat(tokenizer, "My colleague works diligently", history=None, system="You will write beautiful compliments according to needs") +print(response) +# Your colleague is an outstanding worker! Their dedication and hard work are truly inspiring. They always go above and beyond to ensure that their tasks are completed on time and to the highest standard. I am lucky to have them as a colleague, and I know I can count on them to handle any challenge that comes their way. +``` \ No newline at end of file diff --git a/examples/vllm_wrapper.py b/examples/vllm_wrapper.py new file mode 100644 index 0000000..799f4bf --- /dev/null +++ b/examples/vllm_wrapper.py @@ -0,0 +1,239 @@ +from transformers import PreTrainedTokenizer, GenerationConfig, StoppingCriteriaList +from typing import Optional, Callable, List, Tuple, Union +import copy +import torch +from transformers import AutoTokenizer +from transformers.generation.logits_process import LogitsProcessorList +from packaging import version + +_ERROR_BAD_CHAT_FORMAT = """\ +We detect you are probably using the pretrained model (rather than chat model) for chatting, since the chat_format in generation_config is not "chatml". +If you are directly using the model downloaded from Huggingface, please make sure you are using our "Qwen/Qwen-7B-Chat" Huggingface model (rather than "Qwen/Qwen-7B") when you call model.chat(). +我们检测到您可能在使用预训练模型(而非chat模型)进行多轮chat,因为您当前在generation_config指定的chat_format,并未设置为我们在对话中所支持的"chatml"格式。 +如果您在直接使用我们从Huggingface提供的模型,请确保您在调用model.chat()时,使用的是"Qwen/Qwen-7B-Chat"模型(而非"Qwen/Qwen-7B"预训练模型)。 +""" + +IMEND = "<|im_end|>" +ENDOFTEXT = "<|endoftext|>" + +HistoryType = List[Tuple[str, str]] +TokensType = List[int] +BatchTokensType = List[List[int]] + +def get_stop_words_ids(chat_format, tokenizer): + if chat_format == "raw": + stop_words_ids = [tokenizer.encode("Human:"), [tokenizer.eod_id]] + elif chat_format == "chatml": + stop_words_ids = [[tokenizer.im_end_id], [tokenizer.im_start_id]] + else: + raise NotImplementedError(f"Unknown chat format {chat_format!r}") + return stop_words_ids + +def make_context( + tokenizer: PreTrainedTokenizer, + query: str, + history: List[Tuple[str, str]] = None, + system: str = "", + max_window_size: int = 6144, + chat_format: str = "chatml", +): + if history is None: + history = [] + + if chat_format == "chatml": + im_start, im_end = "<|im_start|>", "<|im_end|>" + im_start_tokens = [tokenizer.im_start_id] + im_end_tokens = [tokenizer.im_end_id] + nl_tokens = tokenizer.encode("\n") + + def _tokenize_str(role, content): + return f"{role}\n{content}", tokenizer.encode( + role, allowed_special=set() + ) + nl_tokens + tokenizer.encode(content, allowed_special=set()) + + system_text, system_tokens_part = _tokenize_str("system", system) + system_tokens = im_start_tokens + system_tokens_part + im_end_tokens + + raw_text = "" + context_tokens = [] + + for turn_query, turn_response in reversed(history): + query_text, query_tokens_part = _tokenize_str("user", turn_query) + query_tokens = im_start_tokens + query_tokens_part + im_end_tokens + response_text, response_tokens_part = _tokenize_str( + "assistant", turn_response + ) + response_tokens = im_start_tokens + response_tokens_part + im_end_tokens + + next_context_tokens = nl_tokens + query_tokens + nl_tokens + response_tokens + prev_chat = ( + f"\n{im_start}{query_text}{im_end}\n{im_start}{response_text}{im_end}" + ) + + current_context_size = ( + len(system_tokens) + len(next_context_tokens) + len(context_tokens) + ) + if current_context_size < max_window_size: + context_tokens = next_context_tokens + context_tokens + raw_text = prev_chat + raw_text + else: + break + + context_tokens = system_tokens + context_tokens + raw_text = f"{im_start}{system_text}{im_end}" + raw_text + context_tokens += ( + nl_tokens + + im_start_tokens + + _tokenize_str("user", query)[1] + + im_end_tokens + + nl_tokens + + im_start_tokens + + tokenizer.encode("assistant") + + nl_tokens + ) + raw_text += f"\n{im_start}user\n{query}{im_end}\n{im_start}assistant\n" + + elif chat_format == "raw": + raw_text = query + context_tokens = tokenizer.encode(raw_text) + else: + raise NotImplementedError(f"Unknown chat format {chat_format!r}") + + return raw_text, context_tokens + +class vLLMWrapper: + def __init__(self, + model_dir: str, + trust_remote_code: bool = True, + tensor_parallel_size: int = 1, + gpu_memory_utilization: float = 0.98, + dtype: str = "bfloat16", + **kwargs): + + if dtype not in ("bfloat16", "float16", "float32"): + print("now not support {}!".format(dtype)) + raise Exception + + # build generation_config + self.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=trust_remote_code) + + # build tokenizer + self.tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) + self.tokenizer.eos_token_id = self.generation_config.eos_token_id + + self.stop_words_ids = [] + + from vllm import LLM + import vllm + if version.parse(vllm.__version__) >= version.parse("0.2.2"): + self.__vllm_support_repetition_penalty = True + else: + self.__vllm_support_repetition_penalty = False + + quantization = getattr(kwargs, 'quantization', None) + + self.model = LLM(model=model_dir, + tokenizer=model_dir, + tensor_parallel_size=tensor_parallel_size, + trust_remote_code=trust_remote_code, + quantization=quantization, + gpu_memory_utilization=gpu_memory_utilization, + dtype=dtype) + + for stop_id in get_stop_words_ids(self.generation_config.chat_format, self.tokenizer): + self.stop_words_ids.extend(stop_id) + self.stop_words_ids.extend([self.generation_config.eos_token_id]) + + def chat(self, + query: str, + history: Optional[HistoryType], + tokenizer: PreTrainedTokenizer = None, + system: str = "You are a helpful assistant.", + generation_config: Optional[GenerationConfig] = None, + **kwargs): + generation_config = generation_config if generation_config is not None else self.generation_config + tokenizer = self.tokenizer if tokenizer is None else tokenizer + + assert generation_config.chat_format == 'chatml', _ERROR_BAD_CHAT_FORMAT + if not self.__vllm_support_repetition_penalty and generation_config.repetition_penalty != 1: + raise RuntimeError("The installed vLLM doesn't support repetition_penalty, please set ``model.generation_config.repetition_penalty = 1`` or install vllm>=0.2.2") + + if history is None: + history = [] + else: + # make a copy of the user's input such that is is left untouched + history = copy.deepcopy(history) + + extra_stop_words_ids = kwargs.get('stop_words_ids', None) + if extra_stop_words_ids is None: + extra_stop_words_ids = [] + + max_window_size = kwargs.get('max_window_size', None) + if max_window_size is None: + max_window_size = generation_config.max_window_size + + from vllm.sampling_params import SamplingParams + sampling_kwargs = { + "stop_token_ids": self.stop_words_ids, + "early_stopping": False, + "top_p": generation_config.top_p, + "top_k": -1 if generation_config.top_k == 0 else generation_config.top_k, + "temperature": generation_config.temperature, + "max_tokens": generation_config.max_new_tokens, + "repetition_penalty": generation_config.repetition_penalty + } + if not self.__vllm_support_repetition_penalty: + sampling_kwargs.pop("repetition_penalty") + sampling_params = SamplingParams(**sampling_kwargs) + + raw_text, context_tokens = make_context( + self.tokenizer, + query, + history=history, + system=system, + max_window_size=max_window_size, + chat_format=generation_config.chat_format, + ) + + req_outputs = self.model.generate([query], + sampling_params=sampling_params, + prompt_token_ids=[context_tokens]) + req_output = req_outputs[0] + + prompt_str = req_output.prompt + prompt_ids = req_output.prompt_token_ids + req_sample_output_ids = [] + req_sample_output_strs = [] + for sample in req_output.outputs: + output_str = sample.text + output_ids = sample.token_ids + if IMEND in output_str: + output_str = output_str[:-len(IMEND)] + if ENDOFTEXT in output_str: + output_str = output_str[:-len(ENDOFTEXT)] + req_sample_output_ids.append(prompt_ids + output_ids) + req_sample_output_strs.append(prompt_str + output_str) + assert len(req_sample_output_strs) == 1 + response = req_sample_output_strs[0][len(prompt_str):] + history.append((prompt_str, response)) + + return response, history + +if __name__ == '__main__': + + model_dir = 'Qwen/Qwen-72B-Chat' + tensor_parallel_size = 2 + + model = vLLMWrapper(model_dir, + tensor_parallel_size=tensor_parallel_size, + ) + + response, history = model.chat(query="你好", + history=None) + print(response) + response, history = model.chat(query="给我讲一个年轻人奋斗创业最终取得成功的故事。", + history=history) + print(response) + response, history = model.chat(query="给这个故事起一个标题", + history=history) + print(response) diff --git a/finetune.py b/finetune.py index a756556..4eebd15 100644 --- a/finetune.py +++ b/finetune.py @@ -278,11 +278,11 @@ def train(): local_rank = training_args.local_rank - device_map = None + device_map = "auto" world_size = int(os.environ.get("WORLD_SIZE", 1)) ddp = world_size != 1 if lora_args.q_lora: - device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)} if ddp else None + device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)} if ddp else "auto" if len(training_args.fsdp) > 0 or deepspeed.is_deepspeed_zero3_enabled(): logging.warning( "FSDP or ZeRO3 are not incompatible with QLoRA." diff --git a/finetune/finetune_ds.sh b/finetune/finetune_ds.sh index f6bd381..1953723 100644 --- a/finetune/finetune_ds.sh +++ b/finetune/finetune_ds.sh @@ -2,7 +2,7 @@ export CUDA_DEVICE_MAX_CONNECTIONS=1 DIR=`pwd` -GPUS_PER_NODE=8 +GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())') NNODES=1 NODE_RANK=0 MASTER_ADDR=localhost @@ -13,6 +13,34 @@ MODEL="Qwen/Qwen-7B" # Set the path if you do not want to load from huggingface # See the section for finetuning in README for more information. DATA="path_to_data" +function usage() { + echo ' +Usage: bash finetune/finetune_ds.sh [-m MODEL_PATH] [-d DATA_PATH] +' +} + +while [[ "$1" != "" ]]; do + case $1 in + -m | --model ) + shift + MODEL=$1 + ;; + -d | --data ) + shift + DATA=$1 + ;; + -h | --help ) + usage + exit 0 + ;; + * ) + echo "Unknown argument ${1}" + exit 1 + ;; + esac + shift +done + DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ @@ -44,4 +72,4 @@ torchrun $DISTRIBUTED_ARGS finetune.py \ --model_max_length 512 \ --gradient_checkpointing True \ --lazy_preprocess True \ - --deepspeed finetune/ds_config_zero3.json \ No newline at end of file + --deepspeed finetune/ds_config_zero3.json diff --git a/finetune/finetune_lora_ds.sh b/finetune/finetune_lora_ds.sh index 30f7883..1dfe814 100644 --- a/finetune/finetune_lora_ds.sh +++ b/finetune/finetune_lora_ds.sh @@ -2,7 +2,7 @@ export CUDA_DEVICE_MAX_CONNECTIONS=1 DIR=`pwd` -GPUS_PER_NODE=8 +GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())') NNODES=1 NODE_RANK=0 MASTER_ADDR=localhost @@ -12,6 +12,39 @@ MODEL="Qwen/Qwen-7B" # Set the path if you do not want to load from huggingface # ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations. # See the section for finetuning in README for more information. DATA="path_to_data" +DS_CONFIG_PATH="finetune/ds_config_zero2.json" + +function usage() { + echo ' +Usage: bash finetune/finetune_lora_ds.sh [-m MODEL_PATH] [-d DATA_PATH] [--deepspeed DS_CONFIG_PATH] +' +} + +while [[ "$1" != "" ]]; do + case $1 in + -m | --model ) + shift + MODEL=$1 + ;; + -d | --data ) + shift + DATA=$1 + ;; + --deepspeed ) + shift + DS_CONFIG_PATH=$1 + ;; + -h | --help ) + usage + exit 0 + ;; + * ) + echo "Unknown argument ${1}" + exit 1 + ;; + esac + shift +done DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ @@ -45,4 +78,4 @@ torchrun $DISTRIBUTED_ARGS finetune.py \ --lazy_preprocess True \ --use_lora \ --gradient_checkpointing \ - --deepspeed finetune/ds_config_zero2.json + --deepspeed ${DS_CONFIG_PATH} diff --git a/finetune/finetune_lora_single_gpu.sh b/finetune/finetune_lora_single_gpu.sh index 3d8be58..74d9d36 100644 --- a/finetune/finetune_lora_single_gpu.sh +++ b/finetune/finetune_lora_single_gpu.sh @@ -1,13 +1,39 @@ #!/bin/bash export CUDA_DEVICE_MAX_CONNECTIONS=1 -DIR=`pwd` - MODEL="Qwen/Qwen-7B" # Set the path if you do not want to load from huggingface directly # ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations. # See the section for finetuning in README for more information. DATA="path_to_data" +function usage() { + echo ' +Usage: bash finetune/finetune_lora_single_gpu.sh [-m MODEL_PATH] [-d DATA_PATH] +' +} + +while [[ "$1" != "" ]]; do + case $1 in + -m | --model ) + shift + MODEL=$1 + ;; + -d | --data ) + shift + DATA=$1 + ;; + -h | --help ) + usage + exit 0 + ;; + * ) + echo "Unknown argument ${1}" + exit 1 + ;; + esac + shift +done + export CUDA_VISIBLE_DEVICES=0 python finetune.py \ diff --git a/finetune/finetune_qlora_ds.sh b/finetune/finetune_qlora_ds.sh index 5ca0ce6..a43d35d 100644 --- a/finetune/finetune_qlora_ds.sh +++ b/finetune/finetune_qlora_ds.sh @@ -2,7 +2,7 @@ export CUDA_DEVICE_MAX_CONNECTIONS=1 DIR=`pwd` -GPUS_PER_NODE=8 +GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())') NNODES=1 NODE_RANK=0 MASTER_ADDR=localhost @@ -13,6 +13,34 @@ MODEL="Qwen/Qwen-7B-Chat-Int4" # Set the path if you do not want to load from hu # See the section for finetuning in README for more information. DATA="path_to_data" +function usage() { + echo ' +Usage: bash finetune/finetune_qlora_ds.sh [-m MODEL_PATH] [-d DATA_PATH] +' +} + +while [[ "$1" != "" ]]; do + case $1 in + -m | --model ) + shift + MODEL=$1 + ;; + -d | --data ) + shift + DATA=$1 + ;; + -h | --help ) + usage + exit 0 + ;; + * ) + echo "Unknown argument ${1}" + exit 1 + ;; + esac + shift +done + DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ diff --git a/finetune/finetune_qlora_single_gpu.sh b/finetune/finetune_qlora_single_gpu.sh index 20031e8..fb019a0 100644 --- a/finetune/finetune_qlora_single_gpu.sh +++ b/finetune/finetune_qlora_single_gpu.sh @@ -7,6 +7,34 @@ MODEL="Qwen/Qwen-7B-Chat-Int4" # Set the path if you do not want to load from hu # See the section for finetuning in README for more information. DATA="path_to_data" +function usage() { + echo ' +Usage: bash finetune/finetune_qlora_single_gpu.sh [-m MODEL_PATH] [-d DATA_PATH] +' +} + +while [[ "$1" != "" ]]; do + case $1 in + -m | --model ) + shift + MODEL=$1 + ;; + -d | --data ) + shift + DATA=$1 + ;; + -h | --help ) + usage + exit 0 + ;; + * ) + echo "Unknown argument ${1}" + exit 1 + ;; + esac + shift +done + export CUDA_VISIBLE_DEVICES=0 # Remember to use --fp16 instead of --bf16 due to autogptq