init commit of recipes (#1027)

Add recipes
2026-05-20 16:35:47 +08:00 · 2024-01-30 01:57:09 -06:00
parent d275e5b91a
commit ee01f36ed9
30 changed files with 5146 additions and 0 deletions
--- a/recipes/finetune/ascend/README.md
+++ b/recipes/finetune/ascend/README.md
@@ -0,0 +1,142 @@
+# Fine-tuning Qwen by Ascend NPU
+Below, we provide a simple example to show how to finetune Qwen by Ascend NPU. You can also refer to the official [mindformers](https://gitee.com/mindspore/mindformers/blob/dev/research/qwen/qwen.md) for detailed usage.
+
+## Environment Requirement
+
+- Hardware: Ascend 910A/B
+
+## Quickstart
+
+1. Launch Docker Image
+
+```bash
+ImageID=pai-image-manage-registry.cn-wulanchabu.cr.aliyuncs.com/pai/llm-inference:qwen_v23.0.rc3
+docker run -it -u root --ipc=host \
+--device=/dev/davinci0 \
+--device=/dev/davinci1 \
+--device=/dev/davinci2 \
+--device=/dev/davinci3 \
+--device=/dev/davinci4 \
+--device=/dev/davinci5 \
+--device=/dev/davinci6 \
+--device=/dev/davinci7 \
+--device=/dev/davinci_manager \
+--device=/dev/devmm_svm \
+--device=/dev/hisi_hdc \
+-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
+-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /var/log/npu/:/usr/slog \
+-v /etc/hccn.conf:/etc/hccn.conf \
+${ImageID} /bin/bash
+```
+
+2. Download and Convert model
+
+- download model by modelscope
+
+```bash
+cd mindformers
+python3 -c "from modelscope.hub.snapshot_download import snapshot_download; snapshot_download('Qwen/Qwen-7B-Chat', cache_dir='.', revision='master')"
+```
+
+- convert hf model weights to ckpt weights
+
+```bash
+python research/qwen/convert_weight.py \
+    --torch_ckpt_dir Qwen/Qwen-7B-Chat \
+    --mindspore_ckpt_path qwen-7b-chat.ckpt
+
+mkdir -vp load_checkpoint/rank_0
+mv qwen-7b-chat.ckpt load_checkpoint/rank_0/
+```
+
+3. Prepare training data
+
+- download demo data
+
+```bash
+wget -c https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/alpaca_data_min.json
+```
+
+- Converts the raw data to the specified format
+
+```bash
+python research/qwen/alpaca_converter.py \
+    --data_path alpaca_data_min.json \
+    --output_path alpaca-data-conversation_min.json
+```
+
+- Generate Mindrecord data
+
+```bash
+python research/qwen/qwen_preprocess.py \
+    --input_glob alpaca-data-conversation_min.json \
+    --model_file Qwen/Qwen-7B-Chat/qwen.tiktoken \
+    --seq_length 1024 \
+    --output_file alpaca_min.mindrecord
+```
+
+4. Prepare RANK_TABLE_FILE
+
+```bash
+# generate RANK_TABLE_FILE with 8 npu
+python mindformers/tools/hccl_tools.py --device_num "[0,8)"
+```
+
+5. Fine-tune
+
+You need to replace RANK_TABLE_FILE with the file generated in step 5.
+
+```bash
+export MS_ASCEND_CHECK_OVERFLOW_MODE=INFNAN_MODE
+bash research/run_singlenode.sh "python3 research/qwen/run_qwen.py \
+--config research/qwen/run_qwen_7b.yaml \
+--load_checkpoint /mindformers/research/qwen/load_checkpoint \
+--vocab_file Qwen/Qwen-7B-Chat/qwen.tiktoken \
+--use_parallel True \
+--run_mode finetune \
+--auto_trans_ckpt True \
+--train_data alpaca_min.mindrecord" \
+RANK_TABLE_FILE [0,8] 8
+```
+
+6. Merge model weights
+
+- Rename model weights
+
+```bash
+cd output/checkpoint_network
+mv rank_0/qwen_rank_0-network.ckpt rank_0/checkpoint_0.ckpt
+mv rank_1/qwen_rank_1-network.ckpt rank_1/checkpoint_1.ckpt
+mv rank_2/qwen_rank_2-network.ckpt rank_2/checkpoint_2.ckpt
+mv rank_3/qwen_rank_3-network.ckpt rank_3/checkpoint_3.ckpt
+mv rank_4/qwen_rank_4-network.ckpt rank_4/checkpoint_4.ckpt
+mv rank_5/qwen_rank_5-network.ckpt rank_5/checkpoint_5.ckpt
+mv rank_6/qwen_rank_6-network.ckpt rank_6/checkpoint_6.ckpt
+mv rank_7/qwen_rank_7-network.ckpt rank_7/checkpoint_7.ckpt
+cd ../..
+```
+
+- Merge model weights
+
+```bash
+python mindformers/tools/transform_ckpt.py \
+    --src_ckpt_strategy output/strategy  \
+    --src_ckpt_dir output/checkpoint_network \
+    --dst_ckpt_dir output/merged_model
+```
+
+7. Inference fine-tuned model
+
+```bash
+python research/qwen/run_qwen.py \
+    --config research/qwen/run_qwen_7b.yaml \
+    --predict_data '比较适合深度学习入门的书籍有' \
+    --run_mode predict \
+    --load_checkpoint output/merged_model/rank_0/checkpoint_0.ckpt \
+    --auto_trans_ckpt False \
+    --device_id 0
+```
--- a/recipes/finetune/deepspeed/finetune_fullparameter_multi_gpu.ipynb
+++ b/recipes/finetune/deepspeed/finetune_fullparameter_multi_gpu.ipynb
@@ -0,0 +1,213 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6e6981ab-2d9a-4280-923f-235a166855ba",
+   "metadata": {},
+   "source": [
+    "# Fine-Tuning Qwen-Chat Large Language Model (Multiple GPUs)\n",
+    "\n",
+    "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n",
+    "\n",
+    "This notebook uses Qwen-1.8B-Chat as an example to introduce how to fine-tune the Qianwen model using Deepspeed.\n",
+    "\n",
+    "## Environment Requirements\n",
+    "\n",
+    "Please refer to **requirements.txt** to install the required dependencies.\n",
+    "\n",
+    "## Preparation\n",
+    "\n",
+    "### Download Qwen-1.8B-Chat\n",
+    "\n",
+    "First, download the model files. You can choose to download directly from ModelScope."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f",
+   "metadata": {},
+   "source": [
+    "### Download Example Training Data\n",
+    "\n",
+    "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n",
+    "\n",
+    "Disclaimer: the dataset can be only used for the research purpose."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f",
+   "metadata": {},
+   "source": [
+    "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"我是一个语言模型，我叫通义千问。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "You can also use multi-turn conversations as the training set. Here is a simple example:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好，能告诉我遛狗的最佳时间吗？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"当地最佳遛狗时间因地域差异而异，请问您所在的城市是哪里？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"我在纽约市。\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间，因为这些时间段气温较低，遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "## Fine-Tune the Model\n",
+    "\n",
+    "You can directly run the prepared training script to fine-tune the model. **nproc_per_node** refers to the number of GPUs used fro training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 ../../finetune.py \\\n",
+    "    --model_name_or_path \"Qwen/Qwen-1_8B-Chat/\" \\\n",
+    "    --data_path \"Belle_sampled_qwen.json\" \\\n",
+    "    --bf16 True \\\n",
+    "    --output_dir \"output_qwen\" \\\n",
+    "    --num_train_epochs 5 \\\n",
+    "    --per_device_train_batch_size 1 \\\n",
+    "    --per_device_eval_batch_size 1 \\\n",
+    "    --gradient_accumulation_steps 16 \\\n",
+    "    --evaluation_strategy \"no\" \\\n",
+    "    --save_strategy \"steps\" \\\n",
+    "    --save_steps 1000 \\\n",
+    "    --save_total_limit 10 \\\n",
+    "    --learning_rate 1e-5 \\\n",
+    "    --weight_decay 0.1 \\\n",
+    "    --adam_beta2 0.95 \\\n",
+    "    --warmup_ratio 0.01 \\\n",
+    "    --lr_scheduler_type \"cosine\" \\\n",
+    "    --logging_steps 1 \\\n",
+    "    --report_to \"none\" \\\n",
+    "    --model_max_length 512 \\\n",
+    "    --gradient_checkpointing True \\\n",
+    "    --lazy_preprocess True \\\n",
+    "    --deepspeed \"../../finetune/ds_config_zero2.json\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Test the Model\n",
+    "\n",
+    "We can test the model as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "from transformers.generation import GenerationConfig\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen\", trust_remote_code=True)\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    \"output_qwen\",\n",
+    "    device_map=\"auto\",\n",
+    "    trust_remote_code=True\n",
+    ").eval()\n",
+    "\n",
+    "response, history = model.chat(tokenizer, \"你好\", history=None)\n",
+    "print(response)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/recipes/finetune/deepspeed/finetune_fullparameter_single_gpu.ipynb
+++ b/recipes/finetune/deepspeed/finetune_fullparameter_single_gpu.ipynb
@@ -0,0 +1,234 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6e6981ab-2d9a-4280-923f-235a166855ba",
+   "metadata": {},
+   "source": [
+    "# Fine-Tuning Qwen-Chat Large Language Model (Single GPU)\n",
+    "\n",
+    "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n",
+    "\n",
+    "This notebook uses Qwen-1.8B-Chat as an example to introduce how to fine-tune the Qianwen model using Deepspeed.\n",
+    "\n",
+    "## Environment Requirements\n",
+    "\n",
+    "Please refer to **requirements.txt** to install the required dependencies.\n",
+    "\n",
+    "## Preparation\n",
+    "\n",
+    "### Download Qwen-1.8B-Chat\n",
+    "\n",
+    "First, download the model files. You can choose to download directly from ModelScope."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "execution": {
+     "iopub.execute_input": "2023-12-31T03:19:11.059814Z",
+     "iopub.status.busy": "2023-12-31T03:19:11.059177Z",
+     "iopub.status.idle": "2023-12-31T03:21:54.157827Z",
+     "shell.execute_reply": "2023-12-31T03:21:54.157333Z",
+     "shell.execute_reply.started": "2023-12-31T03:19:11.059783Z"
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f",
+   "metadata": {},
+   "source": [
+    "### Download Example Training Data\n",
+    "\n",
+    "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n",
+    "\n",
+    "Disclaimer: the dataset can be only used for the research purpose."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-12-31T03:21:57.596577Z",
+     "iopub.status.busy": "2023-12-31T03:21:57.595847Z",
+     "iopub.status.idle": "2023-12-31T03:21:57.971112Z",
+     "shell.execute_reply": "2023-12-31T03:21:57.970576Z",
+     "shell.execute_reply.started": "2023-12-31T03:21:57.596555Z"
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f",
+   "metadata": {},
+   "source": [
+    "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"我是一个语言模型，我叫通义千问。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "You can also use multi-turn conversations as the training set. Here is a simple example:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好，能告诉我遛狗的最佳时间吗？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"当地最佳遛狗时间因地域差异而异，请问您所在的城市是哪里？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"我在纽约市。\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间，因为这些时间段气温较低，遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "\n",
+    "## Fine-Tune the Model\n",
+    "\n",
+    "You can directly run the prepared training script to fine-tune the model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "execution": {
+     "iopub.execute_input": "2023-12-31T03:23:52.455178Z",
+     "iopub.status.busy": "2023-12-31T03:23:52.454615Z",
+     "iopub.status.idle": "2023-12-31T03:24:15.699948Z",
+     "shell.execute_reply": "2023-12-31T03:24:15.699358Z",
+     "shell.execute_reply.started": "2023-12-31T03:23:52.455144Z"
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!python ../../finetune.py \\\n",
+    "    --model_name_or_path \"Qwen/Qwen-1_8B-Chat/\"\\\n",
+    "    --data_path  \"Belle_sampled_qwen.json\"\\\n",
+    "    --bf16 \\\n",
+    "    --output_dir \"output_qwen\" \\\n",
+    "    --num_train_epochs 5 \\\n",
+    "    --per_device_train_batch_size 1 \\\n",
+    "    --per_device_eval_batch_size 1 \\\n",
+    "    --gradient_accumulation_steps 16 \\\n",
+    "    --evaluation_strategy \"no\" \\\n",
+    "    --save_strategy \"steps\" \\\n",
+    "    --save_steps 1000 \\\n",
+    "    --save_total_limit 10 \\\n",
+    "    --learning_rate 1e-5 \\\n",
+    "    --weight_decay 0.1 \\\n",
+    "    --adam_beta2 0.95 \\\n",
+    "    --warmup_ratio 0.01 \\\n",
+    "    --lr_scheduler_type \"cosine\" \\\n",
+    "    --logging_steps 1 \\\n",
+    "    --report_to \"none\" \\\n",
+    "    --model_max_length 512 \\\n",
+    "    --gradient_checkpointing \\\n",
+    "    --lazy_preprocess"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Test the Model\n",
+    "\n",
+    "We can test the model as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "from transformers.generation import GenerationConfig\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen\", trust_remote_code=True)\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    \"output_qwen\",\n",
+    "    device_map=\"auto\",\n",
+    "    trust_remote_code=True\n",
+    ").eval()\n",
+    "\n",
+    "response, history = model.chat(tokenizer, \"你好\", history=None)\n",
+    "print(response)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/recipes/finetune/deepspeed/finetune_lora_multi_gpu.ipynb
+++ b/recipes/finetune/deepspeed/finetune_lora_multi_gpu.ipynb
@@ -0,0 +1,267 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6e6981ab-2d9a-4280-923f-235a166855ba",
+   "metadata": {},
+   "source": [
+    "# LoRA Fine-Tuning Qwen-Chat Large Language Model (Multiple GPUs)\n",
+    "\n",
+    "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n",
+    "\n",
+    "This notebook uses Qwen-1.8B-Chat as an example to introduce how to LoRA fine-tune the Qianwen model using Deepspeed.\n",
+    "\n",
+    "## Environment Requirements\n",
+    "\n",
+    "Please refer to **requirements.txt** to install the required dependencies.\n",
+    "\n",
+    "## Preparation\n",
+    "\n",
+    "### Download Qwen-1.8B-Chat\n",
+    "\n",
+    "First, download the model files. You can choose to download directly from ModelScope."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f",
+   "metadata": {},
+   "source": [
+    "### Download Example Training Data\n",
+    "\n",
+    "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n",
+    "\n",
+    "Disclaimer: the dataset can be only used for the research purpose."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f",
+   "metadata": {},
+   "source": [
+    "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"我是一个语言模型，我叫通义千问。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "You can also use multi-turn conversations as the training set. Here is a simple example:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好，能告诉我遛狗的最佳时间吗？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"当地最佳遛狗时间因地域差异而异，请问您所在的城市是哪里？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"我在纽约市。\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间，因为这些时间段气温较低，遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "## Fine-Tune the Model\n",
+    "\n",
+    "You can directly run the prepared training script to fine-tune the model. **nproc_per_node** refers to the number of GPUs used fro training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 ../../finetune.py \\\n",
+    "    --model_name_or_path \"Qwen/Qwen-1_8B-Chat/\" \\\n",
+    "    --data_path \"Belle_sampled_qwen.json\" \\\n",
+    "    --bf16 True \\\n",
+    "    --output_dir \"output_qwen\" \\\n",
+    "    --num_train_epochs 5 \\\n",
+    "    --per_device_train_batch_size 1 \\\n",
+    "    --per_device_eval_batch_size 1 \\\n",
+    "    --gradient_accumulation_steps 16 \\\n",
+    "    --evaluation_strategy \"no\" \\\n",
+    "    --save_strategy \"steps\" \\\n",
+    "    --save_steps 1000 \\\n",
+    "    --save_total_limit 10 \\\n",
+    "    --learning_rate 1e-5 \\\n",
+    "    --weight_decay 0.1 \\\n",
+    "    --adam_beta2 0.95 \\\n",
+    "    --warmup_ratio 0.01 \\\n",
+    "    --lr_scheduler_type \"cosine\" \\\n",
+    "    --logging_steps 1 \\\n",
+    "    --report_to \"none\" \\\n",
+    "    --model_max_length 512 \\\n",
+    "    --gradient_checkpointing True \\\n",
+    "    --lazy_preprocess True \\\n",
+    "    --deepspeed \"../../finetune/ds_config_zero2.json\" \\\n",
+    "    --use_lora"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "35acf008-1dfe-4d32-8cf5-7022e042aadb",
+   "metadata": {},
+   "source": [
+    "## Merge Weights\n",
+    "\n",
+    "The training of both LoRA and Q-LoRA only saves the adapter parameters. You can load the fine-tuned model and merge weights as shown below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "61021499-4a44-45af-a682-943ed63c2fcb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM\n",
+    "from peft import PeftModel\n",
+    "import torch\n",
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-1_8B-Chat/\", torch_dtype=torch.float16, device_map=\"auto\", trust_remote_code=True)\n",
+    "model = PeftModel.from_pretrained(model, \"output_qwen/\")\n",
+    "merged_model = model.merge_and_unload()\n",
+    "merged_model.save_pretrained(\"output_qwen_merged\", max_shard_size=\"2048MB\", safe_serialization=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0dfbd261-6451-4532-82e8-3ae19ed93ee1",
+   "metadata": {},
+   "source": [
+    "The tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ddcba069-340b-4a93-a145-2028b425dd23",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoTokenizer\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\n",
+    "    \"Qwen/Qwen-1_8B-Chat/\",\n",
+    "    trust_remote_code=True\n",
+    ")\n",
+    "\n",
+    "tokenizer.save_pretrained(\"output_qwen_merged\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fe9f2878-79d3-4b1c-ba95-ac2f73aa6e1b",
+   "metadata": {},
+   "source": [
+    "## Test the Model\n",
+    "\n",
+    "After merging the weights, we can test the model as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "from transformers.generation import GenerationConfig\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen_merged\", trust_remote_code=True)\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    \"output_qwen_merged\",\n",
+    "    device_map=\"auto\",\n",
+    "    trust_remote_code=True\n",
+    ").eval()\n",
+    "\n",
+    "response, history = model.chat(tokenizer, \"你好\", history=None)\n",
+    "print(response)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/recipes/finetune/deepspeed/finetune_lora_single_gpu.ipynb
+++ b/recipes/finetune/deepspeed/finetune_lora_single_gpu.ipynb
@@ -0,0 +1,274 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6e6981ab-2d9a-4280-923f-235a166855ba",
+   "metadata": {},
+   "source": [
+    "# LoRA Fine-Tuning Qwen-Chat Large Language Model (Single GPU)\n",
+    "\n",
+    "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n",
+    "\n",
+    "This notebook uses Qwen-1.8B-Chat as an example to introduce how to LoRA fine-tune the Qianwen model using Deepspeed.\n",
+    "\n",
+    "## Environment Requirements\n",
+    "\n",
+    "Please refer to **requirements.txt** to install the required dependencies.\n",
+    "\n",
+    "## Preparation\n",
+    "\n",
+    "### Download Qwen-1.8B-Chat\n",
+    "\n",
+    "First, download the model files. You can choose to download directly from ModelScope."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f",
+   "metadata": {},
+   "source": [
+    "### Download Example Training Data\n",
+    "\n",
+    "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n",
+    "\n",
+    "Disclaimer: the dataset can be only used for the research purpose."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f",
+   "metadata": {},
+   "source": [
+    "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"我是一个语言模型，我叫通义千问。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "You can also use multi-turn conversations as the training set. Here is a simple example:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好，能告诉我遛狗的最佳时间吗？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"当地最佳遛狗时间因地域差异而异，请问您所在的城市是哪里？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"我在纽约市。\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间，因为这些时间段气温较低，遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "## Fine-Tune the Model\n",
+    "\n",
+    "You can directly run the prepared training script to fine-tune the model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!export CUDA_VISIBLE_DEVICES=0\n",
+    "!python ../../finetune.py \\\n",
+    "    --model_name_or_path \"Qwen/Qwen-1_8B-Chat/\"\\\n",
+    "    --data_path  \"Belle_sampled_qwen.json\"\\\n",
+    "    --bf16 \\\n",
+    "    --output_dir \"output_qwen\" \\\n",
+    "    --num_train_epochs 5 \\\n",
+    "    --per_device_train_batch_size 1 \\\n",
+    "    --per_device_eval_batch_size 1 \\\n",
+    "    --gradient_accumulation_steps 16 \\\n",
+    "    --evaluation_strategy \"no\" \\\n",
+    "    --save_strategy \"steps\" \\\n",
+    "    --save_steps 1000 \\\n",
+    "    --save_total_limit 10 \\\n",
+    "    --learning_rate 1e-5 \\\n",
+    "    --weight_decay 0.1 \\\n",
+    "    --adam_beta2 0.95 \\\n",
+    "    --warmup_ratio 0.01 \\\n",
+    "    --lr_scheduler_type \"cosine\" \\\n",
+    "    --logging_steps 1 \\\n",
+    "    --report_to \"none\" \\\n",
+    "    --model_max_length 512 \\\n",
+    "    --gradient_checkpointing \\\n",
+    "    --lazy_preprocess \\\n",
+    "    --use_lora"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5e6f28aa-1772-48ce-aa15-8cf29e7d67b5",
+   "metadata": {},
+   "source": [
+    "## Merge Weights\n",
+    "\n",
+    "The training of both LoRA and Q-LoRA only saves the adapter parameters. You can load the fine-tuned model and merge weights as shown below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4fd5ef2a-34f9-4909-bebe-7b3b086fd16a",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM\n",
+    "from peft import PeftModel\n",
+    "import torch\n",
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-1_8B-Chat/\", torch_dtype=torch.float16, device_map=\"auto\", trust_remote_code=True)\n",
+    "model = PeftModel.from_pretrained(model, \"output_qwen/\")\n",
+    "merged_model = model.merge_and_unload()\n",
+    "merged_model.save_pretrained(\"output_qwen_merged\", max_shard_size=\"2048MB\", safe_serialization=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2e3f5b9f-63a1-4599-8d9b-a8d8f764838f",
+   "metadata": {},
+   "source": [
+    "The tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "10fa5ea3-dd55-4901-86af-c045d4c56533",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import AutoTokenizer\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\n",
+    "    \"Qwen/Qwen-1_8B-Chat/\",\n",
+    "    trust_remote_code=True\n",
+    ")\n",
+    "\n",
+    "tokenizer.save_pretrained(\"output_qwen_merged\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "804b84d8",
+   "metadata": {},
+   "source": [
+    "## Test the Model\n",
+    "\n",
+    "After merging the weights, we can test the model as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "from transformers.generation import GenerationConfig\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen_merged\", trust_remote_code=True)\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    \"output_qwen_merged\",\n",
+    "    device_map=\"auto\",\n",
+    "    trust_remote_code=True\n",
+    ").eval()\n",
+    "\n",
+    "response, history = model.chat(tokenizer, \"你好\", history=None)\n",
+    "print(response)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/recipes/finetune/deepspeed/finetune_qlora_multi_gpu.ipynb
+++ b/recipes/finetune/deepspeed/finetune_qlora_multi_gpu.ipynb
@@ -0,0 +1,282 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6e6981ab-2d9a-4280-923f-235a166855ba",
+   "metadata": {},
+   "source": [
+    "# QLoRA Fine-Tuning Qwen-Chat Large Language Model (Multiple GPUs)\n",
+    "\n",
+    "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n",
+    "\n",
+    "This notebook uses Qwen-1.8B-Chat as an example to introduce how to QLoRA fine-tune the Qianwen model using Deepspeed.\n",
+    "\n",
+    "## Environment Requirements\n",
+    "\n",
+    "Please refer to **requirements.txt** to install the required dependencies.\n",
+    "\n",
+    "## Preparation\n",
+    "\n",
+    "### Download Qwen-1.8B-Chat\n",
+    "\n",
+    "First, download the model files. You can choose to download directly from ModelScope."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "execution": {
+     "iopub.execute_input": "2023-12-31T08:42:52.842315Z",
+     "iopub.status.busy": "2023-12-31T08:42:52.841665Z",
+     "iopub.status.idle": "2023-12-31T08:44:19.832661Z",
+     "shell.execute_reply": "2023-12-31T08:44:19.832193Z",
+     "shell.execute_reply.started": "2023-12-31T08:42:52.842295Z"
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat-Int4', cache_dir='.', revision='master')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f",
+   "metadata": {},
+   "source": [
+    "### Download Example Training Data\n",
+    "\n",
+    "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n",
+    "\n",
+    "Disclaimer: the dataset can be only used for the research purpose."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f",
+   "metadata": {},
+   "source": [
+    "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"我是一个语言模型，我叫通义千问。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "You can also use multi-turn conversations as the training set. Here is a simple example:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好，能告诉我遛狗的最佳时间吗？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"当地最佳遛狗时间因地域差异而异，请问您所在的城市是哪里？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"我在纽约市。\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间，因为这些时间段气温较低，遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "## Fine-Tune the Model\n",
+    "\n",
+    "You can directly run the prepared training script to fine-tune the model. **nproc_per_node** refers to the number of GPUs used fro training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "execution": {
+     "iopub.execute_input": "2023-12-31T08:45:37.959631Z",
+     "iopub.status.busy": "2023-12-31T08:45:37.958961Z",
+     "iopub.status.idle": "2023-12-31T08:46:19.501657Z",
+     "shell.execute_reply": "2023-12-31T08:46:19.500854Z",
+     "shell.execute_reply.started": "2023-12-31T08:45:37.959609Z"
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 ../../finetune.py \\\n",
+    "    --model_name_or_path \"Qwen/Qwen-1_8B-Chat-Int4/\" \\\n",
+    "    --data_path \"Belle_sampled_qwen.json\" \\\n",
+    "    --bf16 True \\\n",
+    "    --output_dir \"output_qwen\" \\\n",
+    "    --num_train_epochs 5 \\\n",
+    "    --per_device_train_batch_size 1 \\\n",
+    "    --per_device_eval_batch_size 1 \\\n",
+    "    --gradient_accumulation_steps 16 \\\n",
+    "    --evaluation_strategy \"no\" \\\n",
+    "    --save_strategy \"steps\" \\\n",
+    "    --save_steps 1000 \\\n",
+    "    --save_total_limit 10 \\\n",
+    "    --learning_rate 1e-5 \\\n",
+    "    --weight_decay 0.1 \\\n",
+    "    --adam_beta2 0.95 \\\n",
+    "    --warmup_ratio 0.01 \\\n",
+    "    --lr_scheduler_type \"cosine\" \\\n",
+    "    --logging_steps 1 \\\n",
+    "    --report_to \"none\" \\\n",
+    "    --model_max_length 512 \\\n",
+    "    --gradient_checkpointing True \\\n",
+    "    --lazy_preprocess True \\\n",
+    "    --deepspeed \"../../finetune/ds_config_zero2.json\" \\\n",
+    "    --use_lora \\\n",
+    "    --q_lora"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Merge Weights\n",
+    "\n",
+    "The training of both LoRA and Q-LoRA only saves the adapter parameters. Note that you can not merge weights into quantized models. Instead, we can merge the weights based on the original chat model.\n",
+    "\n",
+    "You can load the fine-tuned model and merge weights as shown below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')\n",
+    "\n",
+    "from transformers import AutoModelForCausalLM\n",
+    "from peft import PeftModel\n",
+    "import torch\n",
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-1_8B-Chat/\", torch_dtype=torch.float16, device_map=\"auto\", trust_remote_code=True)\n",
+    "model = PeftModel.from_pretrained(model, \"output_qwen/\")\n",
+    "merged_model = model.merge_and_unload()\n",
+    "merged_model.save_pretrained(\"output_qwen_merged\", max_shard_size=\"2048MB\", safe_serialization=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoTokenizer\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\n",
+    "    \"Qwen/Qwen-1_8B-Chat-Int4/\",\n",
+    "    trust_remote_code=True\n",
+    ")\n",
+    "\n",
+    "tokenizer.save_pretrained(\"output_qwen_merged\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Test the Model\n",
+    "\n",
+    "After merging the weights, we can test the model as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "from transformers.generation import GenerationConfig\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen_merged\", trust_remote_code=True)\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    \"output_qwen_merged\",\n",
+    "    device_map=\"auto\",\n",
+    "    trust_remote_code=True\n",
+    ").eval()\n",
+    "\n",
+    "response, history = model.chat(tokenizer, \"你好\", history=None)\n",
+    "print(response)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/recipes/finetune/deepspeed/finetune_qlora_single_gpu.ipynb
+++ b/recipes/finetune/deepspeed/finetune_qlora_single_gpu.ipynb
@@ -0,0 +1,283 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6e6981ab-2d9a-4280-923f-235a166855ba",
+   "metadata": {},
+   "source": [
+    "# QLoRA Fine-Tuning Qwen-Chat Large Language Model (Single GPU)\n",
+    "\n",
+    "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n",
+    "\n",
+    "This notebook uses Qwen-1.8B-Chat as an example to introduce how to QLoRA fine-tune the Qianwen model using Deepspeed.\n",
+    "\n",
+    "## Environment Requirements\n",
+    "\n",
+    "Please refer to **requirements.txt** to install the required dependencies.\n",
+    "\n",
+    "## Preparation\n",
+    "\n",
+    "### Download Qwen-1.8B-Chat\n",
+    "\n",
+    "First, download the model files. You can choose to download directly from ModelScope.\n",
+    "\n",
+    "Note that we use the Int4 version of the models for QLoRA training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat-Int4', cache_dir='.', revision='master')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f",
+   "metadata": {},
+   "source": [
+    "### Download Example Training Data\n",
+    "\n",
+    "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n",
+    "\n",
+    "Disclaimer: the dataset can be only used for the research purpose."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f",
+   "metadata": {},
+   "source": [
+    "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"我是一个语言模型，我叫通义千问。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "You can also use multi-turn conversations as the training set. Here is a simple example:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好，能告诉我遛狗的最佳时间吗？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"当地最佳遛狗时间因地域差异而异，请问您所在的城市是哪里？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"我在纽约市。\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间，因为这些时间段气温较低，遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "## Fine-Tune the Model\n",
+    "\n",
+    "You can directly run the prepared training script to fine-tune the model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!python ../../finetune.py \\\n",
+    "    --model_name_or_path \"Qwen/Qwen-1_8B-Chat-Int4/\"\\\n",
+    "    --data_path  \"Belle_sampled_qwen.json\"\\\n",
+    "    --bf16 \\\n",
+    "    --output_dir \"output_qwen\" \\\n",
+    "    --num_train_epochs 5 \\\n",
+    "    --per_device_train_batch_size 1 \\\n",
+    "    --per_device_eval_batch_size 1 \\\n",
+    "    --gradient_accumulation_steps 16 \\\n",
+    "    --evaluation_strategy \"no\" \\\n",
+    "    --save_strategy \"steps\" \\\n",
+    "    --save_steps 1000 \\\n",
+    "    --save_total_limit 10 \\\n",
+    "    --learning_rate 1e-5 \\\n",
+    "    --weight_decay 0.1 \\\n",
+    "    --adam_beta2 0.95 \\\n",
+    "    --warmup_ratio 0.01 \\\n",
+    "    --lr_scheduler_type \"cosine\" \\\n",
+    "    --logging_steps 1 \\\n",
+    "    --report_to \"none\" \\\n",
+    "    --model_max_length 512 \\\n",
+    "    --gradient_checkpointing \\\n",
+    "    --lazy_preprocess \\\n",
+    "    --use_lora \\\n",
+    "    --q_lora \\\n",
+    "    --deepspeed \"../../finetune/ds_config_zero2.json\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0a50941d-3c3c-4ed2-9185-d4fe6172da2f",
+   "metadata": {},
+   "source": [
+    "## Merge Weights\n",
+    "\n",
+    "The training of both LoRA and Q-LoRA only saves the adapter parameters. Note that you can not merge weights into quantized models. Instead, we can merge the weights based on the original chat model.\n",
+    "\n",
+    "You can load the fine-tuned model and merge weights as shown below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "909ff537-f851-488e-b1e8-1046f6852202",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')\n",
+    "\n",
+    "from transformers import AutoModelForCausalLM\n",
+    "from peft import PeftModel\n",
+    "import torch\n",
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-1_8B-Chat/\", torch_dtype=torch.float16, device_map=\"auto\", trust_remote_code=True)\n",
+    "model = PeftModel.from_pretrained(model, \"output_qwen/\")\n",
+    "merged_model = model.merge_and_unload()\n",
+    "merged_model.save_pretrained(\"output_qwen_merged\", max_shard_size=\"2048MB\", safe_serialization=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7969df6e-ba8a-45f5-8b44-e1cbe74a8ef6",
+   "metadata": {},
+   "source": [
+    "The tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c01b6a3f-036f-4b7c-b5a6-76a7b6894d4e",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import AutoTokenizer\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\n",
+    "    \"Qwen/Qwen-1_8B-Chat-Int4/\",\n",
+    "    trust_remote_code=True\n",
+    ")\n",
+    "\n",
+    "tokenizer.save_pretrained(\"output_qwen_merged\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c2944b9b-89c7-4fb5-bd08-941d4706e943",
+   "metadata": {},
+   "source": [
+    "## Test the Model\n",
+    "\n",
+    "After merging the weights, we can test the model as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b77abbb1-5b29-4eb1-8a6c-e2e146b8d33d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "from transformers.generation import GenerationConfig\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen_merged\", trust_remote_code=True)\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    \"output_qwen_merged\",\n",
+    "    device_map=\"auto\",\n",
+    "    trust_remote_code=True\n",
+    ").eval()\n",
+    "\n",
+    "response, history = model.chat(tokenizer, \"你好\", history=None)\n",
+    "print(response)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/recipes/finetune/deepspeed/readme.md
+++ b/recipes/finetune/deepspeed/readme.md
@@ -0,0 +1,416 @@
+# Fine-tuning Qwen Using Deepspeed
+
+
+## TL;DR
+
+We provide the official training script `finetune.py` and serveral notebooks that can be leveraged for users to finetune pre-trained models for downstream applications in a simple fashion. The algorithms that we support include full-parameter fine-tuning, LoRA fine-tuning and Q-LoRA fine-tuning. Here is the matrix of our notebooks used in different settings:
+
+| Algorithm | Single GPU | Multiple GPUs|
+| --- | --- | --- |
+| Full-parameter Fine-tuning | [finetune_fullparameter_single_gpu](finetune_fullparameter_single_gpu.ipynb) | [finetune_fullparameter_multi_gpu](finetune_fullparameter_multi_gpu.ipynb) |
+| LoRA Fine-tuning | [finetune_lora_single_gpu](finetune_lora_single_gpu.ipynb) | [finetune_lora_multi_gpu](finetune_lora_multi_gpu.ipynb) |
+| Q-LoRA Fine-tuning | [finetune_qlora_single_gpu](finetune_qlora_single_gpu.ipynb) | [finetune_qlora_multi_gpu](finetune_qlora_multi_gpu.ipynb) |
+
+## Requirements
+
+### Environments
+
+The basic requirements for running Qwen models include:
+
+- python 3.8 and above
+- pytorch 1.12 and above, 2.0 and above are recommended
+- transformers 4.32 and above
+- CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
+
+Our notebooks launch fine-tuning with DeepSpeed and Peft.
+(Note: this may have conflicts with the latest version of pydantic and you should use make sure `pydantic<2.0`.)
+You can install them by:
+```bash
+pip install peft deepspeed
+```
+
+### Settings and GPU Requirements
+
+We first provide the support matrix for different learning settings. Full-parameter fine-tuning requires updating all parameters in the whole training process.
+In comparison with full-parameter fine-tuning, LoRA only updates the parameters of adapter layers but keeps the original large language model layers frozen. This allows much fewer memory costs and thus fewer computation costs. If you still suffer from insufficient memory, you can consider Q-LoRA, which uses the quantized large language model to allow even fewer memory costs. Generally, the GPU consumption rule for tuning Qwen is as follows: full parameter > full parameter (ZeRO2) > full parameter (ZeRO3) > LoRA > LoRA (ZeRO2) > LoRA (ZeRO3) > Q-LoRA > Q-LoRA (ZeRO2).
+
+| Setting | Full-parameter | LoRA | Q-LoRA |
+| --- | --- | --- | --- |
+| Base | Yes (up to ZeRO3) | Yes (up to ZeRO2) | No |
+| Chat | Yes (up to ZeRO3) | Yes (up to ZeRO3) | No |
+| Chat-Int4/8 | No | No | Yes |
+
+Here are some useful suggestions for choosing different fine-tuning settings based on GPU memory, espcially for users with GeForce RTX 3090/4090 (24GB) GPUs (or similar), and A100 (80GB) GPUs (or similar). In the experiments, we uniformly use a batch size of 1, gradient accumulation of 16, and max length of 512. Other parameters are set as the same shown in our notebooks. The results are as follows.
+
+| GPU Memory | Number of GPUs |  Qwen-1.8B-Chat | Qwen-7B-Chat | Qwen-14B-Chat | Qwen-72B-Chat |
+| --- | --- | --- | --- | --- |  --- |
+| 24GB | *1 | Full Parameter | LoRA | Q-LoRA | N/A |
+| 24GB | *2 | Full Parameter | LoRA | Q-LoRA | N/A |
+| 24GB | *4 | Full Parameter | LoRA | LoRA (w/ ZeRO3) | N/A |
+| 80GB | *1 | Full Parameter | LoRA | LoRA | Q-LoRA |
+| 80GB | *2 | Full Parameter | Full Parameter (w/ ZeRO3) | LoRA (w/ ZeRO2) | TBD |
+| 80GB | *4 | Full Parameter | Full Parameter (w/ ZeRO2) | Full Parameter (w/ ZeRO3) | LoRA (w/ ZeRO3) |
+
+Using other configurations of LoRA/Q-LoRA and ZeRO stages will easily result in failures.
+
+
+## Data Preparation
+
+To prepare your training data, you need to put all the samples into a list and save it to a json file. Each sample is a dictionary consisting of an id and a list for conversation. Below is a simple example list with 1 sample:
+```json
+[
+  {
+    "id": "identity_0",
+    "conversations": [
+      {
+        "from": "user",
+        "value": "你好"
+      },
+      {
+        "from": "assistant",
+        "value": "我是一个语言模型，我叫通义千问。"
+      }
+    ]
+  }
+]
+```
+
+You can also use multi-turn conversations as the training set. Here is a simple example:
+
+```json
+[
+  {
+    "id": "identity_0",
+    "conversations": [
+      {
+        "from": "user",
+        "value": "你好"
+      },
+      {
+        "from": "assistant",
+        "value": "你好！我是一名AI助手，我叫通义千问，有需要请告诉我。"
+      },
+      {
+        "from": "user",
+        "value": "你都能做什么"
+      },
+      {
+        "from": "assistant",
+        "value": "我能做很多事情，包括但不限于回答各种领域的问题、提供实用建议和指导、进行多轮对话交流、文本生成等。"
+      }
+    ]
+  }
+]
+```
+
+
+## Single-GPU Training
+
+In the single-GPU training setting, we provide three notebooks:
+
+- [finetune_fullparameter_single_gpu](finetune_fullparameter_single_gpu.ipynb)
+- [finetune_lora_single_gpu](finetune_lora_single_gpu.ipynb)
+- [finetune_qlora_single_gpu](finetune_qlora_single_gpu.ipynb)
+
+### Full-parameter Fine-tuning
+
+To launch your training, run the following command (with hyper-parameter settings omitted):
+```bash
+python finetune.py \
+    --model_name_or_path $MODEL \
+    --data_path  $DATA \
+    --output_dir $OUTPUT
+```
+Remember to specify the correct model name or path, the data path, as well as the output directory.
+
+### LoRA Fine-tuning
+
+Similarly, to run LoRA, use another notebook to run the command as shown below. Before you start, make sure that you have installed `peft`. Also, you need to specify your paths to your model, data, and output. We advise you to use absolute path for your pre-trained model. This is because LoRA only saves the adapter and the absolute path in the adapter configuration json file is used for finding out the pre-trained model to load. 
+```bash
+python finetune.py \
+    --model_name_or_path $MODEL \
+    --data_path  $DATA \
+    --output_dir $OUTPUT \
+    --use_lora
+```
+Note that if you use LoRA to fine-tune the base language model, e.g., Qwen-7B, instead of chat models, e.g., Qwen-7B-Chat, the script automatically switches the embedding and output layer as trainable parameters. This is because the base language model has no knowledge of special tokens brought by ChatML format. Thus these layers should be updated for the model to understand and predict the tokens. Or in another word, if your training brings in special tokens in LoRA, you should set the layers to trainable parameters by setting `modules_to_save` inside the code. Check out the following code in the training script `finetune.py`:
+```python
+is_chat_model = 'chat' in model_args.model_name_or_path.lower()
+if training_args.use_lora:
+  if lora_args.q_lora or is_chat_model:
+    modules_to_save = None
+  else:
+    modules_to_save = ["wte", "lm_head"]
+    lora_config = LoraConfig(
+            r=lora_args.lora_r,
+            lora_alpha=lora_args.lora_alpha,
+            target_modules=lora_args.lora_target_modules,
+            lora_dropout=lora_args.lora_dropout,
+            bias=lora_args.lora_bias,
+            task_type="CAUSAL_LM",
+            modules_to_save=modules_to_save  # This argument serves for adding new tokens.
+    )
+    ...
+    model = get_peft_model(model, lora_config)
+    ...
+```
+Pay attention that the script relies on the model path to identify the model type, so please keep `chat` in the chat model paths.
+
+
+
+### Q-LoRA Fine-tuning
+
+To run single-GPU Q-LoRA training, you may need to install `mpi4py`. Directly run the following script:
+```bash
+python finetune.py \
+    --model_name_or_path $MODEL \
+    --data_path  $DATA \
+    --output_dir $OUTPUT \
+    --use_lora \
+    --q_lora \
+    --deepspeed "ds_config_zero2.json"
+```
+
+For Q-LoRA, we advise you to load our provided quantized model, e.g., Qwen-7B-Chat-Int4. You **SHOULD NOT** use the bf16 models. Different from full-parameter fine-tuning and LoRA, only fp16 is supported for Q-LoRA. For single-GPU training, we have to use DeepSpeed for mixed-precision training due to our observation of errors caused by torch amp. Besides, for Q-LoRA, the troubles with the special tokens in LoRA still exist. However, as we only provide the Int4 models for chat models, which means the language model has learned the special tokens of ChatML format, you have no worry about the layers. Note that the layers of the Int4 model should not be trainable, and thus if you introduce special tokens in your training, Q-LoRA might not work.
+
+
+In default, our notebooks provide training codes for Qwen-1.8B-Chat.
+You can also run the training script to fine-tune other version of the Qwen-series models. We profile the GPU memory usage of all versions based on our notebooks (without changing any hyper-parameter settings) on a single A800 GPU (80GB). The statistics are listed below:
+
+| Training | Qwen-1.8B-Chat | Qwen-7B-Chat | Qwen-14B-Chat | Qwen-72B-Chat |
+| --- | --- | --- | --- | --- |
+| Full Parameter | 19.6GB | 76.8GB | OOM | OOM |
+| LoRA | 7.4GB | 20.3GB | 34.2GB | OOM |
+| Q-LoRA | 6.1GB | 12.5GB | 17.8GB | 61.9GB |
+
+
+### Merging Weights from LoRA and Q-LoRA
+
+
+#### Inference with Adapters
+
+Different from full-parameter fine-tuning, the training of both LoRA and Q-LoRA only saves the adapter parameters. Suppose your training starts from Qwen-7B, you can load the fine-tuned model for inference as shown below:
+```python
+from peft import AutoPeftModelForCausalLM
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained(
+    path_to_adapter, # path to the output directory
+    trust_remote_code=True
+)
+model = AutoPeftModelForCausalLM.from_pretrained(
+    path_to_adapter, # path to the output directory
+    device_map="auto",
+    trust_remote_code=True
+).eval()
+
+response, history = model.chat(tokenizer, "你好", history=None)
+```
+
+#### Inference with Merged Weights
+
+If you want to merge the adapters and save the fine-tuned model as a standalone model, take LoRA as an example, you can run the following codes:
+```python
+from peft import AutoPeftModelForCausalLM
+
+model = AutoPeftModelForCausalLM.from_pretrained(
+    path_to_adapter, # path to the output directory
+    device_map="auto",
+    trust_remote_code=True
+).eval()
+
+merged_model = model.merge_and_unload()
+# max_shard_size and safe serialization are not necessary. 
+# They respectively work for sharding checkpoint and save the model to safetensors.
+merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True)
+```
+
+The `new_model_directory` directory will contain the merged model weights and module files. Please note that `*.cu` and `*.cpp` files may be missing in the saved files. If you wish to use the KV cache functionality, please manually copy them. Besides, the tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:
+```python
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained(
+    path_to_adapter, # path to the output directory
+    trust_remote_code=True
+)
+tokenizer.save_pretrained(new_model_directory)
+```
+Next, the model with merged weights can be loaded by the following code:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained(new_model_directory, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    new_model_directory,
+    device_map="auto",
+    trust_remote_code=True
+).eval()
+response, history = model.chat(tokenizer, "你好", history=None)
+```
+
+Note that you can not merge weights into quantized models. Instead, we can merge the weights based on the original chat model. Take Qwen-7B-Chat-In4 as an example. 
+```python
+from transformers import AutoModelForCausalLM
+from peft import PeftModel
+import torch
+
+# Here, we load the original Qwen-7B-Chat model, instead of the Qwen-7B-Chat-Int4 model.
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)
+# We merge the learned adapter to the Qwen-7B-Chat.
+model = PeftModel.from_pretrained(model, path_to_adapter)
+merged_model = model.merge_and_unload()
+# We save the model to a new path.
+merged_model.save_pretrained(path_to_new_model, max_shard_size="2048MB", safe_serialization=True)
+```
+
+
+## Multi-GPU Training
+
+In the multi-GPU training setting, we provide three notebooks:
+
+- [finetune_fullparameter_multi_gpu](finetune_fullparameter_multi_gpu.ipynb)
+- [finetune_lora_multi_gpu](finetune_lora_multi_gpu.ipynb)
+- [finetune_qlora_multi_gpu](finetune_qlora_multi_gpu.ipynb)
+
+We use `torchrun` to launch the training job on multiple GPUs:
+
+```bash
+# for full-parameter fine-tuning
+torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 finetune.py \
+    --model_name_or_path $MODEL \
+    --data_path  $DATA \
+    --output_dir $OUTPUT \
+    --deepspeed "ds_config_zero2.json"
+
+# for LoRA fine-tuning
+torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 finetune.py \
+    --model_name_or_path $MODEL \
+    --data_path  $DATA \
+    --output_dir $OUTPUT \
+    --deepspeed "ds_config_zero2.json" \
+    --use_lora
+
+# for Q-LoRA fine-tuning
+torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 finetune.py \
+    --model_name_or_path $MODEL \
+    --data_path  $DATA \
+    --output_dir $OUTPUT \
+    --deepspeed "ds_config_zero2.json" \
+    --use_lora \
+    --q_lora
+```
+
+For multi-GPU training, you also need to specify proper hyperparameters for distributed training based on your machine. Besides, we advise you to specify your maximum sequence length with the argument `--model_max_length`, based on your consideration of data, memory footprint, and training speed.
+For the usage of `torchrun` and distrubuted arguments, please refer to [here](https://pytorch.org/docs/stable/elastic/run.html).
+Additionally, we find that there is a significant gap between the memory footprint of LoRA with and without these trainable parameters. Therefore, if you have trouble with memory, we advise you to LoRA fine-tune the chat models. Check the profile below for more information. 
+
+
+### Multi-node Fine-tuning
+
+Our provided scripts also support multi-node fine-tuning. You can refer to the comments in the scripts to correctly set corresponding arguments and launch the script on each node. For more information about multi-node distributed training, please refer to [torchrun](https://pytorch.org/docs/stable/elastic/run.html).
+
+Note: DeepSpeed ZeRO 3 requires much greater inter-node communication rate than ZeRO 2, which will significantly reduce the training speed in the case of multinode finetuning. Therefore, we do not recommend using DeepSpeed ZeRO 3 configurations in multi-node fine-tuning scripts.
+
+### Profiling of Memory and Speed
+
+We profile the GPU memory and training speed of both LoRA (LoRA (emb) refers to training the embedding and output layer, while LoRA has no trainable embedding and output layer) and Q-LoRA in the setup of single-GPU training. In this test, we experiment on a single A100-SXM4-80G GPU, and we use CUDA 11.8 and Pytorch 2.0. Flash attention 2 is applied. We uniformly use a batch size of 1 and gradient accumulation of 8. We profile the memory (GB) and speed (s/iter) of inputs of different lengths, namely 256, 512, 1024, 2048, 4096, and 8192. We also report the statistics of full-parameter fine-tuning with Qwen-7B on 2 A100 GPUs. We only report the statistics of 256, 512, and 1024 tokens due to the limitation of GPU memory. 
+
+For Qwen-7B, we also test the performance of multi-node fine-tuning. We experiment using two servers, each containing two A100-SXM4-80G GPUs, and the rest of configurations are the same as other Qwen-7B experiments. The results of multi-node fine-tuning are marked as LoRA (multinode) in the table.
+
+For Qwen-72B, we experiment in two ways: 1) LoRA fine-tuning + DeepSpeed ZeRO 3 on 4 A100-SXM4-80G GPUs and 2) Q-LoRA (int4) fine-tuning on a single A100-SXM4-80G GPU. Note that OOM occurs on 4 A100-SXM4-80G GPUs both with LoRA (emb) fine-tuning and LoRA fine-tuning without Deepspeed ZeRO 3 (you can pass `--deepspeed ds_config_zero3.json` to `finetune_lora_ds.sh` to enable DeepSpeed ZeRO 3).
+
+The statistics are listed below:
+
+<table>
+    <tr>
+      <th rowspan="2">Model Size</th><th rowspan="2">Method</th><th rowspan="2">#Nodes</th><th rowspan="2">#GPUs per node</th><th colspan="6" align="center">Sequence Length</th>
+    </tr>
+    <tr>
+        <th align="center">256</th><th align="center">512</th><th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th>
+    </tr>
+    <tr>
+        <th rowspan="4">1.8B</th><td>LoRA</td>
+        <td>1</td><td>1</td>
+        <td align="center">6.7G / 1.0s/it</td><td align="center">7.4G / 1.0s/it</td><td align="center">8.4G / 1.1s/it</td><td align="center">11.0G / 1.7s/it</td><td align="center">16.2G / 3.3s/it</td><td align="center">21.8G / 6.8s/it</td>
+    </tr>
+    <tr>
+        <td>LoRA (emb)</td>
+        <td>1</td><td>1</td>
+        <td align="center">13.7G / 1.0s/it</td><td align="center">14.0G / 1.0s/it</td><td align="center">14.0G / 1.1s/it</td><td align="center">15.1G / 1.8s/it</td><td align="center">19.7G / 3.4s/it</td><td align="center">27.7G / 7.0s/it</td>
+    </tr>
+    <tr>
+        <td>Q-LoRA</td>
+        <td>1</td><td>1</td>
+        <td align="center">5.8G / 1.4s/it</td><td align="center">6.0G / 1.4s/it</td><td align="center">6.6G / 1.4s/it</td><td align="center">7.8G / 2.0s/it</td><td align="center">10.2G / 3.4s/it</td><td align="center">15.8G / 6.5s/it</td>
+    </tr>
+    <tr>
+        <td>Full-parameter</td>
+        <td>1</td><td>1</td>
+        <td align="center">43.5G / 2.1s/it</td><td align="center">43.5G / 2.2s/it</td><td align="center">43.5G / 2.2s/it</td><td align="center">43.5G / 2.3s/it</td><td align="center">47.1G / 2.8s/it</td><td align="center">48.3G / 5.6s/it</td>
+    </tr>
+    <tr>
+        <th rowspan="5">7B</th>
+        <td>LoRA</td>
+        <td>1</td><td>1</td>
+        <td align="center">20.1G / 1.2s/it</td><td align="center">20.4G / 1.5s/it</td><td align="center">21.5G / 2.8s/it</td><td align="center">23.8G / 5.2s/it</td><td align="center">29.7G / 10.1s/it</td><td align="center">36.6G / 21.3s/it</td>
+    </tr>
+    <tr>
+        <td>LoRA (emb)</td>
+        <td>1</td><td>1</td>
+        <td align="center">33.7G / 1.4s/it</td><td align="center">34.1G / 1.6s/it</td><td align="center">35.2G / 2.9s/it</td><td align="center">35.1G / 5.3s/it</td><td align="center">39.2G / 10.3s/it</td><td align="center">48.5G / 21.7s/it</td>
+    </tr>
+    <tr>
+        <td>Q-LoRA</td>
+        <td>1</td><td>1</td>
+        <td align="center">11.5G / 3.0s/it</td><td align="center">11.5G / 3.0s/it</td><td align="center">12.3G / 3.5s/it</td><td align="center">13.9G / 7.0s/it</td><td align="center">16.9G / 11.6s/it</td><td align="center">23.5G / 22.3s/it</td>
+    </tr>
+    <tr>
+        <td>Full-parameter</td>
+<td>1</td><td>2</td>
+<td align="center">139.2G / 4.0s/it</td><td align="center">148.0G / 4.0s/it</td><td align="center">162.0G / 4.5s/it</td><td align="center">-</td><td align="center">-</td><td align="center">-</td>
+    </tr>
+    <tr>
+        <td>LoRA (multinode)</td>
+        <td>2</td><td>2</td>
+        <td align="center">74.7G / 2.09s/it</td><td align="center">77.6G / 3.16s/it</td><td align="center">84.9G / 5.17s/it</td><td align="center">95.1G / 9.25s/it</td><td align="center">121.1G / 18.1s/it</td><td align="center">155.5G / 37.4s/it</td>
+    </tr>
+    <tr>
+        <th rowspan="3">14B</th>
+        <td>LoRA</td>
+        <td>1</td><td>1</td>
+        <td align="center">34.6G / 1.6s/it</td><td align="center">35.1G / 2.4s/it</td><td align="center">35.3G / 4.4s/it</td><td align="center">37.4G / 8.4s/it</td><td align="center">42.5G / 17.0s/it</td><td align="center">55.2G / 36.0s/it</td>
+    </tr>
+    <tr>
+        <td>LoRA (emb)</td>
+        <td>1</td><td>1</td>
+        <td align="center">51.2 / 1.7s/it</td><td align="center">51.1G / 2.6s/it</td><td align="center">51.5G / 4.6s/it</td><td align="center">54.1G / 8.6s/it</td><td align="center">56.8G / 17.2s/it</td><td align="center">67.7G / 36.3s/it</td>
+    </tr>
+    <tr>
+        <td>Q-LoRA</td>
+        <td>1</td><td>1</td>
+        <td align="center">18.7G / 5.3s/it</td><td align="center">18.4G / 6.3s/it</td><td align="center">18.9G / 8.2s/it</td><td align="center">19.9G / 11.8s/it</td><td align="center">23.0G / 20.1s/it</td><td align="center">27.9G / 38.3s/it</td>
+    </tr>
+    <tr>
+        <th rowspan="2">72B</th>
+        <td>LoRA + Deepspeed Zero3</td>
+        <td>1</td><td>4</td>
+        <td align="center">215.4G / 17.6s/it</td><td align="center">217.7G / 20.5s/it</td><td align="center">222.6G / 29.4s/it</td><td align="center">228.8G / 45.7s/it</td><td align="center">249.0G / 83.4s/it</td><td align="center">289.2G / 161.5s/it</td>
+    </tr>
+    <tr>
+        <td>Q-LoRA</td>
+        <td>1</td><td>1</td>
+        <td align="center">61.4G / 27.4s/it</td><td align="center">61.4G / 31.5s/it</td><td align="center">62.9G / 41.4s/it</td><td align="center">64.1G / 59.5s/it</td><td align="center">68.0G / 97.7s/it</td><td align="center">75.6G / 179.8s/it</td>
+    </tr>
+</table>
+<br>
+
+
+
+
+
+
+
+
+
+
+
--- a/recipes/finetune/deepspeed/requirements.txt
+++ b/recipes/finetune/deepspeed/requirements.txt
@@ -0,0 +1,2 @@
+deepspeed
+peft
--- a/recipes/finetune/swift/README.md
+++ b/recipes/finetune/swift/README.md
@@ -0,0 +1,198 @@
+## Introduction
+[SWIFT](https://github.com/modelscope/swift) (Scalable lightWeight Infrastructure for Fine-Tuning) is an extensible framwork designed to faciliate lightweight model fine-tuning and inference. It integrates implementations for various efficient fine-tuning methods, by embracing approaches that is parameter-efficient, memory-efficient, and time-efficient. SWIFT integrates seamlessly into ModelScope ecosystem and offers the capabilities to finetune various models, with a primary emphasis on LLMs and vision models. Additionally, SWIFT is fully compatible with PEFT, enabling users to leverage the familiar Peft interface to finetune ModelScope models.
+
+## Installation
+
+```shell
+# Set the global pip mirror
+pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
+# Install ms-swift
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e .[llm]
+
+# If you want to use deepspeed
+pip install deepspeed -U
+
+# If you want to use qlora training based on auto_gptq (recommended, performs better than bnb)
+# Models supporting auto_gptq: `https://github.com/modelscope/swift/blob/main/docs/source/LLM/支持的模型和数据集.md#模型`
+# There's a version correspondence between auto_gptq and cuda; refer to `https://github.com/PanQiWei/AutoGPTQ#quick-installation` for selecting the appropriate version
+pip install auto_gptq -U
+
+# If you want to use qlora training based on bnb
+pip install bitsandbytes -U
+
+# Environment alignment (run the following commands if you encounter errors; the repository is tested with the latest environment)
+pip install -r requirements/framework.txt  -U
+pip install -r requirements/llm.txt  -U
+```
+
+## WebUI Usage
+
+Run the following command to start the webui and conduct model training and inference through the graphical interface:
+```shell
+swift web-ui
+```
+A screenshot example can be found at:
+![image](https://modelscope.oss-cn-beijing.aliyuncs.com/resource/swift_webui.jpg)
+
+## Fine-tuning
+
+```python
+# Experimental environment: A10, 3090, V100, ...
+# GPU memory requirement: 20GB
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --dataset blossom-math-zh \
+    --output_dir output \
+
+# Use your own dataset
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --custom_train_dataset_path chatml.jsonl \
+    --output_dir output \
+
+# Using DDP (Distributed Data Parallel)
+# Experimental environment: 2 * 3090
+# GPU memory requirement: 2 * 23GB
+CUDA_VISIBLE_DEVICES=0,1 \
+NPROC_PER_NODE=2 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --dataset blossom-math-zh \
+    --output_dir output \
+
+# Multi-machine multi-GPU setup
+# node0
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NNODES=2 \
+NODE_RANK=0 \
+MASTER_ADDR=127.0.0.1 \
+NPROC_PER_NODE=4 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --dataset blossom-math-zh \
+    --output_dir output \
+# node1
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NNODES=2 \
+NODE_RANK=1 \
+MASTER_ADDR=xxx.xxx.xxx.xxx \
+NPROC_PER_NODE=4 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --dataset blossom-math-zh \
+    --output_dir output \
+```
+For more fine-tuning methods, please refer to [here](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM%E5%BE%AE%E8%B0%83%E6%96%87%E6%A1%A3.md#%E5%BE%AE%E8%B0%83).
+
+
+
+Examples
+
+| 模型名称          | 训练方法                                                                                             |
+|:-------------------|:---------------------------------------------------------------------------------------------------------------------------|
+| qwen_14b           | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b/lora_ddp_ds)             |
+| qwen_14b           | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b/qlora)                         |
+| qwen_14b           | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b/qlora_ddp_ds)           |
+| qwen_14b_chat      | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat/lora_ddp_ds)        |
+| qwen_14b_chat      | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat/qlora)                    |
+| qwen_14b_chat      | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat/qlora_ddp_ds)      |
+| qwen_14b_chat_int4 | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora)               |
+| qwen_14b_chat_int4 | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora_ddp_ds) |
+| qwen_14b_chat_int8 | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora)               |
+| qwen_14b_chat_int8 | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora_ddp_ds) |
+| qwen_1_8b_chat     | [full](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_1_8b_chat/full)                     |
+| qwen_1_8b_chat     | [full_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_1_8b_chat/full_ddp)             |
+| qwen_72b_chat      | [lora_mp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat/lora_mp)                |
+| qwen_72b_chat      | [lora_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat/lora_mp_ddp)        |
+| qwen_72b_chat      | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat/qlora)                    |
+| qwen_72b_chat_int4 | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat_int4/qlora_ddp_ds) |
+| qwen_72b_chat_int8 | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat_int8/qlora_ddp_ds) |
+| qwen_7b            | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b/lora_ddp_ds)              |
+| qwen_7b            | [qlora_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b/qlora_ddp)                  |
+| qwen_7b_chat       | [full](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full)                       |
+| qwen_7b_chat       | [full_freeze_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_freeze_ddp) |
+| qwen_7b_chat       | [full_mp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp)                 |
+| qwen_7b_chat       | [full_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp_ddp)         |
+| qwen_7b_chat       | [lora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora)                       |
+| qwen_7b_chat       | [lora_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp)               |
+| qwen_7b_chat       | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp_ds)         |
+| qwen_7b_chat       | [lora_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora_mp_ddp)         |
+| qwen_7b_chat       | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/qlora)                     |
+| qwen_7b_chat       | [qlora_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp)             |
+| qwen_7b_chat       | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp_ds)       |
+| qwen_7b_chat_int4  | [qalora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qalora)              |
+| qwen_7b_chat_int4  | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora)                |
+| qwen_7b_chat_int4  | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora_ddp_ds)  |
+| qwen_7b_chat_int8  | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora)                |
+| qwen_7b_chat_int8  | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora_ddp_ds)  |
+| qwen_audio_chat    | [full_mp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_audio_chat/full_mp)              |
+| qwen_audio_chat    | [full_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_audio_chat/full_mp_ddp)      |
+| qwen_audio_chat    | [lora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_audio_chat/lora)                    |
+| qwen_audio_chat    | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_audio_chat/lora_ddp_ds)      |
+| qwen_vl            | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl/lora_ddp_ds)              |
+| qwen_vl_chat       | [full_mp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/full_mp)                 |
+| qwen_vl_chat       | [full_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/full_mp_ddp)         |
+| qwen_vl_chat       | [lora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/lora)                       |
+| qwen_vl_chat       | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/lora_ddp_ds)         |
+| qwen_vl_chat       | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/qlora)                     |
+| qwen_vl_chat_int4  | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora)                |
+| qwen_vl_chat_int4  | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora_ddp_ds)  |
+
+
+## Inference
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
+)
+from swift.utils import seed_everything
+
+model_type = ModelType.qwen_7b_chat
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')  # template_type: qwen
+
+
+kwargs = {}
+# kwargs['use_flash_attn'] = True  # Use flash_attn if desired
+
+model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'}, **kwargs)
+# Modify max_new_tokens
+model.generation_config.max_new_tokens = 128
+
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+query = 'What is the provincial capital of Zhejiang?'
+response, history = inference(model, template, query)
+print(f'query: {query}')
+print(f'response: {response}')
+
+query = 'What delicious food can be found here?'
+response, history = inference(model, template, query, history)
+print(f'query: {query}')
+print(f'response: {response}')
+print(f'history: {history}')
+
+"""Output[0]:
+query: What is the provincial capital of Zhejiang?
+response: The provincial capital of Zhejiang is Hangzhou.
+query: What delicious food can be found here?
+response: Hangzhou has many famous delicacies, such as West Lake Vinegar Fish, Longjing Shrimp, Sweet and Sour Spare Ribs, and Maodu. Additionally, there are unique Hangzhou-style pastries like Osmanthus Cake, Lotus Paste Pastry, and Aiwo Steamed Rice Cakes.
+history: [('What is the provincial capital of Zhejiang?', 'The provincial capital of Zhejiang is Hangzhou.'), ('What delicious food can be found here?', 'Hangzhou has many famous delicacies, such as West Lake Vinegar Fish, Longjing Shrimp, Sweet and Sour Spare Ribs, and Maodu. Additionally, there are unique Hangzhou-style pastries like Osmanthus Cake, Lotus Paste Pastry, and Aiwo Steamed Rice Cakes.')]
+"""
+
+# Streaming dialogue output with verbose mode
+inference(model, template, 'What was the first question?', history, verbose=True, stream=True)
+"""Output[1]:
+[PROMPT]
+You asked your first question, "What is the provincial capital of Zhejiang?"
+[OUTPUT] Your first question was “What is the provincial capital of Zhejiang?”
+"""
+
+For more on inference usage, please refer to [here](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM_Inference_Guide.md).
--- a/recipes/finetune/swift/README_CN.md
+++ b/recipes/finetune/swift/README_CN.md
@@ -0,0 +1,203 @@
+## 介绍
+[SWIFT](https://github.com/modelscope/swift)（Scalable lightWeight Infrastructure for Fine-Tuning）是一个可扩展的轻量级一站式训练、推理深度学习框架。它集成了各种高效的微调方法，如LoRA、QLoRA、阿里云自研的ResTuning-Bypass等，以及开箱即用的训练推理脚本，使开发者可以在单张商业级显卡上微调推理LLM&AIGC模型。此外，SWIFT与PEFT完全兼容，使开发者可以在ModelScope模型体系中使用PEFT的能力。
+
+## 安装
+```shell
+# 设置pip全局镜像
+pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
+# 安装ms-swift
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e .[llm]
+
+# 如果你想要使用deepspeed.
+pip install deepspeed -U
+
+# 如果你想要使用基于auto_gptq的qlora训练. (推荐, 效果优于bnb)
+# 支持auto_gptq的模型: `https://github.com/modelscope/swift/blob/main/docs/source/LLM/支持的模型和数据集.md#模型`
+# auto_gptq和cuda版本有对应关系，请按照`https://github.com/PanQiWei/AutoGPTQ#quick-installation`选择版本
+pip install auto_gptq -U
+
+# 如果你想要使用基于bnb的qlora训练.
+pip install bitsandbytes -U
+
+# 环境对齐 (如果你运行错误, 可以跑下面的代码, 仓库使用最新环境测试)
+pip install -r requirements/framework.txt  -U
+pip install -r requirements/llm.txt  -U
+```
+
+
+## webui使用
+
+执行如下命令启动webui通过界面方式进行模型训练推理
+```shell
+swift web-ui
+```
+界面示例如下
+![image](https://modelscope.oss-cn-beijing.aliyuncs.com/resource/swift_webui.jpg)
+
+## 微调
+```python
+# Experimental environment: A10, 3090, V100, ...
+# 20GB GPU memory
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --dataset blossom-math-zh \
+    --output_dir output \
+
+# 使用自己的数据集
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --custom_train_dataset_path chatml.jsonl \
+    --output_dir output \
+
+# 使用DDP
+# Experimental environment: 2 * 3090
+# 2 * 23GB GPU memory
+CUDA_VISIBLE_DEVICES=0,1 \
+NPROC_PER_NODE=2 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --dataset blossom-math-zh \
+    --output_dir output \
+
+# 多机多卡
+# node0
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NNODES=2 \
+NODE_RANK=0 \
+MASTER_ADDR=127.0.0.1 \
+NPROC_PER_NODE=4 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --dataset blossom-math-zh \
+    --output_dir output \
+# node1
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NNODES=2 \
+NODE_RANK=1 \
+MASTER_ADDR=xxx.xxx.xxx.xxx \
+NPROC_PER_NODE=4 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --dataset blossom-math-zh \
+    --output_dir output \
+```
+更多微调方法参考[这里](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM%E5%BE%AE%E8%B0%83%E6%96%87%E6%A1%A3.md#%E5%BE%AE%E8%B0%83)
+
+已有微调代码示例
+| 模型名称          | 训练方法                                                                                             |
+|:-------------------|:---------------------------------------------------------------------------------------------------------------------------|
+| qwen_14b           | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b/lora_ddp_ds)             |
+| qwen_14b           | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b/qlora)                         |
+| qwen_14b           | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b/qlora_ddp_ds)           |
+| qwen_14b_chat      | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat/lora_ddp_ds)        |
+| qwen_14b_chat      | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat/qlora)                    |
+| qwen_14b_chat      | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat/qlora_ddp_ds)      |
+| qwen_14b_chat_int4 | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora)               |
+| qwen_14b_chat_int4 | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora_ddp_ds) |
+| qwen_14b_chat_int8 | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora)               |
+| qwen_14b_chat_int8 | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora_ddp_ds) |
+| qwen_1_8b_chat     | [full](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_1_8b_chat/full)                     |
+| qwen_1_8b_chat     | [full_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_1_8b_chat/full_ddp)             |
+| qwen_72b_chat      | [lora_mp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat/lora_mp)                |
+| qwen_72b_chat      | [lora_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat/lora_mp_ddp)        |
+| qwen_72b_chat      | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat/qlora)                    |
+| qwen_72b_chat_int4 | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat_int4/qlora_ddp_ds) |
+| qwen_72b_chat_int8 | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat_int8/qlora_ddp_ds) |
+| qwen_7b            | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b/lora_ddp_ds)              |
+| qwen_7b            | [qlora_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b/qlora_ddp)                  |
+| qwen_7b_chat       | [full](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full)                       |
+| qwen_7b_chat       | [full_freeze_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_freeze_ddp) |
+| qwen_7b_chat       | [full_mp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp)                 |
+| qwen_7b_chat       | [full_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp_ddp)         |
+| qwen_7b_chat       | [lora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora)                       |
+| qwen_7b_chat       | [lora_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp)               |
+| qwen_7b_chat       | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp_ds)         |
+| qwen_7b_chat       | [lora_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora_mp_ddp)         |
+| qwen_7b_chat       | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/qlora)                     |
+| qwen_7b_chat       | [qlora_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp)             |
+| qwen_7b_chat       | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp_ds)       |
+| qwen_7b_chat_int4  | [qalora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qalora)              |
+| qwen_7b_chat_int4  | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora)                |
+| qwen_7b_chat_int4  | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora_ddp_ds)  |
+| qwen_7b_chat_int8  | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora)                |
+| qwen_7b_chat_int8  | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora_ddp_ds)  |
+| qwen_audio_chat    | [full_mp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_audio_chat/full_mp)              |
+| qwen_audio_chat    | [full_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_audio_chat/full_mp_ddp)      |
+| qwen_audio_chat    | [lora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_audio_chat/lora)                    |
+| qwen_audio_chat    | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_audio_chat/lora_ddp_ds)      |
+| qwen_vl            | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl/lora_ddp_ds)              |
+| qwen_vl_chat       | [full_mp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/full_mp)                 |
+| qwen_vl_chat       | [full_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/full_mp_ddp)         |
+| qwen_vl_chat       | [lora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/lora)                       |
+| qwen_vl_chat       | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/lora_ddp_ds)         |
+| qwen_vl_chat       | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/qlora)                     |
+| qwen_vl_chat_int4  | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora)                |
+| qwen_vl_chat_int4  | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora_ddp_ds)  |
+
+## 推理
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
+)
+from swift.utils import seed_everything
+
+model_type = ModelType.qwen_7b_chat
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')  # template_type: qwen
+
+
+kwargs = {}
+# kwargs['use_flash_attn'] = True  # 使用flash_attn
+
+model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'}, **kwargs)
+# 修改max_new_tokens
+model.generation_config.max_new_tokens = 128
+
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+query = '浙江的省会在哪里？'
+response, history = inference(model, template, query)
+print(f'query: {query}')
+print(f'response: {response}')
+query = '这有什么好吃的？'
+response, history = inference(model, template, query, history)
+print(f'query: {query}')
+print(f'response: {response}')
+print(f'history: {history}')
+
+"""Out[0]
+query: 浙江的省会在哪里？
+response: 浙江省的省会是杭州。
+query: 这有什么好吃的？
+response: 杭州市有很多著名的美食，例如西湖醋鱼、龙井虾仁、糖醋排骨、毛血旺等。此外，还有杭州特色的点心，如桂花糕、荷花酥、艾窝窝等。
+history: [('浙江的省会在哪里？', '浙江省的省会是杭州。'), ('这有什么好吃的？', '杭州市有很多著名的美食，例如西湖醋鱼、龙井虾仁、糖醋排骨、毛血旺等。此外，还有杭州特色的点心，如桂花糕、荷花酥、艾窝窝等。')]
+"""
+
+# 流式输出对话模板
+inference(model, template, '第一个问题是什么', history, verbose=True, stream=True)
+"""Out[1]
+[PROMPT]<|im_start|>system
+You are a helpful assistant.<|im_end|>
+<|im_start|>user
+浙江的省会在哪里？<|im_end|>
+<|im_start|>assistant
+浙江省的省会是杭州。<|im_end|>
+<|im_start|>user
+这有什么好吃的？<|im_end|>
+<|im_start|>assistant
+杭州市有很多著名的美食，例如西湖醋鱼、龙井虾仁、糖醋排骨、毛血旺等。此外，还有杭州特色的点心，如桂花糕、荷花酥、艾窝窝等。<|im_end|>
+<|im_start|>user
+第一个问题是什么<|im_end|>
+<|im_start|>assistant
+[OUTPUT]你的第一个问题是“浙江的省会在哪里？”<|im_end|>
+"""
+```
+更多推理使用请参考[这里](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM%E6%8E%A8%E7%90%86%E6%96%87%E6%A1%A3.md)