add 72B and 1.8B Qwen models, add Ascend 910 and Hygon DCU support, add docker support

2026-05-20 16:35:47 +08:00 · 2023-11-30 15:29:13 +08:00
parent 981c89b2a9
commit e8e15962d8
52 changed files with 6139 additions and 1435 deletions
--- a/examples/system_prompt.md
+++ b/examples/system_prompt.md
@@ -0,0 +1,92 @@
+# 系统指令 (System Prompts)
+
+## 什么是系统指令? (What is the System Prompts?)
+
+系统指令设定了AI助手的行为模式，例如人物设定、语言风格、任务模式、甚至针对具体问题的具体行为。
+
+System Propmts set the behavior mode of the AI assistant, such as character settings, language styles, task modes, and even specific behaviors for specific tasks.
+
+系统指令可以是一个广泛的人物设定，如“You are a helpful assistant”；也可以是一个十分详细的要求，如“拒绝回答所有代码相关的问题”。
+
+The System Prompts can be a broad character setting, such as "You are a helpful assistant"; or it can be a very detailed request, such as "Refuse to answer all code-related questions."
+
+系统指令为用户提供了一个易组织、上下文稳定的控制AI助手行为的方式，可以从多种角度定制属于你自己的AI助手。
+
+System Prompts provide users with an easy-to-organize, context-stable way to control the behavior of the AI assistant. You can customize your own AI assistant from multiple perspectives.
+
+系统指令需要在多轮对话中稳定，例如角色扮演类系统指令被设定后AI助手不应该在多轮对话中跳脱自身的设定。
+
+System Prompts need to be stable across multiple rounds of dialogue. For example, after a role-playing system prompt is set, the AI assistant should not escape its own settings in multiple rounds of dialogue.
+
+同时，模型也需要具有基于系统指令中对自身行为进行推理的能力。这两者都是为模型赋予跟随系统指令能力时需要克服的难点。
+
+At the same time, the model also needs to have the ability to reason about its own behavior based on system prompts. Both of these are difficulties that need to be overcome when giving the model the ability to follow system prompts.
+
+Qwen-1.8B-Chat 和 Qwen-72B-Chat在多样且存在多轮复杂交互的系统指令上进行了充分训练，使模型可以跟随多样的系统指令，实现上下文(in-context)中的模型定制化，进一步提升了通义千问的可扩展性。
+
+Qwen-1.8-Chat and Qwen-72B-Chat have been fully trained on diverse system prompts with multiple rounds of complex interactions, so that they can follow a variety of system prompts and realize model customization in context, further improving the scalability of Qwen-chat.
+
+## 系统指令能做什么？ (What can System Prompts do?)
+
+### 角色扮演 Role Play
+
+在系统指令中告诉千问你需要它扮演的角色，即可沉浸式和该角色对话交流
+
+Tell Qwen-Chat the role you want it to play in the System Prompt, and you can have an immersive conversation with that role.
+
+
+![](../assets/system_prompt_role_play.png)
+
+![](../assets/system_prompt_role_play_en.png)
+
+### 语言风格 Language Style
+
+
+简单调整千问的语言风格
+
+Simple adjustment of the Qwen-Chat's language style
+
+![](../assets/system_prompt_language_style.png)
+
+![](../assets/system_prompt_language_style_en.png)
+
+### 任务设定 Task Setting
+
+指定具体任务，打造处理专项任务的千问模型
+
+Setting specific tasks and creating a Qwen-Chat model to handle special tasks
+
+![](../assets/system_prompt_task_setting.png)
+
+![](../assets/system_prompt_task_setting_en.png)
+
+### 行为设定 Behavior Setting
+
+设定千问对具体任务的行为模式
+
+Set behavior patterns of Qwen-Chat for specific tasks
+
+![](../assets/system_prompt_behavior_setting.png)
+
+![](../assets/system_prompt_behavior_setting_en.png)
+
+## 代码示例 Example
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat", trust_remote_code=True)
+
+# Only Qwen-72B-Chat and Qwen-1_8B-Chat has system prompt enhancement now.
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat", device_map="auto", trust_remote_code=True).eval()
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B-Chat", device_map="auto", trust_remote_code=True).eval()
+
+response, _ = model.chat(tokenizer, "你好呀", history=None, system="请用二次元可爱语气和我说话")
+print(response)
+# 你好啊！我是一只可爱的二次元猫咪哦，不知道你有什么问题需要我帮忙解答吗？
+
+response, _ = model.chat(tokenizer, "My colleague works diligently", history=None, system="You will write beautiful compliments according to needs")
+print(response)
+# Your colleague is an outstanding worker! Their dedication and hard work are truly inspiring. They always go above and beyond to ensure that their tasks are completed on time and to the highest standard. I am lucky to have them as a colleague, and I know I can count on them to handle any challenge that comes their way.
+```
--- a/examples/vllm_wrapper.py
+++ b/examples/vllm_wrapper.py
@@ -0,0 +1,239 @@
+from transformers import PreTrainedTokenizer, GenerationConfig, StoppingCriteriaList
+from typing import Optional, Callable, List, Tuple, Union
+import copy
+import torch
+from transformers import AutoTokenizer
+from transformers.generation.logits_process import LogitsProcessorList
+from packaging import version
+
+_ERROR_BAD_CHAT_FORMAT = """\
+We detect you are probably using the pretrained model (rather than chat model) for chatting, since the chat_format in generation_config is not "chatml".
+If you are directly using the model downloaded from Huggingface, please make sure you are using our "Qwen/Qwen-7B-Chat" Huggingface model (rather than "Qwen/Qwen-7B") when you call model.chat().
+我们检测到您可能在使用预训练模型（而非chat模型）进行多轮chat，因为您当前在generation_config指定的chat_format，并未设置为我们在对话中所支持的"chatml"格式。
+如果您在直接使用我们从Huggingface提供的模型，请确保您在调用model.chat()时，使用的是"Qwen/Qwen-7B-Chat"模型（而非"Qwen/Qwen-7B"预训练模型）。
+"""
+
+IMEND = "<|im_end|>"
+ENDOFTEXT = "<|endoftext|>"
+
+HistoryType = List[Tuple[str, str]]
+TokensType = List[int]
+BatchTokensType = List[List[int]]
+
+def get_stop_words_ids(chat_format, tokenizer):
+    if chat_format == "raw":
+        stop_words_ids = [tokenizer.encode("Human:"), [tokenizer.eod_id]]
+    elif chat_format == "chatml":
+        stop_words_ids = [[tokenizer.im_end_id], [tokenizer.im_start_id]]
+    else:
+        raise NotImplementedError(f"Unknown chat format {chat_format!r}")
+    return stop_words_ids
+
+def make_context(
+    tokenizer: PreTrainedTokenizer,
+    query: str,
+    history: List[Tuple[str, str]] = None,
+    system: str = "",
+    max_window_size: int = 6144,
+    chat_format: str = "chatml",
+):
+    if history is None:
+        history = []
+
+    if chat_format == "chatml":
+        im_start, im_end = "<|im_start|>", "<|im_end|>"
+        im_start_tokens = [tokenizer.im_start_id]
+        im_end_tokens = [tokenizer.im_end_id]
+        nl_tokens = tokenizer.encode("\n")
+
+        def _tokenize_str(role, content):
+            return f"{role}\n{content}", tokenizer.encode(
+                role, allowed_special=set()
+            ) + nl_tokens + tokenizer.encode(content, allowed_special=set())
+
+        system_text, system_tokens_part = _tokenize_str("system", system)
+        system_tokens = im_start_tokens + system_tokens_part + im_end_tokens
+
+        raw_text = ""
+        context_tokens = []
+
+        for turn_query, turn_response in reversed(history):
+            query_text, query_tokens_part = _tokenize_str("user", turn_query)
+            query_tokens = im_start_tokens + query_tokens_part + im_end_tokens
+            response_text, response_tokens_part = _tokenize_str(
+                "assistant", turn_response
+            )
+            response_tokens = im_start_tokens + response_tokens_part + im_end_tokens
+
+            next_context_tokens = nl_tokens + query_tokens + nl_tokens + response_tokens
+            prev_chat = (
+                f"\n{im_start}{query_text}{im_end}\n{im_start}{response_text}{im_end}"
+            )
+
+            current_context_size = (
+                len(system_tokens) + len(next_context_tokens) + len(context_tokens)
+            )
+            if current_context_size < max_window_size:
+                context_tokens = next_context_tokens + context_tokens
+                raw_text = prev_chat + raw_text
+            else:
+                break
+
+        context_tokens = system_tokens + context_tokens
+        raw_text = f"{im_start}{system_text}{im_end}" + raw_text
+        context_tokens += (
+            nl_tokens
+            + im_start_tokens
+            + _tokenize_str("user", query)[1]
+            + im_end_tokens
+            + nl_tokens
+            + im_start_tokens
+            + tokenizer.encode("assistant")
+            + nl_tokens
+        )
+        raw_text += f"\n{im_start}user\n{query}{im_end}\n{im_start}assistant\n"
+
+    elif chat_format == "raw":
+        raw_text = query
+        context_tokens = tokenizer.encode(raw_text)
+    else:
+        raise NotImplementedError(f"Unknown chat format {chat_format!r}")
+
+    return raw_text, context_tokens
+
+class vLLMWrapper:
+    def __init__(self,
+               model_dir: str,
+               trust_remote_code: bool = True,
+               tensor_parallel_size: int = 1,
+               gpu_memory_utilization: float = 0.98,
+               dtype: str = "bfloat16",
+               **kwargs):
+
+        if dtype not in ("bfloat16", "float16", "float32"):
+            print("now not support {}!".format(dtype))
+            raise Exception
+
+        # build generation_config
+        self.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=trust_remote_code)
+
+        # build tokenizer
+        self.tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
+        self.tokenizer.eos_token_id = self.generation_config.eos_token_id
+
+        self.stop_words_ids = []
+
+        from vllm import LLM
+        import vllm
+        if version.parse(vllm.__version__) >= version.parse("0.2.2"):
+            self.__vllm_support_repetition_penalty = True
+        else:
+            self.__vllm_support_repetition_penalty = False
+
+        quantization = getattr(kwargs, 'quantization', None)
+
+        self.model = LLM(model=model_dir,
+                            tokenizer=model_dir,
+                            tensor_parallel_size=tensor_parallel_size,
+                            trust_remote_code=trust_remote_code,
+                            quantization=quantization,
+                            gpu_memory_utilization=gpu_memory_utilization,
+                            dtype=dtype)
+
+        for stop_id in get_stop_words_ids(self.generation_config.chat_format, self.tokenizer):
+            self.stop_words_ids.extend(stop_id)
+        self.stop_words_ids.extend([self.generation_config.eos_token_id])
+
+    def chat(self,
+        query: str,
+        history: Optional[HistoryType],
+        tokenizer: PreTrainedTokenizer = None,
+        system: str = "You are a helpful assistant.",
+        generation_config: Optional[GenerationConfig] = None,
+        **kwargs):
+        generation_config = generation_config if generation_config is not None else self.generation_config
+        tokenizer = self.tokenizer if tokenizer is None else tokenizer
+
+        assert generation_config.chat_format == 'chatml', _ERROR_BAD_CHAT_FORMAT
+        if not self.__vllm_support_repetition_penalty and generation_config.repetition_penalty != 1:
+            raise RuntimeError("The installed vLLM doesn't support repetition_penalty, please set ``model.generation_config.repetition_penalty = 1`` or install vllm>=0.2.2")
+
+        if history is None:
+            history = []
+        else:
+            # make a copy of the user's input such that is is left untouched
+            history = copy.deepcopy(history)
+
+        extra_stop_words_ids = kwargs.get('stop_words_ids', None)
+        if extra_stop_words_ids is None:
+            extra_stop_words_ids = []
+
+        max_window_size = kwargs.get('max_window_size', None)
+        if max_window_size is None:
+            max_window_size = generation_config.max_window_size
+
+        from vllm.sampling_params import SamplingParams
+        sampling_kwargs = {
+            "stop_token_ids": self.stop_words_ids,
+            "early_stopping": False,
+            "top_p": generation_config.top_p,
+            "top_k": -1 if generation_config.top_k == 0 else generation_config.top_k,
+            "temperature": generation_config.temperature,
+            "max_tokens": generation_config.max_new_tokens,
+            "repetition_penalty": generation_config.repetition_penalty
+        }
+        if not self.__vllm_support_repetition_penalty:
+            sampling_kwargs.pop("repetition_penalty")
+        sampling_params = SamplingParams(**sampling_kwargs)
+
+        raw_text, context_tokens = make_context(
+            self.tokenizer,
+            query,
+            history=history,
+            system=system,
+            max_window_size=max_window_size,
+            chat_format=generation_config.chat_format,
+        )
+
+        req_outputs = self.model.generate([query],
+                                            sampling_params=sampling_params,
+                                            prompt_token_ids=[context_tokens])
+        req_output = req_outputs[0]
+
+        prompt_str = req_output.prompt
+        prompt_ids = req_output.prompt_token_ids
+        req_sample_output_ids = []
+        req_sample_output_strs = []
+        for sample in req_output.outputs:
+            output_str = sample.text
+            output_ids = sample.token_ids
+            if IMEND in output_str:
+                output_str = output_str[:-len(IMEND)]
+            if ENDOFTEXT in output_str:
+                output_str = output_str[:-len(ENDOFTEXT)]
+            req_sample_output_ids.append(prompt_ids + output_ids)
+            req_sample_output_strs.append(prompt_str + output_str)
+        assert len(req_sample_output_strs) == 1
+        response = req_sample_output_strs[0][len(prompt_str):]
+        history.append((prompt_str, response))
+
+        return response, history
+
+if __name__ == '__main__':
+
+    model_dir = 'Qwen/Qwen-72B-Chat'
+    tensor_parallel_size = 2
+
+    model = vLLMWrapper(model_dir,
+                        tensor_parallel_size=tensor_parallel_size,
+                        )
+
+    response, history = model.chat(query="你好",
+                                   history=None)
+    print(response)
+    response, history = model.chat(query="给我讲一个年轻人奋斗创业最终取得成功的故事。",
+                                   history=history)
+    print(response)
+    response, history = model.chat(query="给这个故事起一个标题",
+                                   history=history)
+    print(response)