release the evaluation benchmark for tool use; update tool use results to that of the hf version

2026-05-20 08:25:47 +08:00 · 2023-08-08 17:45:41 +08:00
parent fa33db2a26
commit 9139fbdf99
9 changed files with 339 additions and 16 deletions
--- a/tech_memo.md
+++ b/tech_memo.md
@@ -311,13 +311,13 @@ LLMs have shown capability in coordinating multiple external systems to achieve
 Qwen supports calling plugins/tools/APIs through [ReAct Prompting](https://arxiv.org/abs/2210.03629).
 ReAct is also one of the main approaches used by the [LangChain](https://python.langchain.com/) framework.
 For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md).
-In the soon-to-be-released evaluation benchmark for assessing tool usage capabilities, Qwen's performance is as follows:
+In our evaluation [benchmark](eval/EVALUATION.md) for assessing tool usage capabilities, Qwen's performance is as follows:

 | Model       | Tool Selection (Acc.↑)      | Tool Input (Rouge-L↑)      | False Positive Error↓      |
 | :---------- | --------------------------: | -------------------------: | -------------------------: |
 | GPT-4       |                         95% |                   **0.90** |                      15.0% |
 | GPT-3.5     |                         85% |                       0.88 |                      75.0% |
-| **Qwen-7B** |                     **99%** |                       0.89 |                   **8.5%** |
+| **Qwen-7B** |                     **99%** |                       0.89 |                   **9.7%** |

 > The plugins that appear in the evaluation set do not appear in the training set of Qwen.
 > This benchmark evaluates the accuracy of the model in selecting the correct plugin from multiple candidate plugins, the rationality of the parameters passed into the plugin, and the false positive rate.