mirror of
https://github.com/QwenLM/Qwen.git
synced 2026-05-20 16:35:47 +08:00
release the evaluation benchmark for tool use; update tool use results to that of the hf version
This commit is contained in:
@@ -49,9 +49,9 @@ evaluate_functional_correctness HumanEval_res.jsonl
|
||||
python evaluate_chat_mmlu.py -f HumanEval.jsonl -o HumanEval_res_chat.jsonl
|
||||
evaluate_functional_correctness HumanEval_res_chat.jsonl
|
||||
```
|
||||
|
||||
|
||||
When installing package human-eval, please note its following disclaimer:
|
||||
|
||||
|
||||
This program exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. The execution call in execution.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner. See the comment in execution.py for more information and instructions.
|
||||
|
||||
- GSM8K
|
||||
@@ -64,3 +64,20 @@ python evaluate_gsm8k.py
|
||||
python evaluate_chat_gsm8k.py # zeroshot
|
||||
python evaluate_chat_gsm8k.py --use-fewshot # fewshot
|
||||
```
|
||||
|
||||
- PLUGIN
|
||||
|
||||
This script is used to reproduce the results of the ReAct and Hugging Face Agent in the Tool Usage section of the README document.
|
||||
|
||||
```Shell
|
||||
# Qwen-7B-Chat
|
||||
mkdir data;
|
||||
cd data;
|
||||
wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/opensource_data/exam_plugin_v1/exam_plugin_v1_react_positive.jsonl;
|
||||
wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/opensource_data/exam_plugin_v1/exam_plugin_v1_react_negative.jsonl;
|
||||
cd ..;
|
||||
pip install json5;
|
||||
pip install jsonlines;
|
||||
pip install rouge_score;
|
||||
python evaluate_plugin.py --eval-react-positive --eval-react-negative --eval-hfagent
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user