From a01d076bf4b717b87baf07d62ff959903958937e Mon Sep 17 00:00:00 2001 From: Zheng Yuan Date: Fri, 4 Aug 2023 16:40:21 +0800 Subject: [PATCH] Update tech_memo.md --- tech_memo.md | 1 + 1 file changed, 1 insertion(+) diff --git a/tech_memo.md b/tech_memo.md index 32eb2bf..b1c8912 100644 --- a/tech_memo.md +++ b/tech_memo.md @@ -36,6 +36,7 @@ It is pretrained on over 2.2 trillion tokens with 2048 context length from publi **Pretraining data**: Our training data includes a mix of data from publicly available sources, consisting mainly of web documents and code files. +For math reasoning, we include RFT data from [gsm8k-ScRel](https://github.com/OFA-Sys/gsm8k-ScRel). Besides, the data are multilingual, with most of them in English and Chinese. We made an effort and employed an ensemble of models to exclude data of low quality or deemed unfit for pretraining, such as NSFW content. The final data underwent global fuzzy deduplication.