Update tech_memo.md

This commit is contained in:
Yang An
2023-08-04 22:06:32 +08:00
committed by GitHub
parent d0e7159835
commit 957578bd12

View File

@@ -36,9 +36,9 @@ It is pretrained on over 2.2 trillion tokens with 2048 context length from publi
**Pretraining data**: **Pretraining data**:
Our training data includes a mix of data from publicly available sources, consisting mainly of web documents and code files. Our training data includes a mix of data from publicly available sources, consisting mainly of web documents and code files.
For math reasoning, we include RFT data from [gsm8k-ScRel](https://github.com/OFA-Sys/gsm8k-ScRel).
Besides, the data are multilingual, with most of them in English and Chinese. Besides, the data are multilingual, with most of them in English and Chinese.
We made an effort and employed an ensemble of models to exclude data of low quality or deemed unfit for pretraining, such as NSFW content. We made an effort and employed an ensemble of models to exclude data of low quality or deemed unfit for pretraining, such as NSFW content.
For math reasoning, we include RFT data from [gsm8k-ScRel](https://github.com/OFA-Sys/gsm8k-ScRel).
The final data underwent global fuzzy deduplication. The final data underwent global fuzzy deduplication.
The mix of pretraining corpora has been optimized through numerous ablation experiments. The mix of pretraining corpora has been optimized through numerous ablation experiments.