mirror of
https://github.com/QwenLM/Qwen.git
synced 2026-05-20 16:35:47 +08:00
Update tech_memo.md
This commit is contained in:
@@ -36,6 +36,7 @@ It is pretrained on over 2.2 trillion tokens with 2048 context length from publi
|
||||
|
||||
**Pretraining data**:
|
||||
Our training data includes a mix of data from publicly available sources, consisting mainly of web documents and code files.
|
||||
For math reasoning, we include RFT data from [gsm8k-ScRel](https://github.com/OFA-Sys/gsm8k-ScRel).
|
||||
Besides, the data are multilingual, with most of them in English and Chinese.
|
||||
We made an effort and employed an ensemble of models to exclude data of low quality or deemed unfit for pretraining, such as NSFW content.
|
||||
The final data underwent global fuzzy deduplication.
|
||||
|
||||
Reference in New Issue
Block a user