Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training
Yiwei Qin, Zhen Huang, Tiantian Mi, Weiye Si, Chenyang Zhou, Qipeng Guo, Siyuan Feng, Pengfei Liu

TL;DR
This paper introduces Data Darwinism, a taxonomy for data-model co-evolution, and demonstrates how advanced data processing techniques improve scientific data pre-training, leading to significant performance gains on benchmarks.
Contribution
It proposes a ten-level taxonomy for data quality and processing, and validates its effectiveness through constructing and pre-training models on the Darwin-Science corpus, achieving state-of-the-art results.
Findings
Darwin-Science outperforms baselines by +2.12 and +2.95 points on benchmarks.
Higher-level data processing (L5) yields +1.36 total gain.
Systematic data refinement unlocks latent data value.
Abstract
Data quality determines foundation model performance, yet systematic processing frameworks are lacking. We introduce Data Darwinism, a ten-level taxonomy (L0-L9) that conceptualizes data-model co-evolution: advanced models produce superior data for next-generation systems. We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus (L0-L5). We identify a learnability gap in raw scientific text, which we bridge via L4 (Generative Refinement) and L5 (Cognitive Completion) using frontier LLMs to explicate reasoning and terminology. To ensure rigorous attribution, we pre-trained daVinci-origin-3B/7B models from scratch, excluding scientific content to create contamination-free baselines. After 600B tokens of continued pre-training, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points across 20+ benchmarks, rising to +5.60 and +8.40…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Language and cultural evolution
