Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation
Tiantian Mi, Dongming Shan, Zhen Huang, Yiwei Qin, Muhang Xie, Yuxuan Qiao, Yixiu Liu, Chenyang Zhou, Pengfei Liu

TL;DR
DataEvolve is an automated framework that evolves data curation strategies through iterative optimization, significantly improving pretraining data quality and model performance across diverse categories without manual intervention.
Contribution
It introduces an automated evolutionary approach for data curation strategies, enabling scalable and effective data processing at pretraining scale.
Findings
Evolved strategies outperform manual ones by 2.93 points.
Models trained on Darwin-CC outperform baselines and previous datasets.
Evolved strategies focus on noise removal and format normalization.
Abstract
Data Darwinism (Part I) established a ten-level hierarchy for data processing, showing that stronger processing can unlock greater data value. However, that work relied on manually designed strategies for a single category. Modern pretraining corpora comprise hundreds of heterogeneous categories spanning domains and content types, each demanding specialized treatment. At this scale, manual strategy design becomes prohibitive. This raises a key question: can strategies evolve in an automated way? We introduce DataEvolve, a framework that enables strategies to evolve through iterative optimization rather than manual design. For each data category, DataEvolve operates in a closed evolutionary loop: it identifies quality issues, generates candidate strategies, executes them on sampled data, evaluates results, and refines approaches across generations. The process accumulates knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Humanities and Scholarship · Scientific Computing and Data Management · Research Data Management Practices
