MathPile: A Billion-Token-Scale Pretraining Corpus for Math
Zengzhi Wang, Xuefeng Li, Rui Xia, Pengfei Liu

TL;DR
MathPile is a large, high-quality math-focused pretraining corpus with 9.5 billion tokens, designed to enhance mathematical reasoning in language models through meticulous data curation and contamination detection.
Contribution
We introduce MathPile, a novel high-quality, large-scale math corpus with rigorous preprocessing, contamination detection, and open-source resources to improve mathematical reasoning in models.
Findings
Improved performance on mathematical reasoning benchmarks.
High data quality through extensive preprocessing.
Open-source corpus and scripts for community use.
Abstract
High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of "less is more", firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates and conducted continual pre-training experiments, booting the performance on common mathematical reasoning benchmarks. We aim for our MathPile to boost language models' mathematical reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Mathematics, Computing, and Information Processing
