Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, Pengfei Liu

TL;DR
ProX enables small language models to refine training data at scale by treating data cleaning as a programming task, leading to improved model performance and efficiency across diverse benchmarks and domains.
Contribution
The paper introduces ProX, a novel framework that allows models to generate and execute data refinement operations, outperforming traditional human-crafted rules and filtering methods.
Findings
Models trained on ProX-curated data outperform original and filtered data by over 2%.
ProX improves domain-specific pre-training accuracy by up to 20%.
ProX significantly reduces training FLOPs, enhancing efficiency.
Abstract
Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗gair-prox/web-chunk-refining-lmmodel· 22 dl· ♡ 722 dl♡ 7
- 🤗gair-prox/RedPJ-ProX-0.3Bmodel· 8 dl· ♡ 38 dl♡ 3
- 🤗gair-prox/RedPJ-ProX-0.7Bmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗gair-prox/RedPJ-ProX-1.7Bmodel· 2 dl· ♡ 22 dl♡ 2
- 🤗gair-prox/C4-ProX-1.7Bmodel· 5 dl· ♡ 15 dl♡ 1
- 🤗gair-prox/FW-ProX-1.7Bmodel· 4 dl· ♡ 44 dl♡ 4
- 🤗gair-prox/TinyLlama-1.1B-ProXMathmodel· 8 dl· ♡ 28 dl♡ 2
- 🤗gair-prox/Llama-2-7B-ProXMathmodel· 7 dl· ♡ 17 dl♡ 1
- 🤗gair-prox/Mistral-7B-ProXMathmodel· 10 dl· ♡ 310 dl♡ 3
- 🤗gair-prox/CodeLlama-7B-ProXMathmodel· 3 dl· ♡ 13 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Business Intelligence · Statistics Education and Methodologies · Data Mining Algorithms and Applications
