Programming Every Example: Lifting Pre-training Data Quality Like   Experts at Scale

Fan Zhou; Zengzhi Wang; Qian Liu; Junlong Li; Pengfei Liu

arXiv:2409.17115·cs.CL·February 17, 2025

Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale

Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, Pengfei Liu

PDF

Open Access 1 Repo 10 Models 5 Datasets

TL;DR

ProX enables small language models to refine training data at scale by treating data cleaning as a programming task, leading to improved model performance and efficiency across diverse benchmarks and domains.

Contribution

The paper introduces ProX, a novel framework that allows models to generate and execute data refinement operations, outperforming traditional human-crafted rules and filtering methods.

Findings

01

Models trained on ProX-curated data outperform original and filtered data by over 2%.

02

ProX improves domain-specific pre-training accuracy by up to 20%.

03

ProX significantly reduces training FLOPs, enhancing efficiency.

Abstract

Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gair-nlp/prox
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Business Intelligence · Statistics Education and Methodologies · Data Mining Algorithms and Applications