GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data
Jifan Zhang, Ziyue Luo, Jia Liu, Ness Shroff, Robert Nowak

TL;DR
This paper introduces SIEVE, a cost-effective filtering method that approximates GPT-4o's high-quality data selection for large language models by combining active learning with lightweight classifiers, enabling scalable and domain-specific data curation.
Contribution
SIEVE provides a scalable, low-cost alternative to GPT-4o for high-quality data filtering, using active learning to achieve similar accuracy with minimal calls to GPT-4o.
Findings
SIEVE matches GPT-4o's filtering accuracy on specific prompts.
SIEVE significantly reduces filtering costs to less than 1% of GPT-4o.
SIEVE outperforms existing quality filtering methods in web-scale datasets.
Abstract
Large language models require vast amounts of high-quality training data, but effective filtering of web-scale datasets remains a significant challenge. This paper demonstrates that GPT-4o is remarkably effective at identifying high-quality training data, but its prohibitive cost makes it impractical at web-scale. We propose SIEVE, a lightweight alternative that matches GPT-4o accuracy at less than 1\% of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight text classification models, using active learning to fine-tune these models in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. Through different filtering prompts, SIEVE can efficiently curate high quality data for general or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications
MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Layer Normalization · Inverse Square Root Schedule · Dense Connections · Linear Layer · Residual Connection · SentencePiece
