GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to   Filter Language Model Pretraining Data

Jifan Zhang; Ziyue Luo; Jia Liu; Ness Shroff; Robert Nowak

arXiv:2410.02755·cs.CL·February 3, 2025

GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data

Jifan Zhang, Ziyue Luo, Jia Liu, Ness Shroff, Robert Nowak

PDF

Open Access

TL;DR

This paper introduces SIEVE, a cost-effective filtering method that approximates GPT-4o's high-quality data selection for large language models by combining active learning with lightweight classifiers, enabling scalable and domain-specific data curation.

Contribution

SIEVE provides a scalable, low-cost alternative to GPT-4o for high-quality data filtering, using active learning to achieve similar accuracy with minimal calls to GPT-4o.

Findings

01

SIEVE matches GPT-4o's filtering accuracy on specific prompts.

02

SIEVE significantly reduces filtering costs to less than 1% of GPT-4o.

03

SIEVE outperforms existing quality filtering methods in web-scale datasets.

Abstract

Large language models require vast amounts of high-quality training data, but effective filtering of web-scale datasets remains a significant challenge. This paper demonstrates that GPT-4o is remarkably effective at identifying high-quality training data, but its prohibitive cost makes it impractical at web-scale. We propose SIEVE, a lightweight alternative that matches GPT-4o accuracy at less than 1\% of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight text classification models, using active learning to fine-tune these models in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. Through different filtering prompts, SIEVE can efficiently curate high quality data for general or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Physics and Python Applications

MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Layer Normalization · Inverse Square Root Schedule · Dense Connections · Linear Layer · Residual Connection · SentencePiece