CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training
David Brandfonbrener, Hanlin Zhang, Andreas Kirsch, Jonathan Richard, Schwarz, Sham Kakade

TL;DR
CoLoR-Filter is a scalable data selection method for language model pre-training that uses auxiliary models to identify high-quality data, significantly reducing training data requirements while maintaining performance.
Contribution
The paper introduces CoLoR-Filter, a novel, efficient data selection technique based on auxiliary models' loss values, improving data efficiency in language model pre-training.
Findings
CoLoR-Filter achieves comparable performance with 25x less data for Books.
It scales effectively with smaller auxiliary models.
Reduces data needs for downstream tasks significantly.
Abstract
Selecting high-quality data for pre-training is crucial in shaping the downstream task performance of language models. A major challenge lies in identifying this optimal subset, a problem generally considered intractable, thus necessitating scalable and effective heuristics. In this work, we propose a data selection method, CoLoR-Filter (Conditional Loss Reduction Filtering), which leverages an empirical Bayes-inspired approach to derive a simple and computationally efficient selection criterion based on the relative loss values of two auxiliary models. In addition to the modeling rationale, we evaluate CoLoR-Filter empirically on two language modeling tasks: (1) selecting data from C4 for domain adaptation to evaluation on Books and (2) selecting data from C4 for a suite of downstream multiple-choice question answering tasks. We demonstrate favorable scaling both as we subselect more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
