CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language   Model Pre-training

David Brandfonbrener; Hanlin Zhang; Andreas Kirsch; Jonathan Richard; Schwarz; Sham Kakade

arXiv:2406.10670·cs.LG·October 31, 2024

CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training

David Brandfonbrener, Hanlin Zhang, Andreas Kirsch, Jonathan Richard, Schwarz, Sham Kakade

PDF

Open Access 1 Repo 1 Datasets

TL;DR

CoLoR-Filter is a scalable data selection method for language model pre-training that uses auxiliary models to identify high-quality data, significantly reducing training data requirements while maintaining performance.

Contribution

The paper introduces CoLoR-Filter, a novel, efficient data selection technique based on auxiliary models' loss values, improving data efficiency in language model pre-training.

Findings

01

CoLoR-Filter achieves comparable performance with 25x less data for Books.

02

It scales effectively with smaller auxiliary models.

03

Reduces data needs for downstream tasks significantly.

Abstract

Selecting high-quality data for pre-training is crucial in shaping the downstream task performance of language models. A major challenge lies in identifying this optimal subset, a problem generally considered intractable, thus necessitating scalable and effective heuristics. In this work, we propose a data selection method, CoLoR-Filter (Conditional Loss Reduction Filtering), which leverages an empirical Bayes-inspired approach to derive a simple and computationally efficient selection criterion based on the relative loss values of two auxiliary models. In addition to the modeling rationale, we evaluate CoLoR-Filter empirically on two language modeling tasks: (1) selecting data from C4 for domain adaptation to evaluation on Books and (2) selecting data from C4 for a suite of downstream multiple-choice question answering tasks. We demonstrate favorable scaling both as we subselect more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

davidbrandfonbrener/color-filter-olmo
pytorchOfficial

Datasets

davidbrandfonbrener/color-filtered-c4
dataset· 34 dl
34 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis