Improving Pretraining Data Using Perplexity Correlations

Tristan Thrush; Christopher Potts; Tatsunori Hashimoto

arXiv:2409.05816·cs.CL·March 11, 2025·3 cites

Improving Pretraining Data Using Perplexity Correlations

Tristan Thrush, Christopher Potts, Tatsunori Hashimoto

PDF

Open Access 8 Models 3 Reviews

TL;DR

This paper introduces a cost-effective statistical framework for selecting high-quality pretraining data based on perplexity-benchmark correlations, improving language model performance without additional LLM training.

Contribution

It presents a novel data selection method using perplexity correlations that outperforms existing techniques across multiple benchmarks at various scales.

Findings

01

Outperforms DSIR on all benchmarks at 160M scale

02

Matches the best data selector in DataComp-LM

03

Shows increasing improvements with larger model scales

Abstract

Quality pretraining data is often seen as the key to high-performance language models. However, progress in understanding pretraining data has been slow due to the costly pretraining runs required for data selection experiments. We present a framework that avoids these costs and selects high-quality pretraining data without any LLM training of our own. Our work is based on a simple observation: LLM losses on many pretraining texts are correlated with downstream benchmark performance, and selecting high-correlation documents is an effective pretraining data selection method. We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations and perform data selection using a sample of 90 LLMs taken from the Open LLM Leaderboard on texts from tens of thousands of web domains. In controlled pretraining experiments at the 160M parameter…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. This paper focuses on avoiding costly pre-training ablations for LLM pre-training data selection and proposes a method that leverages public, pre-trained LLMs based on the correlation between downstream performance and perplexity. This aim is quite important for today's LLM practitioners and the proposed framework is very original. 2. Although limited (see below "weaknesses"), the proposed framework demonstrates promising results compared to baselines. 3. Although the presentation can be

Weaknesses

1. The main weakness of this paper is the scale of the experiments. The paper only includes pre-training experiments using 160M parameters LLM with a 3.2B token budget. This is very small and creates questions about the validity of the results. For example, the smallest scale for DataComp-LM paper is 400M and 8.2B and they experimented with 5 different scales which goes up to 7B parameters model trained on 276B tokens (as DCLM competition scale). 2. Experiments only include 8 evaluation benchm

Reviewer 02Rating 5Confidence 3

Strengths

The paper presents a new and remarkably simple approach to data selection for LLM pretraining. It leverages readily available resources (publicly available LLMs) and a straightforward correlation-based method. The authors ground their approach in a statistical framework, providing a theoretical basis for their correlation-based data selection. The authors demonstrate the robustness of their approach by showing its effectiveness across various benchmarks and conditions.

Weaknesses

While the paper presents promising results at the 160M parameter scale, further validation at larger scales is necessary to assess the scalability and effectiveness of the approach for slightly more massive LLMs. The method inherits the biases present in the publicly available LLMs used for estimating correlations. This raises potential concerns about bias amplification, especially if these LLMs were trained on data with inherent biases.

Reviewer 03Rating 8Confidence 4

Strengths

This approach provides a promising way to make use of “the millions of dollars collectively spent” (line 39) on the experimentation represented by all open weight models. The approach extracts information about performant pretraining data selection even when information about the pretraining data of these open-weight models is not available. Their approach is intuitive and the core of it is quite simple (appearing to be insensitive to the trickier details suggested by theory). This means it sho

Weaknesses

I think the biggest weakness is that the paper’s method targets a single benchmark. Are there obvious ways to extend the work to target some aggregation of benchmark scores? Sure, but the paper offers no theory to suggest how we might do that, and no experiments to show whether the obvious ways of doing that would work. Lambada, which accounts for 5 out of 8 of downstream evaluations, is a language modeling task (i.e. next word prediction) just like the perplexity measures being used as proxie

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Physics and Python Applications · Neural Networks and Applications