Improving Pretraining Data Using Perplexity Correlations
Tristan Thrush, Christopher Potts, Tatsunori Hashimoto

TL;DR
This paper introduces a cost-effective statistical framework for selecting high-quality pretraining data based on perplexity-benchmark correlations, improving language model performance without additional LLM training.
Contribution
It presents a novel data selection method using perplexity correlations that outperforms existing techniques across multiple benchmarks at various scales.
Findings
Outperforms DSIR on all benchmarks at 160M scale
Matches the best data selector in DataComp-LM
Shows increasing improvements with larger model scales
Abstract
Quality pretraining data is often seen as the key to high-performance language models. However, progress in understanding pretraining data has been slow due to the costly pretraining runs required for data selection experiments. We present a framework that avoids these costs and selects high-quality pretraining data without any LLM training of our own. Our work is based on a simple observation: LLM losses on many pretraining texts are correlated with downstream benchmark performance, and selecting high-correlation documents is an effective pretraining data selection method. We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations and perform data selection using a sample of 90 LLMs taken from the Open LLM Leaderboard on texts from tens of thousands of web domains. In controlled pretraining experiments at the 160M parameter…
Peer Reviews
Decision·ICLR 2025 Poster
1. This paper focuses on avoiding costly pre-training ablations for LLM pre-training data selection and proposes a method that leverages public, pre-trained LLMs based on the correlation between downstream performance and perplexity. This aim is quite important for today's LLM practitioners and the proposed framework is very original. 2. Although limited (see below "weaknesses"), the proposed framework demonstrates promising results compared to baselines. 3. Although the presentation can be
1. The main weakness of this paper is the scale of the experiments. The paper only includes pre-training experiments using 160M parameters LLM with a 3.2B token budget. This is very small and creates questions about the validity of the results. For example, the smallest scale for DataComp-LM paper is 400M and 8.2B and they experimented with 5 different scales which goes up to 7B parameters model trained on 276B tokens (as DCLM competition scale). 2. Experiments only include 8 evaluation benchm
The paper presents a new and remarkably simple approach to data selection for LLM pretraining. It leverages readily available resources (publicly available LLMs) and a straightforward correlation-based method. The authors ground their approach in a statistical framework, providing a theoretical basis for their correlation-based data selection. The authors demonstrate the robustness of their approach by showing its effectiveness across various benchmarks and conditions.
While the paper presents promising results at the 160M parameter scale, further validation at larger scales is necessary to assess the scalability and effectiveness of the approach for slightly more massive LLMs. The method inherits the biases present in the publicly available LLMs used for estimating correlations. This raises potential concerns about bias amplification, especially if these LLMs were trained on data with inherent biases.
This approach provides a promising way to make use of “the millions of dollars collectively spent” (line 39) on the experimentation represented by all open weight models. The approach extracts information about performant pretraining data selection even when information about the pretraining data of these open-weight models is not available. Their approach is intuitive and the core of it is quite simple (appearing to be insensitive to the trickier details suggested by theory). This means it sho
I think the biggest weakness is that the paper’s method targets a single benchmark. Are there obvious ways to extend the work to target some aggregation of benchmark scores? Sure, but the paper offers no theory to suggest how we might do that, and no experiments to show whether the obvious ways of doing that would work. Lambada, which accounts for 5 out of 8 of downstream evaluations, is a language modeling task (i.e. next word prediction) just like the perplexity measures being used as proxie
Code & Models
- 🤗perplexity-correlations/fasttext-arc-easy-targetmodel· 1 dl1 dl
- 🤗perplexity-correlations/fasttext-piqa-targetmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗perplexity-correlations/fasttext-sciq-targetmodel· 4 dl4 dl
- 🤗perplexity-correlations/fasttext-lambada-targetmodel
- 🤗perplexity-correlations/fasttext-lambada-es-targetmodel· 2 dl2 dl
- 🤗perplexity-correlations/fasttext-lambada-de-targetmodel· 3 dl3 dl
- 🤗perplexity-correlations/fasttext-lambada-fr-targetmodel· 3 dl3 dl
- 🤗perplexity-correlations/fasttext-lambada-it-targetmodel· 6 dl6 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Neural Networks and Applications
