A Bitter Lesson for Data Filtering

Christopher Mohri; John Duchi; Tatsunori Hashimoto

arXiv:2605.19407·cs.LG·May 20, 2026

A Bitter Lesson for Data Filtering

Christopher Mohri, John Duchi, Tatsunori Hashimoto

PDF

TL;DR

This study shows that for large model pretraining, using all available data without filtering can be more effective than filtering out low-quality data, especially with sufficient compute.

Contribution

The paper challenges the common belief that data filtering is necessary for high-quality model training, demonstrating that unfiltered data can outperform filtered data in large-scale pretraining.

Findings

01

Large models tolerate and benefit from low-quality data.

02

Filtering data may not be necessary with enough compute.

03

Unfiltered data can lead to better performance in large-scale training.

Abstract

We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally ``poor'' data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.