A Bitter Lesson for Data Filtering
Christopher Mohri, John Duchi, Tatsunori Hashimoto

TL;DR
This study shows that for large model pretraining, using all available data without filtering can be more effective than filtering out low-quality data, especially with sufficient compute.
Contribution
The paper challenges the common belief that data filtering is necessary for high-quality model training, demonstrating that unfiltered data can outperform filtered data in large-scale pretraining.
Findings
Large models tolerate and benefit from low-quality data.
Filtering data may not be necessary with enough compute.
Unfiltered data can lead to better performance in large-scale training.
Abstract
We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally ``poor'' data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
