The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering
Haichao Yu, Yu Tian, Sateesh Kumar, Linjie Yang, Heng Wang

TL;DR
This paper presents a comprehensive data filtering strategy for foundation models, combining multiple methods and innovations to improve data quality and model performance, validated through the DataComp benchmark.
Contribution
It introduces a multi-stage filtering approach with novel techniques like CLIP score adjustments and data rebalancing, advancing data filtering methods for foundation models.
Findings
Outperforms previous methods by over 4% on 38 tasks
Achieves over 2% improvement on ImageNet
Provides detailed analysis of filtering choices
Abstract
The quality of pre-training data plays a critical role in the performance of foundation models. Popular foundation models often design their own recipe for data filtering, which makes it hard to analyze and compare different data filtering approaches. DataComp is a new benchmark dedicated to evaluating different methods for data filtering. This paper describes our learning and solution when participating in the DataComp challenge. Our filtering strategy includes three stages: single-modality filtering, cross-modality filtering, and data distribution alignment. We integrate existing methods and propose new solutions, such as computing CLIP score on horizontally flipped images to mitigate the interference of scene text, using vision and language models to retrieve training samples for target downstream tasks, rebalancing the data distribution to improve the efficiency of allocating the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsContrastive Language-Image Pre-training
