The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data   Filtering

Haichao Yu; Yu Tian; Sateesh Kumar; Linjie Yang; Heng Wang

arXiv:2309.15954·cs.CV·September 29, 2023·2 cites

The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering

Haichao Yu, Yu Tian, Sateesh Kumar, Linjie Yang, Heng Wang

PDF

Open Access

TL;DR

This paper presents a comprehensive data filtering strategy for foundation models, combining multiple methods and innovations to improve data quality and model performance, validated through the DataComp benchmark.

Contribution

It introduces a multi-stage filtering approach with novel techniques like CLIP score adjustments and data rebalancing, advancing data filtering methods for foundation models.

Findings

01

Outperforms previous methods by over 4% on 38 tasks

02

Achieves over 2% improvement on ImageNet

03

Provides detailed analysis of filtering choices

Abstract

The quality of pre-training data plays a critical role in the performance of foundation models. Popular foundation models often design their own recipe for data filtering, which makes it hard to analyze and compare different data filtering approaches. DataComp is a new benchmark dedicated to evaluating different methods for data filtering. This paper describes our learning and solution when participating in the DataComp challenge. Our filtering strategy includes three stages: single-modality filtering, cross-modality filtering, and data distribution alignment. We integrate existing methods and propose new solutions, such as computing CLIP score on horizontally flipped images to mitigate the interference of scene text, using vision and language models to retrieve training samples for target downstream tasks, rebalancing the data distribution to improve the efficiency of allocating the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsContrastive Language-Image Pre-training