Devil in the Number: Towards Robust Multi-modality Data Filter

Yichen Xu; Zihan Xu; Wenhao Chai; Zhonghan Zhao; Enxin Song; Gaoang; Wang

arXiv:2309.13770·cs.LG·September 26, 2023

Devil in the Number: Towards Robust Multi-modality Data Filter

Yichen Xu, Zihan Xu, Wenhao Chai, Zhonghan Zhao, Enxin Song, Gaoang, Wang

PDF

Open Access

TL;DR

This paper introduces a robust multi-modality data filtering method that mitigates the influence of redundant textual elements like numbers, improving data selection quality and model performance on large-scale datasets.

Contribution

It proposes a novel text-masked CLIP filtering approach that outperforms existing methods by reducing the impact of redundant information in multi-modality data.

Findings

01

Text-masked filter outperforms original CLIP filter on DataComp benchmark.

02

Removing numbers from text improves CLIP score reliability.

03

Proposed method achieves 3.6% performance gain on ImageNet distribution shifts.

Abstract

In order to appropriately filter multi-modality data sets on a web-scale, it becomes crucial to employ suitable filtering methods to boost performance and reduce training costs. For instance, LAION papers employs the CLIP score filter to select data with CLIP scores surpassing a certain threshold. On the other hand, T-MARS achieves high-quality data filtering by detecting and masking text within images and then filtering by CLIP score. Through analyzing the dataset, we observe a significant proportion of redundant information, such as numbers, present in the textual content. Our experiments on a subset of the data unveil the profound impact of these redundant elements on the CLIP scores. A logical approach would involve reevaluating the CLIP scores after eliminating these influences. Experimentally, our text-based CLIP filter outperforms the top-ranked method on the ``small scale" of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training