Devil in the Number: Towards Robust Multi-modality Data Filter
Yichen Xu, Zihan Xu, Wenhao Chai, Zhonghan Zhao, Enxin Song, Gaoang, Wang

TL;DR
This paper introduces a robust multi-modality data filtering method that mitigates the influence of redundant textual elements like numbers, improving data selection quality and model performance on large-scale datasets.
Contribution
It proposes a novel text-masked CLIP filtering approach that outperforms existing methods by reducing the impact of redundant information in multi-modality data.
Findings
Text-masked filter outperforms original CLIP filter on DataComp benchmark.
Removing numbers from text improves CLIP score reliability.
Proposed method achieves 3.6% performance gain on ImageNet distribution shifts.
Abstract
In order to appropriately filter multi-modality data sets on a web-scale, it becomes crucial to employ suitable filtering methods to boost performance and reduce training costs. For instance, LAION papers employs the CLIP score filter to select data with CLIP scores surpassing a certain threshold. On the other hand, T-MARS achieves high-quality data filtering by detecting and masking text within images and then filtering by CLIP score. Through analyzing the dataset, we observe a significant proportion of redundant information, such as numbers, present in the textual content. Our experiments on a subset of the data unveil the profound impact of these redundant elements on the CLIP scores. A logical approach would involve reevaluating the CLIP scores after eliminating these influences. Experimentally, our text-based CLIP filter outperforms the top-ranked method on the ``small scale" of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
