Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

Divyansh Pareek; Sewoong Oh; Simon S. Du

arXiv:2512.14230·cs.LG·December 17, 2025

Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

Divyansh Pareek, Sewoong Oh, Simon S. Du

PDF

Open Access 1 Video

TL;DR

This paper analyzes how data filtering improves multimodal contrastive learning, providing theoretical bounds that explain the empirical success of teacher-based filtering in enhancing model performance.

Contribution

It offers a theoretical characterization of the benefits of data filtering in multimodal contrastive learning, quantifying error bounds under a standard data generation model.

Findings

01

Filtering reduces the error bounds in contrastive learning.

02

Theoretical bounds depend on the fraction of correctly matched data.

03

Filtering provides significant benefits especially when data quality is low.

Abstract

The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting $η \in (0, 1]$ as the fraction of data with correctly matched modalities among $n$ paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: $(i)$ the error without filtering is upper and lower bounded by $\frac{1}{η n}$ , and $(ii)$ the error with teacher-based filtering is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Understanding the Gain from Data Filtering in Multimodal Contrastive Learning· slideslive

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks · Face recognition and analysis