Understanding the Gain from Data Filtering in Multimodal Contrastive Learning
Divyansh Pareek, Sewoong Oh, Simon S. Du

TL;DR
This paper analyzes how data filtering improves multimodal contrastive learning, providing theoretical bounds that explain the empirical success of teacher-based filtering in enhancing model performance.
Contribution
It offers a theoretical characterization of the benefits of data filtering in multimodal contrastive learning, quantifying error bounds under a standard data generation model.
Findings
Filtering reduces the error bounds in contrastive learning.
Theoretical bounds depend on the fraction of correctly matched data.
Filtering provides significant benefits especially when data quality is low.
Abstract
The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting as the fraction of data with correctly matched modalities among paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: the error without filtering is upper and lower bounded by , and the error with teacher-based filtering is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks · Face recognition and analysis
