Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
Filip Radenovic, Abhimanyu Dubey, Abhishek Kadian, Todor Mihaylov,, Simon Vandenhende, Yash Patel, Yi Wen, Vignesh Ramanathan, Dhruv Mahajan

TL;DR
This paper enhances vision-language contrastive pre-training by filtering noisy data, leveraging unimodal representations, and emphasizing hard negatives, leading to significant improvements across numerous zero-shot and few-shot tasks.
Contribution
It introduces the CAT filtering strategy, Concept Distillation, and an importance-sampling method for hard negatives, advancing the state-of-the-art in contrastive vision-language models.
Findings
Improved performance on 20 out of 29 zero-shot tasks.
Significant gains in few-shot linear probing accuracy.
Effective reduction of dataset noise without increasing training complexity.
Abstract
Vision-language models trained with contrastive learning on large-scale noisy data are becoming increasingly popular for zero-shot recognition problems. In this paper we improve the following three aspects of the contrastive pre-training pipeline: dataset noise, model initialization and the training objective. First, we propose a straightforward filtering strategy titled Complexity, Action, and Text-spotting (CAT) that significantly reduces dataset size, while achieving improved performance across zero-shot vision-language tasks. Next, we propose an approach titled Concept Distillation to leverage strong unimodal representations for contrastive training that does not increase training complexity while outperforming prior work. Finally, we modify the traditional contrastive alignment objective, and propose an importance-sampling approach to up-sample the importance of hard-negatives…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · COVID-19 diagnosis using AI
MethodsContrastive Learning
