Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift
Yihao Xue, Siddharth Joshi, Dang Nguyen, Baharan Mirzasoleiman

TL;DR
This paper analyzes why multimodal contrastive learning methods like CLIP are robust to distribution shifts, revealing mechanisms such as intra-class contrasting and inter-class feature sharing that enhance generalization.
Contribution
It uncovers the mechanisms behind MMCL's robustness, provides theoretical insights on the role of rich captions, and validates findings through synthetic and real-world experiments.
Findings
Intra-class contrasting learns high-variance features.
Inter-class feature sharing improves generalization.
Rich captions enhance robustness to distribution shifts.
Abstract
Recently, multimodal contrastive learning (MMCL) approaches, such as CLIP, have achieved a remarkable success in learning representations that are robust against distribution shift and generalize to new domains. Despite the empirical success, the mechanism behind learning such generalizable representations is not understood. In this work, we rigorously analyze this problem and uncover two mechanisms behind MMCL's robustness: \emph{intra-class contrasting}, which allows the model to learn features with a high variance, and \emph{inter-class feature sharing}, where annotated details in one class help learning other classes better. Both mechanisms prevent spurious features that are over-represented in the training data to overshadow the generalizable core features. This yields superior zero-shot classification accuracy under distribution shift. Furthermore, we theoretically demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Cancer-related molecular mechanisms research
MethodsContrastive Learning · Contrastive Language-Image Pre-training
