Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning
Maurits Bleeker, Mariya Hendriksen, Andrew Yates, Maarten de Rijke

TL;DR
This paper investigates how contrastive vision-language models often learn shortcuts instead of comprehensive representations, introduces synthetic shortcuts to evaluate this issue, and proposes methods to mitigate shortcut learning, highlighting ongoing challenges.
Contribution
The paper introduces a synthetic shortcut framework for evaluating and reducing shortcut learning in contrastive vision-language models, revealing limitations of current training methods.
Findings
Contrastive models often learn shortcuts rather than full representations.
Synthetic shortcuts can be injected to evaluate shortcut learning.
Proposed methods partially reduce shortcut reliance.
Abstract
Vision-language models (VLMs) mainly rely on contrastive training to learn general-purpose representations of images and captions. We focus on the situation when one image is associated with several captions, each caption containing both information shared among all captions and unique information per caption about the scene depicted in the image. In such cases, it is unclear whether contrastive losses are sufficient for learning task-optimal representations that contain all the information provided by the captions or whether the contrastive learning setup encourages the learning of a simple shortcut that minimizes contrastive loss. We introduce synthetic shortcuts for vision-language: a training and evaluation framework where we inject synthetic shortcuts into image-text data. We show that contrastive VLMs trained from scratch or fine-tuned with data containing these synthetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsFocus · Contrastive Learning
