Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models
Simon Schrodi, David T. Hoffmann, Max Argus, Volker Fischer, Thomas, Brox

TL;DR
This paper investigates the modality gap and object bias in contrastive vision-language models like CLIP, revealing that an information imbalance between images and captions drives both phenomena and impacts model performance.
Contribution
It introduces a measure of object bias, analyzes the effects of the modality gap, and uncovers the role of information imbalance in these phenomena within contrastive VLMs.
Findings
Closing the modality gap can improve performance.
Few embedding dimensions primarily drive the modality gap.
Object bias does not worsen attribute recognition performance.
Abstract
Contrastive vision-language models (VLMs), like CLIP, have gained popularity for their versatile applicability to various downstream tasks. Despite their successes in some tasks, like zero-shot object recognition, they perform surprisingly poor on other tasks, like attribute recognition. Previous work has attributed these challenges to the modality gap, a separation of image and text in the shared representation space, and to a bias towards objects over other factors, such as attributes. In this analysis paper, we investigate both phenomena thoroughly. We evaluated off-the-shelf VLMs and while the gap's influence on performance is typically overshadowed by other factors, we find indications that closing the gap indeed leads to improvements. Moreover, we find that, contrary to intuition, only few embedding dimensions drive the gap and that the embedding spaces are differently organized.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsContrastive Language-Image Pre-training
