Two Effects, One Trigger: On the Modality Gap, Object Bias, and   Information Imbalance in Contrastive Vision-Language Models

Simon Schrodi; David T. Hoffmann; Max Argus; Volker Fischer; Thomas; Brox

arXiv:2404.07983·cs.CV·April 17, 2025·1 cites

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models

Simon Schrodi, David T. Hoffmann, Max Argus, Volker Fischer, Thomas, Brox

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the modality gap and object bias in contrastive vision-language models like CLIP, revealing that an information imbalance between images and captions drives both phenomena and impacts model performance.

Contribution

It introduces a measure of object bias, analyzes the effects of the modality gap, and uncovers the role of information imbalance in these phenomena within contrastive VLMs.

Findings

01

Closing the modality gap can improve performance.

02

Few embedding dimensions primarily drive the modality gap.

03

Object bias does not worsen attribute recognition performance.

Abstract

Contrastive vision-language models (VLMs), like CLIP, have gained popularity for their versatile applicability to various downstream tasks. Despite their successes in some tasks, like zero-shot object recognition, they perform surprisingly poor on other tasks, like attribute recognition. Previous work has attributed these challenges to the modality gap, a separation of image and text in the shared representation space, and to a bias towards objects over other factors, such as attributes. In this analysis paper, we investigate both phenomena thoroughly. We evaluated off-the-shelf VLMs and while the gap's influence on performance is typically overshadowed by other factors, we find indications that closing the gap indeed leads to improvements. Moreover, we find that, contrary to intuition, only few embedding dimensions drive the gap and that the embedding spaces are differently organized.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lmb-freiburg/two-effects-one-trigger
pytorchOfficial

Videos

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsContrastive Language-Image Pre-training