Is the Modality Gap a Bug or a Feature? A Robustness Perspective
Rhea Chowers, Oshri Naparstek, Udi Barzelay, Yair Weiss

TL;DR
This paper investigates the modality gap in multi-modal models like CLIP, revealing it can be a robustness feature rather than a flaw, and shows how to improve robustness with simple post-processing.
Contribution
It demonstrates that the modality gap relates to robustness and proposes a post-processing method to enhance robustness without sacrificing accuracy.
Findings
Minimizing contrastive loss creates a global gap vector orthogonal to embeddings.
Reducing the modality gap increases model robustness against perturbations.
Simple post-processing can significantly improve robustness without accuracy loss.
Abstract
Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss yields a representation in which the two modalities are separated by a global gap vector that is orthogonal to their embeddings. We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
