I0T: Embedding Standardization Method Towards Zero Modality Gap
Na Min An, Eunki Kim, James Thorne, Hyunjung Shim

TL;DR
This paper introduces I0T, a standardization method that significantly reduces the modality gap in CLIP-based models, improving zero-shot image-text tasks without retraining the entire model.
Contribution
The paper proposes two novel methods, I0T_post and I0T_async, to address the modality gap in CLIP embeddings, enhancing alignment while preserving original model representations.
Findings
I0T methods effectively reduce the modality gap.
I0T_post can serve as an automatic evaluation metric.
The approach preserves original embedding representations.
Abstract
Contrastive Language-Image Pretraining (CLIP) enables zero-shot inference in downstream tasks such as image-text retrieval and classification. However, recent works extending CLIP suffer from the issue of modality gap, which arises when the image and text embeddings are projected to disparate manifolds, deviating from the intended objective of image-text contrastive learning. We discover that this phenomenon is linked to the modality-specific characteristic that each image/text encoder independently possesses and propose two methods to address the modality gap: (1) a post-hoc embedding standardization method, that reduces the modality gap approximately to zero and (2) a trainable method, , to alleviate the modality gap problem by adding two normalization layers for each encoder. Our I0T framework can significantly reduce the modality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsPower Systems and Technologies
MethodsContrastive Language-Image Pre-training
