I0T: Embedding Standardization Method Towards Zero Modality Gap

Na Min An; Eunki Kim; James Thorne; Hyunjung Shim

arXiv:2412.14384·cs.LG·December 20, 2024

I0T: Embedding Standardization Method Towards Zero Modality Gap

Na Min An, Eunki Kim, James Thorne, Hyunjung Shim

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces I0T, a standardization method that significantly reduces the modality gap in CLIP-based models, improving zero-shot image-text tasks without retraining the entire model.

Contribution

The paper proposes two novel methods, I0T_post and I0T_async, to address the modality gap in CLIP embeddings, enhancing alignment while preserving original model representations.

Findings

01

I0T methods effectively reduce the modality gap.

02

I0T_post can serve as an automatic evaluation metric.

03

The approach preserves original embedding representations.

Abstract

Contrastive Language-Image Pretraining (CLIP) enables zero-shot inference in downstream tasks such as image-text retrieval and classification. However, recent works extending CLIP suffer from the issue of modality gap, which arises when the image and text embeddings are projected to disparate manifolds, deviating from the intended objective of image-text contrastive learning. We discover that this phenomenon is linked to the modality-specific characteristic that each image/text encoder independently possesses and propose two methods to address the modality gap: (1) a post-hoc embedding standardization method, $I0T_{post}$ that reduces the modality gap approximately to zero and (2) a trainable method, $I0T_{async}$ , to alleviate the modality gap problem by adding two normalization layers for each encoder. Our I0T framework can significantly reduce the modality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xfactlab/i0t
pytorchOfficial

Videos

I0T: Embedding Standardization Method Towards Zero Modality Gap· underline

Taxonomy

TopicsPower Systems and Technologies

MethodsContrastive Language-Image Pre-training