Bridging the Modality Gap: Dimension Information Alignment and Sparse   Spatial Constraint for Image-Text Matching

Xiang Ma; Xuemei Li; Lexin Fang; Caiming Zhang

arXiv:2410.16853·cs.CV·October 23, 2024

Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text Matching

Xiang Ma, Xuemei Li, Lexin Fang, Caiming Zhang

PDF

Open Access

TL;DR

This paper introduces DIAS, a novel method for image-text matching that aligns modality information and applies sparse spatial constraints, significantly improving performance by bridging the modality gap.

Contribution

The paper proposes a new approach called DIAS that aligns cross-modal embeddings dimension-wise and incorporates sparse spatial constraints to enhance image-text matching.

Findings

01

Achieves 4.3%-10.2% rSum improvements on Flickr30k and MSCOCO.

02

Effectively bridges the modality gap with dimension alignment and spatial constraints.

03

Outperforms existing contrastive learning models in image-text matching.

Abstract

Many contrastive learning based models have achieved advanced performance in image-text matching tasks. The key of these models lies in analyzing the correlation between image-text pairs, which involves cross-modal interaction of embeddings in corresponding dimensions. However, the embeddings of different modalities are from different models or modules, and there is a significant modality gap. Directly interacting such embeddings lacks rationality and may capture inaccurate correlation. Therefore, we propose a novel method called DIAS to bridge the modality gap from two aspects: (1) We align the information representation of embeddings from different modalities in corresponding dimension to ensure the correlation calculation is based on interactions of similar information. (2) The spatial constraints of inter- and intra-modalities unmatched pairs are introduced to ensure the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques

MethodsALIGN · Contrastive Learning