Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text Matching
Xiang Ma, Xuemei Li, Lexin Fang, Caiming Zhang

TL;DR
This paper introduces DIAS, a novel method for image-text matching that aligns modality information and applies sparse spatial constraints, significantly improving performance by bridging the modality gap.
Contribution
The paper proposes a new approach called DIAS that aligns cross-modal embeddings dimension-wise and incorporates sparse spatial constraints to enhance image-text matching.
Findings
Achieves 4.3%-10.2% rSum improvements on Flickr30k and MSCOCO.
Effectively bridges the modality gap with dimension alignment and spatial constraints.
Outperforms existing contrastive learning models in image-text matching.
Abstract
Many contrastive learning based models have achieved advanced performance in image-text matching tasks. The key of these models lies in analyzing the correlation between image-text pairs, which involves cross-modal interaction of embeddings in corresponding dimensions. However, the embeddings of different modalities are from different models or modules, and there is a significant modality gap. Directly interacting such embeddings lacks rationality and may capture inaccurate correlation. Therefore, we propose a novel method called DIAS to bridge the modality gap from two aspects: (1) We align the information representation of embeddings from different modalities in corresponding dimension to ensure the correlation calculation is based on interactions of similar information. (2) The spatial constraints of inter- and intra-modalities unmatched pairs are introduced to ensure the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques
MethodsALIGN · Contrastive Learning
