Learning Relation Alignment for Calibrated Cross-modal Retrieval
Shuhuai Ren, Junyang Lin, Guangxiang Zhao, Rui Men, An Yang, Jingren, Zhou, Xu Sun, Hongxia Yang

TL;DR
This paper introduces a relation alignment method for cross-modal retrieval that improves semantic consistency between image and text by calibrating intra-modal relations, leading to better retrieval performance.
Contribution
It proposes a novel metric, ISD, and a regularized training method, IAIS, to align linguistic and visual relations, enhancing cross-modal retrieval models.
Findings
Significant performance improvements on Flickr30k and MS COCO datasets.
The ISD metric effectively measures relation consistency.
The IAIS method enhances model interpretability and accuracy.
Abstract
Despite the achievements of large-scale multimodal pre-training approaches, cross-modal retrieval, e.g., image-text retrieval, remains a challenging task. To bridge the semantic gap between the two modalities, previous studies mainly focus on word-region alignment at the object level, lacking the matching between the linguistic relation among the words and the visual relation among the regions. The neglect of such relation consistency impairs the contextualized representation of image-text pairs and hinders the model performance and the interpretability. In this paper, we first propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations. In response, we present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
