Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for   Image-Text Matching

Xuri Ge; Fuhai Chen; Songpei Xu; Fuxiang Tao; Jie Wang; Joemon M. Jose

arXiv:2406.18579·cs.CV·June 28, 2024

Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching

Xuri Ge, Fuhai Chen, Songpei Xu, Fuxiang Tao, Jie Wang, Joemon M. Jose

PDF

Open Access

TL;DR

This paper introduces Hire, a hybrid-modal interaction framework with relational enhancements for image-text matching, leveraging explicit and implicit relationship modeling to improve contextual understanding and achieve state-of-the-art results.

Contribution

The paper proposes a novel hybrid-modal interaction method with multiple relational enhancements, combining explicit spatial-semantic graph reasoning and implicit relationship modeling for improved image-text matching.

Findings

01

Achieves new state-of-the-art results on MS-COCO and Flickr30K datasets.

02

Demonstrates the effectiveness of explicit and implicit relationship modeling in image-text matching.

03

Improves contextual object representation through graph-based reasoning and cross-modal attention.

Abstract

Image-text matching (ITM) is a fundamental problem in computer vision. The key issue lies in jointly learning the visual and textual representation to estimate their similarity accurately. Most existing methods focus on feature enhancement within modality or feature interaction across modalities, which, however, neglects the contextual information of the object representation based on the inter-object relationships that match the corresponding sentences with rich contextual semantics. In this paper, we propose a Hybrid-modal Interaction with multiple Relational Enhancements (termed \textit{Hire}) for image-text matching, which correlates the intra- and inter-modal semantics between objects and words with implicit and explicit relationship modelling. In particular, the explicit intra-modal spatial-semantic graph-based reasoning network is designed to improve the contextual representation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Video Analysis and Summarization

MethodsFocus