Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching
Xuri Ge, Fuhai Chen, Songpei Xu, Fuxiang Tao, Jie Wang, Joemon M. Jose

TL;DR
This paper introduces Hire, a hybrid-modal interaction framework with relational enhancements for image-text matching, leveraging explicit and implicit relationship modeling to improve contextual understanding and achieve state-of-the-art results.
Contribution
The paper proposes a novel hybrid-modal interaction method with multiple relational enhancements, combining explicit spatial-semantic graph reasoning and implicit relationship modeling for improved image-text matching.
Findings
Achieves new state-of-the-art results on MS-COCO and Flickr30K datasets.
Demonstrates the effectiveness of explicit and implicit relationship modeling in image-text matching.
Improves contextual object representation through graph-based reasoning and cross-modal attention.
Abstract
Image-text matching (ITM) is a fundamental problem in computer vision. The key issue lies in jointly learning the visual and textual representation to estimate their similarity accurately. Most existing methods focus on feature enhancement within modality or feature interaction across modalities, which, however, neglects the contextual information of the object representation based on the inter-object relationships that match the corresponding sentences with rich contextual semantics. In this paper, we propose a Hybrid-modal Interaction with multiple Relational Enhancements (termed \textit{Hire}) for image-text matching, which correlates the intra- and inter-modal semantics between objects and words with implicit and explicit relationship modelling. In particular, the explicit intra-modal spatial-semantic graph-based reasoning network is designed to improve the contextual representation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Video Analysis and Summarization
MethodsFocus
