Visual Semantic Reasoning for Image-Text Matching
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, Yun Fu

TL;DR
This paper introduces a reasoning model that enhances visual representations with semantic concepts for improved image-text matching, achieving state-of-the-art results on MS-COCO and Flickr30K datasets.
Contribution
It presents a novel graph convolutional network-based reasoning approach combined with gating and memory mechanisms for semantic scene understanding.
Findings
Achieves new state-of-the-art performance on MS-COCO and Flickr30K datasets.
Outperforms previous methods by significant margins in image and caption retrieval.
Demonstrates the effectiveness of semantic reasoning in visual-text matching tasks.
Abstract
Image-text matching has been a hot research topic bridging the vision and language areas. It remains challenging because the current representation of image usually lacks global semantic concepts as in its corresponding text caption. To address this issue, we propose a simple and interpretable reasoning model to generate visual representation that captures key objects and semantic concepts of a scene. Specifically, we first build up connections between image regions and perform reasoning with Graph Convolutional Networks to generate features with semantic relationships. Then, we propose to use the gate and memory mechanism to perform global semantic reasoning on these relationship-enhanced features, select the discriminative information and gradually generate the representation for the whole scene. Experiments validate that our method achieves a new state-of-the-art for the image-text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsGraph Convolutional Networks
