Visual Semantic Reasoning for Image-Text Matching

Kunpeng Li; Yulun Zhang; Kai Li; Yuanyuan Li; Yun Fu

arXiv:1909.02701·cs.CV·September 9, 2019·21 cites

Visual Semantic Reasoning for Image-Text Matching

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, Yun Fu

PDF

Open Access 2 Repos

TL;DR

This paper introduces a reasoning model that enhances visual representations with semantic concepts for improved image-text matching, achieving state-of-the-art results on MS-COCO and Flickr30K datasets.

Contribution

It presents a novel graph convolutional network-based reasoning approach combined with gating and memory mechanisms for semantic scene understanding.

Findings

01

Achieves new state-of-the-art performance on MS-COCO and Flickr30K datasets.

02

Outperforms previous methods by significant margins in image and caption retrieval.

03

Demonstrates the effectiveness of semantic reasoning in visual-text matching tasks.

Abstract

Image-text matching has been a hot research topic bridging the vision and language areas. It remains challenging because the current representation of image usually lacks global semantic concepts as in its corresponding text caption. To address this issue, we propose a simple and interpretable reasoning model to generate visual representation that captures key objects and semantic concepts of a scene. Specifically, we first build up connections between image regions and perform reasoning with Graph Convolutional Networks to generate features with semantic relationships. Then, we propose to use the gate and memory mechanism to perform global semantic reasoning on these relationship-enhanced features, select the discriminative information and gradually generate the representation for the whole scene. Experiments validate that our method achieves a new state-of-the-art for the image-text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsGraph Convolutional Networks