Exploring Visual Relationship for Image Captioning

Ting Yao; Yingwei Pan; Yehao Li; Tao Mei

arXiv:1809.07041·cs.CV·September 20, 2018·49 cites

Exploring Visual Relationship for Image Captioning

Ting Yao, Yingwei Pan, Yehao Li, Tao Mei

PDF

Open Access

TL;DR

This paper proposes a novel GCN-LSTM architecture that models object relationships via graphs to improve image captioning, achieving state-of-the-art results on the COCO dataset.

Contribution

It introduces a new graph-based object relationship modeling approach integrated into an attention-based encoder-decoder framework for image captioning.

Findings

01

GCN-LSTM outperforms previous methods on COCO dataset

02

CIDEr-D score improves from 120.1% to 128.7%

03

Graph-based relationships enhance caption quality

Abstract

It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image. Nevertheless, there has not been evidence in support of the idea on image description generation. In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework. Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that novelly integrates both semantic and spatial object relationships into image encoder. Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections. The representations of each region proposed on objects are then refined by leveraging graph structure through GCN. With the learnt region-level features, our GCN-LSTM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsGraph Convolutional Network