Exploring Visual Relationship for Image Captioning
Ting Yao, Yingwei Pan, Yehao Li, Tao Mei

TL;DR
This paper proposes a novel GCN-LSTM architecture that models object relationships via graphs to improve image captioning, achieving state-of-the-art results on the COCO dataset.
Contribution
It introduces a new graph-based object relationship modeling approach integrated into an attention-based encoder-decoder framework for image captioning.
Findings
GCN-LSTM outperforms previous methods on COCO dataset
CIDEr-D score improves from 120.1% to 128.7%
Graph-based relationships enhance caption quality
Abstract
It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image. Nevertheless, there has not been evidence in support of the idea on image description generation. In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework. Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that novelly integrates both semantic and spatial object relationships into image encoder. Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections. The representations of each region proposed on objects are then refined by leveraging graph structure through GCN. With the learnt region-level features, our GCN-LSTM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsGraph Convolutional Network
