Better Understanding Hierarchical Visual Relationship for Image Caption

Zheng-cong Fei

arXiv:1912.01881·cs.CV·December 5, 2019

Better Understanding Hierarchical Visual Relationship for Image Caption

Zheng-cong Fei

PDF

Open Access

TL;DR

This paper introduces a hierarchical visual relationship model combining CNN and GCN to improve image captioning by capturing multi-level semantic and spatial relationships, outperforming previous models on COCO.

Contribution

It proposes a novel CNN+GCN architecture that encodes hierarchical visual relationships for enhanced image captioning within an encoder-decoder framework.

Findings

01

Outperforms state-of-the-art models on COCO dataset

02

Effectively captures hierarchical semantic and spatial relationships

03

Improves captioning accuracy across multiple metrics

Abstract

The Convolutional Neural Network (CNN) has been the dominant image feature extractor in computer vision for years. However, it fails to get the relationship between images/objects and their hierarchical interactions which can be helpful for representing and describing an image. In this paper, we propose a new design for image caption under a general encoder-decoder framework. It takes into account the hierarchical interactions between different abstraction levels of visual information in the images and their bounding-boxes. Specifically, we present CNN plus Graph Convolutional Network (GCN) architecture that novelly integrates both semantic and spatial visual relationships into image encoder. The representations of regions in an image and the connections between images are refined by leveraging graph structure through GCN. With the learned multi-level features, our model capitalizes on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques

MethodsGraph Convolutional Network