Better Understanding Hierarchical Visual Relationship for Image Caption
Zheng-cong Fei

TL;DR
This paper introduces a hierarchical visual relationship model combining CNN and GCN to improve image captioning by capturing multi-level semantic and spatial relationships, outperforming previous models on COCO.
Contribution
It proposes a novel CNN+GCN architecture that encodes hierarchical visual relationships for enhanced image captioning within an encoder-decoder framework.
Findings
Outperforms state-of-the-art models on COCO dataset
Effectively captures hierarchical semantic and spatial relationships
Improves captioning accuracy across multiple metrics
Abstract
The Convolutional Neural Network (CNN) has been the dominant image feature extractor in computer vision for years. However, it fails to get the relationship between images/objects and their hierarchical interactions which can be helpful for representing and describing an image. In this paper, we propose a new design for image caption under a general encoder-decoder framework. It takes into account the hierarchical interactions between different abstraction levels of visual information in the images and their bounding-boxes. Specifically, we present CNN plus Graph Convolutional Network (GCN) architecture that novelly integrates both semantic and spatial visual relationships into image encoder. The representations of regions in an image and the connections between images are refined by leveraging graph structure through GCN. With the learned multi-level features, our model capitalizes on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
MethodsGraph Convolutional Network
