Exploring Explicit and Implicit Visual Relationships for Image Captioning
Zeliang Song, Xiaofei Zhou

TL;DR
This paper enhances image captioning by integrating explicit semantic graphs and implicit global interactions using Gated GCNs and Region BERT, leading to improved captioning performance on the COCO dataset.
Contribution
It introduces a novel approach combining explicit semantic graphs and implicit transformer-based interactions to better understand visual relationships in image captioning.
Findings
Significant performance improvements on COCO benchmark.
Effective use of Gated GCN for local relationship aggregation.
Utilization of Region BERT for global contextual understanding.
Abstract
Image captioning is one of the most challenging tasks in AI, which aims to automatically generate textual sentences for an image. Recent methods for image captioning follow encoder-decoder framework that transforms the sequence of salient regions in an image into natural language descriptions. However, these models usually lack the comprehensive understanding of the contextual interactions reflected on various visual relationships between objects. In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning. Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information. Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsGraph Convolutional Networks
