Exploring Explicit and Implicit Visual Relationships for Image   Captioning

Zeliang Song; Xiaofei Zhou

arXiv:2105.02391·cs.CV·May 7, 2021·1 cites

Exploring Explicit and Implicit Visual Relationships for Image Captioning

Zeliang Song, Xiaofei Zhou

PDF

Open Access

TL;DR

This paper enhances image captioning by integrating explicit semantic graphs and implicit global interactions using Gated GCNs and Region BERT, leading to improved captioning performance on the COCO dataset.

Contribution

It introduces a novel approach combining explicit semantic graphs and implicit transformer-based interactions to better understand visual relationships in image captioning.

Findings

01

Significant performance improvements on COCO benchmark.

02

Effective use of Gated GCN for local relationship aggregation.

03

Utilization of Region BERT for global contextual understanding.

Abstract

Image captioning is one of the most challenging tasks in AI, which aims to automatically generate textual sentences for an image. Recent methods for image captioning follow encoder-decoder framework that transforms the sequence of salient regions in an image into natural language descriptions. However, these models usually lack the comprehensive understanding of the contextual interactions reflected on various visual relationships between objects. In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning. Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information. Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsGraph Convolutional Networks