Dual Attention on Pyramid Feature Maps for Image Captioning

Litao Yu; Jian Zhang; Qiang Wu

arXiv:2011.01385·cs.CV·June 11, 2021

Dual Attention on Pyramid Feature Maps for Image Captioning

Litao Yu, Jian Zhang, Qiang Wu

PDF

TL;DR

This paper introduces a dual attention mechanism on pyramid image features to enhance image captioning by better localizing regions and re-calibrating feature importance, leading to improved descriptive sentence generation.

Contribution

It presents a novel dual attention approach that leverages pyramid features and contextual information to improve image captioning performance.

Findings

01

Achieved impressive results on Flickr8K, Flickr30K, and MS COCO datasets.

02

Enhanced caption quality with more descriptive and smooth sentences.

03

Modular methods that can be integrated into various captioning models.

Abstract

Generating natural sentences from images is a fundamental learning task for visual-semantic understanding in multimedia. In this paper, we propose to apply dual attention on pyramid image feature maps to fully explore the visual-semantic correlations and improve the quality of generated sentences. Specifically, with the full consideration of the contextual information provided by the hidden state of the RNN controller, the pyramid attention can better localize the visually indicative and semantically consistent regions in images. On the other hand, the contextual information can help re-calibrate the importance of feature components by learning the channel-wise dependencies, to improve the discriminative power of visual features for better content description. We conducted comprehensive experiments on three well-known datasets: Flickr8K, Flickr30K and MS COCO, which achieved impressive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsConvolution