Distributed Attention for Grounded Image Captioning

Nenglun Chen; Xingjia Pan; Runnan Chen; Lei Yang; Zhiwen Lin; Yuqiang; Ren; Haolei Yuan; Xiaowei Guo; Feiyue Huang; Wenping Wang

arXiv:2108.01056·cs.CV·August 24, 2021

Distributed Attention for Grounded Image Captioning

Nenglun Chen, Xingjia Pan, Runnan Chen, Lei Yang, Zhiwen Lin, Yuqiang, Ren, Haolei Yuan, Xiaowei Guo, Feiyue Huang, Wenping Wang

PDF

TL;DR

This paper introduces a distributed attention mechanism for weakly supervised grounded image captioning, improving the coverage of object regions in generated captions by aggregating information from multiple regions.

Contribution

It proposes a novel distributed attention approach to address partial grounding issues, enhancing the accuracy of region-word alignment in weakly supervised captioning.

Findings

01

Outperforms state-of-the-art methods in grounded image captioning

02

Improves coverage of object regions in generated descriptions

03

Enhances attention accuracy for visually grounded words

Abstract

We study the problem of weakly supervised grounded image captioning. That is, given an image, the goal is to automatically generate a sentence describing the context of the image with each noun word grounded to the corresponding region in the image. This task is challenging due to the lack of explicit fine-grained region word alignments as supervision. Previous weakly supervised methods mainly explore various kinds of regularization schemes to improve attention accuracy. However, their performances are still far from the fully supervised ones. One main issue that has been ignored is that the attention for generating visually groundable words may only focus on the most discriminate parts and can not cover the whole object. To this end, we propose a simple yet effective method to alleviate the issue, termed as partial grounding problem in our paper. Specifically, we design a distributed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.