Distributed Attention for Grounded Image Captioning
Nenglun Chen, Xingjia Pan, Runnan Chen, Lei Yang, Zhiwen Lin, Yuqiang, Ren, Haolei Yuan, Xiaowei Guo, Feiyue Huang, Wenping Wang

TL;DR
This paper introduces a distributed attention mechanism for weakly supervised grounded image captioning, improving the coverage of object regions in generated captions by aggregating information from multiple regions.
Contribution
It proposes a novel distributed attention approach to address partial grounding issues, enhancing the accuracy of region-word alignment in weakly supervised captioning.
Findings
Outperforms state-of-the-art methods in grounded image captioning
Improves coverage of object regions in generated descriptions
Enhances attention accuracy for visually grounded words
Abstract
We study the problem of weakly supervised grounded image captioning. That is, given an image, the goal is to automatically generate a sentence describing the context of the image with each noun word grounded to the corresponding region in the image. This task is challenging due to the lack of explicit fine-grained region word alignments as supervision. Previous weakly supervised methods mainly explore various kinds of regularization schemes to improve attention accuracy. However, their performances are still far from the fully supervised ones. One main issue that has been ignored is that the attention for generating visually groundable words may only focus on the most discriminate parts and can not cover the whole object. To this end, we propose a simple yet effective method to alleviate the issue, termed as partial grounding problem in our paper. Specifically, we design a distributed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
