Attention Correctness in Neural Image Captioning
Chenxi Liu, Junhua Mao, Fei Sha, Alan Yuille

TL;DR
This paper evaluates and enhances the correctness of attention mechanisms in neural image captioning by introducing a quantitative metric and supervised training methods, leading to improved attention and caption quality.
Contribution
It proposes a new metric for attention correctness and introduces supervised training approaches to improve attention in image captioning models.
Findings
Supervised attention training improves attention correctness.
Enhanced attention leads to better caption quality.
Quantitative evaluation correlates attention correctness with caption performance.
Abstract
Attention mechanisms have recently been introduced in deep learning for various tasks in natural language processing and computer vision. But despite their popularity, the "correctness" of the implicitly-learned attention maps has only been assessed qualitatively by visualization of several examples. In this paper we focus on evaluating and improving the correctness of attention in neural image captioning models. Specifically, we propose a quantitative evaluation metric for the consistency between the generated attention maps and human annotations, using recently released datasets with alignment between regions in images and entities in captions. We then propose novel models with different levels of explicit supervision for learning attention maps during training. The supervision can be strong when alignment between regions and caption entities are available, or weak when only object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
