Neural Attention for Image Captioning: Review of Outstanding Methods
Zanyar Zohourianshahzadi, Jugal K. Kalita

TL;DR
This paper reviews various attention mechanisms in deep learning models for image captioning, highlighting that multi-head and bottom-up attention variants currently achieve the best results.
Contribution
It provides a focused review of attention mechanisms in image captioning models, analyzing their effectiveness and identifying the most successful types.
Findings
Multi-head attention with bottom-up attention yields top performance.
Soft, bottom-up, and multi-head attention are the most common mechanisms.
Variants of multi-head attention with bottom-up attention achieve the best results.
Abstract
Image captioning is the task of automatically generating sentences that describe an input image in the best way possible. The most successful techniques for automatically generating image captions have recently used attentive deep learning models. There are variations in the way deep learning models with attention are designed. In this survey, we provide a review of literature related to attentive deep learning models for image captioning. Instead of offering a comprehensive review of all prior work on deep image captioning models, we explain various types of attention mechanisms used for the task of image captioning in deep learning models. The most successful deep learning models used for image captioning follow the encoder-decoder architecture, although there are differences in the way these models employ attention mechanisms. Via analysis on performance results from different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Linear Layer
