Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron, Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio

TL;DR
This paper introduces an attention-based neural model for image captioning that learns to focus on salient image regions, achieving state-of-the-art results on multiple benchmark datasets.
Contribution
It presents a novel attention mechanism for image captioning trained via backpropagation and variational methods, with visualizations demonstrating learned focus on important objects.
Findings
Achieved state-of-the-art performance on Flickr8k, Flickr30k, and MS COCO datasets.
Model automatically learns to fix its gaze on salient objects during caption generation.
Attention visualization confirms the model's focus aligns with relevant image regions.
Abstract
Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
MethodsHow to Contact American Customer Service:(+1(866)-690-1553)
