Show, Attend and Tell: Neural Image Caption Generation with Visual   Attention

Kelvin Xu; Jimmy Ba; Ryan Kiros; Kyunghyun Cho; Aaron; Courville; Ruslan Salakhutdinov; Richard Zemel; Yoshua Bengio

arXiv:1502.03044·cs.LG·April 20, 2016·7.5k cites

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron, Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio

PDF

Open Access 5 Repos 2 Models

TL;DR

This paper introduces an attention-based neural model for image captioning that learns to focus on salient image regions, achieving state-of-the-art results on multiple benchmark datasets.

Contribution

It presents a novel attention mechanism for image captioning trained via backpropagation and variational methods, with visualizations demonstrating learned focus on important objects.

Findings

01

Achieved state-of-the-art performance on Flickr8k, Flickr30k, and MS COCO datasets.

02

Model automatically learns to fix its gaze on salient objects during caption generation.

03

Attention visualization confirms the model's focus aligns with relevant image regions.

Abstract

Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques

MethodsHow to Contact American Customer Service:(+1(866)-690-1553)