Text-guided Attention Model for Image Captioning

Jonghwan Mun; Minsu Cho; Bohyung Han

arXiv:1612.03557·cs.CV·December 13, 2016·5 cites

Text-guided Attention Model for Image Captioning

Jonghwan Mun, Minsu Cho, Bohyung Han

PDF

Open Access 1 Repo

TL;DR

This paper introduces a text-guided attention model for image captioning that leverages associated captions to improve visual attention and generate more detailed descriptions, achieving state-of-the-art results on MS-COCO.

Contribution

The paper proposes a novel exemplar-based learning approach that uses associated captions to guide visual attention in image captioning models.

Findings

01

Achieves state-of-the-art performance on MS-COCO Captioning benchmark.

02

Enables detailed scene descriptions by distinguishing small or confusable objects.

03

Demonstrates effectiveness of text-guided attention in generating natural language descriptions.

Abstract

Visual attention plays an important role to understand images and demonstrates its effectiveness in generating natural language descriptions of images. On the other hand, recent studies show that language associated with an image can steer visual attention in the scene during our cognitive process. Inspired by this, we introduce a text-guided attention model for image captioning, which learns to drive visual attention using associated captions. For this model, we propose an exemplar-based learning approach that retrieves from training data associated captions with each image, and use them to learn attention on visual features. Our attention model enables to describe a detailed state of scenes by distinguishing small or confusable objects effectively. We validate our model on MS-COCO Captioning benchmark and achieve the state-of-the-art performance in standard metrics.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vikramnitin9/nnfl
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning