Text-guided Attention Model for Image Captioning
Jonghwan Mun, Minsu Cho, Bohyung Han

TL;DR
This paper introduces a text-guided attention model for image captioning that leverages associated captions to improve visual attention and generate more detailed descriptions, achieving state-of-the-art results on MS-COCO.
Contribution
The paper proposes a novel exemplar-based learning approach that uses associated captions to guide visual attention in image captioning models.
Findings
Achieves state-of-the-art performance on MS-COCO Captioning benchmark.
Enables detailed scene descriptions by distinguishing small or confusable objects.
Demonstrates effectiveness of text-guided attention in generating natural language descriptions.
Abstract
Visual attention plays an important role to understand images and demonstrates its effectiveness in generating natural language descriptions of images. On the other hand, recent studies show that language associated with an image can steer visual attention in the scene during our cognitive process. Inspired by this, we introduce a text-guided attention model for image captioning, which learns to drive visual attention using associated captions. For this model, we propose an exemplar-based learning approach that retrieves from training data associated captions with each image, and use them to learn attention on visual features. Our attention model enables to describe a detailed state of scenes by distinguishing small or confusable objects effectively. We validate our model on MS-COCO Captioning benchmark and achieve the state-of-the-art performance in standard metrics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
