Watch What You Just Said: Image Captioning with Text-Conditional   Attention

Luowei Zhou; Chenliang Xu; Parker Koch; Jason J. Corso

arXiv:1606.04621·cs.CV·November 28, 2016·20 cites

Watch What You Just Said: Image Captioning with Text-Conditional Attention

Luowei Zhou, Chenliang Xu, Parker Koch, Jason J. Corso

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel text-conditional attention mechanism for image captioning that leverages textual context to improve focus on image features, resulting in superior performance on MS-COCO.

Contribution

It proposes a new attention mechanism that incorporates textual context into image captioning, enabling joint end-to-end learning of image and text features.

Findings

01

Outperforms state-of-the-art methods on MS-COCO

02

Improves caption quality in quantitative metrics

03

Receives higher human evaluation scores

Abstract

Attention mechanisms have attracted considerable interest in image captioning due to its powerful performance. However, existing methods use only visual content as attention and whether textual context can improve attention in image captioning remains unsolved. To explore this problem, we propose a novel attention mechanism, called \textit{text-conditional attention}, which allows the caption generator to focus on certain image features given previously generated text. To obtain text-related image features for our attention model, we adopt the guiding Long Short-Term Memory (gLSTM) captioning architecture with CNN fine-tuning. Our proposed method allows joint learning of the image embedding, text embedding, text-conditional attention and language model with one network architecture in an end-to-end manner. We perform extensive experiments on the MS-COCO dataset. The experimental results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LuoweiZhou/e2e-gLSTM-sc
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization