Learning a Recurrent Visual Representation for Image Caption Generation

Xinlei Chen; C. Lawrence Zitnick

arXiv:1411.5654·cs.CV·November 21, 2014·181 cites

Learning a Recurrent Visual Representation for Image Caption Generation

Xinlei Chen, C. Lawrence Zitnick

PDF

Open Access

TL;DR

This paper introduces a recurrent neural network model with a visual memory for bidirectional image and sentence mapping, enabling novel caption generation and visual feature reconstruction, achieving state-of-the-art results.

Contribution

It presents a novel recurrent visual memory model that improves image captioning and retrieval by learning long-term visual concepts, outperforming previous embedding-based methods.

Findings

01

State-of-the-art image caption generation results.

02

Automatically generated captions preferred over 19.8% of human captions.

03

Competitive performance on image and sentence retrieval tasks.

Abstract

In this paper we explore the bi-directional mapping between images and their sentence-based descriptions. We propose learning this mapping using a recurrent neural network. Unlike previous approaches that map both sentences and images to a common embedding, we enable the generation of novel sentences given an image. Using the same model, we can also reconstruct the visual features associated with an image given its visual description. We use a novel recurrent visual memory that automatically learns to remember long-term visual concepts to aid in both sentence generation and visual feature reconstruction. We evaluate our approach on several tasks. These include sentence generation, sentence retrieval and image retrieval. State-of-the-art results are shown for the task of generating novel image descriptions. When compared to human generated captions, our automatically generated captions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization