Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

Junhua Mao; Wei Xu; Yi Yang; Jiang Wang; Zhiheng Huang; Alan Yuille

arXiv:1412.6632·cs.CV·June 12, 2015·ICLR·651 cites

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille

PDF

Open Access 2 Repos

TL;DR

This paper introduces a multimodal RNN model that combines deep neural networks for images and sentences to generate accurate image captions and improve retrieval tasks, outperforming existing methods on multiple benchmarks.

Contribution

The paper presents a novel multimodal RNN architecture that integrates deep convolutional and recurrent networks for improved image captioning and retrieval.

Findings

01

Outperforms state-of-the-art on four benchmark datasets

02

Achieves significant improvements in image and sentence retrieval tasks

03

Effectively models the joint probability of images and captions

Abstract

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, we apply the m-RNN model to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning