Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille

TL;DR
This paper introduces a multimodal RNN model that combines deep neural networks for images and sentences to generate accurate image captions and improve retrieval tasks, outperforming existing methods on multiple benchmarks.
Contribution
The paper presents a novel multimodal RNN architecture that integrates deep convolutional and recurrent networks for improved image captioning and retrieval.
Findings
Outperforms state-of-the-art on four benchmark datasets
Achieves significant improvements in image and sentence retrieval tasks
Effectively models the joint probability of images and captions
Abstract
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, we apply the m-RNN model to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
