Explain Images with Multimodal Recurrent Neural Networks

Junhua Mao; Wei Xu; Yi Yang; Jiang Wang; Alan L. Yuille

arXiv:1410.1090·cs.CV·October 7, 2014·370 cites

Explain Images with Multimodal Recurrent Neural Networks

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Alan L. Yuille

PDF

Open Access

TL;DR

This paper introduces a multimodal RNN model that generates descriptive sentences for images by integrating deep neural networks for both image and language understanding, validated on multiple datasets.

Contribution

The novel multimodal RNN architecture effectively combines image and sentence modeling, outperforming existing methods in image captioning and retrieval tasks.

Findings

01

Outperforms state-of-the-art in image captioning on benchmark datasets

02

Achieves significant improvements in image and sentence retrieval tasks

03

Validates effectiveness across three diverse datasets

Abstract

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images. It directly models the probability distribution of generating a word given previous words and the image. Image descriptions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on three benchmark datasets (IAPR TC-12, Flickr 8K, and Flickr 30K). Our model outperforms the state-of-the-art generative method. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Anomaly Detection Techniques and Applications · Digital Media Forensic Detection