Explain Images with Multimodal Recurrent Neural Networks
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Alan L. Yuille

TL;DR
This paper introduces a multimodal RNN model that generates descriptive sentences for images by integrating deep neural networks for both image and language understanding, validated on multiple datasets.
Contribution
The novel multimodal RNN architecture effectively combines image and sentence modeling, outperforming existing methods in image captioning and retrieval tasks.
Findings
Outperforms state-of-the-art in image captioning on benchmark datasets
Achieves significant improvements in image and sentence retrieval tasks
Validates effectiveness across three diverse datasets
Abstract
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images. It directly models the probability distribution of generating a word given previous words and the image. Image descriptions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on three benchmark datasets (IAPR TC-12, Flickr 8K, and Flickr 30K). Our model outperforms the state-of-the-art generative method. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Anomaly Detection Techniques and Applications · Digital Media Forensic Detection
