Multimodal Convolutional Neural Networks for Matching Image and Sentence
Lin Ma, Zhengdong Lu, Lifeng Shang, Hang Li

TL;DR
This paper introduces multimodal convolutional neural networks (m-CNNs) that effectively match images with sentences by jointly learning representations and relations, achieving state-of-the-art results on benchmark datasets.
Contribution
The paper presents an end-to-end m-CNN framework that captures inter-modal relations and semantic composition for image-sentence matching, advancing prior methods.
Findings
Achieves state-of-the-art performance on Flickr30K and COCO datasets.
Effectively models semantic composition and inter-modal relations.
Outperforms existing approaches in bidirectional image and sentence retrieval.
Abstract
In this paper, we propose multimodal convolutional neural networks (m-CNNs) for matching image and sentence. Our m-CNN provides an end-to-end framework with convolutional architectures to exploit image representation, word composition, and the matching relations between the two modalities. More specifically, it consists of one image CNN encoding the image content, and one matching CNN learning the joint representation of image and sentence. The matching CNN composes words to different semantic fragments and learns the inter-modal relations between image and the composed fragments at different levels, thus fully exploit the matching relations between image and sentence. Experimental results on benchmark databases of bidirectional image and sentence retrieval demonstrate that the proposed m-CNNs can effectively capture the information necessary for image and sentence matching.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
