Joint Learning of Distributed Representations for Images and Texts
Xiaodong He, Rupesh Srivastava, Jianfeng Gao, Li Deng

TL;DR
This paper details the deep multimodal similarity model (DMSM) that learns joint representations of images and texts by maximizing semantic similarity, using the Microsoft COCO dataset, to improve image-caption matching.
Contribution
It introduces a deep model for learning joint image-text representations that effectively capture visual concepts and cues for multimodal similarity tasks.
Findings
Effective joint representations learned for images and texts.
Improved semantic similarity matching performance.
Utilizes large-scale Microsoft COCO dataset.
Abstract
This technical report provides extra details of the deep multimodal similarity model (DMSM) which was proposed in (Fang et al. 2015, arXiv:1411.4952). The model is trained via maximizing global semantic similarity between images and their captions in natural language using the public Microsoft COCO database, which consists of a large set of images and their corresponding captions. The learned representations attempt to capture the combination of various visual concepts and cues.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Image Retrieval and Classification Techniques
