Joint Learning of Distributed Representations for Images and Texts

Xiaodong He; Rupesh Srivastava; Jianfeng Gao; Li Deng

arXiv:1504.03083·cs.CV·April 29, 2015

Joint Learning of Distributed Representations for Images and Texts

Xiaodong He, Rupesh Srivastava, Jianfeng Gao, Li Deng

PDF

Open Access

TL;DR

This paper details the deep multimodal similarity model (DMSM) that learns joint representations of images and texts by maximizing semantic similarity, using the Microsoft COCO dataset, to improve image-caption matching.

Contribution

It introduces a deep model for learning joint image-text representations that effectively capture visual concepts and cues for multimodal similarity tasks.

Findings

01

Effective joint representations learned for images and texts.

02

Improved semantic similarity matching performance.

03

Utilizes large-scale Microsoft COCO dataset.

Abstract

This technical report provides extra details of the deep multimodal similarity model (DMSM) which was proposed in (Fang et al. 2015, arXiv:1411.4952). The model is trained via maximizing global semantic similarity between images and their captions in natural language using the public Microsoft COCO database, which consists of a large set of images and their corresponding captions. The learned representations attempt to capture the combination of various visual concepts and cues.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Image Retrieval and Classification Techniques