Unifying Visual-Semantic Embeddings with Multimodal Neural Language   Models

Ryan Kiros; Ruslan Salakhutdinov; Richard S. Zemel

arXiv:1411.2539·cs.LG·November 11, 2014·1.3k cites

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Ryan Kiros, Ruslan Salakhutdinov, Richard S. Zemel

PDF

Open Access 3 Repos

TL;DR

This paper presents a unified multimodal neural model that learns joint image-text embeddings and generates descriptions, achieving state-of-the-art results on benchmark datasets without relying on object detections.

Contribution

It introduces a novel encoder-decoder framework that combines joint embedding learning with a neural language model, unifying image-text retrieval and caption generation.

Findings

01

Achieves state-of-the-art performance on Flickr8K and Flickr30K datasets.

02

Demonstrates that the embedding space captures multimodal regularities like vector arithmetic.

03

Sets new benchmarks with deeper convolutional networks.

Abstract

Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effectively unifies joint image-text embedding models with multimodal neural language models. We introduce the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder. The encoder allows one to rank images and sentences while the decoder can generate novel descriptions from scratch. Using LSTM to encode sentences, we match the state-of-the-art performance on Flickr8K and Flickr30K without using object detections. We also set new best results when using the 19-layer Oxford convolutional network.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory