Image Pivoting for Learning Multilingual Multimodal Representations

Spandana Gella; Rico Sennrich; Frank Keller; Mirella Lapata

arXiv:1707.07601·cs.CL·July 25, 2017

Image Pivoting for Learning Multilingual Multimodal Representations

Spandana Gella, Rico Sennrich, Frank Keller, Mirella Lapata

PDF

TL;DR

This paper introduces a model that learns shared multilingual multimodal representations by using images as pivots, enabling improved image and sentence matching across languages without requiring parallel data.

Contribution

It proposes a novel pivot-based approach with a new ranking loss for multilingual multimodal learning, achieving state-of-the-art results.

Findings

01

State-of-the-art performance on German-English image-description ranking

02

Effective multilingual image understanding without parallel data

03

Improved semantic textual similarity in image descriptions

Abstract

In this paper we propose a model to learn multimodal multilingual representations for matching images and sentences in different languages, with the aim of advancing multilingual versions of image search and image understanding. Our model learns a common representation for images and their descriptions in two different languages (which need not be parallel) by considering the image as a pivot between two languages. We introduce a new pairwise ranking loss function which can handle both symmetric and asymmetric similarity between the two modalities. We evaluate our models on image-description ranking for German and English, and on semantic textual similarity of image descriptions in English. In both cases we achieve state-of-the-art performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.