Learning Modality-Invariant Representations for Speech and Images

Kenneth Leidal; David Harwath; and James Glass

arXiv:1712.03897·cs.LG·December 12, 2017

Learning Modality-Invariant Representations for Speech and Images

Kenneth Leidal, David Harwath, and James Glass

PDF

TL;DR

This paper introduces an unsupervised method for learning a shared semantic space for speech and images, using a variational autoencoder approach to improve modality-invariant representations.

Contribution

It proposes a novel VAE-based technique for aligning semantic embeddings of co-occurring sensory inputs, enabling modality-invariant representations for speech and images.

Findings

01

Effective mapping of speech and image inputs to a shared semantic space

02

Regularization filters modality-specific information while preserving semantics

03

Potential applicability to cross-modality retrieval and transfer learning

Abstract

In this paper, we explore the unsupervised learning of a semantic embedding space for co-occurring sensory inputs. Specifically, we focus on the task of learning a semantic vector space for both spoken and handwritten digits using the TIDIGITs and MNIST datasets. Current techniques encode image and audio/textual inputs directly to semantic embeddings. In contrast, our technique maps an input to the mean and log variance vectors of a diagonal Gaussian from which sample semantic embeddings are drawn. In addition to encouraging semantic similarity between co-occurring inputs,our loss function includes a regularization term borrowed from variational autoencoders (VAEs) which drives the posterior distributions over embeddings to be unit Gaussian. We can use this regularization term to filter out modality information while preserving semantic information. We speculate this technique may be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.