Learning Modality-Invariant Representations for Speech and Images
Kenneth Leidal, David Harwath, and James Glass

TL;DR
This paper introduces an unsupervised method for learning a shared semantic space for speech and images, using a variational autoencoder approach to improve modality-invariant representations.
Contribution
It proposes a novel VAE-based technique for aligning semantic embeddings of co-occurring sensory inputs, enabling modality-invariant representations for speech and images.
Findings
Effective mapping of speech and image inputs to a shared semantic space
Regularization filters modality-specific information while preserving semantics
Potential applicability to cross-modality retrieval and transfer learning
Abstract
In this paper, we explore the unsupervised learning of a semantic embedding space for co-occurring sensory inputs. Specifically, we focus on the task of learning a semantic vector space for both spoken and handwritten digits using the TIDIGITs and MNIST datasets. Current techniques encode image and audio/textual inputs directly to semantic embeddings. In contrast, our technique maps an input to the mean and log variance vectors of a diagonal Gaussian from which sample semantic embeddings are drawn. In addition to encouraging semantic similarity between co-occurring inputs,our loss function includes a regularization term borrowed from variational autoencoders (VAEs) which drives the posterior distributions over embeddings to be unit Gaussian. We can use this regularization term to filter out modality information while preserving semantic information. We speculate this technique may be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
