Unsupervised vs. transfer learning for multimodal one-shot matching of   speech and images

Leanne Nortje; Herman Kamper

arXiv:2008.06258·cs.CL·August 17, 2020

Unsupervised vs. transfer learning for multimodal one-shot matching of speech and images

Leanne Nortje, Herman Kamper

PDF

Open Access 1 Repo

TL;DR

This paper compares unsupervised and transfer learning methods for one-shot speech-image matching, finding transfer learning generally outperforms unsupervised approaches on paired digit datasets.

Contribution

It provides a systematic comparison between unsupervised and transfer learning for multimodal one-shot matching, highlighting the advantages of transfer learning.

Findings

01

Transfer learning outperforms unsupervised models in one-shot matching.

02

Unsupervised autoencoder-like models are less effective than supervised classifiers.

03

Combining unsupervised and transfer learning does not significantly improve performance.

Abstract

We consider the task of multimodal one-shot speech-image matching. An agent is shown a picture along with a spoken word describing the object in the picture, e.g. cookie, broccoli and ice-cream. After observing one paired speech-image example per class, it is shown a new set of unseen pictures, and asked to pick the "ice-cream". Previous work attempted to tackle this problem using transfer learning: supervised models are trained on labelled background data not containing any of the one-shot classes. Here we compare transfer learning to unsupervised models trained on unlabelled in-domain data. On a dataset of paired isolated spoken and visual digits, we specifically compare unsupervised autoencoder-like models to supervised classifier and Siamese neural networks. In both unimodal and multimodal few-shot matching experiments, we find that transfer learning outperforms unsupervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LeanneNortje/multimodal_speech-image_matching
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech Recognition and Synthesis · Human Pose and Action Recognition