Unsupervised vs. transfer learning for multimodal one-shot matching of speech and images
Leanne Nortje, Herman Kamper

TL;DR
This paper compares unsupervised and transfer learning methods for one-shot speech-image matching, finding transfer learning generally outperforms unsupervised approaches on paired digit datasets.
Contribution
It provides a systematic comparison between unsupervised and transfer learning for multimodal one-shot matching, highlighting the advantages of transfer learning.
Findings
Transfer learning outperforms unsupervised models in one-shot matching.
Unsupervised autoencoder-like models are less effective than supervised classifiers.
Combining unsupervised and transfer learning does not significantly improve performance.
Abstract
We consider the task of multimodal one-shot speech-image matching. An agent is shown a picture along with a spoken word describing the object in the picture, e.g. cookie, broccoli and ice-cream. After observing one paired speech-image example per class, it is shown a new set of unseen pictures, and asked to pick the "ice-cream". Previous work attempted to tackle this problem using transfer learning: supervised models are trained on labelled background data not containing any of the one-shot classes. Here we compare transfer learning to unsupervised models trained on unlabelled in-domain data. On a dataset of paired isolated spoken and visual digits, we specifically compare unsupervised autoencoder-like models to supervised classifier and Siamese neural networks. In both unimodal and multimodal few-shot matching experiments, we find that transfer learning outperforms unsupervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech Recognition and Synthesis · Human Pose and Action Recognition
