Direct multimodal few-shot learning of speech and images

Leanne Nortje; Herman Kamper

arXiv:2012.05680·cs.CL·July 30, 2021

Direct multimodal few-shot learning of speech and images

Leanne Nortje, Herman Kamper

PDF

1 Repo

TL;DR

This paper introduces direct multimodal few-shot learning models that learn a shared embedding space for speech and images, outperforming previous indirect methods by combining unsupervised and transfer learning.

Contribution

It presents two novel direct models, MTriplet and MCAE, that learn a shared embedding space for speech and images directly, avoiding two-step errors and improving few-shot multimodal recognition.

Findings

01

Direct models outperform indirect models in speech-image matching.

02

MTriplet achieves the highest five-shot accuracy.

03

Unsupervised and transfer learning contribute to improvements.

Abstract

We propose direct multimodal few-shot models that learn a shared embedding space of spoken words and images from only a few paired examples. Imagine an agent is shown an image along with a spoken word describing the object in the picture, e.g. pen, book and eraser. After observing a few paired examples of each class, the model is asked to identify the "book" in a set of unseen pictures. Previous work used a two-step indirect approach relying on learned unimodal representations: speech-speech and image-image comparisons are performed across the support set of given speech-image pairs. We propose two direct models which instead learn a single multimodal space where inputs from different modalities are directly comparable: a multimodal triplet network (MTriplet) and a multimodal correspondence autoencoder (MCAE). To train these direct models, we mine speech-image pairs: the support set is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LeanneNortje/direct_multimodal_few-shot_learning
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSolana Customer Service Number +1-833-534-1729