Large-scale representation learning from visually grounded untranscribed   speech

Gabriel Ilharco; Yuan Zhang; Jason Baldridge

arXiv:1909.08782·cs.CV·September 20, 2019

Large-scale representation learning from visually grounded untranscribed speech

Gabriel Ilharco, Yuan Zhang, Jason Baldridge

PDF

TL;DR

This paper introduces a scalable approach for learning joint representations of images and spoken audio captions, improving image-caption retrieval performance and highlighting the limitations of automatic evaluation metrics.

Contribution

It presents a novel scalable data generation method and a dual encoder model with a masked margin softmax loss for better audio-image alignment.

Findings

01

Achieved state-of-the-art recall in image-caption retrieval on Flickr8k dataset.

02

Demonstrated the superiority of the masked margin softmax loss over triplet loss.

03

Found that automatic metrics underestimate the true quality of retrieval results.

Abstract

Systems that can associate images with their spoken audio captions are an important step towards visually grounded language learning. We describe a scalable method to automatically generate diverse audio for image captioning datasets. This supports pretraining deep networks for encoding both audio and images, which we do via a dual encoder that learns to align latent representations from both modalities. We show that a masked margin softmax loss for such models is superior to the standard triplet loss. We fine-tune these models on the Flickr8k Audio Captions Corpus and obtain state-of-the-art results---improving recall in the top 10 from 29.6% to 49.5%. We also obtain human ratings on retrieval outputs to better assess the impact of incidentally matching image-caption pairs that were not associated in the data, finding that automatic evaluation substantially underestimates the quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax