Semantic speech retrieval with a visually grounded model of   untranscribed speech

Herman Kamper; Gregory Shakhnarovich; Karen Livescu

arXiv:1710.01949·cs.CL·November 2, 2018

Semantic speech retrieval with a visually grounded model of untranscribed speech

Herman Kamper, Gregory Shakhnarovich, Karen Livescu

PDF

2 Repos

TL;DR

This paper presents a visually grounded speech model trained on paired images and speech that effectively captures semantic content, enabling semantic speech retrieval without transcriptions, outperforming some supervised models in certain aspects.

Contribution

The study introduces a new dataset and task for semantic speech retrieval, demonstrating that a model trained on images and speech can learn semantic representations without transcriptions.

Findings

01

Achieves nearly 60% precision in top ten semantic retrievals

02

Outperforms supervised models in retrieving non-verbatim semantic matches

03

Provides extensive analysis of learned speech representations

Abstract

There is growing interest in models that can learn from unlabelled speech paired with visual context. This setting is relevant for low-resource speech processing, robotics, and human language acquisition research. Here we study how a visually grounded speech model, trained on images of scenes paired with spoken captions, captures aspects of semantics. We use an external image tagger to generate soft text labels from images, which serve as targets for a neural model that maps untranscribed speech to (semantic) keyword labels. We introduce a newly collected data set of human semantic relevance judgements and an associated task, semantic speech retrieval, where the goal is to search for spoken utterances that are semantically relevant to a given text query. Without seeing any text, the model trained on parallel speech and images achieves a precision of almost 60% on its top ten semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.