TL;DR
This paper presents a visually grounded speech model trained on paired images and speech that effectively captures semantic content, enabling semantic speech retrieval without transcriptions, outperforming some supervised models in certain aspects.
Contribution
The study introduces a new dataset and task for semantic speech retrieval, demonstrating that a model trained on images and speech can learn semantic representations without transcriptions.
Findings
Achieves nearly 60% precision in top ten semantic retrievals
Outperforms supervised models in retrieving non-verbatim semantic matches
Provides extensive analysis of learned speech representations
Abstract
There is growing interest in models that can learn from unlabelled speech paired with visual context. This setting is relevant for low-resource speech processing, robotics, and human language acquisition research. Here we study how a visually grounded speech model, trained on images of scenes paired with spoken captions, captures aspects of semantics. We use an external image tagger to generate soft text labels from images, which serve as targets for a neural model that maps untranscribed speech to (semantic) keyword labels. We introduce a newly collected data set of human semantic relevance judgements and an associated task, semantic speech retrieval, where the goal is to search for spoken utterances that are semantically relevant to a given text query. Without seeing any text, the model trained on parallel speech and images achieves a precision of almost 60% on its top ten semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
