Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval
Ramon Sanabria, Austin Waters, Jason Baldridge

TL;DR
This paper thoroughly investigates speech-based image retrieval, comparing various models and training methods, and demonstrates that well-designed speech models can outperform traditional ASR-based approaches, especially with challenging speech.
Contribution
It provides a comprehensive analysis of encoder architectures and training strategies, showing that speech-based retrieval can rival or surpass ASR-based methods in difficult scenarios.
Findings
Achieved significant improvements in recall-at-one over previous state-of-the-art.
Speech models can outperform cascaded ASR-to-text systems on spontaneous and accented speech.
Extensive experiments across three datasets validate the effectiveness of proposed methods.
Abstract
Speech-based image retrieval has been studied as a proxy for joint representation learning, usually without emphasis on retrieval itself. As such, it is unclear how well speech-based retrieval can work in practice -- both in an absolute sense and versus alternative strategies that combine automatic speech recognition (ASR) with strong text encoders. In this work, we extensively study and expand choices of encoder architectures, training methodology (including unimodal and multimodal pretraining), and other factors. Our experiments cover different types of speech in three datasets: Flickr Audio, Places Audio, and Localized Narratives. Our best model configuration achieves large gains over state of the art, e.g., pushing recall-at-one from 21.8% to 33.2% for Flickr Audio and 27.6% to 53.4% for Places Audio. We also show our best speech-based models can match or exceed cascaded ASR-to-text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Domain Adaptation and Few-Shot Learning
