I see what you hear: a vision-inspired method to localize words
Mohammad Samragh, Arnav Kundu, Ting-Yao Hu, Minsik Cho, Aman Chadha,, Ashish Shrivastava, Oncel Tuzel, Devang Naik

TL;DR
This paper introduces a novel vision-inspired approach to localize words in speech data by treating audio as a 1D image, leveraging object detection techniques for efficient and accurate word detection.
Contribution
It presents a lightweight, bounding box regression-based model for word localization in speech, significantly reducing model size and improving accuracy over existing methods.
Findings
Reduces model size by 94%
Improves F1 score by 6.5%
Successfully localizes 1000 words in LibriSpeech
Abstract
This paper explores the possibility of using visual object detection techniques for word localization in speech data. Object detection has been thoroughly studied in the contemporary literature for visual data. Noting that an audio can be interpreted as a 1-dimensional image, object localization techniques can be fundamentally useful for word localization. Building upon this idea, we propose a lightweight solution for word detection and localization. We use bounding box regression for word localization, which enables our model to detect the occurrence, offset, and duration of keywords in a given audio stream. We experiment with LibriSpeech and train a model to localize 1000 words. Compared to existing work, our method reduces model size by 94%, and improves the F1 score by 6.5\%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hand Gesture Recognition Systems · Subtitles and Audiovisual Media
