I see what you hear: a vision-inspired method to localize words

Mohammad Samragh; Arnav Kundu; Ting-Yao Hu; Minsik Cho; Aman Chadha,; Ashish Shrivastava; Oncel Tuzel; Devang Naik

arXiv:2210.13567·cs.CV·October 26, 2022

I see what you hear: a vision-inspired method to localize words

Mohammad Samragh, Arnav Kundu, Ting-Yao Hu, Minsik Cho, Aman Chadha,, Ashish Shrivastava, Oncel Tuzel, Devang Naik

PDF

Open Access

TL;DR

This paper introduces a novel vision-inspired approach to localize words in speech data by treating audio as a 1D image, leveraging object detection techniques for efficient and accurate word detection.

Contribution

It presents a lightweight, bounding box regression-based model for word localization in speech, significantly reducing model size and improving accuracy over existing methods.

Findings

01

Reduces model size by 94%

02

Improves F1 score by 6.5%

03

Successfully localizes 1000 words in LibriSpeech

Abstract

This paper explores the possibility of using visual object detection techniques for word localization in speech data. Object detection has been thoroughly studied in the contemporary literature for visual data. Noting that an audio can be interpreted as a 1-dimensional image, object localization techniques can be fundamentally useful for word localization. Building upon this idea, we propose a lightweight solution for word detection and localization. We use bounding box regression for word localization, which enables our model to detect the occurrence, offset, and duration of keywords in a given audio stream. We experiment with LibriSpeech and train a model to localize 1000 words. Compared to existing work, our method reduces model size by 94%, and improves the F1 score by 6.5\%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hand Gesture Recognition Systems · Subtitles and Audiovisual Media