Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting

Hyeon-Kyeong Shin; Hyewon Han; Doyeon Kim; Soo-Whan Chung; Hong-Goo; Kang

arXiv:2206.15400·eess.AS·July 4, 2022

Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting

Hyeon-Kyeong Shin, Hyewon Han, Doyeon Kim, Soo-Whan Chung, Hong-Goo, Kang

PDF

Open Access 1 Repo

TL;DR

This paper introduces an end-to-end cross-modal keyword spotting method that compares speech and text sequences directly, improving robustness and enabling user-defined keywords without prior speech enrollment.

Contribution

It presents a novel audio-text agreement approach with an attention-based model, a new dataset, and training strategies for open-vocabulary keyword spotting.

Findings

01

Achieves competitive accuracy on multiple benchmarks.

02

Improves robustness in noisy environments.

03

Introduces the LibriPhrase dataset for training.

Abstract

In this paper, we propose a novel end-to-end user-defined keyword spotting method that utilizes linguistically corresponding patterns between speech and text sequences. Unlike previous approaches requiring speech keyword enrollment, our method compares input queries with an enrolled text keyword sequence. To place the audio and text representations within a common latent space, we adopt an attention-based cross-modal matching approach that is trained in an end-to-end manner with monotonic matching loss and keyword classification loss. We also utilize a de-noising loss for the acoustic embedding network to improve robustness in noisy environments. Additionally, we introduce the LibriPhrase dataset, a new short-phrase dataset based on LibriSpeech for efficiently training keyword spotting models. Our proposed method achieves competitive results on various evaluation sets compared to other…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gusrud1103/libriphrase
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing