Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding

Kumari Nishu; Minsik Cho; Paul Dixon; Devang Naik

arXiv:2308.06472·cs.SD·August 15, 2023·1 cites

Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding

Kumari Nishu, Minsik Cho, Paul Dixon, Devang Naik

PDF

Open Access

TL;DR

This paper introduces a novel homogeneous audio-text embedding architecture for flexible keyword spotting, improving accuracy and efficiency by using an audio-compliant text encoder and confusable keyword augmentation.

Contribution

The work proposes a new architecture that employs a homogeneous audio-compliant text encoder and confusable keyword generation for improved flexible keyword spotting.

Findings

01

Outperforms state-of-the-art on Libriphrase hard dataset

02

Increases AUC from 84.21% to 92.7%

03

Reduces EER from 23.36% to 14.4%

Abstract

Spotting user-defined/flexible keywords represented in text frequently uses an expensive text encoder for joint analysis with an audio encoder in an embedding space, which can suffer from heterogeneous modality representation (i.e., large mismatch) and increased complexity. In this work, we propose a novel architecture to efficiently detect arbitrary keywords based on an audio-compliant text encoder which inherently has homogeneous representation with audio embedding, and it is also much smaller than a compatible text encoder. Our text encoder converts the text to phonemes using a grapheme-to-phoneme (G2P) model, and then to an embedding using representative phoneme vectors, extracted from the paired audio encoder on rich speech datasets. We further augment our method with confusable keyword generation to develop an audio-text embedding verifier with strong discriminative power.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Advanced Text Analysis Techniques