Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding
Kumari Nishu, Minsik Cho, Paul Dixon, Devang Naik

TL;DR
This paper introduces a novel homogeneous audio-text embedding architecture for flexible keyword spotting, improving accuracy and efficiency by using an audio-compliant text encoder and confusable keyword augmentation.
Contribution
The work proposes a new architecture that employs a homogeneous audio-compliant text encoder and confusable keyword generation for improved flexible keyword spotting.
Findings
Outperforms state-of-the-art on Libriphrase hard dataset
Increases AUC from 84.21% to 92.7%
Reduces EER from 23.36% to 14.4%
Abstract
Spotting user-defined/flexible keywords represented in text frequently uses an expensive text encoder for joint analysis with an audio encoder in an embedding space, which can suffer from heterogeneous modality representation (i.e., large mismatch) and increased complexity. In this work, we propose a novel architecture to efficiently detect arbitrary keywords based on an audio-compliant text encoder which inherently has homogeneous representation with audio embedding, and it is also much smaller than a compatible text encoder. Our text encoder converts the text to phonemes using a grapheme-to-phoneme (G2P) model, and then to an embedding using representative phoneme vectors, extracted from the paired audio encoder on rich speech datasets. We further augment our method with confusable keyword generation to develop an audio-text embedding verifier with strong discriminative power.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Advanced Text Analysis Techniques
