Bridging the Gap between Audio and Text using Parallel-attention for User-defined Keyword Spotting
Youkyum Kim, Jaemin Jung, Jihwan Park, Byeong-Yeol Kim, Joon Son Chung

TL;DR
This paper introduces ParallelKWS, a novel framework for user-defined keyword spotting that leverages parallel self- and cross-attention mechanisms to align audio and text modalities, achieving state-of-the-art results without extra data.
Contribution
The paper presents a new parallel attention-based architecture with a phoneme duration alignment loss for improved audio-text keyword spotting.
Findings
Achieves state-of-the-art performance on benchmark datasets.
Effective in both seen and unseen domains.
Does not require additional data beyond existing datasets.
Abstract
This paper proposes a novel user-defined keyword spotting framework that accurately detects audio keywords based on text enrollment. Since audio data possesses additional acoustic information compared to text, there are discrepancies between these two modalities. To address this challenge, we present ParallelKWS, which utilises self- and cross-attention in a parallel architecture to effectively capture information both within and across the two modalities. We further propose a phoneme duration-based alignment loss that enforces the sequential correspondence between audio and text features. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art performance on several benchmark datasets in both seen and unseen domains, without incorporating extra data beyond the dataset used in previous studies.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Music and Audio Processing
