Bridging the Gap between Audio and Text using Parallel-attention for   User-defined Keyword Spotting

Youkyum Kim; Jaemin Jung; Jihwan Park; Byeong-Yeol Kim; Joon Son Chung

arXiv:2408.03593·eess.AS·October 23, 2024·IEEE Signal Process. Lett.

Bridging the Gap between Audio and Text using Parallel-attention for User-defined Keyword Spotting

Youkyum Kim, Jaemin Jung, Jihwan Park, Byeong-Yeol Kim, Joon Son Chung

PDF

Open Access

TL;DR

This paper introduces ParallelKWS, a novel framework for user-defined keyword spotting that leverages parallel self- and cross-attention mechanisms to align audio and text modalities, achieving state-of-the-art results without extra data.

Contribution

The paper presents a new parallel attention-based architecture with a phoneme duration alignment loss for improved audio-text keyword spotting.

Findings

01

Achieves state-of-the-art performance on benchmark datasets.

02

Effective in both seen and unseen domains.

03

Does not require additional data beyond existing datasets.

Abstract

This paper proposes a novel user-defined keyword spotting framework that accurately detects audio keywords based on text enrollment. Since audio data possesses additional acoustic information compared to text, there are discrepancies between these two modalities. To address this challenge, we present ParallelKWS, which utilises self- and cross-attention in a parallel architecture to effectively capture information both within and across the two modalities. We further propose a phoneme duration-based alignment loss that enforces the sequential correspondence between audio and text features. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art performance on several benchmark datasets in both seen and unseen domains, without incorporating extra data beyond the dataset used in previous studies.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques · Music and Audio Processing