Phoneme-Level Contrastive Learning for User-Defined Keyword Spotting   with Flexible Enrollment

Li Kewei; Zhou Hengshun; Shen Kai; Dai Yusheng; Du Jun

arXiv:2412.20805·eess.AS·December 31, 2024

Phoneme-Level Contrastive Learning for User-Defined Keyword Spotting with Flexible Enrollment

Li Kewei, Zhou Hengshun, Shen Kai, Dai Yusheng, Du Jun

PDF

Open Access

TL;DR

This paper introduces Phoneme-Level Contrastive Learning (PLCL), a novel method that improves user-defined keyword spotting by enhancing phoneme-level feature alignment, reducing false alarms, and supporting flexible enrollment modes.

Contribution

The paper proposes a new phoneme-level contrastive learning approach that enhances robustness and flexibility in user-defined keyword spotting systems, outperforming existing methods.

Findings

01

Achieves state-of-the-art performance on LibriPhrase dataset.

02

Enhances disambiguation of confusable words through phoneme-level alignment.

03

Supports multiple enrollment modalities within a unified framework.

Abstract

User-defined keyword spotting (KWS) enhances the user experience by allowing individuals to customize keywords. However, in open-vocabulary scenarios, most existing methods commonly suffer from high false alarm rates with confusable words and are limited to either audio-only or text-only enrollment. Therefore, in this paper, we first explore the model's robustness against confusable words. Specifically, we propose Phoneme-Level Contrastive Learning (PLCL), which refines and aligns query and source feature representations at the phoneme level. This method enhances the model's disambiguation capability through fine-grained positive and negative comparisons for more accurate alignment, and it is generalizable to jointly optimize both audio-text and audio-audio matching, adapting to various enrollment modes. Furthermore, we maintain a context-agnostic phoneme memory bank to construct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Advanced Text Analysis Techniques

MethodsContrastive Learning