Phoneme-Level Contrastive Learning for User-Defined Keyword Spotting with Flexible Enrollment
Li Kewei, Zhou Hengshun, Shen Kai, Dai Yusheng, Du Jun

TL;DR
This paper introduces Phoneme-Level Contrastive Learning (PLCL), a novel method that improves user-defined keyword spotting by enhancing phoneme-level feature alignment, reducing false alarms, and supporting flexible enrollment modes.
Contribution
The paper proposes a new phoneme-level contrastive learning approach that enhances robustness and flexibility in user-defined keyword spotting systems, outperforming existing methods.
Findings
Achieves state-of-the-art performance on LibriPhrase dataset.
Enhances disambiguation of confusable words through phoneme-level alignment.
Supports multiple enrollment modalities within a unified framework.
Abstract
User-defined keyword spotting (KWS) enhances the user experience by allowing individuals to customize keywords. However, in open-vocabulary scenarios, most existing methods commonly suffer from high false alarm rates with confusable words and are limited to either audio-only or text-only enrollment. Therefore, in this paper, we first explore the model's robustness against confusable words. Specifically, we propose Phoneme-Level Contrastive Learning (PLCL), which refines and aligns query and source feature representations at the phoneme level. This method enhances the model's disambiguation capability through fine-grained positive and negative comparisons for more accurate alignment, and it is generalizable to jointly optimize both audio-text and audio-audio matching, adapting to various enrollment modes. Furthermore, we maintain a context-agnostic phoneme memory bank to construct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Advanced Text Analysis Techniques
MethodsContrastive Learning
