Effective User-defined Keyword Spotting with Dual-stage Matching, Multi-modal Enrollment, and Continual Adaptation

Zhiqi Ai; Han Cheng; Shiyi Mu; Xinnuo Li; Yongjin Zhou; Shugong Xu

arXiv:2605.22120·eess.AS·May 22, 2026

Effective User-defined Keyword Spotting with Dual-stage Matching, Multi-modal Enrollment, and Continual Adaptation

Zhiqi Ai, Han Cheng, Shiyi Mu, Xinnuo Li, Yongjin Zhou, Shugong Xu

PDF

TL;DR

DMA-KWS is a novel user-defined keyword spotting framework that combines dual-stage matching, multi-modal enrollment, and continual adaptation to improve accuracy, robustness, and efficiency for personalized voice interaction.

Contribution

It introduces a comprehensive framework integrating dual-stage matching, multi-modal enrollment, and lightweight continual adaptation for improved user-defined keyword spotting.

Findings

01

Achieves 97.85% AUC and 6.13% EER on LibriPhrase Hard subset

02

Outperforms text-only enrollment in speaker-dependent settings

03

Uses only 187k parameters for fine-tuning, suitable for on-device deployment

Abstract

User-defined keyword spotting (KWS) is crucial for personalized voice interaction, yet existing methods face several challenges: (1) insufficient discriminability among confusable words, (2) performance inconsistency across speakers with varying pronunciations, and (3) high data cost to ensure reliable wake-word performance. In this paper, we introduce DMA-KWS, an efficient and robust framework for user-defined keyword spotting. First, it adopts a dual-stage matching pipeline: CTC decoding with streaming phoneme search to locate candidate segments, followed by QbyT with a phoneme matcher for fine-grained verification, enabling it to better distinguish confusable words. Next, multi-modal enrollment fuses user-specific speech with text embeddings to further improve accuracy for registered users. Finally, a parameter-efficient continual adaptation mechanism performs lightweight updates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.