ProKWS: Personalized Keyword Spotting via Collaborative Learning of Phonemes and Prosody

Jianan Pan; Yuanming Zhang; Kejie Huang

arXiv:2603.18024·eess.AS·March 20, 2026

ProKWS: Personalized Keyword Spotting via Collaborative Learning of Phonemes and Prosody

Jianan Pan, Yuanming Zhang, Kejie Huang

PDF

Open Access

TL;DR

ProKWS introduces a personalized keyword spotting framework that combines phoneme-level matching with prosody modeling, improving adaptability and robustness across different speakers and acoustic conditions.

Contribution

It presents a dual-stream encoder that jointly learns phonemes and prosody, enabling personalized and robust keyword detection.

Findings

01

Achieves performance comparable to state-of-the-art models

02

Demonstrates strong robustness to tone and intent variations

03

Enhances personalization in keyword spotting systems

Abstract

Current keyword spotting systems primarily use phoneme-level matching to distinguish confusable words but ignore user-specific pronunciation traits like prosody (intonation, stress, rhythm). This paper presents ProKWS, a novel framework integrating fine-grained phoneme learning with personalized prosody modeling. We design a dual-stream encoder where one stream derives robust phonemic representations through contrastive learning, while the other extracts speaker-specific prosodic patterns. A collaborative fusion module dynamically combines phonemic and prosodic information, enhancing adaptability across acoustic environments. Experiments show ProKWS delivers highly competitive performance, comparable to state-of-the-art models on standard benchmarks and demonstrates strong robustness for personalized keywords with tone and intent variations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling