ProKWS: Personalized Keyword Spotting via Collaborative Learning of Phonemes and Prosody
Jianan Pan, Yuanming Zhang, Kejie Huang

TL;DR
ProKWS introduces a personalized keyword spotting framework that combines phoneme-level matching with prosody modeling, improving adaptability and robustness across different speakers and acoustic conditions.
Contribution
It presents a dual-stream encoder that jointly learns phonemes and prosody, enabling personalized and robust keyword detection.
Findings
Achieves performance comparable to state-of-the-art models
Demonstrates strong robustness to tone and intent variations
Enhances personalization in keyword spotting systems
Abstract
Current keyword spotting systems primarily use phoneme-level matching to distinguish confusable words but ignore user-specific pronunciation traits like prosody (intonation, stress, rhythm). This paper presents ProKWS, a novel framework integrating fine-grained phoneme learning with personalized prosody modeling. We design a dual-stream encoder where one stream derives robust phonemic representations through contrastive learning, while the other extracts speaker-specific prosodic patterns. A collaborative fusion module dynamically combines phonemic and prosodic information, enhancing adaptability across acoustic environments. Experiments show ProKWS delivers highly competitive performance, comparable to state-of-the-art models on standard benchmarks and demonstrates strong robustness for personalized keywords with tone and intent variations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling
