Personalizing Keyword Spotting with Speaker Information
Beltr\'an Labrador, Pai Zhu, Guanlong Zhao, Angelo Scorza Scarpati,, Quan Wang, Alicia Lozano-Diez, Alex Park, Ignacio L\'opez Moreno

TL;DR
This paper introduces a method to improve keyword spotting accuracy across diverse speakers by integrating speaker information via FiLM, with minimal additional computational cost.
Contribution
We propose a novel FiLM-based approach that incorporates speaker information into keyword spotting, enhancing performance for underrepresented groups with minimal parameter increase.
Findings
Significant improvement in keyword detection accuracy for diverse speakers.
Effective integration of speaker info from both input and enrolled audio.
Minimal impact on latency and computational cost.
Abstract
Keyword spotting systems often struggle to generalize to a diverse population with various accents and age groups. To address this challenge, we propose a novel approach that integrates speaker information into keyword spotting using Feature-wise Linear Modulation (FiLM), a recent method for learning from multiple sources of information. We explore both Text-Dependent and Text-Independent speaker recognition systems to extract speaker information, and we experiment on extracting this information from both the input audio and pre-enrolled user audio. We evaluate our systems on a diverse dataset and achieve a substantial improvement in keyword detection accuracy, particularly among underrepresented speaker groups. Moreover, our proposed approach only requires a small 1% increase in the number of parameters, with a minimum impact on latency and computational cost, which makes it a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
