Closing the Gap between Single-User and Multi-User VoiceFilter-Lite
Rajeev Rikhye, Quan Wang, Qiao Liang, Yanzhang He, Ian McGraw

TL;DR
This paper enhances multi-user VoiceFilter-Lite by using dual learning rates and FiLM conditioning, effectively closing the performance gap with single-user models and supporting multiple users in smart devices.
Contribution
Introduces a novel training and conditioning approach that improves multi-user VoiceFilter-Lite performance and scalability over previous methods.
Findings
Closed the performance gap between multi-user and single-user models.
Significantly improved multi-speaker evaluation results.
Supported any number of users with the new model.
Abstract
VoiceFilter-Lite is a speaker-conditioned voice separation model that plays a crucial role in improving speech recognition and speaker verification by suppressing overlapping speech from non-target speakers. However, one limitation of VoiceFilter-Lite, and other speaker-conditioned speech models in general, is that these models are usually limited to a single target speaker. This is undesirable as most smart home devices now support multiple enrolled users. In order to extend the benefits of personalization to multiple users, we previously developed an attention-based speaker selection mechanism and applied it to VoiceFilter-Lite. However, the original multi-user VoiceFilter-Lite model suffers from significant performance degradation compared with single-user models. In this paper, we devised a series of experiments to improve the multi-user VoiceFilter-Lite model. By incorporating a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsVoiceFilter-Lite
