Multi-user VoiceFilter-Lite via Attentive Speaker Embedding
Rajeev Rikhye, Quan Wang, Qiao Liang, Yanzhang He, Ian McGraw

TL;DR
This paper introduces a multi-user VoiceFilter-Lite that uses attentive speaker embeddings to support multiple enrolled users simultaneously, improving speech separation and recognition in noisy, overlapping speech scenarios.
Contribution
It presents a novel attention-based method for multi-user speaker embedding integration in VoiceFilter-Lite, enabling support for arbitrary numbers of users in a single pass.
Findings
Significant error reduction in speech recognition and speaker verification with up to four users.
Maintains performance in non-overlapping, noisy environments.
Applicable to other speaker-conditioned models like VAD and personalized ASR.
Abstract
In this paper, we propose a solution to allow speaker conditioned speech models, such as VoiceFilter-Lite, to support an arbitrary number of enrolled users in a single pass. This is achieved by using an attention mechanism on multiple speaker embeddings to compute a single attentive embedding, which is then used as a side input to the model. We implemented multi-user VoiceFilter-Lite and evaluated it for three tasks: (1) a streaming automatic speech recognition (ASR) task; (2) a text-independent speaker verification task; and (3) a personalized keyphrase detection task, where ASR has to detect keyphrases from multiple enrolled users in a noisy environment. Our experiments show that, with up to four enrolled users, multi-user VoiceFilter-Lite is able to significantly reduce speech recognition and speaker verification errors when there is overlapping speech, without affecting performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsVoiceFilter-Lite
