VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin, Wu, John Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, Ignacio Lopez Moreno

TL;DR
This paper introduces VoiceFilter, a neural network-based system that isolates a target speaker's voice from multi-speaker audio using a reference signal, improving speech recognition accuracy.
Contribution
It proposes a novel speaker-conditioned spectrogram masking approach combining speaker embeddings and neural networks for targeted voice separation.
Findings
Reduces speech recognition WER on multi-speaker signals
Maintains minimal WER degradation on single-speaker signals
Demonstrates effectiveness of speaker-conditioned masking
Abstract
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
