VoiceFilter: Targeted Voice Separation by Speaker-Conditioned   Spectrogram Masking

Quan Wang; Hannah Muckenhirn; Kevin Wilson; Prashant Sridhar; Zelin; Wu; John Hershey; Rif A. Saurous; Ron J. Weiss; Ye Jia; Ignacio Lopez Moreno

arXiv:1810.04826·eess.AS·June 20, 2019·49 cites

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin, Wu, John Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, Ignacio Lopez Moreno

PDF

Open Access 5 Repos

TL;DR

This paper introduces VoiceFilter, a neural network-based system that isolates a target speaker's voice from multi-speaker audio using a reference signal, improving speech recognition accuracy.

Contribution

It proposes a novel speaker-conditioned spectrogram masking approach combining speaker embeddings and neural networks for targeted voice separation.

Findings

01

Reduces speech recognition WER on multi-speaker signals

02

Maintains minimal WER degradation on single-speaker signals

03

Demonstrates effectiveness of speaker-conditioned masking

Abstract

In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing