Optimization of Speaker Extraction Neural Network with Magnitude and   Temporal Spectrum Approximation Loss

Chenglin Xu; Wei Rao; Eng Siong Chng; Haizhou Li

arXiv:1903.09952·eess.AS·March 26, 2019·1 cites

Optimization of Speaker Extraction Neural Network with Magnitude and Temporal Spectrum Approximation Loss

Chenglin Xu, Wei Rao, Eng Siong Chng, Haizhou Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel loss function combining magnitude and temporal spectrum approximation to improve speaker extraction neural networks, significantly enhancing signal quality and speech separation accuracy.

Contribution

It proposes a phase-sensitive mask estimation method with a new loss function and a concatenation framework for better speaker embedding integration.

Findings

01

Achieves 70.4% SDR improvement over baseline

02

Improves PESQ by 17.7%

03

Significantly enhances separation for same and different gender mixtures

Abstract

The SpeakerBeam-FE (SBF) method is proposed for speaker extraction. It attempts to overcome the problem of unknown number of speakers in an audio recording during source separation. The mask approximation loss of SBF is sub-optimal, which doesn't calculate direct signal reconstruction error and consider the speech context. To address these problems, this paper proposes a magnitude and temporal spectrum approximation loss to estimate a phase sensitive mask for the target speaker with the speaker characteristics. Moreover, this paper explores a concatenation framework instead of the context adaptive deep neural network in the SBF method to encode a speaker embedding into the mask estimation network. Experimental results under open evaluation condition show that the proposed method achieves 70.4% and 17.7% relative improvement over the SBF baseline on signal-to-distortion ratio (SDR) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xuchenglin28/speaker_extraction
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis