Optimization of Speaker Extraction Neural Network with Magnitude and Temporal Spectrum Approximation Loss
Chenglin Xu, Wei Rao, Eng Siong Chng, Haizhou Li

TL;DR
This paper introduces a novel loss function combining magnitude and temporal spectrum approximation to improve speaker extraction neural networks, significantly enhancing signal quality and speech separation accuracy.
Contribution
It proposes a phase-sensitive mask estimation method with a new loss function and a concatenation framework for better speaker embedding integration.
Findings
Achieves 70.4% SDR improvement over baseline
Improves PESQ by 17.7%
Significantly enhances separation for same and different gender mixtures
Abstract
The SpeakerBeam-FE (SBF) method is proposed for speaker extraction. It attempts to overcome the problem of unknown number of speakers in an audio recording during source separation. The mask approximation loss of SBF is sub-optimal, which doesn't calculate direct signal reconstruction error and consider the speech context. To address these problems, this paper proposes a magnitude and temporal spectrum approximation loss to estimate a phase sensitive mask for the target speaker with the speaker characteristics. Moreover, this paper explores a concatenation framework instead of the context adaptive deep neural network in the SBF method to encode a speaker embedding into the mask estimation network. Experimental results under open evaluation condition show that the proposed method achieves 70.4% and 17.7% relative improvement over the SBF baseline on signal-to-distortion ratio (SDR) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
