Target Speech Extraction Based on Blind Source Separation and X-vector-based Speaker Selection Trained with Data Augmentation
Zhaoyi Gu, Lele Liao, Kai Chen, Jing Lu

TL;DR
This paper proposes a sequential target speech extraction method combining blind source separation and x-vector speaker recognition, enhanced by data augmentation, to improve generalization and extraction accuracy in varied acoustic environments.
Contribution
It introduces a novel combination of BSS methods with an x-vector SR module trained with data augmentation for better target speech extraction.
Findings
MVAE generalizes better to unseen speakers with augmented training.
The cascaded approach improves extraction accuracy in real-room environments.
Data augmentation enhances speaker recognition performance.
Abstract
Extracting the desired speech from a mixture is a meaningful and challenging task. The end-to-end DNN-based methods, though attractive, face the problem of generalization. In this paper, we explore a sequential approach for target speech extraction by combining blind source separation (BSS) with the x-vector based speaker recognition (SR) module. Two promising BSS methods based on source independence assumption, independent low-rank matrix analysis (ILRMA) and multi-channel variational autoencoder (MVAE), are utilized and compared. ILRMA employs nonnegative matrix factorization (NMF) to capture spectral structures of source signals and MVAE utilizes the strong modeling power of deep neural networks (DNN). However, the investigation of MVAE has been limited to the training with very few speakers and the speech signals of test speakers are usually included. We extend the training of MVAE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBlind Source Separation Techniques · Speech and Audio Processing · Speech Recognition and Synthesis
