Location-based training for multi-channel talker-independent speaker separation
Hassan Taherian, Ke Tan, and DeLiang Wang

TL;DR
This paper introduces location-based training (LBT), a novel method leveraging spatial information from microphone arrays to improve multi-channel speaker separation, outperforming permutation-invariant training (PIT) especially in complex scenarios.
Contribution
The study proposes a new location-based training approach that assigns speakers based on spatial locations, reducing complexity and enhancing separation performance over existing methods.
Findings
LBT outperforms PIT in separating two- and three-speaker mixtures.
Azimuth-based training is more effective than distance-based training.
Dynamic selection of training type further improves separation results.
Abstract
Permutation-invariant training (PIT) is a dominant approach for addressing the permutation ambiguity problem in talker-independent speaker separation. Leveraging spatial information afforded by microphone arrays, we propose a new training approach to resolving permutation ambiguities for multi-channel speaker separation. The proposed approach, named location-based training (LBT), assigns speakers on the basis of their spatial locations. This training strategy is easy to apply, and organizes speakers according to their positions in physical space. Specifically, this study investigates azimuth angles and source distances for location-based training. Evaluation results on separating two- and three-speaker mixtures show that azimuth-based training consistently outperforms PIT, and distance-based training further improves the separation performance when speaker azimuths are close.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
