Separation Guided Speaker Diarization in Realistic Mismatched Conditions
Shu-Tong Niu, Jun Du, Lei Sun, Chin-Hui Lee

TL;DR
This paper introduces a separation guided speaker diarization approach that combines speech separation and clustering to better handle overlapping speech in realistic conditions, significantly reducing diarization errors.
Contribution
It proposes a novel SGSD framework that effectively integrates speech separation with clustering, addressing the limitations of conventional methods in overlapping speech scenarios under mismatched conditions.
Findings
SGSD significantly reduces diarization error rates by over 20%.
Separation-based processing improves handling of overlapping speech in realistic data.
The approach outperforms state-of-the-art clustering-based diarization systems.
Abstract
We propose a separation guided speaker diarization (SGSD) approach by fully utilizing a complementarity of speech separation and speaker clustering. Since the conventional clustering-based speaker diarization (CSD) approach cannot well handle overlapping speech segments, we investigate, in this study, separation-based speaker diarization (SSD) which inherently has the potential to handle the speaker overlap regions. Our preliminary analysis shows that the state-of-the-art Conv-TasNet based speech separation, which works quite well on the simulation data, is unstable in realistic conversational speech due to the high mismatch speaking styles in simulated training speech and read speech. In doing so, separation-based processing can assist CSD in handling the overlapping speech segments under the realistic mismatched conditions. Specifically, several strategies are designed to select…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsConvolution · Non Maximum Suppression · 1x1 Convolution · SSD
