End-to-End Speaker Diarization as Post-Processing
Shota Horiguchi, Paola Garcia, Yusuke Fujita, Shinji Watanabe, Kenji, Nagamatsu

TL;DR
This paper proposes a hybrid speaker diarization approach combining clustering and end-to-end methods, significantly improving performance on multiple datasets by effectively handling overlapping speech.
Contribution
It introduces a novel iterative post-processing technique that enhances clustering-based diarization with a two-speaker end-to-end model, addressing limitations in overlapping speech detection.
Findings
Improved diarization accuracy across CALLHOME, AMI, and DIHARD II datasets.
Effective handling of overlapping speech through iterative refinement.
Consistent performance gains over state-of-the-art methods.
Abstract
This paper investigates the utilization of an end-to-end diarization model as post-processing of conventional clustering-based diarization. Clustering-based diarization methods partition frames into clusters of the number of speakers; thus, they typically cannot handle overlapping speech because each frame is assigned to one speaker. On the other hand, some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification. Although some methods can treat a flexible number of speakers, they do not perform well when the number of speakers is large. To compensate for each other's weakness, we propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method. We iteratively select two speakers from the results and update the results of the two speakers to improve the overlapped…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
