The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-Vector Clustering Systems Combined by DOVER-Lap
Shota Horiguchi, Nelson Yalta, Paola Garcia, Yuki Takashima, Yawen, Xue, Desh Raj, Zili Huang, Yusuke Fujita, Shinji Watanabe, Sanjeev Khudanpur

TL;DR
This paper describes the Hitachi-JHU system for the DIHARD III challenge, combining five subsystems through DOVER-Lap to achieve competitive diarization error rates and secure second place in all tasks.
Contribution
The paper introduces an ensemble of five diverse diarization subsystems refined for optimal performance, combined using DOVER-Lap for improved accuracy.
Findings
Achieved diarization error rates of 11.58% and 14.09% in Track 1 full and core.
Achieved diarization error rates of 16.94% and 20.01% in Track 2 full and core.
Secured second place in all challenge tasks.
Abstract
This paper provides a detailed description of the Hitachi-JHU system that was submitted to the Third DIHARD Speech Diarization Challenge. The system outputs the ensemble results of the five subsystems: two x-vector-based subsystems, two end-to-end neural diarization-based subsystems, and one hybrid subsystem. We refine each system and all five subsystems become competitive and complementary. After the DOVER-Lap based system combination, it achieved diarization error rates of 11.58 % and 14.09 % in Track 1 full and core, and 16.94 % and 20.01 % in Track 2 full and core, respectively. With their results, we won second place in all the tasks of the challenge.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
