Multi-target Extractor and Detector for Unknown-number Speaker Diarization
Chin-Yi Cheng, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang

TL;DR
This paper introduces a neural network architecture that extracts speaker representations and detects their presence in multi-speaker conversations, improving diarization accuracy across varying speaker counts.
Contribution
The proposed model uniquely combines speaker representation extraction and presence detection in a unified framework, handling unknown numbers of speakers effectively.
Findings
Outperforms existing methods on CALLHOME corpus
Achieves 6.4% to 30.9% relative diarization error rate reduction
Effective in scenarios with 2 to 7 simultaneous speakers
Abstract
Strong representations of target speakers can help extract important information about speakers and detect corresponding temporal regions in multi-speaker conversations. In this study, we propose a neural architecture that simultaneously extracts speaker representations consistent with the speaker diarization objective and detects the presence of each speaker on a frame-by-frame basis regardless of the number of speakers in a conversation. A speaker representation (called z-vector) extractor and a time-speaker contextualizer, implemented by a residual network and processing data in both temporal and speaker dimensions, are integrated into a unified framework. Tests on the CALLHOME corpus show that our model outperforms most of the methods proposed so far. Evaluations in a more challenging case with simultaneous speakers ranging from 2 to 7 show that our model achieves 6.4% to 30.9%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
