Multi-Scale Speaker Diarization With Neural Affinity Score Fusion
Tae Jin Park, Manoj Kumar, Shrikanth Narayanan

TL;DR
This paper introduces a neural affinity score fusion method that combines multi-scale speaker segments to improve diarization accuracy, addressing the challenge of unreliable short speech segment representations.
Contribution
The paper presents a novel neural affinity score fusion approach that effectively balances temporal resolution and speaker representation quality for diarization.
Findings
Achieves state-of-the-art diarization performance on CALLHOME dataset.
Effectively balances temporal resolution and speaker representation quality.
Demonstrates improved accuracy over existing methods.
Abstract
Identifying the identity of the speaker of short segments in human dialogue has been considered one of the most challenging problems in speech signal processing. Speaker representations of short speech segments tend to be unreliable, resulting in poor fidelity of speaker representations in tasks requiring speaker recognition. In this paper, we propose an unconventional method that tackles the trade-off between temporal resolution and the quality of the speaker representations. To find a set of weights that balance the scores from multiple temporal scales of segments, a neural affinity score fusion model is presented. Using the CALLHOME dataset, we show that our proposed multi-scale segmentation and integration approach can achieve a state-of-the-art diarization performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
