In search of strong embedding extractors for speaker diarisation

Jee-weon Jung; Hee-Soo Heo; Bong-Jin Lee; Jaesung Huh; Andrew Brown,; Youngki Kwon; Shinji Watanabe; Joon Son Chung

arXiv:2210.14682·cs.SD·October 27, 2022

In search of strong embedding extractors for speaker diarisation

Jee-weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesung Huh, Andrew Brown,, Youngki Kwon, Shinji Watanabe, Joon Son Chung

PDF

Open Access

TL;DR

This paper investigates the effectiveness of speaker embedding extractors for diarisation, highlighting challenges in evaluation and handling overlapped speech, and proposes data augmentation techniques to improve performance in realistic scenarios.

Contribution

The paper introduces evaluation protocols that better reflect diarisation conditions and proposes two data augmentation methods to enhance embedding extractors for overlapped speech and speaker changes.

Findings

01

Augmentation techniques improve diarisation performance.

02

Evaluation protocols better correlate with diarisation accuracy.

03

State-of-the-art extractors benefit from proposed methods.

Abstract

Speaker embedding extractors (EEs), which map input audio to a speaker discriminant latent space, are of paramount importance in speaker diarisation. However, there are several challenges when adopting EEs for diarisation, from which we tackle two key problems. First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation. We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance. Second, embedding extractors have not seen utterances in which multiple speakers exist. These inputs are inevitably present in speaker diarisation because of overlapped speech and speaker changes; they degrade the performance. To mitigate the first problem, we generate speaker verification evaluation protocols that mimic the diarisation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsAttentive Walk-Aggregating Graph Neural Network