Speaker Diarization using Two-pass Leave-One-Out Gaussian PLDA   Clustering of DNN Embeddings

Kiran Karra; Alan McCree

arXiv:2104.02469·eess.AS·June 16, 2021

Speaker Diarization using Two-pass Leave-One-Out Gaussian PLDA Clustering of DNN Embeddings

Kiran Karra, Alan McCree

PDF

Open Access 1 Repo

TL;DR

This paper introduces a two-pass speaker diarization system using Leave-One-Out Gaussian PLDA clustering of DNN embeddings, achieving state-of-the-art results without task-specific tuning.

Contribution

It presents a novel two-pass system with refined second-pass clustering and an embedding training method optimized for LGPDA scoring, improving diarization accuracy.

Findings

01

Achieved below 4% error rate on Callhome corpus without parameter tuning.

02

Significant progress towards a universal diarization solution.

03

Enhanced performance with a two-pass clustering approach.

Abstract

Many modern systems for speaker diarization, such as the recently-developed VBx approach, rely on clustering of DNN speaker embeddings followed by resegmentation. Two problems with this approach are that the DNN is not directly optimized for this task, and the parameters need significant retuning for different applications. We have recently presented progress in this direction with a Leave-One-Out Gaussian PLDA (LGP) clustering algorithm and an approach to training the DNN such that embeddings directly optimize performance of this scoring method. This paper presents a new two-pass version of this system, where the second pass uses finer time resolution to significantly improve overall performance. For the Callhome corpus, we achieve the first published error rate below 4% without any task-dependent parameter tuning. We also show significant progress towards a robust single solution for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hltcoe/VBx
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing