Robust Speaker Clustering using Mixtures of von Mises-Fisher Distributions for Naturalistic Audio Streams
Harishchandra Dubey, Abhijeet Sangwan, John H. L. Hansen

TL;DR
This paper introduces a robust speaker clustering method using mixtures of von Mises-Fisher distributions, significantly improving diarization accuracy in naturalistic multi-speaker audio streams.
Contribution
The study proposes a novel speaker clustering approach based on von Mises-Fisher mixture models, tailored for high-dimensional normalized i-Vectors in naturalistic settings.
Findings
Achieved up to 44.48% relative improvement on PLTL corpus.
Achieved up to 53.68% relative improvement on AMI corpus.
Outperformed baseline K-means clustering with cosine distance.
Abstract
Speaker Diarization (i.e. determining who spoke and when?) for multi-speaker naturalistic interactions such as Peer-Led Team Learning (PLTL) sessions is a challenging task. In this study, we propose robust speaker clustering based on mixture of multivariate von Mises-Fisher distributions. Our diarization pipeline has two stages: (i) ground-truth segmentation; (ii) proposed speaker clustering. The ground-truth speech activity information is used for extracting i-Vectors from each speechsegment. We post-process the i-Vectors with principal component analysis for dimension reduction followed by lengthnormalization. Normalized i-Vectors are high-dimensional unit vectors possessing discriminative directional characteristics. We model the normalized i-Vectors with a mixture model consisting of multivariate von Mises-Fisher distributions. K-means clustering with cosine distance is chosen as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
