TL;DR
This paper introduces a self-supervised hierarchical clustering method that jointly learns speaker representations and clusters, significantly improving diarization accuracy over traditional methods.
Contribution
It presents a novel joint clustering and representation learning algorithm based on self-supervision, enhancing speaker diarization performance.
Findings
29% relative improvement over AHC with cosine similarity
10% relative improvement over state-of-the-art PLDA-based system
Effective integration of clustering with representation learning
Abstract
The state-of-the-art speaker diarization systems use agglomerative hierarchical clustering (AHC) which performs the clustering of previously learned neural embeddings. While the clustering approach attempts to identify speaker clusters, the AHC algorithm does not involve any further learning. In this paper, we propose a novel algorithm for hierarchical clustering which combines the speaker clustering along with a representation learning framework. The proposed approach is based on principles of self-supervised learning where the self-supervision is derived from the clustering algorithm. The representation learning network is trained with a regularized triplet loss using the clustering solution at the current step while the clustering algorithm uses the deep embeddings from the representation learning step. By combining the self-supervision based representation learning along with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
