TL;DR
This paper introduces an iterative self-supervised clustering method for speaker diarization that combines deep representation learning with path integral clustering, significantly improving diarization accuracy on benchmark datasets.
Contribution
The paper presents a novel iterative self-supervised clustering algorithm that jointly optimizes speaker representations and clustering, outperforming existing methods.
Findings
13% DER improvement on CALLHOME
59% DER improvement on AMI
Outperforms recent diarization approaches
Abstract
Automatic speaker diarization techniques typically involve a two-stage processing approach where audio segments of fixed duration are converted to vector representations in the first stage. This is followed by an unsupervised clustering of the representations in the second stage. In most of the prior approaches, these two stages are performed in an isolated manner with independent optimization steps. In this paper, we propose a representation learning and clustering algorithm that can be iteratively performed for improved speaker diarization. The representation learning is based on principles of self-supervised learning while the clustering algorithm is a graph structural method based on path integral clustering (PIC). The representation learning step uses the cluster targets from PIC and the clustering step is performed on embeddings learned from the self-supervised deep model. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
