TL;DR
This paper introduces a self-supervised metric learning approach for speaker diarization that jointly learns speaker embeddings and similarity metrics, significantly improving clustering accuracy over traditional methods.
Contribution
It presents a novel joint learning framework combining self-supervised metric learning with graph clustering for speaker diarization, outperforming existing approaches.
Findings
Achieved 60% relative DER reduction on AMI dataset.
Achieved 7% relative DER reduction on DIHARD dataset.
Demonstrated effectiveness of self-supervised metric learning in speaker diarization.
Abstract
In this paper, we propose a novel algorithm for speaker diarization using metric learning for graph based clustering. The graph clustering algorithms use an adjacency matrix consisting of similarity scores. These scores are computed between speaker embeddings extracted from pairs of audio segments within the given recording. In this paper, we propose an approach that jointly learns the speaker embeddings and the similarity metric using principles of self-supervised learning. The metric learning network implements a neural model of the probabilistic linear discriminant analysis (PLDA). The self-supervision is derived from the pseudo labels obtained from a previous iteration of clustering. The entire model of representation learning and metric learning is trained with a binary cross entropy loss. By combining the self-supervision based metric learning along with the graph-based clustering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
