Self-supervised Speaker Diarization
Yehoshua Dissen, Felix Kreuk, Joseph Keshet

TL;DR
This paper introduces an unsupervised deep learning approach for speaker diarization that generates high-quality speaker representations without annotated data, outperforming other unsupervised methods and nearing supervised model performance.
Contribution
The study presents a fully unsupervised neural speaker embedding model and a method to estimate hyperparameters without annotations, advancing speaker diarization techniques.
Findings
Outperforms unsupervised baselines on CallHome when two speakers are present.
Nearly matches the performance of supervised models.
Effective in generating speaker representations without labeled data.
Abstract
Over the last few years, deep learning has grown in popularity for speaker verification, identification, and diarization. Inarguably, a significant part of this success is due to the demonstrated effectiveness of their speaker representations. These, however, are heavily dependent on large amounts of annotated data and can be sensitive to new domains. This study proposes an entirely unsupervised deep-learning model for speaker diarization. Specifically, the study focuses on generating high-quality neural speaker representations without any annotated data, as well as on estimating secondary hyperparameters of the model without annotations. The speaker embeddings are represented by an encoder trained in a self-supervised fashion using pairs of adjacent segments assumed to be of the same speaker. The trained encoder model is then used to self-generate pseudo-labels to subsequently train…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
