Speaker Diarization using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings
Pawel Cyrta, Tomasz Trzci\'nski, Wojciech Stokowiec

TL;DR
This paper introduces a deep recurrent convolutional neural network for speaker diarization that directly learns speaker embeddings from spectrograms, outperforming traditional methods and significantly reducing error rates.
Contribution
The paper presents a novel neural network architecture for speaker embedding extraction directly from spectrograms, improving diarization accuracy over existing approaches.
Findings
Reduces diarization error rate by over 30% compared to baseline.
Outperforms state-of-the-art methods on multiple benchmark datasets.
Provides a new annotated dataset for speaker diarization research.
Abstract
In this paper we propose a new method of speaker diarization that employs a deep learning architecture to learn speaker embeddings. In contrast to the traditional approaches that build their speaker embeddings using manually hand-crafted spectral features, we propose to train for this purpose a recurrent convolutional neural network applied directly on magnitude spectrograms. To compare our approach with the state of the art, we collect and release for the public an additional dataset of over 6 hours of fully annotated broadcast material. The results of our evaluation on the new dataset and three other benchmark datasets show that our proposed method significantly outperforms the competitors and reduces diarization error rate by a large margin of over 30% with respect to the baseline.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
