Combination of Deep Speaker Embeddings for Diarisation
Guangzhi Sun, Chao Zhang, Phil Woodland

TL;DR
This paper introduces a novel c-vector method that combines multiple speaker embeddings from neural networks to improve diarisation accuracy, demonstrating significant error rate reductions on challenging real-world datasets.
Contribution
The paper proposes a new c-vector approach with three neural network structures and a neural-based single-pass diarisation pipeline, enhancing robustness and performance over traditional d-vectors.
Findings
Achieved up to 29% relative SER reduction on AMI eval set.
Demonstrated 15% relative SER reduction on NIST RT05 dataset.
Improved robustness with VoxCeleb data integration.
Abstract
Significant progress has recently been made in speaker diarisation after the introduction of d-vectors as speaker embeddings extracted from neural network (NN) speaker classifiers for clustering speech segments. To extract better-performing and more robust speaker embeddings, this paper proposes a c-vector method by combining multiple sets of complementary d-vectors derived from systems with different NN components. Three structures are used to implement the c-vectors, namely 2D self-attentive, gated additive, and bilinear pooling structures, relying on attention mechanisms, a gating mechanism, and a low-rank bilinear pooling mechanism respectively. Furthermore, a neural-based single-pass speaker diarisation pipeline is also proposed in this paper, which uses NNs to achieve voice activity detection, speaker change point detection, and speaker embedding extraction. Experiments and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
