Discriminative Neural Clustering for Speaker Diarisation
Qiujia Li, Florian L. Kreyssig, Chao Zhang, and Philip C. Woodland

TL;DR
This paper introduces Discriminative Neural Clustering, a supervised sequence-to-sequence approach using Transformer architecture for speaker diarisation, effectively reducing speaker error rates on the AMI dataset.
Contribution
It presents a novel supervised neural clustering method with data augmentation techniques, outperforming traditional spectral clustering in speaker diarisation.
Findings
DNC reduces speaker error rate by 29.4% relative to spectral clustering.
Data augmentation schemes improve training effectiveness on limited data.
Transformer-based DNC is effective for speaker diarisation tasks.
Abstract
In this paper, we propose Discriminative Neural Clustering (DNC) that formulates data clustering with a maximum number of clusters as a supervised sequence-to-sequence learning problem. Compared to traditional unsupervised clustering algorithms, DNC learns clustering patterns from training data without requiring an explicit definition of a similarity measure. An implementation of DNC based on the Transformer architecture is shown to be effective on a speaker diarisation task using the challenging AMI dataset. Since AMI contains only 147 complete meetings as individual input sequences, data scarcity is a significant issue for training a Transformer model for DNC. Accordingly, this paper proposes three data augmentation schemes: sub-sequence randomisation, input vector randomisation, and Diaconis augmentation, which generates new data samples by rotating the entire input sequence of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
