End-to-end Neural Diarization: From Transformer to Conformer
Yi Chieh Liu, Eunjung Han, Chul Lee, Andreas Stolcke

TL;DR
This paper introduces a Conformer-based end-to-end neural diarization system that improves speaker diarization accuracy by combining convolutional and Transformer architectures, with data augmentation and mixed training to address domain mismatch.
Contribution
The paper demonstrates that Conformer architecture enhances neural diarization performance and proposes a method to mitigate domain mismatch using mixed simulated and real data.
Findings
Conformer-based EEND outperforms Transformer-based models with 24% error reduction.
Data augmentation and convolutional subsampling improve EEND performance.
Mixing simulated and real data reduces domain mismatch, enhancing diarization accuracy.
Abstract
We propose a new end-to-end neural diarization (EEND) system that is based on Conformer, a recently proposed neural architecture that combines convolutional mappings and Transformer to model both local and global dependencies in speech. We first show that data augmentation and convolutional subsampling layers enhance the original self-attentive EEND in the Transformer-based EEND, and then Conformer gives an additional gain over the Transformer-based EEND. However, we notice that the Conformer-based EEND does not generalize as well from simulated to real conversation data as the Transformer-based model. This leads us to quantify the mismatch between simulated data and real speaker behavior in terms of temporal statistics reflecting turn-taking between speakers, and investigate its correlation with diarization error. By mixing simulated and real data in EEND training, we mitigate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · End-to-End Neural Diarization · Byte Pair Encoding · Attention Is All You Need · Adam · Label Smoothing · Residual Connection
