Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer
Zhengyang Chen, Bing Han, Shuai Wang, Yanmin Qian

TL;DR
This paper introduces an attention-based encoder-decoder neural diarization system with an enhancer module, improving generalization to unseen speakers and achieving state-of-the-art results on multiple benchmarks.
Contribution
The paper proposes a novel AED-EEND model with an enhancer and iterative decoding, addressing speaker permutation and unseen speaker challenges in diarization.
Findings
Achieved state-of-the-art DER on CALLHOME, DIHARD II, and AMI datasets.
Demonstrated effectiveness of the enhancer module and iterative decoding approach.
Showed that training with more realistic simulated data improves model consistency.
Abstract
Deep neural network-based systems have significantly improved the performance of speaker diarization tasks. However, end-to-end neural diarization (EEND) systems often struggle to generalize to scenarios with an unseen number of speakers, while target speaker voice activity detection (TS-VAD) systems tend to be overly complex. In this paper, we propose a simple attention-based encoder-decoder network for end-to-end neural diarization (AED-EEND). In our training process, we introduce a teacher-forcing strategy to address the speaker permutation problem, leading to faster model convergence. For evaluation, we propose an iterative decoding method that outputs diarization results for each speaker sequentially. Additionally, we propose an Enhancer module to enhance the frame-level speaker embeddings, enabling the model to handle scenarios with an unseen number of speakers. We also explore…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
