Attention-based Encoder-Decoder End-to-End Neural Diarization with   Embedding Enhancer

Zhengyang Chen; Bing Han; Shuai Wang; Yanmin Qian

arXiv:2309.06672·cs.SD·September 14, 2023·1 cites

Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer

Zhengyang Chen, Bing Han, Shuai Wang, Yanmin Qian

PDF

Open Access

TL;DR

This paper introduces an attention-based encoder-decoder neural diarization system with an enhancer module, improving generalization to unseen speakers and achieving state-of-the-art results on multiple benchmarks.

Contribution

The paper proposes a novel AED-EEND model with an enhancer and iterative decoding, addressing speaker permutation and unseen speaker challenges in diarization.

Findings

01

Achieved state-of-the-art DER on CALLHOME, DIHARD II, and AMI datasets.

02

Demonstrated effectiveness of the enhancer module and iterative decoding approach.

03

Showed that training with more realistic simulated data improves model consistency.

Abstract

Deep neural network-based systems have significantly improved the performance of speaker diarization tasks. However, end-to-end neural diarization (EEND) systems often struggle to generalize to scenarios with an unseen number of speakers, while target speaker voice activity detection (TS-VAD) systems tend to be overly complex. In this paper, we propose a simple attention-based encoder-decoder network for end-to-end neural diarization (AED-EEND). In our training process, we introduce a teacher-forcing strategy to address the speaker permutation problem, leading to faster model convergence. For evaluation, we propose an iterative decoding method that outputs diarization results for each speaker sequentially. Additionally, we propose an Enhancer module to enhance the frame-level speaker embeddings, enabling the model to handle scenarios with an unseen number of speakers. We also explore…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing