Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture
Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang, Yanyan Yue,, Shuangqing Qian, Shilong Wu, Jun Du, Chin-Hui Lee

TL;DR
This paper introduces a novel neural speaker diarization system combining memory-aware multi-speaker embeddings with sequence-to-sequence architecture, significantly improving accuracy and efficiency in speaker diarization tasks.
Contribution
The paper presents a new neural diarization model that integrates memory-aware embeddings, sequence-to-sequence architecture, and a deep interactive module, achieving state-of-the-art results on the CHiME-7 dataset.
Findings
Achieved a 15.9% DER on CHiME-7 EVAL set, a 49% relative improvement over baseline.
Incorporated input features fusion to reduce memory usage during decoding.
Outperformed previous systems in the CHiME-7 DASR Challenge.
Abstract
We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates the strengths of memory-aware multi-speaker embedding (MA-MSE) and sequence-to-sequence (Seq2Seq) architecture, leading to improvement in both efficiency and performance. Next, we further decrease the memory occupation of decoding by incorporating input features fusion and then employ a multi-head attention mechanism to capture features at different levels. NSD-MS2S achieved a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set, which signifies a relative improvement of 49% over the official baseline system, and is the key technique for us to achieve the best performance for the main track of CHiME-7 DASR Challenge. Additionally, we introduce a deep interactive module (DIM) in MA-MSE module to better retrieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsSoftmax · Linear Layer
