Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding   with Sequence-to-Sequence Architecture

Gaobin Yang; Maokui He; Shutong Niu; Ruoyu Wang; Yanyan Yue,; Shuangqing Qian; Shilong Wu; Jun Du; Chin-Hui Lee

arXiv:2309.09180·eess.AS·December 27, 2023

Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture

Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang, Yanyan Yue,, Shuangqing Qian, Shilong Wu, Jun Du, Chin-Hui Lee

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel neural speaker diarization system combining memory-aware multi-speaker embeddings with sequence-to-sequence architecture, significantly improving accuracy and efficiency in speaker diarization tasks.

Contribution

The paper presents a new neural diarization model that integrates memory-aware embeddings, sequence-to-sequence architecture, and a deep interactive module, achieving state-of-the-art results on the CHiME-7 dataset.

Findings

01

Achieved a 15.9% DER on CHiME-7 EVAL set, a 49% relative improvement over baseline.

02

Incorporated input features fusion to reduce memory usage during decoding.

03

Outperformed previous systems in the CHiME-7 DASR Challenge.

Abstract

We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates the strengths of memory-aware multi-speaker embedding (MA-MSE) and sequence-to-sequence (Seq2Seq) architecture, leading to improvement in both efficiency and performance. Next, we further decrease the memory occupation of decoding by incorporating input features fusion and then employ a multi-head attention mechanism to capture features at different levels. NSD-MS2S achieved a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set, which signifies a relative improvement of 49% over the official baseline system, and is the key technique for us to achieve the best performance for the main track of CHiME-7 DASR Challenge. Additionally, we introduce a deep interactive module (DIM) in MA-MSE module to better retrieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liyunlongaaa/nsd-ms2s
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsSoftmax · Linear Layer