Exploring Speaker Diarization with Mixture of Experts

Gaobin Yang; Maokui He; Shutong Niu; Ruoyu Wang; Hang Chen; Jun Du

arXiv:2506.14750·cs.SD·June 18, 2025

Exploring Speaker Diarization with Mixture of Experts

Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang, Hang Chen, Jun Du

PDF

Open Access

TL;DR

This paper introduces a novel neural speaker diarization system that combines memory-aware embeddings, sequence-to-sequence architecture, and a mixture of experts to improve robustness and accuracy in complex acoustic environments.

Contribution

It presents a new neural diarization framework integrating memory-aware embeddings with a mixture of experts, achieving state-of-the-art results on multiple challenging datasets.

Findings

01

Enhanced robustness and generalization in speaker diarization.

02

State-of-the-art performance on CHiME-6, DiPCo, Mixer 6, and DIHARD-III datasets.

03

Effective mitigation of model bias through SS-MoE.

Abstract

In this paper, we propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates a memory-aware multi-speaker embedding module with a sequence-to-sequence architecture. The system leverages a memory module to enhance speaker embeddings and employs a Seq2Seq framework to efficiently map acoustic features to speaker labels. Additionally, we explore the application of mixture of experts in speaker diarization, and introduce a Shared and Soft Mixture of Experts (SS-MoE) module to further mitigate model bias and enhance performance. Incorporating SS-MoE leads to the extended model NSD-MS2S-SSMoE. Experiments on multiple complex acoustic datasets, including CHiME-6, DiPCo, Mixer 6 and DIHARD-III evaluation sets, demonstrate meaningful improvements in robustness and generalization. The proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsLong Short-Term Memory · Sequence to Sequence