Multi-Channel Sequence-to-Sequence Neural Diarization: Experimental Results for The MISP 2025 Challenge
Ming Cheng, Fei Su, Cancan Li, Juan Liu, Ming Li

TL;DR
This paper presents a multi-channel extension of the sequence-to-sequence neural diarization system, achieving state-of-the-art results in the MISP 2025 Challenge by effectively utilizing multi-channel audio data.
Contribution
The paper introduces MC-S2SND, an extension of S2SND that leverages multi-channel audio for improved speaker diarization accuracy.
Findings
Achieved DER of 8.09% on the challenge dataset
Ranked first in the MISP 2025 speaker diarization task
Demonstrated effectiveness of multi-channel audio in diarization
Abstract
This paper describes the speaker diarization system developed for the Multimodal Information-Based Speech Processing (MISP) 2025 Challenge. First, we utilize the Sequence-to-Sequence Neural Diarization (S2SND) framework to generate initial predictions using single-channel audio. Then, we extend the original S2SND framework to create a new version, Multi-Channel Sequence-to-Sequence Neural Diarization (MC-S2SND), which refines the initial results using multi-channel audio. The final system achieves a diarization error rate (DER) of 8.09% on the evaluation set of the competition database, ranking first place in the speaker diarization task of the MISP 2025 Challenge.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling
MethodsSparse Evolutionary Training
