The xmuspeech system for multi-channel multi-party meeting transcription challenge
Jie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li1, Shipeng Xia,, Jiayang Zhang, Lin Li1, Qingyang Hong, Feng Tong

TL;DR
This paper presents a multi-channel speaker diarization system for meeting transcription that leverages spatial information, novel neural network architecture, and fusion techniques to significantly improve diarization accuracy in multi-party, overlapped speech scenarios.
Contribution
It introduces DMSNet, a novel multi-channel sequence-to-sequence neural network with attention and Conformer components, enhancing overlapped speech handling and diarization performance.
Findings
Achieved a 10.1% reduction in Detection Error Rate compared to LSTM-based modules.
Reduced diarization error rate from 13.44% to 7.63% using DMSNet-based OSD.
Best fusion system achieved 7.09% DER on evaluation set.
Abstract
This paper describes the system developed by the XMUSPEECH team for the Multi-channel Multi-party Meeting Transcription Challenge (M2MeT). For the speaker diarization task, we propose a multi-channel speaker diarization system that obtains spatial information of speaker by Difference of Arrival (DOA) technology. Speaker-spatial embedding is generated by x-vector and s-vector derived from Filter-and-Sum Beamforming (FSB) which makes the embedding more robust. Specifically, we propose a novel multi-channel sequence-to-sequence neural network architecture named Discriminative Multi-stream Neural Network (DMSNet) which consists of Attention Filter-and-Sum block (AFSB) and Conformer encoder. We explore DMSNet to address overlapped speech problem on multi-channel audio. Compared with LSTM based OSD module, we achieve a decreases of 10.1% in Detection Error Rate(DetER). By performing DMSNet…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
