The xmuspeech system for multi-channel multi-party meeting transcription   challenge

Jie Wang; Yuji Liu; Binling Wang; Yiming Zhi; Song Li1; Shipeng Xia,; Jiayang Zhang; Lin Li1; Qingyang Hong; Feng Tong

arXiv:2202.05744·eess.AS·February 14, 2022

The xmuspeech system for multi-channel multi-party meeting transcription challenge

Jie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li1, Shipeng Xia,, Jiayang Zhang, Lin Li1, Qingyang Hong, Feng Tong

PDF

Open Access

TL;DR

This paper presents a multi-channel speaker diarization system for meeting transcription that leverages spatial information, novel neural network architecture, and fusion techniques to significantly improve diarization accuracy in multi-party, overlapped speech scenarios.

Contribution

It introduces DMSNet, a novel multi-channel sequence-to-sequence neural network with attention and Conformer components, enhancing overlapped speech handling and diarization performance.

Findings

01

Achieved a 10.1% reduction in Detection Error Rate compared to LSTM-based modules.

02

Reduced diarization error rate from 13.44% to 7.63% using DMSNet-based OSD.

03

Best fusion system achieved 7.09% DER on evaluation set.

Abstract

This paper describes the system developed by the XMUSPEECH team for the Multi-channel Multi-party Meeting Transcription Challenge (M2MeT). For the speaker diarization task, we propose a multi-channel speaker diarization system that obtains spatial information of speaker by Difference of Arrival (DOA) technology. Speaker-spatial embedding is generated by x-vector and s-vector derived from Filter-and-Sum Beamforming (FSB) which makes the embedding more robust. Specifically, we propose a novel multi-channel sequence-to-sequence neural network architecture named Discriminative Multi-stream Neural Network (DMSNet) which consists of Attention Filter-and-Sum block (AFSB) and Conformer encoder. We explore DMSNet to address overlapped speech problem on multi-channel audio. Compared with LSTM based OSD module, we achieve a decreases of 10.1% in Detection Error Rate(DetER). By performing DMSNet…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory