The Volcspeech system for the ICASSP 2022 multi-channel multi-party   meeting transcription challenge

Chen Shen; Yi Liu; Wenzhi Fan; Bin Wang; Shixue Wen; Yao Tian; Jun; Zhang; Jingsheng Yang; Zejun Ma

arXiv:2202.04261·cs.SD·February 11, 2022

The Volcspeech system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

Chen Shen, Yi Liu, Wenzhi Fan, Bin Wang, Shixue Wen, Yao Tian, Jun, Zhang, Jingsheng Yang, Zejun Ma

PDF

Open Access

TL;DR

This paper presents the Volcspeech system for the ICASSP 2022 M2MeT challenge, featuring advanced speaker diarization and multi-speaker speech recognition techniques that significantly improve accuracy in multi-channel, multi-party meetings.

Contribution

The paper introduces novel approaches for overlapped speech handling, multi-channel audio modeling, and end-to-end multi-speaker recognition, achieving state-of-the-art results in the challenge.

Findings

01

DER of 5.79% on Eval set

02

CER of 19.2% on Eval set

03

Effective multi-channel and overlap handling methods

Abstract

This paper describes our submission to ICASSP 2022 Multi-channel Multi-party Meeting Transcription (M2MeT) Challenge. For Track 1, we propose several approaches to empower the clustering-based speaker diarization system to handle overlapped speech. Front-end dereverberation and the direction-of-arrival (DOA) estimation are used to improve the accuracy of speaker diarization. Multi-channel combination and overlap detection are applied to reduce the missed speaker error. A modified DOVER-Lap is also proposed to fuse the results of different systems. We achieve the final DER of 5.79% on the Eval set and 7.23% on the Test set. For Track 2, we develop our system using the Conformer model in a joint CTC-attention architecture. Serialized output training is adopted to multi-speaker overlapped speech recognition. We propose a neural front-end module to model multi-channel audio and train the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Absolute Position Encodings · Softmax · Byte Pair Encoding · Layer Normalization · Dropout · Label Smoothing