The RoyalFlush System of Speech Recognition for M2MeT Challenge

Shuaishuai Ye; Peiyao Wang; Shunfei Chen; Xinhui Hu; and Xinkang Xu

arXiv:2202.01614·cs.SD·February 25, 2022

The RoyalFlush System of Speech Recognition for M2MeT Challenge

Shuaishuai Ye, Peiyao Wang, Shunfei Chen, Xinhui Hu, and Xinkang Xu

PDF

Open Access

TL;DR

This paper presents the RoyalFlush system for multi-speaker speech recognition in the M2MeT challenge, utilizing advanced front-end processing, extensive data augmentation, and model fusion to significantly improve accuracy.

Contribution

The paper introduces a multi-speaker ASR system that combines WPE, beamforming, data augmentation, and model fusion techniques, achieving state-of-the-art results in the M2MeT challenge.

Findings

01

12.22% CER reduction on validation set

02

12.11% CER reduction on test set

03

Effective combination of front-end processing and model fusion

Abstract

This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge. We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data. Firstly, we investigated a set of front-end methods, including multi-channel weighted predicted error (WPE), beamforming, speech separation, speech enhancement and so on, to process training, validation and test sets. But we only selected WPE and beamforming as our frontend methods according to their experimental results. Secondly, we made great efforts in the data augmentation for multi-speaker ASR, mainly including adding noise and reverberation, overlapped speech simulation, multi-channel speech simulation, speed perturbation, front-end processing, and so on, which brought us a great performance improvement. Finally, in order to make full…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings