A Conformer Based Acoustic Model for Robust Automatic Speech Recognition

Yufeng Yang; Peidong Wang; DeLiang Wang

arXiv:2203.00725·cs.SD·October 21, 2022·6 cites

A Conformer Based Acoustic Model for Robust Automatic Speech Recognition

Yufeng Yang, Peidong Wang, DeLiang Wang

PDF

Open Access

TL;DR

This paper introduces a Conformer-based acoustic model for robust speech recognition, demonstrating significant improvements in accuracy, model size, and training efficiency on the CHiME-4 dataset.

Contribution

It replaces the recurrent network in WRBN with a Conformer encoder, achieving better performance and efficiency in speech recognition tasks.

Findings

01

Achieves 6.25% WER on CHiME-4, outperforming WRBN by 8.4% relative.

02

Model size is reduced by 18.3%, and training time is cut by 79.6%.

03

Uses convolution-augmented attention for improved acoustic modeling.

Abstract

This study addresses robust automatic speech recognition (ASR) by introducing a Conformer-based acoustic model. The proposed model builds on the wide residual bi-directional long short-term memory network (WRBN) with utterance-wise dropout and iterative speaker adaptation, but employs a Conformer encoder instead of the recurrent network. The Conformer encoder uses a convolution-augmented attention mechanism for acoustic modeling. The proposed system is evaluated on the monaural ASR task of the CHiME-4 corpus. Coupled with utterance-wise normalization and speaker adaptation, our model achieves $6.25%$ word error rate, which outperforms WRBN by $8.4%$ relatively. In addition, the proposed Conformer-based model is $18.3%$ smaller in model size and reduces total training time by $79.6%$ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsMemory Network · Dropout