Improving non-autoregressive end-to-end speech recognition with   pre-trained acoustic and language models

Keqi Deng; Zehui Yang; Shinji Watanabe; Yosuke Higuchi; Gaofeng Cheng,; Pengyuan Zhang

arXiv:2201.10103·eess.AS·January 27, 2022

Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

Keqi Deng, Zehui Yang, Shinji Watanabe, Yosuke Higuchi, Gaofeng Cheng,, Pengyuan Zhang

PDF

Open Access

TL;DR

This paper introduces a non-autoregressive speech recognition model that leverages pre-trained acoustic and language models, achieving high accuracy and fast inference, especially for logographic languages like Chinese.

Contribution

The paper proposes a novel NAR CTC/attention model with a modality conversion mechanism and cache-based decoding, improving recognition accuracy over previous NAR systems.

Findings

01

15.1% relative CER reduction on AISHELL-1

02

Outperforms previous NAR models on AISHELL-1

03

Shows potential for English speech recognition

Abstract

While Transformers have achieved promising results in end-to-end (E2E) automatic speech recognition (ASR), their autoregressive (AR) structure becomes a bottleneck for speeding up the decoding process. For real-world deployment, ASR systems are desired to be highly accurate while achieving fast inference. Non-autoregressive (NAR) models have become a popular alternative due to their fast inference speed, but they still fall behind AR systems in recognition accuracy. To fulfill the two demands, in this paper, we propose a NAR CTC/attention model utilizing both pre-trained acoustic and language models: wav2vec2.0 and BERT. To bridge the modality gap between speech and text representations obtained from the pre-trained models, we design a novel modality conversion mechanism, which is more suitable for logographic languages. During inference, we employ a CTC branch to generate a target…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Residual Connection · WordPiece · Dropout · Dense Connections · Weight Decay