Paraformer: Fast and Accurate Parallel Transformer for   Non-autoregressive End-to-End Speech Recognition

Zhifu Gao; Shiliang Zhang; Ian McLoughlin; Zhijie Yan

arXiv:2206.08317·cs.SD·March 31, 2023·1 cites

Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition

Zhifu Gao, Shiliang Zhang, Ian McLoughlin, Zhijie Yan

PDF

Open Access 2 Repos 4 Models

TL;DR

Paraformer introduces a fast, parallel transformer model for speech recognition that achieves near state-of-the-art accuracy with significantly improved inference speed by addressing token prediction and interdependence modeling challenges.

Contribution

It proposes a novel parallel transformer architecture with a token predictor and context modeling enhancements, significantly improving non-autoregressive speech recognition performance.

Findings

01

Achieves comparable accuracy to AR transformers on benchmark datasets.

02

Provides over 10x inference speedup in speech recognition tasks.

03

Effective in industrial-scale 20,000-hour speech recognition scenarios.

Abstract

Transformers have recently dominated the ASR field. Although able to yield good performance, they involve an autoregressive (AR) decoder to generate tokens one by one, which is computationally inefficient. To speed up inference, non-autoregressive (NAR) methods, e.g. single-step NAR, were designed, to enable parallel generation. However, due to an independence assumption within the output tokens, performance of single-step NAR is inferior to that of AR models, especially with a large-scale corpus. There are two challenges to improving single-step NAR: Firstly to accurately predict the number of output tokens and extract hidden variables; secondly, to enhance modeling of interdependence between output tokens. To tackle both challenges, we propose a fast and accurate parallel transformer, termed Paraformer. This utilizes a continuous integrate-and-fire based predictor to predict the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings