Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition
Zhifu Gao, Shiliang Zhang, Ian McLoughlin, Zhijie Yan

TL;DR
Paraformer introduces a fast, parallel transformer model for speech recognition that achieves near state-of-the-art accuracy with significantly improved inference speed by addressing token prediction and interdependence modeling challenges.
Contribution
It proposes a novel parallel transformer architecture with a token predictor and context modeling enhancements, significantly improving non-autoregressive speech recognition performance.
Findings
Achieves comparable accuracy to AR transformers on benchmark datasets.
Provides over 10x inference speedup in speech recognition tasks.
Effective in industrial-scale 20,000-hour speech recognition scenarios.
Abstract
Transformers have recently dominated the ASR field. Although able to yield good performance, they involve an autoregressive (AR) decoder to generate tokens one by one, which is computationally inefficient. To speed up inference, non-autoregressive (NAR) methods, e.g. single-step NAR, were designed, to enable parallel generation. However, due to an independence assumption within the output tokens, performance of single-step NAR is inferior to that of AR models, especially with a large-scale corpus. There are two challenges to improving single-step NAR: Firstly to accurately predict the number of output tokens and extract hidden variables; secondly, to enhance modeling of interdependence between output tokens. To tackle both challenges, we propose a fast and accurate parallel transformer, termed Paraformer. This utilizes a continuous integrate-and-fire based predictor to predict the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
