Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models
Tianzi Wang, Yuya Fujita, Xuankai Chang, Shinji Watanabe

TL;DR
This paper introduces a novel streaming non-autoregressive end-to-end speech recognition system that processes audio in small blocks, achieving low latency and faster inference while maintaining accuracy.
Contribution
It combines blockwise attention and Mask-CTC with an overlapping decoding strategy for real-time speech recognition, addressing latency and coherence issues.
Findings
Improved online ASR accuracy in low latency scenarios
Faster inference speed compared to autoregressive models
Effective handling of edge errors with overlapping decoding
Abstract
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. With recent state-of-the-art attention-based automatic speech recognition (ASR) structure, NAR can realize promising real-time factor (RTF) improvement with only small degradation of accuracy compared to the autoregressive (AR) models. However, the recognition inference needs to wait for the completion of a full speech utterance, which limits their applications on low latency scenarios. To address this issue, we propose a novel end-to-end streaming NAR speech recognition system by combining blockwise-attention and connectionist temporal classification with mask-predict (Mask-CTC) NAR. During inference, the input audio is separated into small blocks and then processed in a blockwise streaming way. To address the insertion and deletion error at the edge of the output of each block, we apply an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
