Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models

Tianzi Wang; Yuya Fujita; Xuankai Chang; Shinji Watanabe

arXiv:2107.09428·eess.AS·July 21, 2021·1 cites

Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models

Tianzi Wang, Yuya Fujita, Xuankai Chang, Shinji Watanabe

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel streaming non-autoregressive end-to-end speech recognition system that processes audio in small blocks, achieving low latency and faster inference while maintaining accuracy.

Contribution

It combines blockwise attention and Mask-CTC with an overlapping decoding strategy for real-time speech recognition, addressing latency and coherence issues.

Findings

01

Improved online ASR accuracy in low latency scenarios

02

Faster inference speed compared to autoregressive models

03

Effective handling of edge errors with overlapping decoding

Abstract

Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. With recent state-of-the-art attention-based automatic speech recognition (ASR) structure, NAR can realize promising real-time factor (RTF) improvement with only small degradation of accuracy compared to the autoregressive (AR) models. However, the recognition inference needs to wait for the completion of a full speech utterance, which limits their applications on low latency scenarios. To address this issue, we propose a novel end-to-end streaming NAR speech recognition system by combining blockwise-attention and connectionist temporal classification with mask-predict (Mask-CTC) NAR. During inference, the input audio is separated into small blocks and then processed in a blockwise streaming way. To address the insertion and deletion error at the edge of the output of each block, we apply an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

espnet/espnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing