Semi-Autoregressive Streaming ASR With Label Context
Siddhant Arora, George Saon, Shinji Watanabe, Brian Kingsbury

TL;DR
This paper introduces a semi-autoregressive streaming ASR model that leverages label context and a novel decoding algorithm to significantly improve accuracy and reduce latency compared to existing streaming NAR models.
Contribution
The paper proposes a semi-autoregressive streaming ASR model with label context integration and a new greedy decoding method, reducing the accuracy gap with AR models and lowering latency.
Findings
Outperforms existing streaming NAR models by up to 19% relative accuracy.
Reduces the accuracy gap with streaming AR models.
Achieves 2.5x lower latency while maintaining high accuracy.
Abstract
Non-autoregressive (NAR) modeling has gained significant interest in speech processing since these models achieve dramatically lower inference time than autoregressive (AR) models while also achieving good transcription accuracy. Since NAR automatic speech recognition (ASR) models must wait for the completion of the entire utterance before processing, some works explore streaming NAR models based on blockwise attention for low-latency applications. However, streaming NAR models significantly lag in accuracy compared to streaming AR and non-streaming NAR models. To address this, we propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context using a Language Model (LM) subnetwork. We also introduce a novel greedy decoding algorithm that addresses insertion and deletion errors near block boundaries while not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Chemical Sensor Technologies · Energy Efficient Wireless Sensor Networks
