Towards Effective and Efficient Non-autoregressive decoders for Conformer and LLM-based ASR using Block-based Attention Mask

Tianzi Wang; Xurong Xie; Zengrui Jin; Mengzhe Geng; Jiajun Deng; Zhaoqing Li; Shoukang Hu; Shujie Hu; Guinan Li; Mingyu Cui; Helen Meng; Xunying Liu

arXiv:2511.09084·eess.AS·November 13, 2025

Towards Effective and Efficient Non-autoregressive decoders for Conformer and LLM-based ASR using Block-based Attention Mask

Tianzi Wang, Xurong Xie, Zengrui Jin, Mengzhe Geng, Jiajun Deng, Zhaoqing Li, Shoukang Hu, Shujie Hu, Guinan Li, Mingyu Cui, Helen Meng, Xunying Liu

PDF

Open Access

TL;DR

This paper introduces a block-based attention mask decoder for non-autoregressive ASR systems that significantly speeds up decoding while maintaining or improving recognition accuracy across various models and datasets.

Contribution

It proposes a novel NAR decoder with block-based attention that balances efficiency and accuracy, adaptable to Conformer and LLM-based ASR systems.

Findings

01

Achieves up to 2.31x decoding speedup without significant WER increase.

02

Reduces WER by up to 0.62% absolute in real-time scenarios.

03

Effective across multiple datasets and model configurations.

Abstract

Automatic speech recognition (ASR) systems often rely on autoregressive (AR) Transformer decoder architectures, which limit efficient inference parallelization due to their sequential nature. To this end, non-autoregressive (NAR) approaches aim primarily to achieve significant decoding speedup while the maintaining recognition accuracy that is comparable to AR baselines. This paper proposes a novel NAR block-based attention mask decoder (AMD) that effectively improves decoding efficiency while maintaining ASR accuracy, and also offering flexibility in balancing the performance-efficiency trade-off on both Conformer and large language model (LLM)-based ASR systems. The proposed AMD performs parallel inference within contiguous blocks of output labels while maintaining monotonic left-to-right prediction between blocks. A one-pass beam search algorithm is designed to dynamically fuse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques