Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization

Genshun Wan; Wenhui Zhang; Jing-Xuan Zhang; Shifu Xiong; Jianqing Gao; Zhongfu Ye

arXiv:2601.22779·eess.AS·February 2, 2026

Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization

Genshun Wan, Wenhui Zhang, Jing-Xuan Zhang, Shifu Xiong, Jianqing Gao, Zhongfu Ye

PDF

Open Access

TL;DR

This paper introduces a streaming speech recognition method using decoder-only large language models combined with a novel latency-optimized segmentation approach, achieving high accuracy and significantly reduced delay.

Contribution

It presents a new streaming ASR framework integrating a read/write policy with monotonic chunkwise attention, enabling efficient segmentation and low-latency recognition with LLMs.

Findings

01

Achieves 5.1% CER on AISHELL-1

02

Reduces token delay by 62.5%

03

Outperforms recent streaming ASR baselines

Abstract

Recent advances have demonstrated the potential of decoderonly large language models (LLMs) for automatic speech recognition (ASR). However, enabling streaming recognition within this framework remains a challenge. In this work, we propose a novel streaming ASR approach that integrates a read/write policy network with monotonic chunkwise attention (MoChA) to dynamically segment speech embeddings. These segments are interleaved with label sequences during training, enabling seamless integration with the LLM. During inference, the audio stream is buffered until the MoChA module triggers a read signal, at which point the buffered segment together with the previous token is fed into the LLM for the next token prediction. We also introduce a minimal-latency training objective to guide the policy network toward accurate segmentation boundaries. Furthermore, we adopt a joint training strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing