NLE: Non-autoregressive LLM-based ASR by Transcript Editing

Avihu Dekel; Samuel Thomas; Takashi Fukada; George Saon

arXiv:2603.08397·eess.AS·March 10, 2026

NLE: Non-autoregressive LLM-based ASR by Transcript Editing

Avihu Dekel, Samuel Thomas, Takashi Fukada, George Saon

PDF

Open Access

TL;DR

NLE introduces a non-autoregressive speech recognition method using transcript editing with bidirectional LLMs, significantly reducing latency and enabling real-time applications while maintaining high accuracy.

Contribution

The paper presents a novel non-autoregressive ASR approach that formulates recognition as transcript editing, leveraging a bidirectional LLM and a new training strategy for parallel decoding.

Findings

01

Achieves 5.67% WER on Open ASR leaderboard

02

Provides 27x speedup over autoregressive models in single-utterance scenarios

03

Enables real-time speech recognition with high accuracy

Abstract

While autoregressive (AR) LLM-based ASR systems achieve strong accuracy, their sequential decoding limits parallelism and incurs high latency. We propose NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully parallel prediction. NLE extracts acoustic embeddings and an initial hypothesis from a pretrained speech encoder, then refines the hypothesis using a bidirectional LLM editor trained with a latent alignment objective. An interleaved padding strategy exploits the identity mapping bias of Transformers, allowing the model to focus on corrections rather than full reconstruction. On the Open ASR leaderboard, NLE++ achieves 5.67% average WER with an RTFx (inverse real-time factor) of 1630. In single-utterance scenarios, NLE achieves 27x speedup over the AR baseline, making it suitable for real-time applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders