dLLM-ASR: A Faster Diffusion LLM-based Framework for Speech Recognition

Wenjie Tian; Bingshen Mu; Guobin Ma; Xuelong Geng; Zhixian Zhao; Lei Xie

arXiv:2601.17902·cs.SD·January 27, 2026

dLLM-ASR: A Faster Diffusion LLM-based Framework for Speech Recognition

Wenjie Tian, Bingshen Mu, Guobin Ma, Xuelong Geng, Zhixian Zhao, Lei Xie

PDF

Open Access

TL;DR

This paper introduces dLLM-ASR, a novel diffusion LLM-based speech recognition framework that significantly speeds up inference while maintaining high accuracy by using prior-guided, adaptive denoising techniques.

Contribution

It presents a new ASR framework that adapts diffusion LLMs with prior guidance and adaptive denoising, reducing redundancy and inference time.

Findings

01

Achieves comparable accuracy to autoregressive LLM-based ASR systems.

02

Provides a 4.44× inference speedup over traditional methods.

03

Demonstrates effective adaptive computation at token level.

Abstract

Automatic speech recognition (ASR) systems based on large language models (LLMs) achieve superior performance by leveraging pretrained LLMs as decoders, but their token-by-token generation mechanism leads to inference latency that grows linearly with sequence length. Meanwhile, discrete diffusion large language models (dLLMs) offer a promising alternative, enabling high-quality parallel sequence generation with pretrained decoders. However, directly applying native text-oriented dLLMs to ASR leads to a fundamental mismatch between open-ended text generation and the acoustically conditioned transcription paradigm required by ASR. As a result, it introduces unnecessary difficulty and computational redundancy, such as denoising from pure noise, inflexible generation lengths, and fixed denoising steps. We propose dLLM-ASR, an efficient dLLM-based ASR framework that formulates dLLM's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders