MDM-ASR: Bridging Accuracy and Efficiency in ASR with Diffusion-Based Non-Autoregressive Decoding

Hao Yen; Pin-Jui Ku; Ante Juki\'c; Sabato Marco Siniscalchi

arXiv:2602.18952·eess.AS·February 26, 2026

MDM-ASR: Bridging Accuracy and Efficiency in ASR with Diffusion-Based Non-Autoregressive Decoding

Hao Yen, Pin-Jui Ku, Ante Juki\'c, Sabato Marco Siniscalchi

PDF

Open Access

TL;DR

This paper introduces MDM-ASR, a diffusion-based non-autoregressive speech recognition framework that balances accuracy and decoding speed, outperforming previous NAR models and rivaling AR models.

Contribution

It presents a novel diffusion-based NAR ASR model with iterative self-correction and confidence-based sampling, significantly improving performance and efficiency.

Findings

01

Consistent improvements over prior NAR models.

02

Competitive performance with autoregressive baselines.

03

Maintains parallel decoding efficiency.

Abstract

In sequence-to-sequence Transformer ASR, autoregressive (AR) models achieve strong accuracy but suffer from slow decoding, while non-autoregressive (NAR) models enable parallel decoding at the cost of degraded performance. We propose a principled NAR ASR framework based on Masked Diffusion Models to reduce this gap. A pre-trained speech encoder is coupled with a Transformer diffusion decoder conditioned on acoustic features and partially masked transcripts for parallel token prediction. To mitigate the training-inference mismatch, we introduce Iterative Self-Correction Training that exposes the model to its own intermediate predictions. We also design a Position-Biased Entropy-Bounded Confidence-based sampler with positional bias to further boost results. Experiments across multiple benchmarks demonstrate consistent gains over prior NAR models and competitive performance with strong AR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques