MDM-ASR: Bridging Accuracy and Efficiency in ASR with Diffusion-Based Non-Autoregressive Decoding
Hao Yen, Pin-Jui Ku, Ante Juki\'c, Sabato Marco Siniscalchi

TL;DR
This paper introduces MDM-ASR, a diffusion-based non-autoregressive speech recognition framework that balances accuracy and decoding speed, outperforming previous NAR models and rivaling AR models.
Contribution
It presents a novel diffusion-based NAR ASR model with iterative self-correction and confidence-based sampling, significantly improving performance and efficiency.
Findings
Consistent improvements over prior NAR models.
Competitive performance with autoregressive baselines.
Maintains parallel decoding efficiency.
Abstract
In sequence-to-sequence Transformer ASR, autoregressive (AR) models achieve strong accuracy but suffer from slow decoding, while non-autoregressive (NAR) models enable parallel decoding at the cost of degraded performance. We propose a principled NAR ASR framework based on Masked Diffusion Models to reduce this gap. A pre-trained speech encoder is coupled with a Transformer diffusion decoder conditioned on acoustic features and partially masked transcripts for parallel token prediction. To mitigate the training-inference mismatch, we introduce Iterative Self-Correction Training that exposes the model to its own intermediate predictions. We also design a Position-Biased Entropy-Bounded Confidence-based sampler with positional bias to further boost results. Experiments across multiple benchmarks demonstrate consistent gains over prior NAR models and competitive performance with strong AR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques
