Diffusion Language Models for Speech Recognition

Davyd Naveriani; Albert Zeyer; Ralf Schl\"uter; Hermann Ney

arXiv:2604.14001·cs.CL·April 30, 2026

Diffusion Language Models for Speech Recognition

Davyd Naveriani, Albert Zeyer, Ralf Schl\"uter, Hermann Ney

PDF

1 Repo

TL;DR

This paper explores the application of diffusion language models in speech recognition, introducing new methods for hypothesis rescoring and joint decoding to improve accuracy.

Contribution

It introduces masked and uniform-state diffusion language models for speech hypothesis rescoring and proposes a novel joint-decoding method combining CTC and USDM.

Findings

01

USDM and MDLM significantly improve recognition accuracy

02

The joint-decoding method effectively combines language and acoustic information

03

All code and recipes are publicly available

Abstract

Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.