Reproducing and Dissecting Denoising Language Models for Speech Recognition
Dorian Koch (1), Albert Zeyer (1, 2), Nick Rossenbach (1, 2), Ralf Schl\"uter (1, 2), Hermann Ney (1, 2) ((1) Machine Learning, Human Language Technology, RWTH Aachen University, (2) AppTek)

TL;DR
This study provides a comprehensive empirical analysis of denoising language models for speech recognition, highlighting their advantages, limitations, and key factors influencing performance, along with a new decoding method and a reproducible pipeline.
Contribution
It offers the first large-scale, systematic evaluation of DLMs, introduces DLM-sum for improved decoding, and releases a complete pipeline for future research.
Findings
DLMs outperform traditional LMs after a compute threshold.
DLMs scale better with longer training, similar to diffusion models.
Performance improvements depend on vocabulary and hypothesis conditioning.
Abstract
Denoising language models (DLMs) have been proposed as a powerful alternative to traditional language models (LMs) for automatic speech recognition (ASR), motivated by their ability to use bidirectional context and adapt to a specific ASR model's error patterns. However, the complexity of the DLM training pipeline has hindered wider investigation. This paper presents the first independent, large-scale empirical study of DLMs. We build and release a complete, reproducible pipeline to systematically investigate the impact of key design choices. We evaluate dozens of configurations across multiple axes, including various data augmentation techniques (e.g., SpecAugment, dropout, mixup), different text-to-speech systems, and multiple decoding strategies. Our comparative analysis in a common subword vocabulary setting demonstrates that DLMs outperform traditional LMs, but only after a…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper’s main strengths lie in its systematic empirical study and carefully engineered pipeline, with some degree of originality in how existing ideas are combined. Originality - Combines denoising LMs with an n-best–style decoding scheme (DLM-sum) and a dense k-probability input, leading to a coherent configuration that has not been extensively studied in prior work. Quality - Substantial experimental effort: an end-to-end LibriSpeech-only pipeline (ASR, TTS, LM/DLM, multiple decoding sche
The main weaknesses concern novelty, positioning, and how sharp the empirical story is. Conceptual novelty may be limited. - The main ideas (n-best marginalization in DLM-sum, log-linear score combination, using top-k posteriors as input) appear closely related to standard ASR practices such as n-best/lattice rescoring and confusion-network–style combination. At present, the paper does not fully clarify what is conceptually new beyond instantiating these ideas with a DLM, or how this perspecti
- The paper provides empirical evidence through detailed and systematic experiments on DLMs for ASR, offering useful insights for improving current approaches. - The proposed DLM-sum decoding appears to be a simple yet effective decoding solution. - The paper is clearly written and well-organized.
- The paper lacks substantial novelty. The proposed DLM-sum decoding method appears to be an incremental extension to current methods. - All analysis is conducted on a single ASR system, which may limit understanding of how the findings about the DLMs generalize across different ASR architectures.
- The paper provides an impressively large range of experiments with DLMs under controlled conditions, making the findings reliable. - The provided open-source pipeline should make it possible for others to reproduce the work (I haven't tried the code myself). - The new DLM-sum technique is a sensible way of taking better advantage of ASR n-best lists.
- Experiments on LibriSpeech alone limit the impact of the contribution. LibriSpeech is extremely clean, dictated speech with very low ASR errors. In order to make a convincing case, the experiments should include additional, more challenging datasets. Other recent studies of LM-based ASR error correction, e.g. Ma et al. 2025 (cited in the paper), include several datasets. - The length and organization of the paper makes it a tough read. Most of the results, and even some of the basic notat
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Face recognition and analysis · Voice and Speech Disorders
