Reproducing and Dissecting Denoising Language Models for Speech Recognition

Dorian Koch (1); Albert Zeyer (1; 2); Nick Rossenbach (1; 2); Ralf Schl\"uter (1; 2); Hermann Ney (1; 2) ((1) Machine Learning; Human Language Technology; RWTH Aachen University; (2) AppTek)

arXiv:2512.13576·cs.NE·December 16, 2025

Reproducing and Dissecting Denoising Language Models for Speech Recognition

Dorian Koch (1), Albert Zeyer (1, 2), Nick Rossenbach (1, 2), Ralf Schl\"uter (1, 2), Hermann Ney (1, 2) ((1) Machine Learning, Human Language Technology, RWTH Aachen University, (2) AppTek)

PDF

Open Access 3 Reviews

TL;DR

This study provides a comprehensive empirical analysis of denoising language models for speech recognition, highlighting their advantages, limitations, and key factors influencing performance, along with a new decoding method and a reproducible pipeline.

Contribution

It offers the first large-scale, systematic evaluation of DLMs, introduces DLM-sum for improved decoding, and releases a complete pipeline for future research.

Findings

01

DLMs outperform traditional LMs after a compute threshold.

02

DLMs scale better with longer training, similar to diffusion models.

03

Performance improvements depend on vocabulary and hypothesis conditioning.

Abstract

Denoising language models (DLMs) have been proposed as a powerful alternative to traditional language models (LMs) for automatic speech recognition (ASR), motivated by their ability to use bidirectional context and adapt to a specific ASR model's error patterns. However, the complexity of the DLM training pipeline has hindered wider investigation. This paper presents the first independent, large-scale empirical study of DLMs. We build and release a complete, reproducible pipeline to systematically investigate the impact of key design choices. We evaluate dozens of configurations across multiple axes, including various data augmentation techniques (e.g., SpecAugment, dropout, mixup), different text-to-speech systems, and multiple decoding strategies. Our comparative analysis in a common subword vocabulary setting demonstrates that DLMs outperform traditional LMs, but only after a…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The paper’s main strengths lie in its systematic empirical study and carefully engineered pipeline, with some degree of originality in how existing ideas are combined. Originality - Combines denoising LMs with an n-best–style decoding scheme (DLM-sum) and a dense k-probability input, leading to a coherent configuration that has not been extensively studied in prior work. Quality - Substantial experimental effort: an end-to-end LibriSpeech-only pipeline (ASR, TTS, LM/DLM, multiple decoding sche

Weaknesses

The main weaknesses concern novelty, positioning, and how sharp the empirical story is. Conceptual novelty may be limited. - The main ideas (n-best marginalization in DLM-sum, log-linear score combination, using top-k posteriors as input) appear closely related to standard ASR practices such as n-best/lattice rescoring and confusion-network–style combination. At present, the paper does not fully clarify what is conceptually new beyond instantiating these ideas with a DLM, or how this perspecti

Reviewer 02Rating 4Confidence 3

Strengths

- The paper provides empirical evidence through detailed and systematic experiments on DLMs for ASR, offering useful insights for improving current approaches. - The proposed DLM-sum decoding appears to be a simple yet effective decoding solution. - The paper is clearly written and well-organized.

Weaknesses

- The paper lacks substantial novelty. The proposed DLM-sum decoding method appears to be an incremental extension to current methods. - All analysis is conducted on a single ASR system, which may limit understanding of how the findings about the DLMs generalize across different ASR architectures.

Reviewer 03Rating 4Confidence 4

Strengths

- The paper provides an impressively large range of experiments with DLMs under controlled conditions, making the findings reliable. - The provided open-source pipeline should make it possible for others to reproduce the work (I haven't tried the code myself). - The new DLM-sum technique is a sensible way of taking better advantage of ASR n-best lists.

Weaknesses

- Experiments on LibriSpeech alone limit the impact of the contribution. LibriSpeech is extremely clean, dictated speech with very low ASR errors. In order to make a convincing case, the experiments should include additional, more challenging datasets. Other recent studies of LM-based ASR error correction, e.g. Ma et al. 2025 (cited in the paper), include several datasets. - The length and organization of the paper makes it a tough read. Most of the results, and even some of the basic notat

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Face recognition and analysis · Voice and Speech Disorders