Text-only adaptation in LLM-based ASR through text denoising

Andr\'es Carofilis; Sergio Burdisso; Esa\'u Villatoro-Tello; Shashi Kumar; Kadri Hacioglu; Srikanth Madikeri; Pradeep Rangappa; Manjunath K E; Petr Motlicek; Shankar Venkatesan; Andreas Stolcke

arXiv:2601.20900·cs.SD·March 13, 2026

Text-only adaptation in LLM-based ASR through text denoising

Andr\'es Carofilis, Sergio Burdisso, Esa\'u Villatoro-Tello, Shashi Kumar, Kadri Hacioglu, Srikanth Madikeri, Pradeep Rangappa, Manjunath K E, Petr Motlicek, Shankar Venkatesan, Andreas Stolcke

PDF

Open Access

TL;DR

This paper proposes a novel text denoising approach for adapting large language model-based ASR systems to new domains using only text data, improving performance while maintaining cross-modal alignment.

Contribution

Introduces a lightweight text denoising method for domain adaptation in LLM-based ASR that preserves speech-text alignment without extra parameters.

Findings

01

Achieves up to 22.1% relative performance improvement

02

Outperforms recent state-of-the-art text-only adaptation methods

03

Effective in two different datasets

Abstract

Adapting large language model (LLM)-based automatic speech recognition (ASR) systems to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on the target domain text often disrupts the critical alignment between the speech and text modality learned by the projector, degrading performance. We introduce a novel text-only adaptation method that frames this process as a text denoising task. Our approach trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques