Text-only adaptation in LLM-based ASR through text denoising
Andr\'es Carofilis, Sergio Burdisso, Esa\'u Villatoro-Tello, Shashi Kumar, Kadri Hacioglu, Srikanth Madikeri, Pradeep Rangappa, Manjunath K E, Petr Motlicek, Shankar Venkatesan, Andreas Stolcke

TL;DR
This paper proposes a novel text denoising approach for adapting large language model-based ASR systems to new domains using only text data, improving performance while maintaining cross-modal alignment.
Contribution
Introduces a lightweight text denoising method for domain adaptation in LLM-based ASR that preserves speech-text alignment without extra parameters.
Findings
Achieves up to 22.1% relative performance improvement
Outperforms recent state-of-the-art text-only adaptation methods
Effective in two different datasets
Abstract
Adapting large language model (LLM)-based automatic speech recognition (ASR) systems to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on the target domain text often disrupts the critical alignment between the speech and text modality learned by the projector, degrading performance. We introduce a novel text-only adaptation method that frames this process as a text denoising task. Our approach trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques
