Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

Thibault Ba\~neras-Roux; Sergio Burdisso; Esa\'u Villatoro-Tello; Dairazalia S\'anchez-Cort\'es; Shiran Liu; Severin Baroudi; Shashi Kumar; Hasindri Watawana; Manjunath K E; Kadri Hacioglu; Petr Motlicek; Andreas Stolcke

arXiv:2604.06487·cs.CL·April 9, 2026

Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

Thibault Ba\~neras-Roux, Sergio Burdisso, Esa\'u Villatoro-Tello, Dairazalia S\'anchez-Cort\'es, Shiran Liu, Severin Baroudi, Shashi Kumar, Hasindri Watawana, Manjunath K E, Kadri Hacioglu, Petr Motlicek, Andreas Stolcke

PDF

TL;DR

This paper explores how limited speech data can bridge the modality gap in LLM-based ASR systems, enabling effective domain adaptation with minimal speech samples.

Contribution

It introduces mixed batching strategies that leverage small amounts of speech to significantly improve LLM-based ASR performance.

Findings

01

Limited speech data improves ASR performance across domains.

02

Mixed batching with 10% speech achieves comparable results to full-data fine-tuning.

03

Small speech samples provide strong modality-alignment signals.

Abstract

Conventional end-to-end automatic speech recognition (ASR) systems rely on paired speech-text data for domain adaptation. Recent LLM-based ASR architectures connect a speech encoder to a large language model via a projection module, enabling adaptation with text-only data. However, this introduces a modality gap, as the LLM is not exposed to the noisy representations produced by the speech projector. We investigate whether small amounts of speech can mitigate this mismatch. We compare three strategies: text-only adaptation, paired speech-text adaptation, and mixed batching (MB), which combines both. Experiments in in-domain and out-of-domain settings show that even limited speech consistently improves performance. Notably, MB using only 10% of the target-domain (less than 4 hours) speech achieves word error rates comparable to, or better than, conventional ASR fine-tuning with the full…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.