Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR

Ryo Magoshi; Takashi Maekaku; Yusuke Shinohara

arXiv:2605.14340·cs.SD·May 15, 2026

Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR

Ryo Magoshi, Takashi Maekaku, Yusuke Shinohara

PDF

TL;DR

This paper introduces a speech-text alignment framework to generate expressive pseudo-audio prompts, significantly improving text-only domain adaptation for LLM-based ASR models.

Contribution

It presents a novel method that explicitly models speech-text alignment to create more effective pseudo-audio prompts for domain adaptation.

Findings

01

Outperforms existing text-only adaptation methods

02

Improves overall error rates in target domains

03

Enhances out-of-vocabulary coverage

Abstract

LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only domain adaptation crucial. Existing methods typically rely on either fine-tuning the LLM alone or employing pseudo-audio prompts. The former neglects essential acoustic context, while the latter either suffers from limited scalability in data-scarce conditions, or yields inexpressive prompts by leveraging only textual features, ignoring audio modality. To address this, we propose an enhanced framework that explicitly models speech-text alignment. Our method efficiently generates highly expressive pseudo-audio prompts that bridges the modality gap, enabling effective target-domain adaptation. Experiments demonstrate that our approach outperforms existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.