TASU: Text-Only Alignment for Speech Understanding

Jing Peng; Yi Yang; Xu Li; Yu Xi; Quanwei Tang; Yangui Fang; Junjie Li; Kai Yu

arXiv:2511.03310·eess.AS·January 27, 2026

TASU: Text-Only Alignment for Speech Understanding

Jing Peng, Yi Yang, Xu Li, Yu Xi, Quanwei Tang, Yangui Fang, Junjie Li, Kai Yu

PDF

Open Access

TL;DR

TASU introduces a text-only alignment method for speech understanding that reduces reliance on paired data and improves zero-shot generalization across multiple speech tasks.

Contribution

It presents a novel alignment paradigm leveraging only unpaired text data, enhancing domain generalization and zero-shot performance in speech understanding models.

Findings

01

Achieves competitive zero-shot speech recognition performance.

02

Enhances domain generalization through curriculum learning.

03

Outperforms GLM-4-Voice and Step-Audio on MMSU benchmark.

Abstract

Recent advances in Speech Large Language Models (Speech LLMs) have paved the way for unified architectures across diverse speech understanding tasks. However, prevailing alignment paradigms rely heavily on large-scale audio-text paired data and computationally intensive training, yet often exhibit limited generalization to unseen domains or tasks. To address these limitations, we propose TASU (Text-only Alignment for Speech Understanding), a novel alignment paradigm that can leverage only unpaired text data to guide cross-modal alignment. Experiments show that TASU achieves competitive zero-shot speech recognition. Leveraging this property, it can further function as a pre-training stage in curriculum learning, enhancing domain generalization in speech recognition. Ultimately, TASU can extend its zero-shot generalization to a wide range of speech understanding tasks and notably…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research