TASU: Text-Only Alignment for Speech Understanding
Jing Peng, Yi Yang, Xu Li, Yu Xi, Quanwei Tang, Yangui Fang, Junjie Li, Kai Yu

TL;DR
TASU introduces a text-only alignment method for speech understanding that reduces reliance on paired data and improves zero-shot generalization across multiple speech tasks.
Contribution
It presents a novel alignment paradigm leveraging only unpaired text data, enhancing domain generalization and zero-shot performance in speech understanding models.
Findings
Achieves competitive zero-shot speech recognition performance.
Enhances domain generalization through curriculum learning.
Outperforms GLM-4-Voice and Step-Audio on MMSU benchmark.
Abstract
Recent advances in Speech Large Language Models (Speech LLMs) have paved the way for unified architectures across diverse speech understanding tasks. However, prevailing alignment paradigms rely heavily on large-scale audio-text paired data and computationally intensive training, yet often exhibit limited generalization to unseen domains or tasks. To address these limitations, we propose TASU (Text-only Alignment for Speech Understanding), a novel alignment paradigm that can leverage only unpaired text data to guide cross-modal alignment. Experiments show that TASU achieves competitive zero-shot speech recognition. Leveraging this property, it can further function as a pre-training stage in curriculum learning, enhancing domain generalization in speech recognition. Ultimately, TASU can extend its zero-shot generalization to a wide range of speech understanding tasks and notably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
