TESU-LLM: Training Speech-LLMs Without Speech via Unified Encoder Alignment
Taesoo Kim, Jong Hwan Ko

TL;DR
TESU-LLM introduces a method to train speech-enabled language models solely with text data by aligning a shared encoder's output with language model embeddings, achieving competitive speech task performance without speech data.
Contribution
The paper proposes a unified encoder alignment approach that enables training speech-capable LLMs using only text data, reducing reliance on speech-text paired datasets and computational resources.
Findings
Achieves strong performance on speech benchmarks using only text data.
Comparable results to models trained on large multimodal datasets.
Demonstrates a scalable and efficient approach for speech LLM development.
Abstract
Recent advances in speech-enabled language models have shown promising results in building intelligent voice assistants. However, most existing approaches rely on large-scale paired speech-text data and extensive computational resources, which pose challenges in terms of scalability and accessibility. In this paper, we present \textbf{TESU-LLM}, a novel framework that enables training speech-capable language models using only text data. Our key insight is to leverage a unified encoder that maps semantically equivalent text and speech inputs to a shared latent space. By aligning the encoder output with the embedding space of a LLM via a lightweight projection network, we enable the model to generalize from text-only supervision to speech-based inference. Despite being trained exclusively on text, TESU-LLM achieves strong performance on various speech-related benchmarks, comparable to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
