TESU-LLM: Training Speech-LLMs Without Speech via Unified Encoder Alignment

Taesoo Kim; Jong Hwan Ko

arXiv:2506.06343·cs.CL·June 10, 2025

TESU-LLM: Training Speech-LLMs Without Speech via Unified Encoder Alignment

Taesoo Kim, Jong Hwan Ko

PDF

Open Access

TL;DR

TESU-LLM introduces a method to train speech-enabled language models solely with text data by aligning a shared encoder's output with language model embeddings, achieving competitive speech task performance without speech data.

Contribution

The paper proposes a unified encoder alignment approach that enables training speech-capable LLMs using only text data, reducing reliance on speech-text paired datasets and computational resources.

Findings

01

Achieves strong performance on speech benchmarks using only text data.

02

Comparable results to models trained on large multimodal datasets.

03

Demonstrates a scalable and efficient approach for speech LLM development.

Abstract

Recent advances in speech-enabled language models have shown promising results in building intelligent voice assistants. However, most existing approaches rely on large-scale paired speech-text data and extensive computational resources, which pose challenges in terms of scalability and accessibility. In this paper, we present \textbf{TESU-LLM}, a novel framework that enables training speech-capable language models using only text data. Our key insight is to leverage a unified encoder that maps semantically equivalent text and speech inputs to a shared latent space. By aligning the encoder output with the embedding space of a LLM via a lightweight projection network, we enable the model to generalize from text-only supervision to speech-based inference. Despite being trained exclusively on text, TESU-LLM achieves strong performance on various speech-related benchmarks, comparable to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques