When Large Language Models Meet Speech: A Survey on Integration Approaches

Zhengdong Yang; Shuichiro Shimizu; Yahan Yu; Chenhui Chu

arXiv:2502.19548·cs.CL·September 10, 2025

When Large Language Models Meet Speech: A Survey on Integration Approaches

Zhengdong Yang, Shuichiro Shimizu, Yahan Yu, Chenhui Chu

PDF

Open Access 1 Video

TL;DR

This survey reviews methods for integrating speech with large language models, categorizing approaches and discussing their applications and challenges to guide future research in multimodal AI.

Contribution

It provides a comprehensive categorization of speech-LLM integration methods and analyzes their applications and challenges, offering a structured overview of this emerging field.

Findings

01

Three main integration approaches identified: text-based, latent-representation, audio-token.

02

Applications span speech recognition, synthesis, and understanding tasks.

03

Highlights include current challenges and future directions in speech-LLM integration.

Abstract

Recent advancements in large language models (LLMs) have spurred interest in expanding their application beyond text-based tasks. A large number of studies have explored integrating other modalities with LLMs, notably speech modality, which is naturally related to text. This paper surveys the integration of speech with LLMs, categorizing the methodologies into three primary approaches: text-based, latent-representation-based, and audio-token-based integration. We also demonstrate how these methods are applied across various speech-related applications and highlight the challenges in this field to offer inspiration for

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

When Large Language Models Meet Speech: A Survey on Integration Approaches· underline

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques