When Large Language Models Meet Speech: A Survey on Integration Approaches
Zhengdong Yang, Shuichiro Shimizu, Yahan Yu, Chenhui Chu

TL;DR
This survey reviews methods for integrating speech with large language models, categorizing approaches and discussing their applications and challenges to guide future research in multimodal AI.
Contribution
It provides a comprehensive categorization of speech-LLM integration methods and analyzes their applications and challenges, offering a structured overview of this emerging field.
Findings
Three main integration approaches identified: text-based, latent-representation, audio-token.
Applications span speech recognition, synthesis, and understanding tasks.
Highlights include current challenges and future directions in speech-LLM integration.
Abstract
Recent advancements in large language models (LLMs) have spurred interest in expanding their application beyond text-based tasks. A large number of studies have explored integrating other modalities with LLMs, notably speech modality, which is naturally related to text. This paper surveys the integration of speech with LLMs, categorizing the methodologies into three primary approaches: text-based, latent-representation-based, and audio-token-based integration. We also demonstrate how these methods are applied across various speech-related applications and highlight the challenges in this field to offer inspiration for
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
