A Survey on Speech Large Language Models for Understanding

Jing Peng; Yucheng Wang; Bohan Li; Yiwei Guo; Hankun Wang; Yangui Fang; Yu Xi; Haoyu Li; Xu Li; Ke Zhang; Shuai Wang; Kai Yu

arXiv:2410.18908·eess.AS·December 8, 2025·IEEE J. Sel. Top. Signal Process.·3 cites

A Survey on Speech Large Language Models for Understanding

Jing Peng, Yucheng Wang, Bohan Li, Yiwei Guo, Hankun Wang, Yangui Fang, Yu Xi, Haoyu Li, Xu Li, Ke Zhang, Shuai Wang, Kai Yu

PDF

Open Access

TL;DR

This survey systematically reviews Speech Large Language Models, defining speech understanding, analyzing architectures, training strategies, datasets, and evaluation methods, while highlighting key challenges and future directions for robust, generalizable speech comprehension systems.

Contribution

It provides a formal definition and taxonomy of speech understanding, analyzes current Speech LLM architectures, and discusses challenges and directions for future research.

Findings

01

Identifies instruction sensitivity as a key challenge.

02

Highlights degradation in semantic reasoning in Speech LLMs.

03

Provides a structured framework for analyzing Speech LLM architectures.

Abstract

Speech understanding is essential for interpreting the diverse forms of information embedded in spoken language, including linguistic, paralinguistic, and non-linguistic cues that are vital for effective human-computer interaction. The rapid advancement of large language models (LLMs) has catalyzed the emergence of Speech Large Language Models (Speech LLMs), which marks a transformative shift toward general-purpose speech understanding systems. To further clarify and systematically delineate task objectives, in this paper, we formally define the concept of speech understanding and introduce a structured taxonomy encompassing its informational, functional, and format dimensions. Within this scope of definition, we present a comprehensive review of current Speech LLMs, analyzing their architectures through a three-stage abstraction: Modality Feature Extraction, Modality Information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsFocus