A Survey on Speech Large Language Models for Understanding
Jing Peng, Yucheng Wang, Bohan Li, Yiwei Guo, Hankun Wang, Yangui Fang, Yu Xi, Haoyu Li, Xu Li, Ke Zhang, Shuai Wang, Kai Yu

TL;DR
This survey systematically reviews Speech Large Language Models, defining speech understanding, analyzing architectures, training strategies, datasets, and evaluation methods, while highlighting key challenges and future directions for robust, generalizable speech comprehension systems.
Contribution
It provides a formal definition and taxonomy of speech understanding, analyzes current Speech LLM architectures, and discusses challenges and directions for future research.
Findings
Identifies instruction sensitivity as a key challenge.
Highlights degradation in semantic reasoning in Speech LLMs.
Provides a structured framework for analyzing Speech LLM architectures.
Abstract
Speech understanding is essential for interpreting the diverse forms of information embedded in spoken language, including linguistic, paralinguistic, and non-linguistic cues that are vital for effective human-computer interaction. The rapid advancement of large language models (LLMs) has catalyzed the emergence of Speech Large Language Models (Speech LLMs), which marks a transformative shift toward general-purpose speech understanding systems. To further clarify and systematically delineate task objectives, in this paper, we formally define the concept of speech understanding and introduce a structured taxonomy encompassing its informational, functional, and format dimensions. Within this scope of definition, we present a comprehensive review of current Speech LLMs, analyzing their architectures through a three-stage abstraction: Modality Feature Extraction, Modality Information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsFocus
