MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
Dingdong Wang, Junan Li, Jincenzi Wu, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, Helen Meng

TL;DR
MMSU is a new comprehensive benchmark with 5,000 audio-question-answer triplets across 47 tasks, designed to evaluate and advance multi-faceted spoken language understanding and reasoning in Speech Large Language Models.
Contribution
This paper introduces MMSU, a large-scale benchmark for evaluating speech understanding and reasoning, incorporating diverse linguistic phenomena and providing a standard for future model development.
Findings
Existing SpeechLLMs show significant room for improvement.
MMSU covers a wide range of linguistic phenomena.
Evaluation highlights the need for further model optimization.
Abstract
Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our…
Peer Reviews
Decision·ICLR 2026 Poster
1. The dataset looks extremely useful and focuses deeply on speech evaluation rather than general audio, which I appreciate. Its size is also reasonably large, allowing robust evaluation. 2. Most of the audio is human-generated, which is closer to real-world scenarios than datasets that primarily use synthetic audio data. 3. The experiments and evaluations are quite thorough, covering a good variety of speech LLMs. The insights are actionable as well e.g. phonology-based understanding is poor fo
1. The dataset creation process involves GPT-4o in-the-loop to augment distractor options. This could potentially create biases in the dataset that might be exploitable by speech LLMs. What percentage of the final distractor options were human-written vs. LLM-generated? Could the authors perform some analysis to check whether this introduces a bias that favors/disfavors LLM-based speech models? 2. The manual review process at the end of the dataset collection is not described completely. For exa
- It covers 47 distinct speech tasks which is huge and comprehensive for SLU - Use of real world recoding and voice actors to produce authentic audio - The paper gives detailed error analysis, highlighting that most models fail in phonological and paralinguistic reasoning, not just semantics which is an important insight for future SpeechLLM research. - Overall a well written paper.
- MCQ-based tasks might not be reliable to judge a model capabilities as selecting the chances of selecting a right answers is 25% and the model might hallucinate.
Strengths of the MMSU benchmark includes: * It is the first benchmark to be systematically grounded in linguistic theory. This allows it to test nuanced areas that other benchmarks miss, including phonetics, prosody, semantics, and paralinguistics. * MMSU is reasonable in size, providing 5,000 audio-question-answer triplets across 47 distinct tasks. It uniquely categorizes these tasks into 24 "perception" abilities (e.g., intonation perception) and 23 "reasoning" abilities (e.g., sarcasm detect
Key weaknesses identified in the MMSU benchmark from my perspective include: * Missing Quantitative Reliability Metrics (IAA): The paper does not report standard inter-annotator agreement (IAA) scores (e.g., Cohen’s Kappa) to validate its dataset. While it details a rigorous, multi-stage review process to force consensus, this is a procedural fix, not a quantitative measurement. By omitting the initial agreement score, the paper obscures the potential inherent ambiguity of its 47 tasks and make
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Speech and dialogue systems
