SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models
Wanqi Yang, Yanda Li, Yunchao Wei, Meng Fang, Ling Chen

TL;DR
SpeechR is a comprehensive benchmark designed to evaluate reasoning abilities in large audio-language models across factual, procedural, and normative tasks, highlighting the gap between transcription accuracy and reasoning skills.
Contribution
Introduces SpeechR, a novel benchmark with diverse evaluation formats to assess reasoning in speech-based models, addressing a key gap in current evaluations.
Findings
High transcription accuracy does not imply strong reasoning skills.
Models perform variably across different reasoning dimensions.
SpeechR enables targeted analysis of reasoning capabilities in spoken language models.
Abstract
Large audio-language models (LALMs) have achieved near-human performance in sentence-level transcription and emotion recognition. However, existing evaluations focus mainly on surface-level perception, leaving the capacity of models for contextual and inference-driven reasoning in speech-based scenarios insufficiently examined. To address this gap, we introduce SpeechR, a unified benchmark for evaluating reasoning over speech in large audio-language models. SpeechR evaluates models along three key dimensions: factual retrieval, procedural inference, and normative judgment. It includes three distinct evaluation formats. The multiple-choice version measures answer selection accuracy. The generative version assesses the coherence and logical consistency of reasoning chains. The acoustic-feature version investigates whether variations in stress and emotion affect reasoning performance.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Speech and Audio Processing
