SonicBench: Dissecting the Physical Perception Bottleneck in Large Audio Language Models
Yirong Sun, Yanjun Chen, Xin Qiu, Gang Zhang, Hongyu Chen, Daokuan Wu, Chengming Li, Min Yang, Dawei Zhu, Wei Zhang, Xiaoyu Shen

TL;DR
This paper introduces SonicBench, a benchmark for evaluating large audio language models' perception of physical audio attributes, revealing significant perceptual limitations and the importance of better alignment and decoding strategies.
Contribution
SonicBench provides a systematic, psychophysically grounded evaluation framework for physical audio perception in LALMs, highlighting their perceptual deficiencies and the potential for improved decoding.
Findings
LALMs perform near random on physical attribute tasks
Models do not show human-like advantage on comparison tasks
Frozen encoders capture physical cues with at least 60% accuracy
Abstract
Large Audio Language Models (LALMs) excel at semantic and paralinguistic tasks, yet their ability to perceive the fundamental physical attributes of audio such as pitch, loudness, and spatial location remains under-explored. To bridge this gap, we introduce SonicBench, a psychophysically grounded benchmark that systematically evaluates 12 core physical attributes across five perceptual dimensions. Unlike previous datasets, SonicBench uses a controllable generation toolbox to construct stimuli for two complementary paradigms: recognition (absolute judgment) and comparison (relative judgment). This design allows us to probe not only sensory precision but also relational reasoning capabilities, a domain where humans typically exhibit greater proficiency. Our evaluation reveals a substantial deficiency in LALMs' foundational auditory understanding; most models perform near random guessing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeuroscience and Music Perception · Music and Audio Processing · Emotion and Mood Recognition
