SonicBench: Dissecting the Physical Perception Bottleneck in Large Audio Language Models

Yirong Sun; Yanjun Chen; Xin Qiu; Gang Zhang; Hongyu Chen; Daokuan Wu; Chengming Li; Min Yang; Dawei Zhu; Wei Zhang; Xiaoyu Shen

arXiv:2601.11039·cs.SD·January 19, 2026

SonicBench: Dissecting the Physical Perception Bottleneck in Large Audio Language Models

Yirong Sun, Yanjun Chen, Xin Qiu, Gang Zhang, Hongyu Chen, Daokuan Wu, Chengming Li, Min Yang, Dawei Zhu, Wei Zhang, Xiaoyu Shen

PDF

Open Access 1 Datasets

TL;DR

This paper introduces SonicBench, a benchmark for evaluating large audio language models' perception of physical audio attributes, revealing significant perceptual limitations and the importance of better alignment and decoding strategies.

Contribution

SonicBench provides a systematic, psychophysically grounded evaluation framework for physical audio perception in LALMs, highlighting their perceptual deficiencies and the potential for improved decoding.

Findings

01

LALMs perform near random on physical attribute tasks

02

Models do not show human-like advantage on comparison tasks

03

Frozen encoders capture physical cues with at least 60% accuracy

Abstract

Large Audio Language Models (LALMs) excel at semantic and paralinguistic tasks, yet their ability to perceive the fundamental physical attributes of audio such as pitch, loudness, and spatial location remains under-explored. To bridge this gap, we introduce SonicBench, a psychophysically grounded benchmark that systematically evaluates 12 core physical attributes across five perceptual dimensions. Unlike previous datasets, SonicBench uses a controllable generation toolbox to construct stimuli for two complementary paradigms: recognition (absolute judgment) and comparison (relative judgment). This design allows us to probe not only sensory precision but also relational reasoning capabilities, a domain where humans typically exhibit greater proficiency. Our evaluation reveals a substantial deficiency in LALMs' foundational auditory understanding; most models perform near random guessing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

YirongSun/SonicBench
dataset· 4.6k dl
4.6k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeuroscience and Music Perception · Music and Audio Processing · Emotion and Mood Recognition