STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
Zihan Liu, Zhikang Niu, Qiuyang Xiao, Zhisheng Zheng, Ruoqi Yuan, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Jianze Liang, Xie Chen, Leilei Sun, Dahua Lin, Jiaqi Wang

TL;DR
STAR-Bench introduces a comprehensive audio benchmark that evaluates deep spatio-temporal reasoning in sound, revealing significant gaps in current models' perceptual and reasoning abilities compared to humans.
Contribution
The paper formalizes audio 4D intelligence and creates STAR-Bench, a novel benchmark combining perceptual and holistic reasoning tasks with high-quality data curation methods.
Findings
Models show large performance drops on STAR-Bench tasks.
Open-source models lag behind humans in perception and reasoning.
Current models are bottlenecked by fine-grained perceptual understanding.
Abstract
Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data,…
Peer Reviews
Decision·ICLR 2026 Poster
A clear way of generating spatial audio and temporal reasoning benchmarks is described and implemented. The results show clear lack of perception of current LALMs on tasks requiring these skills. This is a valuable and distinct addition to the plethora of benchmarks coming out
The audio scenes were generated binaurally, but this restricts the range of applications of the benchmark to perhaps humanoid listeners. Also, the HRTF used was a Kemar one, which is somewhat limited; and the scene rendering was also a bit limited. Nonetheless a good beginning. The models are of course set up to fail in this testing. The spatial audio tasks are relatively hopeless, and analysis of the models was not too much possible To an extent, I would have liked more detailed analysis of th
- Introduces and benchmarks multi-audio segment reordering and stereo spacial reasoning which has been ignored by the previous benchmarks - Proper coverage of non-spatial attributes (Loudness, Pitch, Duration) and spatial attributes (Azimuth, Elevation, Distance) - Detailed error analysis on why models dont do well on the proposed benchmark - Thorough reporting of AA and ACR metrics to test the model's reliability - Really like the finding "a fundamental inability to effectively compare, ground,
- missing details of AI-Assisted Automated Filtering. What exactly are we filtering using gemini 2.5 pro - Spatial data is synthetic and does not represent real world use cases. - Fig 8 not readable - Not sure if the baselines support spatial audio. Analysis on those models would not provide beneficial information - Support for spatial audio in these models can lead to huge jumps in model performance on the benchmark which questions the difficulty of the benchmark
- Clear problem framing ('audio 4D intelligence) and a structured task design that disaggregates perception, temporal reasoning, and spatial reasoning. - Useful diagnostic split: absolute vs. relative perceptual tests; temporal re-ordering; spatial subtasks (single-source localization, multi-source relations, dynamic trajectories). - Strong curation pipeline with AI filtering, human annotation, and expert validation; explicit use of public datasets + simulated audio for coverage and control. -
- Everything is framed as multiple-choice with string-match grading. No open-ended QAs, while they are more real world. - A large portion of audio comes from widely used corpora (FSD50K, Clotho, STARSS23); many models likely pretrained on them. The authors argue the task formulation is novel (re-ordering, spatial relations), but clip-level memorization of events/timbres is still possible. - The paper doesn't quantify inter-rater agreement, item rejection rates, or ambiguity sources. - In table 1
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
