SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases
Laya Iyer, Angelina Wang, Sanmi Koyejo

TL;DR
This paper introduces SCENEBench, a comprehensive benchmark for evaluating large audio language models across diverse real-world audio understanding tasks, highlighting current model gaps and guiding future improvements.
Contribution
The paper presents SCENEBench, a novel benchmark suite for assessing broad audio comprehension in LALMs, covering background sounds, noise localization, cross-lingual speech, and vocal recognition.
Findings
Model performance varies significantly across tasks.
Some models perform below random chance on certain tasks.
Benchmark results highlight specific areas for model improvement.
Abstract
Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech recognition (ASR). This paper closes that gap by proposing a benchmark suite, SCENEBench (Spatial, Cross-lingual, Environmental, Non-speech Evaluation), that targets a broad form of audio comprehension across four real-world categories: background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. These four categories are selected based on understudied needs from accessibility technology and industrial noise monitoring. In addition to performance, we also measure model latency. The purpose of this benchmark suite is to assess audio beyond just what…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
