SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

Laya Iyer; Angelina Wang; Sanmi Koyejo

arXiv:2603.09853·cs.SD·March 11, 2026

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

Laya Iyer, Angelina Wang, Sanmi Koyejo

PDF

Open Access 1 Video

TL;DR

This paper introduces SCENEBench, a comprehensive benchmark for evaluating large audio language models across diverse real-world audio understanding tasks, highlighting current model gaps and guiding future improvements.

Contribution

The paper presents SCENEBench, a novel benchmark suite for assessing broad audio comprehension in LALMs, covering background sounds, noise localization, cross-lingual speech, and vocal recognition.

Findings

01

Model performance varies significantly across tasks.

02

Some models perform below random chance on certain tasks.

03

Benchmark results highlight specific areas for model improvement.

Abstract

Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech recognition (ASR). This paper closes that gap by proposing a benchmark suite, SCENEBench (Spatial, Cross-lingual, Environmental, Non-speech Evaluation), that targets a broad form of audio comprehension across four real-world categories: background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. These four categories are selected based on understudied needs from accessibility technology and industrial noise monitoring. In addition to performance, we also measure model latency. The purpose of this benchmark suite is to assess audio beyond just what…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases· underline

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis