RSA-Bench: Benchmarking Audio Large Models in Real-World Acoustic Scenarios
Yibo Zhang, Liang Lin, Kaiwen Luo, Shilinlu Yan, Jin Wang, Yaoqi Guo, Yitian Chen, Yalan Qin, Zhenhong Zhou, Kun Wang, Li Sun

TL;DR
RSA-Bench introduces a comprehensive auditory scene simulation benchmark to evaluate Audio Large Models' robustness in complex, real-world acoustic environments across multiple tasks.
Contribution
It presents a novel evaluation framework that uses natural environmental soundscapes to stress-test ALMs, revealing their strengths and vulnerabilities in realistic scenarios.
Findings
Models are resilient in low-level recognition but struggle with high-order reasoning under stress.
Vocal-like interference significantly disrupts model performance more than mechanical noise.
Standard speech denoising can worsen model performance due to semantic distortions.
Abstract
While Audio Large Models (ALMs) have achieved remarkable proficiency, their robustness remains brittle in real-world deployment. Existing evaluations largely rely on synthetic Gaussian noise or simplistic single-source interference, failing to capture the intricate, multi-layered acoustic dynamics -- or ``Acoustic Ecology'' -- that characterize authentic physical environments. To bridge this ecological gap, we introduce \textbf{RSA-Bench}, a comprehensive robustness benchmark designed to stress-test ALLMs through high-fidelity auditory scene simulations. Unlike traditional methods, we construct evaluation samples by naturally superimposing diverse environmental soundscapes -- spanning \textit{Pasture}, \textit{Extreme Weather}, \textit{Classroom}, and \textit{Outdoors} -- onto clean speech signals across a spectrum of interference intensities. By evaluating models on six core tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
