A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models
Iwona Christop (1), Mateusz Czy\.znikiewicz (2), Pawe{\l} Sk\'orzewski (1), {\L}ukasz Bondaruk (2), Jakub Kubiak (2), Marcin Lewandowski (2), Marek Kubis (1) ((1) Adam Mickiewicz University, (2) Samsung R&D Institute Poland)

TL;DR
This paper introduces ART, a new benchmark designed to evaluate multimodal large language models' reasoning capabilities across diverse audio tasks, addressing the gap in existing benchmarks that focus only on isolated audio tasks.
Contribution
The paper presents a novel benchmark, ART, specifically aimed at testing the reasoning skills of multimodal models over combined audio tasks, which was lacking in prior evaluations.
Findings
ART enables assessment of reasoning over multiple audio tasks
Multimodal models' reasoning abilities can be systematically evaluated with ART
The benchmark highlights strengths and weaknesses in current models' audio reasoning
Abstract
The present benchmarks for testing the audio modality of multimodal large language models concentrate on testing various audio tasks such as speaker diarization or gender identification in isolation. Whether a multimodal model can answer the questions that require reasoning skills to combine audio tasks of different categories, cannot be verified with their use. To address this issue, we propose Audio Reasoning Tasks (ART), a new benchmark for assessing the ability of multimodal models to solve problems that require reasoning over audio signal.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Emotion and Mood Recognition
