A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models

Iwona Christop (1); Mateusz Czy\.znikiewicz (2); Pawe{\l} Sk\'orzewski (1); {\L}ukasz Bondaruk (2); Jakub Kubiak (2); Marcin Lewandowski (2); Marek Kubis (1) ((1) Adam Mickiewicz University; (2) Samsung R&D Institute Poland)

arXiv:2601.19673·cs.SD·January 28, 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models

Iwona Christop (1), Mateusz Czy\.znikiewicz (2), Pawe{\l} Sk\'orzewski (1), {\L}ukasz Bondaruk (2), Jakub Kubiak (2), Marcin Lewandowski (2), Marek Kubis (1) ((1) Adam Mickiewicz University, (2) Samsung R&D Institute Poland)

PDF

Open Access 1 Video

TL;DR

This paper introduces ART, a new benchmark designed to evaluate multimodal large language models' reasoning capabilities across diverse audio tasks, addressing the gap in existing benchmarks that focus only on isolated audio tasks.

Contribution

The paper presents a novel benchmark, ART, specifically aimed at testing the reasoning skills of multimodal models over combined audio tasks, which was lacking in prior evaluations.

Findings

01

ART enables assessment of reasoning over multiple audio tasks

02

Multimodal models' reasoning abilities can be systematically evaluated with ART

03

The benchmark highlights strengths and weaknesses in current models' audio reasoning

Abstract

The present benchmarks for testing the audio modality of multimodal large language models concentrate on testing various audio tasks such as speaker diarization or gender identification in isolation. Whether a multimodal model can answer the questions that require reasoning skills to combine audio tasks of different categories, cannot be verified with their use. To address this issue, we propose Audio Reasoning Tasks (ART), a new benchmark for assessing the ability of multimodal models to solve problems that require reasoning over audio signal.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Emotion and Mood Recognition