AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning

Siminfar Samakoush Galougah; Rishie Raj; Sanjoy Chowdhury; Sayan Nag; Ramani Duraiswami

arXiv:2508.07470·cs.CV·August 22, 2025

AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning

Siminfar Samakoush Galougah, Rishie Raj, Sanjoy Chowdhury, Sayan Nag, Ramani Duraiswami

PDF

Open Access

TL;DR

AURA is a new benchmark and metric for evaluating the reasoning process of audio-visual models, emphasizing logical coherence and factual grounding beyond mere answer accuracy.

Contribution

We introduce AURA, a comprehensive AV reasoning benchmark with a novel metric AuraScore to evaluate reasoning fidelity and identify reasoning gaps in current models.

Findings

01

High accuracy models often lack reasoning fidelity

02

Models show significant gaps in factual consistency and logical inference

03

AURA reveals the need for improved reasoning capabilities in AV models

Abstract

Current audio-visual (AV) benchmarks focus on final answer accuracy, overlooking the underlying reasoning process. This makes it difficult to distinguish genuine comprehension from correct answers derived through flawed reasoning or hallucinations. To address this, we introduce AURA (Audio-visual Understanding and Reasoning Assessment), a benchmark for evaluating the cross-modal reasoning capabilities of Audio-Visual Large Language Models (AV-LLMs) and Omni-modal Language Models (OLMs). AURA includes questions across six challenging cognitive domains, such as causality, timbre and pitch, tempo and AV synchronization, unanswerability, implicit distractions, and skill profiling, explicitly designed to be unanswerable from a single modality. This forces models to construct a valid logical path grounded in both audio and video, setting AURA apart from AV datasets that allow uni-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Speech and Audio Processing