JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

Jianghan Chao; Jianzhang Gao; Wenhui Tan; Yuchong Sun; Ruihua Song; Liyun Ru

arXiv:2512.12772·cs.MM·May 15, 2026

JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

Jianghan Chao, Jianzhang Gao, Wenhui Tan, Yuchong Sun, Ruihua Song, Liyun Ru

PDF

1 Video

TL;DR

JointAVBench is a comprehensive benchmark designed to evaluate multi-modal reasoning in Omni-Large Language Models across audio-visual dependencies, diverse audio types, and scene spans, revealing current models' limitations.

Contribution

The paper introduces JointAVBench, a novel automated benchmark for joint audio-visual reasoning, covering multiple dimensions and scene spans, with a pipeline for question synthesis.

Findings

01

Best Omni-LLMs achieve only 65.3% accuracy, indicating significant room for improvement.

02

Omni-LLMs outperform uni-modal baselines but struggle with cross-scene reasoning.

03

The benchmark reveals current models' limitations in multi-modal understanding.

Abstract

Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio, an effective benchmark must comprehensively cover three key aspects: (1) multi-modal dependency (i.e., questions that cannot be answered using vision or audio alone), (2) diverse audio information types (e.g., speech, sound events), and (3) varying scene spans. However, existing datasets fall short in one or more of these dimensions, limiting strict and comprehensive evaluation. To address this gap, we introduce JointAVBench, a novel benchmark with strict audio-video correlation, spanning five cognitive dimensions, four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, and full-scene). Given…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation· slideslive