AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction

Zixuan Chen; Depeng Wang; Hao Lin; Li Luo; Ke Xu; Ya Guo; Huijia Zhu; Tanfeng Sun; Xinghao Jiang

arXiv:2604.13593·cs.MM·April 16, 2026

AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction

Zixuan Chen, Depeng Wang, Hao Lin, Li Luo, Ke Xu, Ya Guo, Huijia Zhu, Tanfeng Sun, Xinghao Jiang

PDF

TL;DR

AVID introduces a large-scale benchmark for evaluating audio-visual inconsistency understanding in videos, highlighting current model limitations and providing a comprehensive dataset for advancing trustworthy omni-modal AI.

Contribution

The paper presents AVID, a novel benchmark with a scalable construction pipeline, extensive annotated videos, and evaluation protocols for cross-modal inconsistency detection and reasoning.

Findings

01

State-of-the-art models show significant limitations in temporal grounding and reasoning.

02

Fine-tuned AVID-Qwen outperforms base models in segment reasoning and temporal grounding.

03

AVID serves as an effective testbed for improving trustworthy omni-modal AI systems.

Abstract

We present AVID, the first large-scale benchmark for audio-visual inconsistency understanding in videos. While omni-modal large language models excel at temporally aligned tasks such as captioning and question answering, they struggle to perceive cross-modal conflicts, a fundamental human capability that is critical for trustworthy AI. Existing benchmarks predominantly focus on aligned events or deepfake detection, leaving a significant gap in evaluating inconsistency perception in long-form video contexts. AVID addresses this with: (1) a scalable construction pipeline comprising temporal segmentation that classifies video content into Active Speaker, Voiceover, and Scenic categories; an agent-driven strategy planner that selects semantically appropriate inconsistency categories; and five specialized injectors for diverse audio-visual conflict injection; (2) 11.2K long videos (avg.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.