AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction
Zixuan Chen, Depeng Wang, Hao Lin, Li Luo, Ke Xu, Ya Guo, Huijia Zhu, Tanfeng Sun, Xinghao Jiang

TL;DR
AVID introduces a large-scale benchmark for evaluating audio-visual inconsistency understanding in videos, highlighting current model limitations and providing a comprehensive dataset for advancing trustworthy omni-modal AI.
Contribution
The paper presents AVID, a novel benchmark with a scalable construction pipeline, extensive annotated videos, and evaluation protocols for cross-modal inconsistency detection and reasoning.
Findings
State-of-the-art models show significant limitations in temporal grounding and reasoning.
Fine-tuned AVID-Qwen outperforms base models in segment reasoning and temporal grounding.
AVID serves as an effective testbed for improving trustworthy omni-modal AI systems.
Abstract
We present AVID, the first large-scale benchmark for audio-visual inconsistency understanding in videos. While omni-modal large language models excel at temporally aligned tasks such as captioning and question answering, they struggle to perceive cross-modal conflicts, a fundamental human capability that is critical for trustworthy AI. Existing benchmarks predominantly focus on aligned events or deepfake detection, leaving a significant gap in evaluating inconsistency perception in long-form video contexts. AVID addresses this with: (1) a scalable construction pipeline comprising temporal segmentation that classifies video content into Active Speaker, Voiceover, and Scenic categories; an agent-driven strategy planner that selects semantically appropriate inconsistency categories; and five specialized injectors for diverse audio-visual conflict injection; (2) 11.2K long videos (avg.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
