VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding
Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Fuxiao Liu, Tianyi Zhou, Dinesh Manocha, Jordan Lee Boyd-Graber

TL;DR
This paper introduces VideoHallu, a synthetic dataset designed to evaluate and improve the ability of vision-language models to understand physics and common sense in videos, revealing their limitations and enhancing their reasoning through fine-tuning.
Contribution
The paper presents VideoHallu, a novel synthetic dataset with physics and commonsense violations, and demonstrates how fine-tuning on it improves models' reasoning without sacrificing benchmark performance.
Findings
Leading VLMs often miss violations in VideoHallu, indicating gaps in visual reasoning.
Fine-tuning on VideoHallu enhances models' detection of physical and logical violations.
Models maintain performance on standard benchmarks after fine-tuning.
Abstract
Vision-Language Models (VLMs) have achieved strong results in video understanding, yet a key question remains: do they truly comprehend visual content or only learn shallow correlations between vision and language? Real visual understanding, especially of physics and common sense, is essential for AI systems that interact with the physical world. Current evaluations mostly use real-world videos similar to training data, so high benchmark scores may not reflect real reasoning ability. To address this, we propose negative-control tests using videos that depict physically impossible or logically inconsistent events. We introduce VideoHallu, a synthetic dataset of physics- and commonsense-violating scenes generated with Veo2, Sora, and Kling. It includes expert-annotated question-answer pairs across four categories of violations. Tests of leading VLMs (Qwen-2.5-VL, Video-R1, VideoChat-R1)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis
MethodsSoftmax · Attention Is All You Need
