VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Zongxia Li; Xiyang Wu; Guangyao Shi; Yubin Qin; Hongyang Du; Fuxiao Liu; Tianyi Zhou; Dinesh Manocha; Jordan Lee Boyd-Graber

arXiv:2505.01481·cs.CV·October 28, 2025

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Fuxiao Liu, Tianyi Zhou, Dinesh Manocha, Jordan Lee Boyd-Graber

PDF

Open Access 1 Repo 1 Models 1 Datasets 1 Video

TL;DR

This paper introduces VideoHallu, a synthetic dataset designed to evaluate and improve the ability of vision-language models to understand physics and common sense in videos, revealing their limitations and enhancing their reasoning through fine-tuning.

Contribution

The paper presents VideoHallu, a novel synthetic dataset with physics and commonsense violations, and demonstrates how fine-tuning on it improves models' reasoning without sacrificing benchmark performance.

Findings

01

Leading VLMs often miss violations in VideoHallu, indicating gaps in visual reasoning.

02

Fine-tuning on VideoHallu enhances models' detection of physical and logical violations.

03

Models maintain performance on standard benchmarks after fine-tuning.

Abstract

Vision-Language Models (VLMs) have achieved strong results in video understanding, yet a key question remains: do they truly comprehend visual content or only learn shallow correlations between vision and language? Real visual understanding, especially of physics and common sense, is essential for AI systems that interact with the physical world. Current evaluations mostly use real-world videos similar to training data, so high benchmark scores may not reflect real reasoning ability. To address this, we propose negative-control tests using videos that depict physically impossible or logically inconsistent events. We introduce VideoHallu, a synthetic dataset of physics- and commonsense-violating scenes generated with Veo2, Sora, and Kling. It includes expert-annotated question-answer pairs across four categories of violations. Tests of leading VLMs (Qwen-2.5-VL, Video-R1, VideoChat-R1)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zli12321/videohallu
pytorchOfficial

Models

🤗
IntelligenceLab/RewardPreferenceBert
model· 57 dl· ♡ 3
57 dl♡ 3

Datasets

IntelligenceLab/VideoHallu
dataset· 3.1k dl
3.1k dl

Videos

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis

MethodsSoftmax · Attention Is All You Need