Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

Zhe Huang; Hao Wen; Aiming Hao; Bingze Song; Meiqi Wu; Jiahong Wu; Xiangxiang Chu; Sheng Lu; Haoqian Wang

arXiv:2512.24271·cs.CV·January 1, 2026

Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

Zhe Huang, Hao Wen, Aiming Hao, Bingze Song, Meiqi Wu, Jiahong Wu, Xiangxiang Chu, Sheng Lu, Haoqian Wang

PDF

Open Access

TL;DR

This paper introduces DualityForge, a diffusion-based video editing framework that generates counterfactual videos and QA pairs to reduce hallucinations in Multimodal Large Language Models, improving their robustness and accuracy.

Contribution

The paper presents a novel counterfactual data synthesis method and a specialized training regime to significantly decrease hallucinations in MLLMs during video understanding tasks.

Findings

01

24.0% reduction in hallucinations on counterfactual videos

02

Improved performance on hallucination and general benchmarks

03

Effective contrastive training with synthesized data

Abstract

Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning