Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
Zhe Huang, Hao Wen, Aiming Hao, Bingze Song, Meiqi Wu, Jiahong Wu, Xiangxiang Chu, Sheng Lu, Haoqian Wang

TL;DR
This paper introduces DualityForge, a diffusion-based video editing framework that generates counterfactual videos and QA pairs to reduce hallucinations in Multimodal Large Language Models, improving their robustness and accuracy.
Contribution
The paper presents a novel counterfactual data synthesis method and a specialized training regime to significantly decrease hallucinations in MLLMs during video understanding tasks.
Findings
24.0% reduction in hallucinations on counterfactual videos
Improved performance on hallucination and general benchmarks
Effective contrastive training with synthesized data
Abstract
Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning
