When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
Cui Yakun, Xingqun Qi, TianTian Geng, Yuyao Zhang, Sirui Han, Yike Guo

TL;DR
This paper introduces a new benchmark and a mitigation framework for addressing hallucinations in vision-language models caused by conflicting on-screen text, improving multimodal video understanding accuracy.
Contribution
It presents VisualTextTrap, a large-scale benchmark for evaluating text-induced hallucinations, and VTHM-MoE, a novel disentanglement framework to mitigate this issue.
Findings
VTHM-MoE outperforms existing methods on the VisualTextTrap benchmark.
The benchmark contains 6,057 samples with detailed annotations and hallucination intensity levels.
The proposed framework effectively reduces hallucinations while maintaining performance on clean videos.
Abstract
Recent advances in Vision-Language Models (VLMs) have substantially enhanced their ability across multimodal video understanding benchmarks spanning temporal, action, object, and spatial understanding. However, we identify a critical yet overlooked issue: when embedded on-screen text contradicts the visual scene, existing VLMs systematically hallucinate, prioritizing overlay textual semantics over the actual visual content. We define this phenomenon as Text Overlay-Induced Hallucination (TOIH). In this work, we propose VisualTextTrap, the first comprehensive benchmark, including large-scale human-validated samples with specifically designed evaluation metrics. In particular, we construct VisualTextTrap from widely-used public datasets using a scalable hybrid pipeline of VLMs assisted text generation and rigorous manual verification. The benchmark features 6,057 samples annotated across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
