Seeing is Believing: Rich-Context Hallucination Detection for MLLMs via Backward Visual Grounding

Pinxue Guo; Chongruo Wu; Xinyu Zhou; Lingyi Hong; Zhaoyu Chen; Jinglun Li; Kaixun Jiang; Sen-ching Samson Cheung; Wei Zhang; Wenqiang Zhang

arXiv:2511.12140·cs.CL·November 18, 2025

Seeing is Believing: Rich-Context Hallucination Detection for MLLMs via Backward Visual Grounding

Pinxue Guo, Chongruo Wu, Xinyu Zhou, Lingyi Hong, Zhaoyu Chen, Jinglun Li, Kaixun Jiang, Sen-ching Samson Cheung, Wei Zhang, Wenqiang Zhang

PDF

Open Access 1 Video

TL;DR

This paper introduces VBackChecker, a novel reference-free framework for detecting hallucinations in multimodal large language models by verifying visual-text consistency, and establishes a new benchmark R^2-HalBench with high-quality annotations.

Contribution

The paper presents VBackChecker, a new interpretability-driven, reference-free hallucination detection method for MLLMs, and introduces R^2-HalBench, a comprehensive benchmark with rich-context annotations.

Findings

01

VBackChecker outperforms prior methods in hallucination detection accuracy.

02

Achieves over 10% improvement in pixel-level grounding tasks.

03

R^2-HalBench provides diverse, high-quality annotations for real-world scenarios.

Abstract

Multimodal Large Language Models (MLLMs) have unlocked powerful cross-modal capabilities, but still significantly suffer from hallucinations. As such, accurate detection of hallucinations in MLLMs is imperative for ensuring their reliability in practical applications. To this end, guided by the principle of "Seeing is Believing", we introduce VBackChecker, a novel reference-free hallucination detection framework that verifies the consistency of MLLMgenerated responses with visual inputs, by leveraging a pixellevel Grounding LLM equipped with reasoning and referring segmentation capabilities. This reference-free framework not only effectively handles rich-context scenarios, but also offers interpretability. To facilitate this, an innovative pipeline is accordingly designed for generating instruction-tuning data (R-Instruct), featuring rich-context descriptions, grounding masks, and hard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Seeing Is Believing: Rich-Context Hallucination Detection for MLLMs via Backward Visual Grounding· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Emotion and Mood Recognition