CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

Kesheng Chen; Yamin Hu; Qi Zhou; Zhenqian Zhu; Wenjian Luo

arXiv:2603.27982·cs.CV·April 2, 2026

CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

Kesheng Chen, Yamin Hu, Qi Zhou, Zhenqian Zhu, Wenjian Luo

PDF

1 Datasets

TL;DR

CDH-Bench is a benchmark designed to evaluate vision-language models' tendency to ignore visual evidence in favor of commonsense, revealing their vulnerability to hallucinating based on prior knowledge.

Contribution

The paper introduces CDH-Bench, a novel benchmark with explicit visual-commonsense conflicts, to systematically assess models' reliance on visual evidence versus commonsense.

Findings

01

Models remain vulnerable to prior-driven normalization under conflicts.

02

Even strong models often override visual evidence with commonsense.

03

CDH-Bench enables controlled diagnostics of visual fidelity in VLMs.

Abstract

Vision-language models (VLMs) achieve strong performance on many benchmarks, yet a basic reliability question remains underexplored: when visual evidence conflicts with commonsense, do models follow what is shown or what commonsense suggests? A characteristic failure in this setting is that the model overrides visual evidence and outputs the commonsense alternative. We term this phenomenon \textbf{commonsense-driven hallucination} (CDH). To evaluate it, we introduce \textbf{CDH-Bench}, a benchmark designed to create explicit \textbf{visual evidence--commonsense conflicts}. CDH-Bench covers three dimensions: \textit{counting anomalies}, \textit{relational anomalies}, and \textit{attribute anomalies}. We evaluate frontier VLMs under \textit{binary Question Answering (QA)} and \textit{multiple-choice QA}, and report metrics including \textit{Counterfactual Accuracy} (CF-Acc),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

cks19999/CDH-Bench
dataset· 51 dl
51 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.