Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models

Bei Yan; Jie Zhang; Zheng Yuan; Shiguang Shan; Xilin Chen

arXiv:2406.17115·cs.CV·February 26, 2026·2 cites

Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models

Bei Yan, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen

PDF

Open Access 1 Repo

TL;DR

This paper critically evaluates existing hallucination benchmarks for large vision-language models, introduces a new quality measurement framework, and proposes a high-quality benchmark to improve reliability and validity in hallucination assessment.

Contribution

It introduces HQM, a framework for assessing hallucination benchmark quality, and HQH, a new high-quality benchmark, addressing evaluation inconsistencies and exposing issues in current methods.

Findings

01

Existing benchmarks show inconsistent evaluation results.

02

Current benchmarks often lack alignment with human judgment.

03

The proposed HQH benchmark demonstrates superior reliability and validity.

Abstract

Despite the outstanding performance in multimodal tasks, Large Vision-Language Models (LVLMs) have been plagued by the issue of hallucination, i.e., generating content that is inconsistent with the corresponding visual inputs. While previous works have proposed various benchmarks to evaluate this issue, the quality of these evaluations remains unverified. We observe that some of these benchmarks may produce inconsistent evaluation results across repeated tests or fail to align with human evaluation. To address this, we propose a Hallucination benchmark Quality Measurement framework (HQM), which leverages specific indicators to assess both reliability and validity. Our empirical analysis using HQM reveals and pinpoints potential evaluation issues in existing benchmarks, exposing a critical gap in current hallucination evaluation. To bridge this gap, we propose HQH, a High-Quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hqhbench/hqhbench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBrain Tumor Detection and Classification · Cell Image Analysis Techniques · Epilepsy research and treatment