HallusionBench: An Advanced Diagnostic Suite for Entangled Language   Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan; Fuxiao Liu; Xiyang Wu; Ruiqi Xian; Zongxia Li; Xiaoyu; Liu; Xijun Wang; Lichang Chen; Furong Huang; Yaser Yacoob; Dinesh Manocha,; Tianyi Zhou

arXiv:2310.14566·cs.CV·March 26, 2024·5 cites

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu, Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha,, Tianyi Zhou

PDF

Open Access 5 Repos 4 Datasets

TL;DR

HallusionBench is a new benchmark for evaluating visual-language models' reasoning, revealing significant hallucination and illusion issues, and providing insights for future improvements in model robustness.

Contribution

The paper introduces HallusionBench, a comprehensive diagnostic suite with a novel question structure to analyze hallucination and illusion in large vision-language models.

Findings

01

GPT-4V achieves 31.42% accuracy on HallusionBench

02

Most models score below 16% accuracy

03

Identifies key failure modes like language hallucination and visual illusion

Abstract

We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies, logical consistency, and various failure modes. In our evaluation on HallusionBench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling