HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning
Zhecan Wang, Garrett Bingham, Adams Yu, Quoc Le, Thang Luong, Golnaz, Ghiasi

TL;DR
HaloQuest is a new multimodal hallucination dataset that uses synthetic and real images to evaluate and improve vision-language models, revealing current models' struggles and proposing new evaluation methods.
Contribution
Introduces HaloQuest, a large-scale dataset with synthetic images for benchmarking and fine-tuning VLMs to reduce hallucination in multimodal reasoning.
Findings
Current VLMs achieve below 36% accuracy on HaloQuest.
Fine-tuning on HaloQuest reduces hallucination without harming standard reasoning.
Generated images correlate highly (r=0.97) with real images in benchmarking.
Abstract
Hallucination has been a major problem for large language models and remains a critical challenge when it comes to multimodality in which vision-language models (VLMs) have to deal with not just textual but also visual inputs. Despite rapid progress in VLMs, resources for evaluating and addressing multimodal hallucination are limited and mostly focused on evaluation. This work introduces HaloQuest, a novel visual question answering dataset that captures various aspects of multimodal hallucination such as false premises, insufficient contexts, and visual challenges. A novel idea from HaloQuest is to leverage synthetic images, apart from real ones, to enable dataset creation at scale. With over 7.7K examples spanning across a wide variety of categories, HaloQuest was designed to be both a challenging benchmark for VLMs and a fine-tuning dataset for advancing multimodal reasoning. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics
