Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?
Gregor Geigle, Radu Timofte, Goran Glava\v{s}

TL;DR
This paper systematically evaluates whether object grounding reduces hallucination in large vision-language models, finding that grounding objectives have minimal impact on hallucination in open captioning tasks.
Contribution
It provides the first comprehensive analysis of grounding's effect on hallucination using realistic open-ended evaluation protocols.
Findings
Grounding objectives have little to no effect on hallucination.
Previous claims about grounding reducing hallucination are based on flawed evaluation.
Open captioning evaluation shows minimal impact of grounding on hallucination.
Abstract
Large vision-language models (LVLMs) have recently dramatically pushed the state of the art in image captioning and many image understanding tasks (e.g., visual question answering). LVLMs, however, often \textit{hallucinate} and produce captions that mention concepts that cannot be found in the image. These hallucinations erode the trustworthiness of LVLMs and are arguably among the main obstacles to their ubiquitous adoption. Recent work suggests that addition of grounding objectives -- those that explicitly align image regions or objects to text spans -- reduces the amount of LVLM hallucination. Although intuitive, this claim is not empirically justified as the reduction effects have been established, we argue, with flawed evaluation protocols that (i) rely on data (i.e., MSCOCO) that has been extensively used in LVLM training and (ii) measure hallucination via question answering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychosomatic Disorders and Their Treatments
MethodsALIGN
