Object Hallucination in Image Captioning
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, Kate, Saenko

TL;DR
This paper introduces a new metric for evaluating image captioning models based on image relevance, revealing that high scores on standard metrics do not necessarily correlate with fewer object hallucinations, which are often driven by language priors.
Contribution
The work proposes a novel image relevance metric for better evaluation of captioning models and analyzes the causes of object hallucination across different architectures and learning objectives.
Findings
Models with top standard metric scores may still hallucinate objects.
Hallucination is often caused by language priors rather than image errors.
The new relevance metric better captures true image content in captions.
Abstract
Despite continuously improving performance, contemporary image captioning models are prone to "hallucinating" objects that are not actually in a scene. One problem is that standard metrics only measure similarity to ground truth captions and may not fully capture image relevance. In this work, we propose a new image relevance metric to evaluate current models with veridical visual labels and assess their rate of object hallucination. We analyze how captioning model architectures and learning objectives contribute to object hallucination, explore when hallucination is likely due to image misclassification or language priors, and assess how well current sentence metrics capture object hallucination. We investigate these questions on the standard image captioning benchmark, MSCOCO, using a diverse set of models. Our analysis yields several interesting findings, including that models which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
