Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning?
Mingqian Feng, Yunlong Tang, Zeliang Zhang, Chenliang Xu

TL;DR
This paper introduces a new decoding method and evaluation metrics for LVLM-based image captioning, showing that more detailed descriptions do not necessarily lead to increased hallucinations, contrary to prior beliefs.
Contribution
The study proposes Differentiated Beam Decoding and new CLIP-based metrics, providing a more reliable evaluation of hallucination levels in detailed image captions.
Findings
Our method reduces object hallucinations in detailed captions.
New metrics better correlate with actual caption accuracy.
Extensive experiments validate the effectiveness of the proposed approach.
Abstract
Large Vision-Language Models (LVLMs) excel in integrating visual and linguistic contexts to produce detailed content, facilitating applications such as image captioning. However, using LVLMs to generate descriptions often faces the challenge of object hallucination (OH), where the output text misrepresents actual objects in the input image. While previous studies attribute the occurrence of OH to the inclusion of more details, our study finds technical flaws in existing metrics, leading to unreliable evaluations of models and conclusions about OH. This has sparked a debate on the question: Do more details always introduce more hallucinations in LVLM-based image captioning? In this paper, we address this debate by proposing a novel decoding strategy, Differentiated Beam Decoding (DBD), along with a reliable new set of evaluation metrics: CLIP-Precision, CLIP-Recall, and CLIP-F1. DBD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Image Processing Techniques and Applications
MethodsSparse Evolutionary Training
