TL;DR
This paper introduces SAVER, a style-aware method that reduces hallucinations in vision-language models when processing stylized images, improving their reliability in critical applications.
Contribution
The paper presents a novel style-aware early revision mechanism that leverages visual attention feedback to mitigate hallucinations in LVLMs, especially with stylized images.
Findings
Stylized images cause more hallucinations than photographic images.
SAVER significantly reduces hallucinations across multiple models and datasets.
The method improves the reliability of LVLMs in real-world scenarios.
Abstract
Large Vision-Language Models (LVLMs) recently achieve significant breakthroughs in understanding complex visual-textual contexts. However, hallucination issues still limit their real-world applicability. Although previous mitigation methods effectively reduce hallucinations in photographic images, they largely overlook the potential risks posed by stylized images, which play crucial roles in critical scenarios such as game scene understanding, art education, and medical analysis. In this work, we first construct a dataset comprising photographic images and their corresponding stylized versions with carefully annotated caption labels. We then conduct head-to-head comparisons on both discriminative and generative tasks by benchmarking 13 advanced LVLMs on the collected datasets. Our findings reveal that stylized images tend to induce significantly more hallucinations than their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
