Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens
Meng Shen, Minghao Wu, Deepu Rajan

TL;DR
This paper addresses object hallucination in LVLMs by analyzing token influence and proposing training adjustments and data filtering to reduce hallucinations without extra inference costs.
Contribution
It introduces a novel approach to mitigate hallucinations by emphasizing image-negative tokens and filtering hallucination-prone data during training.
Findings
Reduced hallucination in LVLMs without affecting response length.
Effective across multiple LVLM variants.
No additional inference costs introduced.
Abstract
Object hallucination is a significant challenge that hinders the application of large vision-language models (LVLMs) in practice. We hypothesize that one possible origin of hallucination is the model's tendency to prioritize text generation over meaningful interaction with images. To explore this, we examine the generation process and categorize text tokens into three groups: image-positive, invariant, and negative, based on their visual dependence on input image tokens. Our analysis reveals that most generated tokens are minimally influenced by the image information. This suggests that during the model's training stage, more emphasis is placed on learning how to follow textual instructions, rather than extracting information from images. Based on this finding, we propose adjusting the training weights of different tokens depending on their visual dependence to control hallucination.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
