Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement
Zhenxin Qin, Qiang Li, Qingzhuo Wang, Ruiyang Qin, Zhihua Wei, Wen Shen

TL;DR
This paper introduces a relation-aware visual enhancement framework to reduce action-relation hallucinations in LVLMs by focusing attention on key action-relevant image regions, improving task accuracy.
Contribution
It proposes the ARS score for identifying sensitive attention heads and the RVE method to enhance focus on action-relevant regions, addressing complex relation hallucinations.
Findings
Significantly reduces action-relation hallucinations in LVLMs.
Improves generalization to spatial and object hallucinations.
Achieves superior performance with negligible inference cost.
Abstract
Large Vision-Language Models (LVLMs) have achieved remarkable performance on diverse vision-language tasks. However, LVLMs still suffer from hallucinations, generating text that contradicts the visual input. Existing research has primarily focused on mitigating object hallucinations, but often overlooks more complex relation hallucinations, particularly action relations involving interactions between objects. In this study, we empirically observe that the primary cause of action-relation hallucinations in LVLMs is the insufficient attention allocated to visual information. Thus, we propose a framework to locate action-relevant image regions and enhance the LVLM's attention to those regions. Specifically, we define the Action-Relation Sensitivity (ARS) score to identify attention heads that are most sensitive to action-relation changes, thereby localizing action-relevant image regions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
