TL;DR
This paper introduces Mixture of Decoding (MoD), an adaptive decoding strategy for large vision-language models that reduces hallucinations by evaluating and adjusting based on attention correctness, leading to improved performance.
Contribution
The paper presents a novel adaptive decoding method, MoD, that dynamically adjusts decoding strategies based on attention correctness to mitigate hallucinations in LVLMs.
Findings
MoD significantly reduces hallucinations in LVLMs.
MoD outperforms existing decoding methods on multiple benchmarks.
The approach effectively distinguishes correct and incorrect attention during decoding.
Abstract
Large Vision-Language Models (LVLMs) have exhibited impressive capabilities across various visual tasks, yet they remain hindered by the persistent challenge of hallucinations. To address this critical issue, we propose Mixture of Decoding (MoD), a novel approach for hallucination mitigation that dynamically adapts decoding strategies by evaluating the correctness of the model's attention on image tokens. Specifically, MoD measures the consistency between outputs generated from the original image tokens and those derived from the model's attended image tokens, to distinguish the correctness aforementioned. If the outputs are consistent, indicating correct attention, MoD employs a complementary strategy to amplify critical information. Conversely, if the outputs are inconsistent, suggesting erroneous attention, MoD utilizes a contrastive strategy to suppress misleading information.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
