Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Hao Fang, Changle Zhou, Jiawei Kong, Kuofeng Gao, Bin Chen, Shu-Tao Xia

TL;DR
This paper introduces a novel decoding strategy for LVLMs that reduces hallucinations by adaptively maximizing the mutual dependency between generated responses and input images, leading to more accurate and relevant outputs.
Contribution
It proposes a C-PMI calibrated decoding method that jointly models visual and textual contributions, formulated as a bi-level optimization problem for hallucination mitigation.
Findings
Significantly reduces hallucinations in LVLMs.
Maintains decoding efficiency while improving relevance.
Effective across various benchmark datasets.
Abstract
Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychiatry, Mental Health, Neuroscience · EEG and Brain-Computer Interfaces · Topological and Geometric Data Analysis
