Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models
Jihoon Lee, Min Song

TL;DR
This paper introduces RVCD, a novel decoding method that reduces object hallucinations in large vision-language models by using retrieval-based contrastive techniques with positive and negative images at the logit level.
Contribution
RVCD is the first approach to explicitly incorporate retrieval of positive and negative images at the logit level to mitigate object hallucinations without additional training.
Findings
Significant reduction in object hallucinations compared to previous methods
Effective use of retrieval-based contrastive decoding at the logit level
Improved alignment between generated images and true object concepts
Abstract
Despite significant advancements in Large Vision-Language Models, Object Hallucination (OH) remains a persistent challenge. Building upon prior studies on contrastive decoding that address this issue without requiring additional model training, we introduce RVCD (Retrieval Visual Contrastive Decoding), an advanced method to suppress OH. RVCD leverages both negative and positive images at the logit level, explicitly referencing AI-generated images designed to represent a single concept. Our approach demonstrates substantial improvements over existing decoding-based methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Digital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis
