Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models

Jihoon Lee; Min Song

arXiv:2505.20569·cs.CV·May 30, 2025

Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models

Jihoon Lee, Min Song

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces RVCD, a novel decoding method that reduces object hallucinations in large vision-language models by using retrieval-based contrastive techniques with positive and negative images at the logit level.

Contribution

RVCD is the first approach to explicitly incorporate retrieval of positive and negative images at the logit level to mitigate object hallucinations without additional training.

Findings

01

Significant reduction in object hallucinations compared to previous methods

02

Effective use of retrieval-based contrastive decoding at the logit level

03

Improved alignment between generated images and true object concepts

Abstract

Despite significant advancements in Large Vision-Language Models, Object Hallucination (OH) remains a persistent challenge. Building upon prior studies on contrastive decoding that address this issue without requiring additional model training, we introduce RVCD (Retrieval Visual Contrastive Decoding), an advanced method to suppress OH. RVCD leverages both negative and positive images at the logit level, explicitly referencing AI-generated images designed to represent a single concept. Our approach demonstrates substantial improvements over existing decoding-based methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jihoonlee9898/rvcd
jaxOfficial

Videos

Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis