TL;DR
R-CoV is a post-hoc, region-aware verification method that reduces object hallucinations in large vision-language models by focusing on specific image regions and verifying their content without external detectors.
Contribution
The paper introduces R-CoV, a training-free, region-aware chain-of-verification approach that effectively mitigates object hallucinations in LVLMs.
Findings
R-CoV significantly reduces object hallucinations across multiple benchmarks.
The method can be integrated into various LVLMs without retraining.
It does not rely on external detection models, simplifying deployment.
Abstract
Large vision-language models (LVLMs) have demonstrated impressive performance in various multimodal understanding and reasoning tasks. However, they still struggle with object hallucinations, i.e., the claim of nonexistent objects in the visual input. To address this challenge, we propose Region-aware Chain-of-Verification (R-CoV), a visual chain-of-verification method to alleviate object hallucinations in LVLMs in a post-hoc manner. Motivated by how humans comprehend intricate visual information -- often focusing on specific image regions or details within a given sample -- we elicit such region-level processing from LVLMs themselves and use it as a chaining cue to detect and alleviate their own object hallucinations. Specifically, our R-CoV consists of six steps: initial response generation, entity extraction, coordinate generation, region description, verification execution, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
