TL;DR
SECOND introduces a novel decoding approach for vision-language models that reduces object hallucination by selectively and contrastively integrating multi-scale visual information, aligning more closely with human perception.
Contribution
It presents a new method, SECOND, that leverages multi-scale visual information with an object-centric approach to mitigate hallucinations in VLMs.
Findings
SECOND significantly reduces perceptual hallucinations.
It outperforms existing benchmarks in visual understanding tasks.
Prioritizing and contrasting across scales enhances VLM performance.
Abstract
Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
