VORD: Visual Ordinal Calibration for Mitigating Object Hallucinations in Large Vision-Language Models
Dexter Neo, Tsuhan Chen

TL;DR
VORD is a novel calibration method for large vision-language models that reduces object hallucinations by leveraging ordinal relationships between image pairs, improving accuracy without extensive training.
Contribution
The paper introduces VORD, a simple, training-free and trainable calibration approach that effectively mitigates hallucinations in LVLMs by using ordinal relationships.
Findings
VORD improves calibration accuracy in LVLMs.
VORD reduces object hallucinations across benchmarks.
The method is effective with minimal or no training.
Abstract
Large Vision-Language Models (LVLMs) have made remarkable developments along with the recent surge of large language models. Despite their advancements, LVLMs have a tendency to generate plausible yet inaccurate or inconsistent information based on the provided source content. This phenomenon, also known as ``hallucinations" can have serious downstream implications during the deployment of LVLMs. To address this, we present VORD a simple and effective method that alleviates hallucinations by calibrating token predictions based on ordinal relationships between modified image pairs. VORD is presented in two forms: 1.) a minimalist training-free variant which eliminates implausible tokens from modified image pairs, and 2.) a trainable objective function that penalizes unlikely tokens. Our experiments demonstrate that VORD delivers better calibration and effectively mitigates object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFunctional Brain Connectivity Studies · Epilepsy research and treatment · Cell Image Analysis Techniques
