CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs
Yassine Ouali, Adrian Bulat, Brais Martinez, and Georgios, Tzimiropoulos

TL;DR
This paper introduces CLIP-DPO, a novel method leveraging contrastively pre-trained vision-language models like CLIP to reduce hallucinations in LVLMs without external data or APIs, improving robustness and grounding.
Contribution
The paper presents CLIP-DPO, a preference optimization approach that uses CLIP embeddings for fine-tuning LVLMs, effectively reducing hallucinations without additional training data or external models.
Findings
Significant hallucination reduction in MobileVLM-v2 and LlaVA-1.5 models.
Improved zero-shot classification performance.
Preservation of original benchmark performance.
Abstract
Despite recent successes, LVLMs or Large Vision Language Models are prone to hallucinating details like objects and their properties or relations, limiting their real-world deployment. To address this and improve their robustness, we present CLIP-DPO, a preference optimization method that leverages contrastively pre-trained Vision-Language (VL) embedding models, such as CLIP, for DPO-based optimization of LVLMs. Unlike prior works tackling LVLM hallucinations, our method does not rely on paid-for APIs, and does not require additional training data or the deployment of other external LVLMs. Instead, starting from the initial pool of supervised fine-tuning data, we generate a diverse set of predictions, which are ranked based on their CLIP image-text similarities, and then filtered using a robust rule-based approach to obtain a set of positive and negative pairs for DPO-based training. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEpilepsy research and treatment
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
