CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing   Hallucinations in LVLMs

Yassine Ouali; Adrian Bulat; Brais Martinez; and Georgios; Tzimiropoulos

arXiv:2408.10433·cs.CV·August 21, 2024

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Yassine Ouali, Adrian Bulat, Brais Martinez, and Georgios, Tzimiropoulos

PDF

Open Access

TL;DR

This paper introduces CLIP-DPO, a novel method leveraging contrastively pre-trained vision-language models like CLIP to reduce hallucinations in LVLMs without external data or APIs, improving robustness and grounding.

Contribution

The paper presents CLIP-DPO, a preference optimization approach that uses CLIP embeddings for fine-tuning LVLMs, effectively reducing hallucinations without additional training data or external models.

Findings

01

Significant hallucination reduction in MobileVLM-v2 and LlaVA-1.5 models.

02

Improved zero-shot classification performance.

03

Preservation of original benchmark performance.

Abstract

Despite recent successes, LVLMs or Large Vision Language Models are prone to hallucinating details like objects and their properties or relations, limiting their real-world deployment. To address this and improve their robustness, we present CLIP-DPO, a preference optimization method that leverages contrastively pre-trained Vision-Language (VL) embedding models, such as CLIP, for DPO-based optimization of LVLMs. Unlike prior works tackling LVLM hallucinations, our method does not rely on paid-for APIs, and does not require additional training data or the deployment of other external LVLMs. Instead, starting from the initial pool of supervised fine-tuning data, we generate a diverse set of predictions, which are ranked based on their CLIP image-text similarities, and then filtered using a robust rule-based approach to obtain a set of positive and negative pairs for DPO-based training. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEpilepsy research and treatment

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training