Aligning Modalities in Vision Large Language Models via Preference   Fine-tuning

Yiyang Zhou; Chenhang Cui; Rafael Rafailov; Chelsea Finn; Huaxiu Yao

arXiv:2402.11411·cs.LG·February 20, 2024·2 cites

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, Huaxiu Yao

PDF

Open Access 1 Repo 2 Datasets

TL;DR

This paper introduces POVID, a preference fine-tuning method that uses AI-generated feedback data to align vision-language models, significantly reducing hallucinations and improving performance on benchmarks.

Contribution

The work presents an automated, scalable approach to align vision-language models using preference tuning with AI-generated feedback, without human data or expert involvement.

Findings

01

Reduces hallucinations in VLLMs

02

Improves performance on standard benchmarks

03

Outperforms prior alignment methods

Abstract

Instruction-following Vision Large Language Models (VLLMs) have achieved significant progress recently on a variety of tasks. These approaches merge strong pre-trained vision models and large language models (LLMs). Since these components are trained separately, the learned representations need to be aligned with joint training on additional image-language pairs. This procedure is not perfect and can cause the model to hallucinate - provide answers that do not accurately reflect the image, even when the core LLM is highly factual and the vision backbone has sufficiently complete representations. In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning. Specifically, we propose POVID to generate feedback data with AI models. We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yiyangzhou/povid
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling