Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, Huaxiu Yao

TL;DR
This paper introduces POVID, a preference fine-tuning method that uses AI-generated feedback data to align vision-language models, significantly reducing hallucinations and improving performance on benchmarks.
Contribution
The work presents an automated, scalable approach to align vision-language models using preference tuning with AI-generated feedback, without human data or expert involvement.
Findings
Reduces hallucinations in VLLMs
Improves performance on standard benchmarks
Outperforms prior alignment methods
Abstract
Instruction-following Vision Large Language Models (VLLMs) have achieved significant progress recently on a variety of tasks. These approaches merge strong pre-trained vision models and large language models (LLMs). Since these components are trained separately, the learned representations need to be aligned with joint training on additional image-language pairs. This procedure is not perfect and can cause the model to hallucinate - provide answers that do not accurately reflect the image, even when the core LLM is highly factual and the vision backbone has sufficiently complete representations. In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning. Specifically, we propose POVID to generate feedback data with AI models. We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
