Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization

Shuo Xing; Peiran Li; Yuping Wang; Ruizheng Bai; Yueqi Wang; Chan-Wei Hu; Chengxuan Qian; Huaxiu Yao; Zhengzhong Tu

arXiv:2502.13146·cs.CV·September 23, 2025

Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization

Shuo Xing, Peiran Li, Yuping Wang, Ruizheng Bai, Yueqi Wang, Chan-Wei Hu, Chengxuan Qian, Huaxiu Yao, Zhengzhong Tu

PDF

Open Access 1 Repo 1 Video

TL;DR

Re-Align introduces a retrieval-based alignment framework for vision-language models that reduces hallucinations and improves visual question-answering performance by incorporating visual preferences during fine-tuning.

Contribution

It proposes a novel dual-preference dataset construction and an extended optimization method, rDPO, to better align VLMs with visual and textual preferences.

Findings

01

Re-Align outperforms previous methods in reducing hallucinations.

02

It significantly improves VQA task performance.

03

Re-Align is robust across various VLM architectures.

Abstract

The emergence of large Vision Language Models (VLMs) has broadened the scope and capabilities of single-modal Large Language Models (LLMs) by integrating visual modalities, thereby unlocking transformative cross-modal applications in a variety of real-world scenarios. Despite their impressive performance, VLMs are prone to significant hallucinations, particularly in the form of cross-modal inconsistencies. Building on the success of Reinforcement Learning from Human Feedback (RLHF) in aligning LLMs, recent advancements have focused on applying direct preference optimization (DPO) on carefully curated datasets to mitigate these issues. Yet, such approaches typically introduce preference signals in a brute-force manner, neglecting the crucial role of visual information in the alignment process. In this paper, we introduce Re-Align, a novel alignment framework that leverages image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

taco-group/re-align
pytorchOfficial

Videos

Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques