Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection

Zihui Zhao; Zechang Li

arXiv:2512.13240·cs.AI·December 16, 2025

Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection

Zihui Zhao, Zechang Li

PDF

Open Access

TL;DR

Reflective Preference Optimization (RPO) improves on-policy alignment of large models by using hint-guided reflection to generate stronger preference signals, leading to faster, more stable training and reduced hallucinations.

Contribution

The paper introduces RPO, a novel framework that incorporates external hints into preference optimization, enhancing contrastiveness and sample efficiency in model alignment.

Findings

01

RPO achieves better alignment with fewer samples and iterations.

02

RPO substantially reduces hallucination rates.

03

RPO delivers state-of-the-art results on multimodal benchmarks.

Abstract

Direct Preference Optimization (DPO) has emerged as a lightweight and effective alternative to Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with AI Feedback (RLAIF) for aligning large language and vision-language models. However, the standard DPO formulation, in which both the chosen and rejected responses are generated by the same policy, suffers from a weak learning signal because the two responses often share similar errors and exhibit small Kullback-Leibler (KL) divergence. This leads to slow and unstable convergence. To address this limitation, we introduce Reflective Preference Optimization (RPO), a new framework that incorporates hint-guided reflection into the DPO paradigm. RPO uses external models to identify hallucination sources and generate concise reflective hints, enabling the construction of on-policy preference pairs with stronger…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Machine Learning and Data Classification