ADPO: Anchored Direct Preference Optimization
Wang Zixian

TL;DR
ADPO introduces a novel policy alignment method based on KL-regularized reinforcement learning principles, utilizing anchored logits to improve response quality and robustness in alignment tasks.
Contribution
The paper proposes ADPO, a new approach that explicitly models the optimal policy structure using anchored logits, unifying various objectives and addressing key limitations of prior methods.
Findings
Achieves state-of-the-art performance on reasoning tasks.
Outperforms previous methods by 30.9% on Qwen3-1.7B.
Demonstrates superior robustness under distribution shift.
Abstract
We present Anchored Direct Preference Optimization (ADPO), a policy alignment method derived from first principles of KL-regularized reinforcement learning. Unlike standard approaches that treat the reference policy merely as a regularizer, we show that the optimal policy in reinforcement learning from human feedback inherently operates in a differential coordinate system, optimizing relative advantage in the form of log ratios rather than absolute probabilities. ADPO explicitly parameterizes this optimal structure through anchored logits, effectively decoupling response quality from prior popularity and creating an implicit trust region through curvature scaling. We show that this formulation unifies supervised fine-tuning, reinforcement learning, and ranking-based objectives under a single geometric perspective. Theoretically, ADPO resolves the probability smearing problem of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Multi-Objective Optimization Algorithms · Advanced Bandit Algorithms Research
