ADPO: Anchored Direct Preference Optimization

Wang Zixian

arXiv:2510.18913·cs.LG·January 13, 2026

ADPO: Anchored Direct Preference Optimization

Wang Zixian

PDF

Open Access 1 Models

TL;DR

ADPO introduces a novel policy alignment method based on KL-regularized reinforcement learning principles, utilizing anchored logits to improve response quality and robustness in alignment tasks.

Contribution

The paper proposes ADPO, a new approach that explicitly models the optimal policy structure using anchored logits, unifying various objectives and addressing key limitations of prior methods.

Findings

01

Achieves state-of-the-art performance on reasoning tasks.

02

Outperforms previous methods by 30.9% on Qwen3-1.7B.

03

Demonstrates superior robustness under distribution shift.

Abstract

We present Anchored Direct Preference Optimization (ADPO), a policy alignment method derived from first principles of KL-regularized reinforcement learning. Unlike standard approaches that treat the reference policy merely as a regularizer, we show that the optimal policy in reinforcement learning from human feedback inherently operates in a differential coordinate system, optimizing relative advantage in the form of log ratios rather than absolute probabilities. ADPO explicitly parameterizes this optimal structure through anchored logits, effectively decoupling response quality from prior popularity and creating an implicit trust region through curvature scaling. We show that this formulation unifies supervised fine-tuning, reinforcement learning, and ranking-based objectives under a single geometric perspective. Theoretically, ADPO resolves the probability smearing problem of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
wzx111/Qwen3-1.7B-Open-R1-ADPO
model· 1 dl
1 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Multi-Objective Optimization Algorithms · Advanced Bandit Algorithms Research