Autoregressive Direct Preference Optimization
Masanari Oi, Mahiro Ukai, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue

TL;DR
This paper introduces Autoregressive DPO (ADPO), a novel preference optimization method for large language models that explicitly incorporates autoregressive assumptions, improving theoretical understanding and potentially enhancing alignment with human preferences.
Contribution
The paper reformulates DPO to explicitly include autoregressive assumptions, deriving a new variant called ADPO and analyzing the impact of token and feedback lengths on preference optimization.
Findings
ADPO shifts the summation outside the log-sigmoid, simplifying the loss.
Theoretical analysis distinguishes token length and feedback length, impacting algorithm design.
First explicit analysis of length measures in preference optimization for LLMs.
Abstract
Direct preference optimization (DPO) has emerged as a promising approach for aligning large language models (LLMs) with human preferences. However, the widespread reliance on the response-level Bradley-Terry (BT) model may limit its full potential, as the reference and learnable models are assumed to be autoregressive only after deriving the objective function. Motivated by this limitation, we revisit the theoretical foundations of DPO and propose a novel formulation that explicitly introduces the autoregressive assumption prior to applying the BT model. By reformulating and extending DPO, we derive a novel variant, termed Autoregressive DPO (ADPO), that explicitly integrates autoregressive modeling into the preference optimization framework. Without violating the theoretical foundations, the derived loss takes an elegant form: it shifts the summation operation in the DPO objective…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Theoretical soundness: The paper effectively explains how a different perspective in the DPO formulation can lead to token-wise interpretation, which also allows having a new loss with controllable granularity. 2. The paper is well-organized in terms of logical flow and visual presentation.
1. I think the biggest issue of this paper is that it is not citing and comparing with a very similar analysis from the authors of the original DPO paper [1]. While there are a few differences in terms of formulation or notations, [1] also present a token-level perspective of the DPO formulation and that token-level DPO can parameterize any dense reward function (section 4.2 of [1]). 2. While the paper is presenting several results where ADPO outperforms DPO, it seems it is still lacking a reas
Novel objective - The paper considers deriving the objective building in an assumption of having autoregressive models. They demonstrate that the resulting objective still achieves the optimal policy and also demonstrates that DPO is a special case of the generalized ADPO objectives. They also provide a thorough analysis of the resulting objective and present it in a clear way. Experimental results - Experimental results demonstrate consistent improvement across multiple tasks and benchmarks a
Impact - While the benefit of ADPO is clear, one thing that is unclear from the paper is why the model being autoregressive and the output distribution not being autoregressive is considered to be mismatched and why it is expected to be an issue. There are different ways of chunking the outputs as seen with the variants of ADPO, so it seems unclear why treating the output as a single object should lead to issues. I think further reasoning here being provided would strengthen the paper and make t
1. The observation that standard DPO amounts to applying the BT model at the response level, singling out the autoregressive structure after deriving the objective, is a significant theoretical mismatch in the existing formalization. ADPO rectifies this issue by introducing an energy definition over the prefix closure and explicitly assuming an autoregressive reference model at that point. This theory is much closer to the reality of how an LLM actually generates text. 2. It is clearly shown tha
1. About the implicit reward function $r^*(x, y_{\le i})$: In the standard DPO, the implicit reward has an explanation related to the complete response. The optimization in ADPO is tied to the delivered localized rewards. What these rewards could be? Have they a variance? How do they ‘credit’ prefixes for being ‘good’ or ‘bad’ in the response? 2. About effect of moving the summation outside the log-sigmoid function that changes the non-linearity: we know that DPO computes the total advantage, t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Advanced Multi-Objective Optimization Algorithms · Multi-Criteria Decision Making
