Demystifying the unreasonable effectiveness of online alignment methods
Enoch Hyunwook Kang

TL;DR
This paper explains why greedy online alignment methods like RLHF perform so well in practice by analyzing a decision-focused regret criterion, showing they achieve constant regret.
Contribution
It introduces a decision-centric regret measure that isolates the cost of response selection, providing a sharper theoretical understanding of greedy alignment methods' efficiency.
Findings
Greedy online alignment methods achieve constant (O(1)) regret under the new criterion.
The mismatch between empirical success and traditional theoretical bounds is due to the regret measure used.
The analysis clarifies the efficiency of methods like RLHF and DPO in practice.
Abstract
Iterative alignment methods based on purely greedy updates are remarkably effective in practice, yet existing theoretical guarantees of \(O(\log T)\) KL-regularized regret can seem pessimistic relative to their empirical performance. In this paper, we argue that this mismatch arises from the regret criterion itself: KL-regularized regret conflates the statistical cost of learning with the exploratory randomization induced by the softened training policy. To separate these effects, we study the traditional temperature-zero regret criterion, which evaluates only the top-ranked response at inference time. Under this decision-centric notion of performance, we prove that standard greedy online alignment methods, including online RLHF and online DPO, achieve constant \((O(1))\) cumulative regret. By isolating the cost of identifying the best response from the stochasticity induced by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
