APO: Alpha-Divergence Preference Optimization
Wang Zixian

TL;DR
APO introduces a flexible framework that interpolates between mode-covering and mode-seeking behaviors in alignment training, improving stability and performance by adaptively balancing divergence types during optimization.
Contribution
It proposes Alpha-Divergence Preference Optimization (APO), a novel anchored method that smoothly interpolates between forward and reverse KL divergences using Csiszar alpha-divergence, with a practical schedule for stable training.
Findings
Achieves competitive performance on Qwen3-1.7B with math-level3.
Maintains training stability comparable to baselines.
Effectively balances exploration and exploitation during training.
Abstract
Two divergence regimes dominate modern alignment practice. Supervised fine-tuning and many distillation-style objectives implicitly minimize the forward KL divergence KL(q || pi_theta), yielding stable mode-covering updates but often under-exploiting high-reward modes. In contrast, PPO-style online reinforcement learning from human feedback behaves closer to reverse KL divergence KL(pi_theta || q), enabling mode-seeking improvements but risking mode collapse. Recent anchored methods, such as ADPO, show that performing the projection in anchored coordinates can substantially improve stability, yet they typically commit to a single divergence. We introduce Alpha-Divergence Preference Optimization (APO), an anchored framework that uses Csiszar alpha-divergence to continuously interpolate between forward and reverse KL behavior within the same anchored geometry. We derive unified gradient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Reinforcement Learning in Robotics · Advanced Multi-Objective Optimization Algorithms
