DDO-RM: Distribution-Level Policy Improvement after Reward Learning
Tiantian Zhang, Jierui Zuo, Michael Chen, Wenping Wang

TL;DR
DDO-RM introduces a decision-optimization method that improves policies by converting reward scores into explicit distributions, outperforming existing methods like DPO in preliminary tests.
Contribution
The paper presents DDO-RM, a novel finite-candidate decision-optimization approach that links reward learning with mirror-descent policy improvement.
Findings
DDO-RM outperforms DPO in pair accuracy (0.52 to 0.56)
DDO-RM improves mean margin (0.13 to 0.53)
Framework connects reward learning with mirror-descent policy updates
Abstract
Recent theory suggests that reward-model-first methods can be more sample-efficient than direct policy fitting when the reward function is statistically simpler than the induced policy. We propose DDO-RM, a finite-candidate decision-optimization method that converts reward scores into an explicit target distribution. Unlike PPO-based RLHF or DPO, DDO-RM performs a KL-regularized mirror-descent update to project the policy toward a reward-improved distribution over a candidate set. Preliminary experiments on Pythia-410M show that DDO-RM outperforms DPO in pair accuracy (0.52 to 0.56) and mean margin (0.13 to 0.53). Our framework provides a principled connection between reward learning and mirror-descent policy improvement.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
