DDO-RM: Distribution-Level Policy Improvement after Reward Learning

Tiantian Zhang; Jierui Zuo; Michael Chen; Wenping Wang

arXiv:2604.11119·stat.ML·May 1, 2026

DDO-RM: Distribution-Level Policy Improvement after Reward Learning

Tiantian Zhang, Jierui Zuo, Michael Chen, Wenping Wang

PDF

TL;DR

DDO-RM introduces a decision-optimization method that improves policies by converting reward scores into explicit distributions, outperforming existing methods like DPO in preliminary tests.

Contribution

The paper presents DDO-RM, a novel finite-candidate decision-optimization approach that links reward learning with mirror-descent policy improvement.

Findings

01

DDO-RM outperforms DPO in pair accuracy (0.52 to 0.56)

02

DDO-RM improves mean margin (0.13 to 0.53)

03

Framework connects reward learning with mirror-descent policy updates

Abstract

Recent theory suggests that reward-model-first methods can be more sample-efficient than direct policy fitting when the reward function is statistically simpler than the induced policy. We propose DDO-RM, a finite-candidate decision-optimization method that converts reward scores into an explicit target distribution. Unlike PPO-based RLHF or DPO, DDO-RM performs a KL-regularized mirror-descent update to project the policy toward a reward-improved distribution over a candidate set. Preliminary experiments on Pythia-410M show that DDO-RM outperforms DPO in pair accuracy (0.52 to 0.56) and mean margin (0.13 to 0.53). Our framework provides a principled connection between reward learning and mirror-descent policy improvement.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.