ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

Nirmal Patel; Fei Wang; and Inderjit S. Dhillon

arXiv:2605.12667·cs.LG·May 18, 2026

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

Nirmal Patel, Fei Wang, and Inderjit S. Dhillon

PDF

TL;DR

The paper introduces ODRPO, a novel framework that decomposes discrete rewards into ordinal indicators to improve the robustness and efficiency of policy optimization in noisy reward environments for large language models.

Contribution

ODRPO is a new method that isolates evaluation noise through ordinal decomposition, enhancing robustness without additional computational overhead.

Findings

01

ODRPO outperforms baselines with up to 14.8% improvements on FACTS-grounding-v2.

02

ODRPO achieves these gains with negligible training-time overhead.

03

Theoretical analysis confirms its optimization stability.

Abstract

The alignment of Large Language Models (LLMs) utilizes Reinforcement Learning from AI Feedback (RLAIF) for non-verifiable domains such as long-form question answering and open-ended instruction following. These domains often rely on LLM based auto-raters to provide granular, multi-tier discrete rewards (e.g., 1-10 rubrics) that are inherently stochastic due to prompt sensitivity and sampling randomness. We empirically verify the stochasticity of auto-raters that can propagate and corrupt standard advantage estimators like GRPO and MaxRL, as a noisy reward samples can skew normalization statistics and degrade the global learning signal. Empirically, sampling more rewards and taking majority voting may reduce the noise and improve performance, but this approach is computationally expensive. To address this bottleneck, we introduce $O$ rdinal $D$ ecomposition for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.