ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization
Nirmal Patel, Fei Wang, and Inderjit S. Dhillon

TL;DR
The paper introduces ODRPO, a novel framework that decomposes discrete rewards into ordinal indicators to improve the robustness and efficiency of policy optimization in noisy reward environments for large language models.
Contribution
ODRPO is a new method that isolates evaluation noise through ordinal decomposition, enhancing robustness without additional computational overhead.
Findings
ODRPO outperforms baselines with up to 14.8% improvements on FACTS-grounding-v2.
ODRPO achieves these gains with negligible training-time overhead.
Theoretical analysis confirms its optimization stability.
Abstract
The alignment of Large Language Models (LLMs) utilizes Reinforcement Learning from AI Feedback (RLAIF) for non-verifiable domains such as long-form question answering and open-ended instruction following. These domains often rely on LLM based auto-raters to provide granular, multi-tier discrete rewards (e.g., 1-10 rubrics) that are inherently stochastic due to prompt sensitivity and sampling randomness. We empirically verify the stochasticity of auto-raters that can propagate and corrupt standard advantage estimators like GRPO and MaxRL, as a noisy reward samples can skew normalization statistics and degrade the global learning signal. Empirically, sampling more rewards and taking majority voting may reduce the noise and improve performance, but this approach is computationally expensive. To address this bottleneck, we introduce rdinal ecomposition for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
