GOPO: Policy Optimization using Ranked Rewards
Kyuseong Choi, Dwaipayan Saha, Woojeong Kim, Anish Agarwal, Raaz Dwivedi

TL;DR
GOPO introduces a reward ranking-based policy optimization method that improves training efficiency and performance in reinforcement learning tasks with non-verifiable rewards, outperforming existing approaches.
Contribution
It proposes a novel rank-based reward transformation for policy optimization, addressing reward magnitude misalignment in RLHF settings.
Findings
Higher reward trajectories during training
Better LLM evaluation scores
Faster convergence to high-quality policies
Abstract
Standard reinforcement learning from human feedback (RLHF) trains a reward model on pairwise preference data and then uses it for policy optimization. However, while reward models are optimized to capture relative preferences, existing policy optimization techniques rely on absolute reward magnitudes during training. In settings where the rewards are non-verifiable such as summarization, instruction following, and chat completion, this misalignment often leads to suboptimal performance. We introduce Group Ordinal Policy Optimization (GOPO), a policy optimization method that uses only the ranking of the rewards and discards their magnitudes. Our rank-based transformation of rewards provides several gains, compared to Group Relative Policy Optimization (GRPO), in settings with non-verifiable rewards: (1) consistently higher training/validation reward trajectories, (2) improved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Emotion and Mood Recognition · Reinforcement Learning in Robotics
