GOPO: Policy Optimization using Ranked Rewards

Kyuseong Choi; Dwaipayan Saha; Woojeong Kim; Anish Agarwal; Raaz Dwivedi

arXiv:2602.03876·cs.LG·February 5, 2026

GOPO: Policy Optimization using Ranked Rewards

Kyuseong Choi, Dwaipayan Saha, Woojeong Kim, Anish Agarwal, Raaz Dwivedi

PDF

Open Access

TL;DR

GOPO introduces a reward ranking-based policy optimization method that improves training efficiency and performance in reinforcement learning tasks with non-verifiable rewards, outperforming existing approaches.

Contribution

It proposes a novel rank-based reward transformation for policy optimization, addressing reward magnitude misalignment in RLHF settings.

Findings

01

Higher reward trajectories during training

02

Better LLM evaluation scores

03

Faster convergence to high-quality policies

Abstract

Standard reinforcement learning from human feedback (RLHF) trains a reward model on pairwise preference data and then uses it for policy optimization. However, while reward models are optimized to capture relative preferences, existing policy optimization techniques rely on absolute reward magnitudes during training. In settings where the rewards are non-verifiable such as summarization, instruction following, and chat completion, this misalignment often leads to suboptimal performance. We introduce Group Ordinal Policy Optimization (GOPO), a policy optimization method that uses only the ranking of the rewards and discards their magnitudes. Our rank-based transformation of rewards provides several gains, compared to Group Relative Policy Optimization (GRPO), in settings with non-verifiable rewards: (1) consistently higher training/validation reward trajectories, (2) improved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Emotion and Mood Recognition · Reinforcement Learning in Robotics