Reward Learning through Ranking Mean Squared Error
Chaitanya Kharyal, Calarina Muslimani, Matthew E. Taylor

TL;DR
This paper introduces R4, a novel reward learning method that uses a ranking mean squared error loss to infer reward functions from human ratings, improving efficiency and performance in reinforcement learning tasks.
Contribution
R4 is the first method to formalize a ranking MSE loss for reward learning from ratings, with provable guarantees and superior empirical performance on benchmark tasks.
Findings
R4 outperforms existing methods on robotic locomotion benchmarks.
It requires less human feedback to achieve comparable or better results.
The approach offers formal guarantees for the solution set.
Abstract
Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manually specified. Recent work has proposed learning reward functions from human feedback in the form of ratings, rather than traditional binary preferences, enabling richer and potentially less cognitively demanding supervision. Building on this paradigm, we introduce a new rating-based RL method, Ranked Return Regression for RL (R4). At its core, R4 employs a novel ranking mean squared error (rMSE) loss, which treats teacher-provided ratings as ordinal targets. Our approach learns from a dataset of trajectory-rating pairs, where each trajectory is labeled with a discrete rating (e.g., "bad," "neutral," "good"). At each training step, we sample a set of trajectories,…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The proposed ranking mean squared error (rMSE) loss is conceptually simple yet technically elegant, effectively leveraging ordinal ratings through differentiable ranking. 2. The theoretical analysis provides formal guarantees of completeness and minimality, strengthening the method’s conceptual foundation. 3. The experimental results encompass both offline and online feedback settings across multiple continuous-control benchmarks. 4. The ablation studies are carefully designed, demonstrati
1. The paper provides limited conceptual and empirical discussion on how rating-based feedback compares to preference-based feedback. While the authors claim that ratings are more informative and cognitively efficient, this assumption is not rigorously analyzed. 2. The theoretical results rely on several strong assumptions, such as deterministic reward realizability, perfectly consistent rating bins, and nearly exact differentiable ranking, which may not hold in realistic human feedback scenari
- Simple, practical objective. Treating ratings as ordered targets and aligning ranks, instead of absolute scores, gives an intuitive loss that removes the design burden of rating bin boundaries required by RbRL. - Leverages modern differentiable ranking. Building on fast, differentiable ranking/sorting (permutahedron projections) is sensible and computationally efficient compared to older proxies. - Broad empirical envs on standard control suites. The paper reproduces common baselines (PEBBLE,
- The “minimality and completeness” results crucially assume (i) deterministic reward realizability, (ii) perfect ordinal binning by the teacher, and (iii) exact differentiable ranking that outputs true ranks. In practice, (iii) is not guaranteed—soft‑rank operators are precise for soft projections but do not generally equal hard ranks except in limiting or special cases. The paper cites [1] for “exact computation,” but that exactness refers to the soft operator and its gradients, not equality t
1. The idea of using ranking information from ratings is clear and simple. 2. The method is easy to implement within standard reward learning pipelines. 3. The experiments show some improvement over the chosen baselines on locomotion tasks under simulated settings.
1. The ordinal feedback is not a new idea for preference-based RL. The paper does not convincingly show that rMSE gives qualitatively different behavior than straightforward regression-to-targets, ordinal regression, or a calibrated cross-entropy approach. 2. There is a lack of related work [1,2]. [1] is an important one that theoretically analyzes why ordinal feedback can work better than pair-wise ones. Meanwhile, some work has explored using tied preference to improve PbRL [2], which should
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Motor Control and Adaptation
