Learning Ordinal Probabilistic Reward from Preferences

Longze Chen; Lu Wang; Renke Shan; Ze Gong; Run Luo; Jiaming Li; Jing Luo; Qiyao Wang; Min Yang

arXiv:2602.12660·cs.CL·March 3, 2026

Learning Ordinal Probabilistic Reward from Preferences

Longze Chen, Lu Wang, Renke Shan, Ze Gong, Run Luo, Jiaming Li, Jing Luo, Qiyao Wang, Min Yang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a probabilistic reward modeling approach that treats reward as a distribution, improving alignment of language models with human preferences through a novel ordinal discretization and data-efficient training strategy.

Contribution

It proposes the Probabilistic Reward Model (PRM) paradigm and its practical ordinal realization (OPRM), along with the Region Flooding Tuning method for better quality reflection and data efficiency.

Findings

01

Improves reward model accuracy by 2.9% to 7.4% over prior methods.

02

Captures both relative rankings and absolute quality.

03

Demonstrates strong performance and data efficiency.

Abstract

Reward models are crucial for aligning large language models (LLMs) with human values and intentions. Existing approaches follow either Generative (GRMs) or Discriminative (DRMs) paradigms, yet both suffer from limitations: GRMs typically demand costly point-wise supervision, while DRMs produce uncalibrated relative scores that lack probabilistic interpretation. To address these challenges, we introduce a novel reward modeling paradigm: Probabilistic Reward Model (PRM). Instead of modeling reward as a deterministic scalar, our approach treats it as a random variable, learning a full probability distribution for the quality of each response. To make this paradigm practical, we present its closed-form, discrete realization: the Ordinal Probabilistic Reward Model (OPRM), which discretizes the quality score into a finite set of ordinal ratings. Building on OPRM, we propose a data-efficient…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. Rating the response by an integer is not novel, but the paper proposes a good way of learning the integer's distribution by contrasting the overall score distribution of chosen and rejected respones. Empirical results show the efficiency. 2. RgFT training controls the shift of probability mass more precisely by using some additional information of quality ("good/normal/bad"). It makes the training more efficient. 3. Reusing the LM head probabilities over numeric tokens is easy to implement

Weaknesses

1. Calibration not measured. A primary motivation is "calibrated, interpretable distributions", yet there are no metrics (e.g., ECE) in regards of calibration reported. This is a big question, as the authors claim that they are modeling the whole distribution. Accuracy isn't enough. 2. I'm not very satisfied with the presentation of this paper. Two major concerns (but not limited to them): (1) Important model/algorithms such as RgFT are not formally defined, which is confusing. Also, gradient u

Reviewer 02Rating 8Confidence 4

Strengths

1. Presents a novel training paradigm that requires minimal architectural changes to existing neural reward models. 2. The proposed method may improve calibration and robustness compared to traditional Bradley-Terry reward models 3. Extensive experimental evaluation across diverse datasets demonstrates practical utility.

Weaknesses

1. Although framed generally, the approach appears tied to Bradley–Terry-style preference data and is primarily evaluated in pairwise preference setups. It is unclear how well the method applies to verifiable rewards (e.g., math/code with execution-based correctness) or to process-based rewards. 2. The choice of the number of ordinal bins, boundary placement, and mapping from discrete classes back to scalar rewards (for policy optimization) may significantly affect performance. Sensitivity anal

Reviewer 03Rating 2Confidence 3

Strengths

1. **Novel and Parameter-Efficient Design:** The core idea of OPRM—repurposing the LM head's vocabulary probabilities for numeric tokens as a reward distribution—is a clever and elegant approach that avoids adding any new parameters, unlike traditional scalar-based reward models that require a separate value head. 2. **Probabilistic Formulation:** Moving from a deterministic scalar to a full probability distribution is a well-motivated step. It inherently allows for modeling uncertainty and,

Weaknesses

1. **Unsupported Core Assumption:** The entire method hinges on the assumption that an LLM's pre-trained probability distribution over the tokens '1'-'9' has an inherent ordinal correspondence to text quality. This is a very strong and unevaluated claim. It is highly plausible that the model has a strong prior based on token frequency in pre-training data (e.g., '1' may be far more common than '9'), which would be unrelated to quality. The paper provides no analysis of this token prior or any e

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Explainable Artificial Intelligence (XAI)