Beyond Binary Preferences: A Principled Framework for Reward Modeling with Ordinal Feedback
Amirhossein Afsharrad, Ruida Zhou, Luca Viano, Sanjay Lall, Mohammad Ghavamzadeh

TL;DR
This paper introduces a mathematically grounded framework for reward modeling using ordinal Likert scale preferences, enabling more effective use of nuanced human feedback in aligning language models.
Contribution
It formulates reward modeling as an ordinal regression problem and derives loss functions that learn thresholds directly from data, improving over heuristic methods.
Findings
Achieves competitive or superior performance on multiple benchmarks.
Effectively captures the ordinal structure of human preferences.
Provides a principled framework for integrating Likert scale feedback.
Abstract
Reward modeling is crucial for aligning large language models with human preferences, yet current approaches lack a principled mathematical framework for leveraging ordinal preference data. When human annotators provide graded preferences on a Likert scale (e.g., significantly better, better, slightly better, negligibly better), existing methods typically apply ad-hoc heuristics, such as margin terms or scaling factors, to loss functions derived from binary preference models like Bradley-Terry. These approaches lack an underlying mathematical model for how ordinal preference data is generated. We present a theoretically grounded framework that formulates reward modeling with Likert scale preferences as a discrete ordinal regression problem. We derive two loss functions from this formulation: a negative log-likelihood loss and an all-threshold loss, both of which learn threshold…
Peer Reviews
Decision·ICLR 2026 Poster
1. Modeling of ordinal feedback from the perspective of discrete ordinal regression is natural. This enables the authors to adopt the literature already exists and makes more sense than simply heuristic choices. 2. Theory is insightful. Theorem 3.1 justifies the choice of regularizer. Theorem 3.2 characterizes one sufficient condition of symmetry. 3. The paper is clearly written and well organized. 4. Empirical experiments show gains on several datasets/models. The NLL‑Symmetric variant is of
1. Joint learning of threshold and reward is still challenging. Scale identifiability remains under‑addressed. Regularizing thresholds cures the unbounded loss, but the joint scaling of reward head and thresholds can still be weakly identifiable. Anchoring strategies (e.g., fixing one threshold gap, adding mild L2 on reward head, or a temperature/variance parameter) and a short calibration section would improve interpretability. 2. Positioning requires more contrast. The claim of being the firs
originality: The paper introduces a new theoretical framework that uses **ordinal regression** on the reward difference to do reward model training. It introduces 2 major variations of such objective – ordered-logit NLL and an all-threshold (margin) loss, and also provide the practical recipe for training with these objectives stably. quality & clarity: The theoretical analyses and proofs are quite thorough and the experiment part comprehensively include different model backbones and benchmarks
1. Overall weakness: It would be helpful if the authors provide more explanation of the method's motivation. From the experimental result itself (e.g. Table 2), the proposed method doesn't produce better benchmark results than other cited methods for quite a lot of the combinations tested. If that's the case, the it indicates that learning the ordinal relationship doesn't buy much gain in the performance? 2. Overall weakness: The paper introduces 3 different objective designs. Although in the t
1) It's very intuitive to incorporate a measure of 'strength' of preference into the reward modeling. The authors do a good job of creating a mathematical formulation to support this, backed by empirical results. 2) The experiments are very well done with care taken for reproducibility and robustness of takeaways. 3) The formulation reduces back down to something very much like DPO when you set two levels of preference, which has a nice 'backward compatibility' to existing methods.
1) The method assumes a consensus among annotators about different levels of ratings. Human preference datasets are already noisy, and this creates an additional source of noise (calibration among annotators), so while we do better at benchmarks, it is unclear if these reward models successfully capture the true distribution of user preferences well. 2) I think some fine-grained error analysis and qualitative examples would really help provide some context for some of the results. For example,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Mobile Crowdsensing and Crowdsourcing · Explainable Artificial Intelligence (XAI)
