Quantile Regression for Distributional Reward Models in RLHF
Nicolai Dorka

TL;DR
This paper introduces Quantile Reward Models (QRMs) for reinforcement learning from human feedback, which learn a distribution over rewards to better capture human preferences and improve policy safety.
Contribution
The paper presents a novel distributional reward modeling approach using quantile regression, enhancing preference representation and downstream RL applications.
Findings
QRM outperforms traditional models on RewardBench.
Distributional estimates enable risk-aware RL.
QRM reduces negative responses in LLM policies.
Abstract
Reinforcement learning from human feedback (RLHF) has become a key method for aligning large language models (LLMs) with human preferences through the use of reward models. However, traditional reward models typically generate point estimates, which oversimplify the diversity and complexity of human values and preferences. In this paper, we introduce Quantile Reward Models (QRMs), a novel approach to reward modeling that learns a distribution over rewards instead of a single scalar value. Our method uses quantile regression to estimate a full, potentially multimodal distribution over preferences, providing a more powerful and nuanced representation of preferences. This distributional approach can better capture the diversity of human values, addresses label noise, and accommodates conflicting preferences by modeling them as distinct modes in the distribution. Our experimental results…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The paper is well-written and easy to follow. 2. This paper tackles a significant challenge by offering a nuanced representation of preferences, which is critical for aligning large language models (LLMs) with human values.
The results have a few issues which make evaluating the contribution difficult: 1. The results presented in Figure 3 show minimal differences between the risk-neutral and risk-aware policies, making it challenging to assert the superiority of one approach over the other. It would be beneficial if the authors could provide additional metrics or a deeper statistical analysis to highlight the distinctions between the two policies more clearly. 2. The paper presents scores in Table 1 but does not of
This research tackles the critical issue of managing disagreement in human feedback—a significant and underexplored direction. The introduction of a quantile approach to model disagreement through distributions, rather than relying solely on scalar values, is a noteworthy innovation. The proposed methodology is compelling in its ability to learn a robust and fair reward distribution by integrating a quantile regression layer with a gating network. QRMs also contribute to the development of a ri
While the methodology is intriguing, several improvements are needed to elevate the paper to a publishable level. 1) Insufficient Performance Analysis: The analysis of QRM’s performance is limited to the primary results in Table 1. Several additional experiments could help substantiate QRM’s effectiveness. For example, what specific training details pertain to the gating network? Does using a selective set of attributes (helpfulness, harmlessness, truthfulness, and complexity) affect performanc
1. Modeling preference rewards as distributions is natural and important, but current SOTA RMs ignore such property of human preference. 2. Quantile regression seems to model reward distributions effectively. 3. Motivation of this paper is well-stated.
1. Severe lack of empirical study of the proposed method. 2. No ablation study to show the efficacy of each proposed module. 3. Experimental settings are poorly stated, e.g. what model is used to evaluate rewards in fig. 3? how is the rewards evaluated? Is it training rewards or evaluation rewards?
The topic is of interest. The idea of turning the scalar reward into the distributional reward is a trend.
1. The idea of replacing scalar reward by distributional reward and the gating mechanism are not brand new ideas \[1,2].  2. The ablation study is not conducted and the QRM ranking is not sufficient enough to validate the effectiveness of the method since the trainset can be varied. 3. The experiments regarding end-to-end RL process are not sufficient and not informative. The proper baseline RL experiment conducted with normal RM with the same training data is missing. [1] Wu Y, Sun Z,
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference
