Learning Guarantee of Reward Modeling Using Deep Neural Networks
Yuanhang Luo, Yeheng Ge, Ruijian Han, Guohao Shen

TL;DR
This paper provides a theoretical analysis of reward modeling with deep neural networks, establishing regret bounds that depend on network architecture and emphasizing the importance of clear human beliefs for efficient learning.
Contribution
It introduces a non-asymptotic regret bound for deep reward estimators and a margin-type condition that improves bounds and explains empirical success of reinforcement learning from human feedback.
Findings
Regret bounds depend explicitly on network architecture.
Margin condition improves regret bounds and highlights human belief importance.
High-quality pairwise data enhances learning efficiency.
Abstract
In this work, we study the learning theory of reward modeling with pairwise comparison data using deep neural networks. We establish a novel non-asymptotic regret bound for deep reward estimators in a non-parametric setting, which depends explicitly on the network architecture. Furthermore, to underscore the critical importance of clear human beliefs, we introduce a margin-type condition that assumes the conditional winning probability of the optimal action in pairwise comparisons is significantly distanced from 1/2. This condition enables a sharper regret bound, which substantiates the empirical efficiency of Reinforcement Learning from Human Feedback and highlights clear human beliefs in its success. Notably, this improvement stems from high-quality pairwise comparison data implied by the margin-type condition, is independent of the specific estimators used, and thus applies to…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper is clearly written and easy to follow. That includes 1. clear illustrations of the relationship between regret, functional error of the reward model, and the maximal-likelihood solution 2. clear statements of the assumptions 3. clear proofs The paper provides a good characteristic of the reward signal distribution that go beyond the BT models.
1. The assumptions in the paper may be restrictive, namely the realizability of the reward model (the existence of an underlying model) and the data coverage assumption (the 2nd smallest eigenvalue of the data coverage Laplacian). This happens to not to match current practice where the signal is sparse and there may not be a true reward model. 2. Many proofs in the paper seem standard in the literatures, especially the generalization analysis of holder-smooth neural networks. That reduces the
1. The paper studies the theory of reward modeling, which is a very important question for LLMs. 2. The paper provides regret bound for neural network structures which is used in practice. 3. The paper introduced the margin condition which can quantify the confidence of the human preference which does not rely on the underlying reward model. The paper then obtained a sharper reward bound given the margin condition.
1. Although the theory looks solid, there is no/few surprise or new insights provided in this paper. 2. Arguably, the most important contribution in this paper is introducing the margin condition which can quantify the confidence of the human preference. However, an empirical verification of this assumption in real-world datasets is missing.
1. Theoretical results are “fine-grained,” as they consider specific neural network structures rather than relying on generalized assumptions about network properties. 2. Moreover, the paper addresses both stochastic and approximation error bounds, and provides guidance on achieving an optimal balance between these by designing the width and depth of DNNs. This is a nice attempt to bridge the theory to real-world model design.
1. The claimed extension (line 378-381) from DNNs to state-of-the-art architectures (such as BERT or GPT) is not fully convincing. While functionally similar, these architectures differ significantly in pipelines, loss design, training methods, and especially in their use of attention mechanisms and transformer layers, which are not addressed in your analysis. Consequently, I believe the gap between these theoretical results and the practical guidance needed for fine-tuning in RLHF remains signi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Recommender Systems and Techniques
