Towards a Theoretical Understanding to the Generalization of RLHF

Zhaochun Li (1,2); Mingyang Yi (3); Yue Wang (2); Shisheng Cui (1); Yong Liu (3) ((1) Beijing Institute of Technolegy; (2) Zhongguancun Academy; (3) Renmin University of China)

arXiv:2601.16403·cs.LG·January 26, 2026

Towards a Theoretical Understanding to the Generalization of RLHF

Zhaochun Li (1,2), Mingyang Yi (3), Yue Wang (2), Shisheng Cui (1), Yong Liu (3) ((1) Beijing Institute of Technolegy, (2) Zhongguancun Academy, (3) Renmin University of China)

PDF

Open Access

TL;DR

This paper develops a theoretical framework to understand the generalization capabilities of Reinforcement Learning from Human Feedback (RLHF) in training large language models, focusing on linear reward models and end-to-end learning.

Contribution

It introduces a new generalization theory for RLHF based on algorithmic stability, applicable to gradient-based algorithms, and provides bounds under a feature coverage condition.

Findings

01

Empirical optima of policy models have a generalization bound of order O(n^{-1/2})

02

Results extend to gradient-based learning algorithms like GA and SGA

03

Provides theoretical support for observed generalization in LLMs after RLHF

Abstract

Reinforcement Learning from Human Feedback (RLHF) and its variants have emerged as the dominant approaches for aligning Large Language Models with human intent. While empirically effective, the theoretical generalization properties of these methods in high-dimensional settings remain to be explored. To this end, we build the generalization theory on RLHF of LLMs under the linear reward model, through the framework of algorithmic stability. In contrast to the existing works built upon the consistency of maximum likelihood estimations on reward model, our analysis is presented under an end-to-end learning framework, which is consistent with practice. Concretely, we prove that under a key \textbf{feature coverage} condition, the empirical optima of policy model have a generalization bound of order $O (n^{- \frac{1}{2}})$ . Moreover, the results can be extrapolated to parameters…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Topic Modeling · Natural Language Processing Techniques