Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

Kenton Tang; Yuzhu Chen; Fengxiang He

arXiv:2602.21765·cs.LG·February 26, 2026

Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

Kenton Tang, Yuzhu Chen, Fengxiang He

PDF

Open Access

TL;DR

This paper develops a theoretical framework for understanding how reinforcement learning from human feedback (RLHF) generalizes under reward shifts and clipped KL regularisation, providing bounds and practical insights.

Contribution

It introduces the first generalisation bounds for RLHF that explicitly incorporate reward shift and KL clipping errors, with implications for model training and regularisation.

Findings

01

Generalisation error includes sampling, reward shift, and KL clipping errors.

02

Optimal KL clipping threshold can be derived from the theory.

03

Guidelines for budget allocation in prompts, rollouts, and preference data.

Abstract

Alignment and adaptation in large language models heavily rely on reinforcement learning from human feedback (RLHF); yet, theoretical understanding of its generalisability remains premature, especially when the learned reward could shift, and the KL control is estimated and clipped. To address this issue, we develop generalisation theory for RLHF that explicitly accounts for (1) \emph{reward shift}: reward models are trained on preference data from earlier or mixed behaviour policies while RLHF optimises the current policy on its own rollouts; and (2) \emph{clipped KL regularisation}: the KL regulariser is estimated from sampled log-probability ratios and then clipped for stabilisation, resulting in an error to RLHF. We present generalisation bounds for RLHF, suggesting that the generalisation error stems from a sampling error from prompts and rollouts, a reward shift error, and a KL…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Speech and dialogue systems · Domain Adaptation and Few-Shot Learning