Accelerating RLHF Training with Reward Variance Increase
Zonglin Yang, Zhexuan Gu, Houduo Qi, Yancheng Yuan

TL;DR
This paper introduces a reward adjustment method that increases reward variance to accelerate RLHF training, integrating it into GRPO to improve efficiency in training large language models aligned with human preferences.
Contribution
The paper proposes a novel reward adjustment model with an efficient algorithm to increase reward variance, enhancing the GRPO algorithm for faster RLHF training.
Findings
GRPOVI significantly improves RLHF training efficiency
The reward adjustment method preserves relative preferences
The algorithm explicitly characterizes extreme points of the feasible set
Abstract
Reinforcement learning from human feedback (RLHF) is an essential technique for ensuring that large language models (LLMs) are aligned with human values and preferences during the post-training phase. As an effective RLHF approach, group relative policy optimization (GRPO) has demonstrated success in many LLM-based applications. However, efficient GRPO-based RLHF training remains a challenge. Recent studies reveal that a higher reward variance of the initial policy model leads to faster RLHF training. Inspired by this finding, we propose a practical reward adjustment model to accelerate RLHF training by provably increasing the reward variance and preserving the relative preferences and reward expectation. Our reward adjustment method inherently poses a nonconvex optimization problem, which is NP-hard to solve in general. To overcome the computational challenges, we design a novel $O(n…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques
