Accelerating RLHF Training with Reward Variance Increase

Zonglin Yang; Zhexuan Gu; Houduo Qi; Yancheng Yuan

arXiv:2505.23247·cs.LG·June 18, 2025

Accelerating RLHF Training with Reward Variance Increase

Zonglin Yang, Zhexuan Gu, Houduo Qi, Yancheng Yuan

PDF

Open Access

TL;DR

This paper introduces a reward adjustment method that increases reward variance to accelerate RLHF training, integrating it into GRPO to improve efficiency in training large language models aligned with human preferences.

Contribution

The paper proposes a novel reward adjustment model with an efficient algorithm to increase reward variance, enhancing the GRPO algorithm for faster RLHF training.

Findings

01

GRPOVI significantly improves RLHF training efficiency

02

The reward adjustment method preserves relative preferences

03

The algorithm explicitly characterizes extreme points of the feasible set

Abstract

Reinforcement learning from human feedback (RLHF) is an essential technique for ensuring that large language models (LLMs) are aligned with human values and preferences during the post-training phase. As an effective RLHF approach, group relative policy optimization (GRPO) has demonstrated success in many LLM-based applications. However, efficient GRPO-based RLHF training remains a challenge. Recent studies reveal that a higher reward variance of the initial policy model leads to faster RLHF training. Inspired by this finding, we propose a practical reward adjustment model to accelerate RLHF training by provably increasing the reward variance and preserving the relative preferences and reward expectation. Our reward adjustment method inherently poses a nonconvex optimization problem, which is NP-hard to solve in general. To overcome the computational challenges, we design a novel $O(n…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques