Policy Filtration for RLHF to Mitigate Noise in Reward Models

Chuheng Zhang; Wei Shen; Li Zhao; Xuyun Zhang; Xiaolong Xu; Wanchun Dou; Jiang Bian

arXiv:2409.06957·cs.LG·June 10, 2025

Policy Filtration for RLHF to Mitigate Noise in Reward Models

Chuheng Zhang, Wei Shen, Li Zhao, Xuyun Zhang, Xiaolong Xu, Wanchun Dou, Jiang Bian

PDF

Open Access 1 Repo 4 Models 3 Datasets

TL;DR

This paper introduces Policy Filtration for Proximal Policy Optimization (PF-PPO), a method that filters unreliable reward samples in RLHF to improve policy learning, achieving state-of-the-art results in code generation and math reasoning tasks.

Contribution

The paper proposes a novel reward filtering strategy using R2 to enhance RLHF training, significantly improving performance on complex reasoning benchmarks.

Findings

01

PF-PPO achieves state-of-the-art results on HumanEval, MBPP, and LeetCode.

02

Filtering unreliable rewards improves policy learning in code and math tasks.

03

Extensive experiments validate the effectiveness of PF-PPO across multiple benchmarks.

Abstract

While direct policy optimization methods exist, pioneering LLMs are fine-tuned with reinforcement learning from human feedback (RLHF) to generate better responses under the supervision of a reward model learned from preference data. One major challenge of RLHF is the inaccuracy of the intermediate reward model, especially in the tasks that requires complex reasoning for the reward model to score a response. We find that the reliability of the reward model varies across responses assigned with different rewards. This motivates us to filter the samples whose rewards may be unreliable to improve the signal-to-noise ratio during policy learning, resulting in Policy Filtration for Proximal Policy Optimization (PF-PPO). To choose a proper policy filtering strategy, we use the coefficient of determination (R2) between the rewards and actual scores on filtered samples as the metrics to help us…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

swtheing/pf-ppo-rlhf
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvancements in Photolithography Techniques · Real-time simulation and control systems · VLSI and Analog Circuit Testing