Policy Filtration for RLHF to Mitigate Noise in Reward Models
Chuheng Zhang, Wei Shen, Li Zhao, Xuyun Zhang, Xiaolong Xu, Wanchun Dou, Jiang Bian

TL;DR
This paper introduces Policy Filtration for Proximal Policy Optimization (PF-PPO), a method that filters unreliable reward samples in RLHF to improve policy learning, achieving state-of-the-art results in code generation and math reasoning tasks.
Contribution
The paper proposes a novel reward filtering strategy using R2 to enhance RLHF training, significantly improving performance on complex reasoning benchmarks.
Findings
PF-PPO achieves state-of-the-art results on HumanEval, MBPP, and LeetCode.
Filtering unreliable rewards improves policy learning in code and math tasks.
Extensive experiments validate the effectiveness of PF-PPO across multiple benchmarks.
Abstract
While direct policy optimization methods exist, pioneering LLMs are fine-tuned with reinforcement learning from human feedback (RLHF) to generate better responses under the supervision of a reward model learned from preference data. One major challenge of RLHF is the inaccuracy of the intermediate reward model, especially in the tasks that requires complex reasoning for the reward model to score a response. We find that the reliability of the reward model varies across responses assigned with different rewards. This motivates us to filter the samples whose rewards may be unreliable to improve the signal-to-noise ratio during policy learning, resulting in Policy Filtration for Proximal Policy Optimization (PF-PPO). To choose a proper policy filtering strategy, we use the coefficient of determination (R2) between the rewards and actual scores on filtered samples as the metrics to help us…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvancements in Photolithography Techniques · Real-time simulation and control systems · VLSI and Analog Circuit Testing
