Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models
Haitao Hong, Yuchen Yan, Xingyu Wu, Guiyang Hou, Wenqi Zhang, Weiming Lu, Yongliang Shen, Jun Xiao

TL;DR
This paper introduces Cooper, a reinforcement learning framework that jointly optimizes policy and reward models to improve robustness and performance in large language models, effectively reducing reward hacking.
Contribution
It proposes a novel co-optimization approach for policy and reward models, along with hybrid annotation and reference-based reward paradigms, enhancing RL training for LLMs.
Findings
Cooper reduces reward hacking in RL for LLMs.
VerifyRM achieves higher accuracy on VerifyBench.
Cooper improves end-to-end RL performance, e.g., 0.54% accuracy gain.
Abstract
Large language models (LLMs) have demonstrated remarkable performance in reasoning tasks, where reinforcement learning (RL) serves as a key algorithm for enhancing their reasoning capabilities. Currently, there are two mainstream reward paradigms: model-based rewards and rule-based rewards. However, both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking. To address these issues, we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL framework that jointly optimizes both the policy model and the reward model. Cooper leverages the high precision of rule-based rewards when identifying correct responses, and dynamically constructs and selects positive-negative sample pairs for continued training the reward model. This design enhances robustness and mitigates the risk of reward hacking. To further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
