On Designing Effective RL Reward at Training Time for LLM Reasoning

Jiaxuan Gao; Shusheng Xu; Wenjie Ye; Weilin Liu; Chuyi He; Wei Fu,; Zhiyu Mei; Guangju Wang; Yi Wu

arXiv:2410.15115·cs.LG·November 28, 2024

On Designing Effective RL Reward at Training Time for LLM Reasoning

Jiaxuan Gao, Shusheng Xu, Wenjie Ye, Weilin Liu, Chuyi He, Wei Fu,, Zhiyu Mei, Guangju Wang, Yi Wu

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the use of reward models during RL training of LLMs for reasoning tasks, revealing challenges like reward hacking and proposing techniques to refine rewards for improved training outcomes.

Contribution

It introduces reward refinement methods, Clipping and Delta, to prevent reward hacking and enhance RL training effectiveness for LLM reasoning capabilities.

Findings

01

Refined reward functions improve LLM performance on reasoning benchmarks.

02

Reward models can cause reward hacking, leading to worse training outcomes.

03

Careful reward design enables RL training to enhance LLM reasoning without extra supervision.

Abstract

Reward models have been increasingly critical for improving the reasoning capability of LLMs. Existing research has shown that a well-trained reward model can substantially improve model performances at inference time via search. However, the potential of reward models during RL training time still remains largely under-explored. It is currently unclear whether these reward models can provide additional training signals to enhance the reasoning capabilities of LLMs in RL training that uses sparse success rewards, which verify the correctness of solutions. In this work, we evaluate popular reward models for RL training, including the Outcome-supervised Reward Model (ORM) and the Process-supervised Reward Model (PRM), and train a collection of LLMs for math problems using RL by combining these learned rewards with success rewards. Surprisingly, even though these learned reward models have…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

I think the paper has important messages for LLM reasoning community: 1) The message that PRMs are hackable is important and valuable. Also, the paper digs into showing what goes wrong which provides insight into what actually goes wrong when using them. That the LLM trained with these PRMs leans towards some steps that are correct but does not move us closer to a solution. I think this contribution is also important. 2) Also, the paper shows limiting these rewards is not obvious. It is only

Weaknesses

I think the paper focuses a lot on the boost it gets from mixing the clip and delta. However, I have some concerns if the clip and delta is a generalizable approach. First, the delta mechanism seems unmotivated. There is not nothing wrong with being unmotivated if it works super well. But, I think the gains are modest. The delta mechanism rewards action `a_t` if the reward of action `a_{t+1}` is less which is a very strong change to the RL environment. I understand that some interesting properti

Reviewer 02Rating 6Confidence 3

Strengths

- The proposed idea is new, interesting, and well-motivated. - The paper is easy to read and follow. - The addressed problem is of significance.

Weaknesses

- More details on the experimental setup could be provided for reproducibility, including the reward thresholds for the Clip and Delta mechanisms and hyperparameters for PPO. - Smaller LLMs may offer a larger scope of improvement, and so the proposed methods may seem to have been successful. However, to confirm the advantage, experiments on larger LLMs may be necessary; e.g., the paper reports GPT-4o-2024-08-06’s performance to be 92.9 on GSM8K, which is higher than almost all other models and

Reviewer 03Rating 8Confidence 2

Strengths

- The paper is well-written, and the problem is nicely motivated. - The empirical results are enough to assess the potential of the reward model, with some techniques for mitigating reward hacking during RL training to enhance the LLM reasoning. - I appreciate the case study shown in Fig. 2 and the others added in the appendix.

Weaknesses

## Major Comments: - We saw in the empirical results that the improvement in the small models was higher than in the larger ones. I disagree with the authors about the importance of evaluating the study or the proposed techniques on larger models. ## Minor Comments: - Typo in Line 057: "on the reward models, it remains **un**clear whether the reward models can provide additional training".

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research

MethodsSparse Evolutionary Training