The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

Ang Lv; Ruobing Xie; Xingwu Sun; Zhanhui Kang; Rui Yan

arXiv:2505.22653·cs.CL·May 29, 2025

The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan

PDF

Open Access 1 Models

TL;DR

This paper investigates the robustness of large language models to reward noise during reinforcement learning for reasoning, revealing that models can perform well even with substantial reward noise and that focusing on reasoning patterns can enhance training.

Contribution

It demonstrates that LLMs are robust to reward noise in reasoning tasks and introduces the use of reasoning pattern rewards to improve performance without strict correctness verification.

Findings

01

Models achieve high accuracy despite 40% reward noise.

02

Reasoning pattern rewards (RPR) improve performance without verifying answer correctness.

03

Combining RPR with noisy rewards enhances open-ended reasoning tasks.

Abstract

Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact of reward noise, a more practical consideration for real-world scenarios involving the post-training of LLMs using reward models. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function's outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. Surprisingly, by only rewarding the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
AngLv/NoisyRewards-in-RL-RM-acc-65
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications