Evaluating Robustness of Reward Models for Mathematical Reasoning
Sunghwan Kim, Dongjin Kang, Taeyoon Kwon, Hyungjoo Chae, Jungsoo Won,, Dongha Lee, Jinyoung Yeo

TL;DR
This paper introduces RewardMATH, a new benchmark for evaluating the robustness of reward models in mathematical reasoning, addressing limitations of previous benchmarks and improving reliability in assessing reward model performance.
Contribution
We propose a novel evaluation design and construct RewardMATH, a benchmark that better captures reward model robustness in math reasoning tasks, correlating well with policy optimization outcomes.
Findings
RewardMATH scores strongly correlate with optimized policy results.
Existing benchmarks show almost no correlation with policy performance.
Our evaluation method effectively estimates reward overoptimization.
Abstract
Reward models are key in reinforcement learning from human feedback (RLHF) systems, aligning the model behavior with human preferences. Particularly in the math domain, there have been plenty of studies using reward models to align policies for improving reasoning capabilities. Recently, as the importance of reward models has been emphasized, RewardBench is proposed to understand their behavior. However, we figure out that the math subset of RewardBench has different representations between chosen and rejected completions, and relies on a single comparison, which may lead to unreliable results as it only see an isolated case. Therefore, it fails to accurately present the robustness of reward models, leading to a misunderstanding of its performance and potentially resulting in reward hacking. In this work, we introduce a new design for reliable evaluation of reward models, and to…
Peer Reviews
Decision·Submitted to ICLR 2025
* The topic on how to Improve LLM reasoning capabilities has recently gained a lot of attention. This paper focuses on having good benchmarks for evaluating these efforts, and this could be very impactful if done correctly. * Authors identify flaws of existing benchmarks and make good efforts to fix them. * Paper has good results, specifically Figure 4 is very cool showing RewardMath has stronger correlation with downstream tasks.
See questions I have below
- The paper provides clear and sufficient empirical evidence that their RewardMATH benchmark is more reliable than the math subset of RewardBench [1]. The empirical results are also clear as LLM policy using BoN on high-scored reward models on RewardBench shows little to no correlation with the performance increase of Math benchmarks (r-square = 0-0.1), while RewardMATH shows a much stronger correlation (r-square = 0.6-0.8) in Figure 3. - The authors have evaluated diverse reward models on Rewa
- The work would be more interesting if the authors showed any other domains (such as coding or text summarisation or maybe safety) reward model benchmark can be improved by the framework proposed here (by adopting multiple responses and using diverse LLMs to generate outputs). Any initial or limited experiments would be helpful. - The lack of PPO (or DPO) usage for policy fine-tuning in experiments seems like a major weakness. The main contribution of this paper is using policy fine-tuning me
1. **Thoroughness**: The paper presents detailed implementations, including training hyperparameters and experimental protocols. This ensures that other researchers can accurately reproduce the experiments and validate the findings. 2. **Relevance**: This work addresses a critical gap in the field by focusing on reward model evaluation, a crucial area of research that has significant implications for the development of more reliable AI systems. 3. **Motivation**: The paper presents a compelling
1. **Clarity**: The paper is generally well written, however, it has some clarity issues, especially in section 5, which is hard to follow. Clarification questions are asked in the question section, marked with [Clarification]. The authors should address those questions. 2. **Benchmark Biases**: The paper has several biases, raising concerns on the claimed robustness and reliability. Examples and comments below: > Line 206: Hence, we first convert the human-annotated solutions from MATH500 int
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Multi-Criteria Decision Making · Statistical and Computational Modeling
MethodsALIGN
