Evaluating Robustness of Reward Models for Mathematical Reasoning

Sunghwan Kim; Dongjin Kang; Taeyoon Kwon; Hyungjoo Chae; Jungsoo Won,; Dongha Lee; Jinyoung Yeo

arXiv:2410.01729·cs.LG·October 3, 2024

Evaluating Robustness of Reward Models for Mathematical Reasoning

Sunghwan Kim, Dongjin Kang, Taeyoon Kwon, Hyungjoo Chae, Jungsoo Won,, Dongha Lee, Jinyoung Yeo

PDF

Open Access 3 Reviews

TL;DR

This paper introduces RewardMATH, a new benchmark for evaluating the robustness of reward models in mathematical reasoning, addressing limitations of previous benchmarks and improving reliability in assessing reward model performance.

Contribution

We propose a novel evaluation design and construct RewardMATH, a benchmark that better captures reward model robustness in math reasoning tasks, correlating well with policy optimization outcomes.

Findings

01

RewardMATH scores strongly correlate with optimized policy results.

02

Existing benchmarks show almost no correlation with policy performance.

03

Our evaluation method effectively estimates reward overoptimization.

Abstract

Reward models are key in reinforcement learning from human feedback (RLHF) systems, aligning the model behavior with human preferences. Particularly in the math domain, there have been plenty of studies using reward models to align policies for improving reasoning capabilities. Recently, as the importance of reward models has been emphasized, RewardBench is proposed to understand their behavior. However, we figure out that the math subset of RewardBench has different representations between chosen and rejected completions, and relies on a single comparison, which may lead to unreliable results as it only see an isolated case. Therefore, it fails to accurately present the robustness of reward models, leading to a misunderstanding of its performance and potentially resulting in reward hacking. In this work, we introduce a new design for reliable evaluation of reward models, and to…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 4

Strengths

* The topic on how to Improve LLM reasoning capabilities has recently gained a lot of attention. This paper focuses on having good benchmarks for evaluating these efforts, and this could be very impactful if done correctly. * Authors identify flaws of existing benchmarks and make good efforts to fix them. * Paper has good results, specifically Figure 4 is very cool showing RewardMath has stronger correlation with downstream tasks.

Weaknesses

See questions I have below

Reviewer 02Rating 5Confidence 4

Strengths

- The paper provides clear and sufficient empirical evidence that their RewardMATH benchmark is more reliable than the math subset of RewardBench [1]. The empirical results are also clear as LLM policy using BoN on high-scored reward models on RewardBench shows little to no correlation with the performance increase of Math benchmarks (r-square = 0-0.1), while RewardMATH shows a much stronger correlation (r-square = 0.6-0.8) in Figure 3. - The authors have evaluated diverse reward models on Rewa

Weaknesses

- The work would be more interesting if the authors showed any other domains (such as coding or text summarisation or maybe safety) reward model benchmark can be improved by the framework proposed here (by adopting multiple responses and using diverse LLMs to generate outputs). Any initial or limited experiments would be helpful. - The lack of PPO (or DPO) usage for policy fine-tuning in experiments seems like a major weakness. The main contribution of this paper is using policy fine-tuning me

Reviewer 03Rating 6Confidence 3

Strengths

1. **Thoroughness**: The paper presents detailed implementations, including training hyperparameters and experimental protocols. This ensures that other researchers can accurately reproduce the experiments and validate the findings. 2. **Relevance**: This work addresses a critical gap in the field by focusing on reward model evaluation, a crucial area of research that has significant implications for the development of more reliable AI systems. 3. **Motivation**: The paper presents a compelling

Weaknesses

1. **Clarity**: The paper is generally well written, however, it has some clarity issues, especially in section 5, which is hard to follow. Clarification questions are asked in the question section, marked with [Clarification]. The authors should address those questions. 2. **Benchmark Biases**: The paper has several biases, raising concerns on the claimed robustness and reliability. Examples and comments below: > Line 206: Hence, we first convert the human-annotated solutions from MATH500 int

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Multi-Criteria Decision Making · Statistical and Computational Modeling

MethodsALIGN