What Are Step-Level Reward Models Rewarding? Counterintuitive Findings   from MCTS-Boosted Mathematical Reasoning

Yiran Ma; Zui Chen; Tianqiao Liu; Mi Tian; Zhuo Liu; Zitao Liu; Weiqi; Luo

arXiv:2412.15904·cs.AI·March 11, 2025

What Are Step-Level Reward Models Rewarding? Counterintuitive Findings from MCTS-Boosted Mathematical Reasoning

Yiran Ma, Zui Chen, Tianqiao Liu, Mi Tian, Zhuo Liu, Zitao Liu, Weiqi, Luo

PDF

Open Access 1 Video

TL;DR

This paper investigates the mechanisms behind step-level reward models in mathematical reasoning, revealing that language descriptions are less critical than logical coherence, guiding more efficient SRM development.

Contribution

It uncovers the counterintuitive finding that natural language descriptions are less impactful, emphasizing the importance of logical coherence in SRMs for mathematical reasoning.

Findings

01

Removing natural language descriptions has minimal impact on SRM performance.

02

SRMs excel at assessing logical coherence in mathematical language.

03

SRMs struggle with natural language understanding.

Abstract

Step-level reward models (SRMs) can significantly enhance mathematical reasoning performance through process supervision or step-level preference alignment based on reinforcement learning. The performance of SRMs is pivotal, as they serve as critical guidelines, ensuring that each step in the reasoning process is aligned with desired outcomes. Recently, AlphaZero-like methods, where Monte Carlo Tree Search (MCTS) is employed for automatic step-level preference annotation, have proven particularly effective. However, the precise mechanisms behind the success of SRMs remain largely unexplored. To address this gap, this study delves into the counterintuitive aspects of SRMs, particularly focusing on MCTS-based approaches. Our findings reveal that the removal of natural language descriptions of thought processes has minimal impact on the efficacy of SRMs. Furthermore, we demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

What Are Step-Level Reward Models Rewarding? Counterintuitive Findings from MCTS-Boosted Mathematical Reasoning· underline

Taxonomy

TopicsDiverse Scientific and Economic Studies