What Are Step-Level Reward Models Rewarding? Counterintuitive Findings from MCTS-Boosted Mathematical Reasoning
Yiran Ma, Zui Chen, Tianqiao Liu, Mi Tian, Zhuo Liu, Zitao Liu, Weiqi, Luo

TL;DR
This paper investigates the mechanisms behind step-level reward models in mathematical reasoning, revealing that language descriptions are less critical than logical coherence, guiding more efficient SRM development.
Contribution
It uncovers the counterintuitive finding that natural language descriptions are less impactful, emphasizing the importance of logical coherence in SRMs for mathematical reasoning.
Findings
Removing natural language descriptions has minimal impact on SRM performance.
SRMs excel at assessing logical coherence in mathematical language.
SRMs struggle with natural language understanding.
Abstract
Step-level reward models (SRMs) can significantly enhance mathematical reasoning performance through process supervision or step-level preference alignment based on reinforcement learning. The performance of SRMs is pivotal, as they serve as critical guidelines, ensuring that each step in the reasoning process is aligned with desired outcomes. Recently, AlphaZero-like methods, where Monte Carlo Tree Search (MCTS) is employed for automatic step-level preference annotation, have proven particularly effective. However, the precise mechanisms behind the success of SRMs remain largely unexplored. To address this gap, this study delves into the counterintuitive aspects of SRMs, particularly focusing on MCTS-based approaches. Our findings reveal that the removal of natural language descriptions of thought processes has minimal impact on the efficacy of SRMs. Furthermore, we demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDiverse Scientific and Economic Studies
