Limits of PRM-Guided Tree Search for Mathematical Reasoning with LLMs
Tristan Cinquin, Geoff Pleiss, Agustinus Kristiadi

TL;DR
This paper investigates whether PRM-guided tree search can improve mathematical reasoning in LLMs, finding that it does not outperform chain-of-thought prompting due to unreliable reward models and the complexity of reasoning.
Contribution
The study introduces an adaptive PRM-guided tree search algorithm and systematically evaluates its effectiveness across diverse mathematical problems, revealing its limitations.
Findings
PRM-guided tree search shows no significant improvement over BoN.
Monte Carlo and beam search outperform other PRM-guided methods.
PRMs poorly estimate state values and degrade with reasoning depth.
Abstract
While chain-of-thought prompting with Best-of-N (BoN) selection has become popular for mathematical reasoning in large language models (LLMs), its linear structure fails to capture the branching and exploratory nature of complex problem-solving. In this work, we propose an adaptive algorithm to maximize process reward model (PRM) scores over the intractable action space, and investigate whether PRM-guided tree search can improve mathematical reasoning by exploring multiple partial solution paths. Across diverse mathematical problems using Qwen2.5-Math-7B-Instruct with its associated PRM as a case study, we find that: (1) PRM-guided tree search shows no statistically significant improvements over BoN despite higher costs, (2) Monte Carlo tree search and beam search outperform other PRM-guided tree search methods, (3) PRMs poorly approximate state values and their reliability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Intelligent Tutoring Systems and Adaptive Learning · Topic Modeling
