From Mathematical Reasoning to Code: Generalization of Process Reward Models in Test-Time Scaling
Zhengyu Chen, Yudong Wang, Teng Xiao, Ruochen Zhou, Xuesheng Yang, Wei Wang, Zhifang Sui, Jingang Wang

TL;DR
This paper explores how Process Reward Models (PRMs) can be scaled and trained for better reasoning in large language models, analyzing their efficiency, generalization, and test-time strategies across diverse datasets.
Contribution
It provides a comprehensive analysis of PRMs' training, scalability, and generalization, introducing insights into test-time scaling and cross-domain robustness.
Findings
PRMs show diminishing returns with increased scale.
Diverse training data improves PRM accuracy and efficiency.
Monte Carlo Tree Search is most effective for test-time scaling.
Abstract
Recent advancements in improving the reasoning capabilities of Large Language Models have underscored the efficacy of Process Reward Models (PRMs) in addressing intermediate errors through structured feedback mechanisms. This study analyzes PRMs from multiple perspectives, including training methodologies, scalability, and generalization capabilities. We investigate the interplay between pre-training and reward model training FLOPs to assess their influence on PRM efficiency and accuracy in complex reasoning tasks. Our analysis reveals a pattern of diminishing returns in performance with increasing PRM scale, highlighting the importance of balancing model size and computational cost. Furthermore, the diversity of training datasets significantly impacts PRM performance, emphasizing the importance of diverse data to enhance both accuracy and efficiency. We further examine test-time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSoftware Engineering Research · Evolutionary Algorithms and Applications · AI-based Problem Solving and Planning
