GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models
Zhouhao Sun, Xuan Zhang, Xiao Ding, Bibo Cai, Li Du, Kai Xiong, Xinran Dai, Fei Zhang, weidi tang, Zhiyuan Kan, Yang Zhao, Bing Qin, Ting Liu

TL;DR
GR-Ben is a comprehensive benchmark designed to evaluate process reward models across diverse reasoning domains, revealing current limitations in error detection capabilities of PRMs and LLMs beyond mathematical reasoning.
Contribution
Introduces GR-Ben, a new benchmark for assessing PRMs across science and logic domains, addressing the lack of diverse reasoning scenario evaluations.
Findings
PRMs and LLMs perform weaker in non-mathematical reasoning domains.
PRMs struggle more with knowledge-based errors, LLMs with computational errors.
The benchmark covers 22 models across multiple subdomains.
Abstract
Currently, process reward models (PRMs) have exhibited remarkable potential for test-time scaling. Since large language models (LLMs) regularly generate flawed intermediate reasoning steps when tackling a broad spectrum of reasoning and decision-making tasks, PRMs are required to possess capabilities for detecting process-level errors in real-world scenarios. However, existing benchmarks primarily focus on mathematical reasoning, thereby failing to comprehensively evaluate the error detection ability of PRMs across diverse reasoning scenarios. To mitigate this gap, we introduce GR-Ben, a process-level benchmark specifically designed for assessing PRM's performance across two primary reasoning domains (science and logic) and nine subdomains. We conduct extensive experiments on a diverse set of 22 models, encompassing both PRMs and LLMs, and derive two key findings: (1) In domains beyond…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
