GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models

Zhouhao Sun; Xuan Zhang; Xiao Ding; Bibo Cai; Li Du; Kai Xiong; Xinran Dai; Fei Zhang; weidi tang; Zhiyuan Kan; Yang Zhao; Bing Qin; Ting Liu

arXiv:2605.01203·cs.AI·May 8, 2026

GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models

Zhouhao Sun, Xuan Zhang, Xiao Ding, Bibo Cai, Li Du, Kai Xiong, Xinran Dai, Fei Zhang, weidi tang, Zhiyuan Kan, Yang Zhao, Bing Qin, Ting Liu

PDF

TL;DR

GR-Ben is a comprehensive benchmark designed to evaluate process reward models across diverse reasoning domains, revealing current limitations in error detection capabilities of PRMs and LLMs beyond mathematical reasoning.

Contribution

Introduces GR-Ben, a new benchmark for assessing PRMs across science and logic domains, addressing the lack of diverse reasoning scenario evaluations.

Findings

01

PRMs and LLMs perform weaker in non-mathematical reasoning domains.

02

PRMs struggle more with knowledge-based errors, LLMs with computational errors.

03

The benchmark covers 22 models across multiple subdomains.

Abstract

Currently, process reward models (PRMs) have exhibited remarkable potential for test-time scaling. Since large language models (LLMs) regularly generate flawed intermediate reasoning steps when tackling a broad spectrum of reasoning and decision-making tasks, PRMs are required to possess capabilities for detecting process-level errors in real-world scenarios. However, existing benchmarks primarily focus on mathematical reasoning, thereby failing to comprehensively evaluate the error detection ability of PRMs across diverse reasoning scenarios. To mitigate this gap, we introduce GR-Ben, a process-level benchmark specifically designed for assessing PRM's performance across two primary reasoning domains (science and logic) and nine subdomains. We conduct extensive experiments on a diverse set of 22 models, encompassing both PRMs and LLMs, and derive two key findings: (1) In domains beyond…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.