Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
Raffaele Pisano, Roberto Navigli

TL;DR
This paper presents a scalable method for generating large, precise step-level reward datasets for language models using planning problems in PDDL, improving reasoning across domains.
Contribution
It introduces a novel approach to create extensive PRM datasets from planning problems, enhancing reasoning capabilities of language models.
Findings
Augmenting PRM training data with PDDL-based datasets improves reasoning performance.
Generated approximately one million reasoning steps across various domains.
Planning problems serve as an effective resource for scalable PRM dataset creation.
Abstract
Process Reward Models (PRMs) have emerged as a powerful tool for providing step-level feedback when evaluating the reasoning of Large Language Models (LLMs), which frequently produce chains of thought (CoTs) containing errors even when the final answer is correct. However, existing PRM datasets remain expensive to construct, prone to annotation errors, and predominantly limited to the mathematical domain. This work introduces a novel and scalable approach to PRM dataset generation based on planning logical problems expressed in the Planning Domain Definition Language (PDDL). Using this method, we generate a corpus of approximately one million reasoning steps across various PDDL domains and use it to train PRMs. Experimental results show that augmenting widely-used PRM training datasets with PDDL-derived data yields substantial improvements in both mathematical and non-mathematical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
