Controllable and Verifiable Process Data Synthesis for Process Reward Models
Yinghui Chi, Lucien Wang

TL;DR
This paper introduces a controllable, verifiable data synthesis framework for process reward models that enhances logical reasoning training and evaluation by injecting and verifying errors in symbolic reasoning trajectories.
Contribution
It presents a novel framework for generating verifiable, controllable process supervision data with injected errors, improving reasoning model training and evaluation.
Findings
Synthesized data improves reranking on logical reasoning benchmarks.
Method transfers effectively to mathematical reasoning tasks.
First-error localization remains more challenging than overall step classification.
Abstract
Process reward models (PRMs) rely on high-quality process supervision data, yet existing construction methods often provide limited control over error location, error type, and trajectory consistency. We propose a controllable and verifiable framework for synthesizing process supervision data for PRMs. Our framework first constructs a correct symbolic reasoning chain, injects a template-aware error into an intermediate step, recomputes subsequent steps under the corrupted state, and verifies that the injected step is not derivable from its prefix. The resulting paired trajectories are prefix-invalid at the first error while remaining trajectory-consistent after symbolic recomputation, and are translated into aligned natural-language processes for PRM training and evaluation. Experiments show that the synthesized data improve Best-of-8 reranking on logical reasoning benchmarks and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
