SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning
Salman Rahman, Sruthi Gorantla, Arpit Gupta, Swastik Roy, Nanyun Peng, Yang Liu

TL;DR
SPARK introduces a three-stage framework that leverages process reward models with step-level feedback, enabling effective reference-free reinforcement learning, especially in domains lacking verifiable ground truth.
Contribution
The paper presents a novel three-stage process for training process reward models without ground-truth references, improving reinforcement learning in mathematical reasoning tasks.
Findings
Achieved 67.5 F1 on ProcessBench, surpassing reference-guided training.
Outperformed ground-truth-based RLVR with 47.4% accuracy on mathematical benchmarks.
Enabled effective reference-free RL training exceeding traditional methods.
Abstract
Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning, yet their adoption remains limited by the need for expensive step-level annotations or ground truth references. We propose SPARK: a three-stage framework where in the first stage a generator model produces diverse solutions and a verifier model evaluates them using parallel scaling (self-consistency) and sequential scaling (meta-critique). In the second stage, we use these verification outputs as synthetic training data to fine-tune generative process reward models, which subsequently serve as reward signals during training. We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision, achieving 67.5 F1 on ProcessBench (a benchmark for identifying erroneous steps…
Peer Reviews
Decision·Submitted to ICLR 2026
- The method is conceptually simple and the paper is easy to follow. - The comparison against GPT-4o and Reference-Guided verification in ProcessBench suggests that the employed methodology is promising, since it does not rely on ground-truth nor on a frontier model.
- The main concern is the lack of evidence to assess statistical significance in the results. The paper does not mention how many experimental seeds were used (I assume it is a single one), and no results in the paper brings confidence intervals. A well known fact supported by prior literature is that RL training is extremely stochastic [1, 2], which is also observed in math reasoning benchmarking [3], so it is unclear whether the reported takeaways are meaningful or just observation noise. This
1. The design of SPARK is intuitive to me. 1. Instead of relying on a static, expensive ground-truth dataset, SPARK uses a dynamic generator-verifier framework. It leverages inference-time scaling techniques (like self-consistency and meta-critique) to aggregate multiple verification attempts, effectively bootstrapping a high-quality, step-level training dataset from the model's own reasoning capabilities. 1. When used in RL training, SPARK's generative PRM enables the policy model to achieve
1. The method is motivated by the need to apply RL to subjective domains without ground truth (e.g., creative writing, ethical reasoning). However, all experiments are conducted exclusively in mathematical reasoning, a domain where objective ground truth does exist. This creates a mismatch between the problem the method claims to solve and the domain in which it is actually validated. 1. The paper provides a systematic analysis of reward hacking patterns. However, the identified patterns (e.g.,
The primary strength of this work is its novel and effective framework for training process reward models (PRMs) without access to ground-truth references. The reliance on expensive, step-level human annotations or gold solutions is a major bottleneck for scaling process-based feedback, and this paper offers a viable, reference-free alternative. The core idea of using inference-time scaling methods (like self-consistency and meta-critique) to generate high-quality synthetic verification data is
There is a mismatch between the paper's core motivation and its experimental validation. The method is motivated as a solution for domains where ground truth is "unavailable," "subjective," or "lacks clear verification criteria," such as creative writing or complex planning. However, all experiments are conducted exclusively on mathematical reasoning, a domain defined by objective, verifiable ground truth. The computational cost of the Stage 1 synthetic data generation pipeline appears to be e
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Topic Modeling
