Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective
Zeyu Jia, Alexander Rakhlin, Tengyang Xie

TL;DR
This paper demonstrates that outcome supervision in reinforcement learning is statistically comparable to process supervision under standard conditions, challenging the belief that process supervision is inherently more effective for complex reasoning tasks.
Contribution
The paper provides a theoretical proof that outcome supervision is not statistically harder than process supervision, introducing the Change of Trajectory Measure Lemma and connecting outcome and process supervision.
Findings
Outcome supervision is statistically comparable to process supervision.
The Change of Trajectory Measure Lemma bridges trajectory and step-level distributions.
Performance gaps are likely due to algorithms, not statistical difficulty.
Abstract
As large language models have evolved, it has become crucial to distinguish between process supervision and outcome supervision -- two key reinforcement learning approaches to complex reasoning tasks. While process supervision offers intuitive advantages for long-term credit assignment, the precise relationship between these paradigms has remained an open question. Conventional wisdom suggests that outcome supervision is fundamentally more challenging due to the trajectory-level coverage problem, leading to significant investment in collecting fine-grained process supervision data. In this paper, we take steps towards resolving this debate. Our main theorem shows that, under standard data coverage assumptions, reinforcement learning through outcome supervision is no more statistically difficult than through process supervision, up to polynomial factors in horizon. At the core of this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBusiness Process Modeling and Analysis
