Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

Fei Ding; Yongkang Zhang; Runhao Liu; Yuhao Liao; Zijian Zeng; Sibo wang; Huiming Yang

arXiv:2605.05226·cs.LG·May 8, 2026

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Sibo wang, Huiming Yang

PDF

TL;DR

This paper introduces a novel reinforcement learning paradigm for reasoning that internalizes outcome supervision into process supervision, enabling finer-grained learning signals without external process annotations.

Contribution

It proposes a supervision-internalization method allowing models to extract process-level signals from outcome supervision, improving reasoning policy optimization.

Findings

01

Enables models to identify and correct failed reasoning trajectories.

02

Achieves finer-grained policy updates without external process supervision.

03

Introduces a new training paradigm for internal process supervision.

Abstract

The central challenge of reinforcement learning for reasoning lies not only in the sparsity of outcome-level supervision, but more fundamentally in how to transform feedback provided only at the end of a sequence into fine-grained learning signals that can guide intermediate reasoning steps. Existing approaches either rely on outcome-level rewards for sequence-level optimization, which makes precise credit assignment difficult, or depend on externally constructed process supervision, which is costly and difficult to scale sustainably. To address this, we propose a new perspective: reinforcement learning for reasoning can be understood as the problem of internalizing outcome supervision into process supervision. From this perspective, we introduce a supervision-internalization method for reinforcement learning for reasoning, enabling the model to automatically extract process-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.