The Two-Stage Decision-Sampling Hypothesis: Understanding the Emergence of Self-Reflection in RL-Trained LLMs

Zibo Zhao (1); Yuanting Zha (2); Haipeng Zhang (2); Xingcheng Xu (3) ((1) Arizona State University; (2) ShanghaiTech University; (3) Shanghai Artificial Intelligence Laboratory)

arXiv:2601.01580·cs.LG·April 13, 2026

The Two-Stage Decision-Sampling Hypothesis: Understanding the Emergence of Self-Reflection in RL-Trained LLMs

Zibo Zhao (1), Yuanting Zha (2), Haipeng Zhang (2), Xingcheng Xu (3) ((1) Arizona State University, (2) ShanghaiTech University, (3) Shanghai Artificial Intelligence Laboratory)

PDF

TL;DR

This paper introduces the Two-Stage Decision-Sampling Hypothesis to explain how RL training enables self-reflection in large language models, emphasizing the roles of sampling and decision components in generating solutions and self-correction.

Contribution

It formalizes the Gradient Attribution Property and decomposes policy into sampling and decision stages, providing a mechanistic explanation for RL's success over supervised fine-tuning.

Findings

01

RL improves decision-making ($$) more than sampling ($$) in arithmetic reasoning.

02

Surrogate rewards exhibit Balanced Gradient Attribution, while SFT and KL penalties show Unbalanced Gradient Attribution.

03

Length-weighting creates asymmetric regularization constraining sampling, explaining RL's effectiveness.

Abstract

Self-reflection capabilities emerge in Large Language Models after RL post-training, with multi-turn RL achieving substantial gains over SFT counterparts. Yet the mechanism of how a unified optimization objective gives rise to functionally distinct capabilities of generating solutions and evaluating when to revise them remains opaque. To address this question, we introduce the Gradient Attribution Property to characterize how reward gradients distribute across policy components, formalized through the Two-Stage Decision-Sampling (DS) Hypothesis, which decomposes the policy into sampling ( $π_{s am pl e}$ ) for generation and decision ( $π_{d}$ ) for verification. We prove that surrogate rewards exhibit Balanced Gradient Attribution, while SFT and KL penalties exhibit Unbalanced Gradient Attribution, with length-weighting creating asymmetric regularization that constrains $π_{s am pl e}$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.