Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

Yuval Ran-Milo; Yotam Alexander; Shahar Mendel; Nadav Cohen

arXiv:2601.15158·cs.LG·February 3, 2026

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

Yuval Ran-Milo, Yotam Alexander, Shahar Mendel, Nadav Cohen

PDF

Open Access

TL;DR

This paper demonstrates that outcome-based reinforcement learning enables Transformers to develop reasoning abilities, but only when trained on data emphasizing simpler reasoning tasks, with theoretical and empirical validation.

Contribution

It provides a theoretical analysis of how sparse rewards guide Transformers to learn systematic reasoning, highlighting the importance of training data distribution.

Findings

01

Policy gradient drives Transformers to learn iterative reasoning algorithms.

02

Simple examples in training data are critical for generalizable reasoning.

03

Training on insufficient simple examples hampers reasoning ability.

Abstract

Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy gradient dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of "simple examples": instances requiring fewer reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications