Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data
Yuval Ran-Milo, Yotam Alexander, Shahar Mendel, Nadav Cohen

TL;DR
This paper demonstrates that outcome-based reinforcement learning enables Transformers to develop reasoning abilities, but only when trained on data emphasizing simpler reasoning tasks, with theoretical and empirical validation.
Contribution
It provides a theoretical analysis of how sparse rewards guide Transformers to learn systematic reasoning, highlighting the importance of training data distribution.
Findings
Policy gradient drives Transformers to learn iterative reasoning algorithms.
Simple examples in training data are critical for generalizable reasoning.
Training on insufficient simple examples hampers reasoning ability.
Abstract
Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy gradient dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of "simple examples": instances requiring fewer reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
