On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
Guanghao Ye, Khiem Duc Pham, Xinzhi Zhang, Sivakanth Gopi, Baolin, Peng, Beibin Li, Janardhan Kulkarni, Huseyin A. Inan

TL;DR
This paper introduces RLSP, a post-training framework that enhances reasoning in large language models by combining supervised fine-tuning, exploration rewards, and outcome verification, leading to emergent complex reasoning behaviors.
Contribution
The paper proposes RLSP, a scalable post-training method that decouples exploration and correctness signals to improve reasoning in LLMs, demonstrating emergent behaviors and reasoning improvements.
Findings
RLSP boosts reasoning performance by 23% on MATH-500.
RLSP improves AIME 2024 math problem accuracy by 10%.
Models trained with RLSP exhibit emergent behaviors like backtracking and idea exploration.
Abstract
Recent AI advancements, such as OpenAI's new models, are transforming LLMs into LRMs (Large Reasoning Models) that perform reasoning during inference, taking extra time and compute for higher-quality outputs. We aim to uncover the algorithmic framework for training LRMs. Methods like self-consistency, PRM, and AlphaZero suggest reasoning as guided search. We ask: what is the simplest, most scalable way to enable search in LLMs? We propose a post-training framework called Reinforcement Learning via Self-Play (RLSP). RLSP involves three steps: (1) supervised fine-tuning with human or synthetic demonstrations of the reasoning process, (2) using an exploration reward signal to encourage diverse and efficient reasoning behaviors, and (3) RL training with an outcome verifier to ensure correctness while preventing reward hacking. Our key innovation is to decouple exploration and correctness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCorporate Insolvency and Governance · Legal Education and Practice Innovations
MethodsEntropy Regularization · Proximal Policy Optimization · AlphaZero
