On the Emergence of Thinking in LLMs I: Searching for the Right   Intuition

Guanghao Ye; Khiem Duc Pham; Xinzhi Zhang; Sivakanth Gopi; Baolin; Peng; Beibin Li; Janardhan Kulkarni; Huseyin A. Inan

arXiv:2502.06773·cs.AI·February 11, 2025

On the Emergence of Thinking in LLMs I: Searching for the Right Intuition

Guanghao Ye, Khiem Duc Pham, Xinzhi Zhang, Sivakanth Gopi, Baolin, Peng, Beibin Li, Janardhan Kulkarni, Huseyin A. Inan

PDF

Open Access 4 Repos

TL;DR

This paper introduces RLSP, a post-training framework that enhances reasoning in large language models by combining supervised fine-tuning, exploration rewards, and outcome verification, leading to emergent complex reasoning behaviors.

Contribution

The paper proposes RLSP, a scalable post-training method that decouples exploration and correctness signals to improve reasoning in LLMs, demonstrating emergent behaviors and reasoning improvements.

Findings

01

RLSP boosts reasoning performance by 23% on MATH-500.

02

RLSP improves AIME 2024 math problem accuracy by 10%.

03

Models trained with RLSP exhibit emergent behaviors like backtracking and idea exploration.

Abstract

Recent AI advancements, such as OpenAI's new models, are transforming LLMs into LRMs (Large Reasoning Models) that perform reasoning during inference, taking extra time and compute for higher-quality outputs. We aim to uncover the algorithmic framework for training LRMs. Methods like self-consistency, PRM, and AlphaZero suggest reasoning as guided search. We ask: what is the simplest, most scalable way to enable search in LLMs? We propose a post-training framework called Reinforcement Learning via Self-Play (RLSP). RLSP involves three steps: (1) supervised fine-tuning with human or synthetic demonstrations of the reasoning process, (2) using an exploration reward signal to encourage diverse and efficient reasoning behaviors, and (3) RL training with an outcome verifier to ensure correctness while preventing reward hacking. Our key innovation is to decouple exploration and correctness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCorporate Insolvency and Governance · Legal Education and Practice Innovations

MethodsEntropy Regularization · Proximal Policy Optimization · AlphaZero