Demystifying Long Chain-of-Thought Reasoning in LLMs
Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, Xiang Yue

TL;DR
This paper investigates the mechanisms behind long chain-of-thought reasoning in large language models, emphasizing the roles of supervised fine-tuning and reinforcement learning in developing and stabilizing these reasoning capabilities.
Contribution
It systematically analyzes how long CoT reasoning emerges in LLMs, providing practical insights into training strategies, reward shaping, and the importance of scaling compute for improved reasoning.
Findings
Supervised fine-tuning improves training efficiency but is not essential.
Increased training compute can lead to reasoning capabilities, but requires reward shaping.
Scaling verifiable reward signals, especially with noisy web data, enhances out-of-distribution reasoning.
Abstract
Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis · Multi-Agent Systems and Negotiation
MethodsBalanced Selection · Shrink and Fine-Tune
