Dissecting Long-Chain-of-Thought Reasoning Models: An Empirical Study

Yongyu Mu; Jiali Zeng; Bei Li; Xinyan Guan; Fandong Meng; Jie Zhou; Tong Xiao; Jingbo Zhu

arXiv:2506.04913·cs.LG·November 11, 2025

Dissecting Long-Chain-of-Thought Reasoning Models: An Empirical Study

Yongyu Mu, Jiali Zeng, Bei Li, Xinyan Guan, Fandong Meng, Jie Zhou, Tong Xiao, Jingbo Zhu

PDF

Open Access 1 Repo

TL;DR

This paper empirically analyzes the training dynamics of long-chain-of-thought reasoning models, revealing insights into sample roles, data inefficiencies, and factors affecting model stability and performance.

Contribution

It provides a systematic analysis of positive and negative samples in RL training, proposes strategies to improve data efficiency, and investigates causes of performance instability in reasoning models.

Findings

01

Negative samples enhance generalization and robustness.

02

Training on negative samples alone can achieve strong reasoning performance.

03

Strategies like relative length rewards improve data efficiency.

Abstract

Despite recent progress in training long-chain-of-thought reasoning models via scaling reinforcement learning (RL), its underlying training dynamics remain poorly understood, and several counterintuitive behaviors persist. This work focuses on three key aspects: (1) We systematically analyze the roles of positive and negative samples in scaling RL, revealing that positive samples mainly facilitate precise fitting to the training data, whereas negative samples significantly enhance generalization and robustness. Interestingly, while positive samples are essential for convergence in the zero-RL setting, training on negative samples alone suffices to attain strong reasoning performance and even better generalization in cold-start scenarios. (2) We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage. To address…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

takagi97/dissect-long-reason-models
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning