When More is Less: Understanding Chain-of-Thought Length in LLMs

Yuyang Wu; Yifei Wang; Ziyu Ye; Tianqi Du; Stefanie Jegelka; Yisen Wang

arXiv:2502.07266·cs.AI·May 28, 2025·3 cites

When More is Less: Understanding Chain-of-Thought Length in LLMs

Yuyang Wu, Yifei Wang, Ziyu Ye, Tianqi Du, Stefanie Jegelka, Yisen Wang

PDF

Open Access 3 Reviews

TL;DR

This paper reveals that longer Chain-of-Thought reasoning in LLMs does not always improve performance, showing an inverted U-shaped relationship, and introduces a theoretical model to optimize CoT length based on task difficulty and model capability.

Contribution

It provides a theoretical framework explaining CoT length effects, uncovers the simplicity bias in models, and offers practical methods for adaptive CoT calibration to enhance reasoning accuracy.

Findings

01

Performance follows an inverted U-shaped curve with CoT length.

02

Optimal CoT length increases with task difficulty but decreases with model capability.

03

Training with optimal CoT length improves reasoning accuracy.

Abstract

Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems. While longer CoTs are often presumed superior, this paper challenges that notion, arguing that longer is not always better. Drawing on combined evidence from real-world observations, controlled experiments, and theoretical analysis, we demonstrate that task accuracy typically follows an inverted U-shaped curve with CoT length, where performance initially improves but eventually decreases as the number of CoT steps increases. With controlled experiments, we further uncover the scaling behaviors of the optimal CoT length: it increases with task difficulty but decreases with model capability, exposing an inherent simplicity bias where more capable models favor shorter, more efficient CoT reasoning. This bias is also evident in Reinforcement Learning (RL) training, where models gravitate…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 4

Strengths

1. Overall the paper is pretty well written and I was able to follow the main points. 2. The insight regarding step-wise computation also increasing for difficult instances is an interesting one that as far as I know as not covered extensively in previous papers.

Weaknesses

The paper is mainly insight-driven, but as far as I can tell many of the insights presented in the paper have already been uncovered by previous work that was not cited: 1. Insights related to long CoTs not being better are covered by Jiang et al., which was not cited. 2. The idea of error accumulation in CoTs being responsible for long chains of thought being less successful was published in Schaeffer et al. 2023. 3. Adaptive length-filtered voting was examined by Fu et al., which was cited in

Reviewer 02Rating 8Confidence 4

Strengths

1. The paper is well-written, clearly organized, and easy to follow. The authors articulate their core argument effectively. 2. The core finding is "an optimal Chain-of-Thought (CoT) length exists". This conclusion is convincingly demonstrated. The results of most experiments are very clear. 3. This study not only analyzes the results, but also provides guidance for the reasoning process of practical models, which has also shown positive effects in experiments.

Weaknesses

The controlled experiments described in Section 3 may be problematic because factors other than CoT length were altered. For instance, the long and short CoT solutions differ not only in total length but also in how they approach problem-solving: short CoTs take fewer but longer steps, whereas long CoTs take more but shorter steps. Since the paper measures CoT length by the number of steps and controls this variable in the experiments, variations in step length and the more complex operations us

Reviewer 03Rating 8Confidence 4

Strengths

* This paper is well-written. * As mentioned, the paper provides both synthetic and real-world tasks where the findings are observed. * The authors meaningfully point out that reasoning traces typically become longer (most prominently demonstrated by the Deepseek-R1 paper), yet they show evidence that in fact this does not always true (e.g. on Leetcode-2k). * The authors provide one important application of their work, which is to propose a novel majority voting mechanism weighted by reasonin

Weaknesses

My main concern is that I do not think the experimental setup adequately discusses how important reflection/backtracking is to the reasoning process, which I think the authors rightfully point out is present in "real-world CoTs" (Section 2, Appendix A.3). To frame it another way, the main question I am asking is: given that real-world models do self-correct, **how does self-correction/backtracking play a role in influencing the total length of the CoT**? * To my understanding, none of the synthe

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management

MethodsALIGN