When More is Less: Understanding Chain-of-Thought Length in LLMs
Yuyang Wu, Yifei Wang, Ziyu Ye, Tianqi Du, Stefanie Jegelka, Yisen Wang

TL;DR
This paper reveals that longer Chain-of-Thought reasoning in LLMs does not always improve performance, showing an inverted U-shaped relationship, and introduces a theoretical model to optimize CoT length based on task difficulty and model capability.
Contribution
It provides a theoretical framework explaining CoT length effects, uncovers the simplicity bias in models, and offers practical methods for adaptive CoT calibration to enhance reasoning accuracy.
Findings
Performance follows an inverted U-shaped curve with CoT length.
Optimal CoT length increases with task difficulty but decreases with model capability.
Training with optimal CoT length improves reasoning accuracy.
Abstract
Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems. While longer CoTs are often presumed superior, this paper challenges that notion, arguing that longer is not always better. Drawing on combined evidence from real-world observations, controlled experiments, and theoretical analysis, we demonstrate that task accuracy typically follows an inverted U-shaped curve with CoT length, where performance initially improves but eventually decreases as the number of CoT steps increases. With controlled experiments, we further uncover the scaling behaviors of the optimal CoT length: it increases with task difficulty but decreases with model capability, exposing an inherent simplicity bias where more capable models favor shorter, more efficient CoT reasoning. This bias is also evident in Reinforcement Learning (RL) training, where models gravitate…
Peer Reviews
Decision·ICLR 2026 Poster
1. Overall the paper is pretty well written and I was able to follow the main points. 2. The insight regarding step-wise computation also increasing for difficult instances is an interesting one that as far as I know as not covered extensively in previous papers.
The paper is mainly insight-driven, but as far as I can tell many of the insights presented in the paper have already been uncovered by previous work that was not cited: 1. Insights related to long CoTs not being better are covered by Jiang et al., which was not cited. 2. The idea of error accumulation in CoTs being responsible for long chains of thought being less successful was published in Schaeffer et al. 2023. 3. Adaptive length-filtered voting was examined by Fu et al., which was cited in
1. The paper is well-written, clearly organized, and easy to follow. The authors articulate their core argument effectively. 2. The core finding is "an optimal Chain-of-Thought (CoT) length exists". This conclusion is convincingly demonstrated. The results of most experiments are very clear. 3. This study not only analyzes the results, but also provides guidance for the reasoning process of practical models, which has also shown positive effects in experiments.
The controlled experiments described in Section 3 may be problematic because factors other than CoT length were altered. For instance, the long and short CoT solutions differ not only in total length but also in how they approach problem-solving: short CoTs take fewer but longer steps, whereas long CoTs take more but shorter steps. Since the paper measures CoT length by the number of steps and controls this variable in the experiments, variations in step length and the more complex operations us
* This paper is well-written. * As mentioned, the paper provides both synthetic and real-world tasks where the findings are observed. * The authors meaningfully point out that reasoning traces typically become longer (most prominently demonstrated by the Deepseek-R1 paper), yet they show evidence that in fact this does not always true (e.g. on Leetcode-2k). * The authors provide one important application of their work, which is to propose a novel majority voting mechanism weighted by reasonin
My main concern is that I do not think the experimental setup adequately discusses how important reflection/backtracking is to the reasoning process, which I think the authors rightfully point out is present in "real-world CoTs" (Section 2, Appendix A.3). To frame it another way, the main question I am asking is: given that real-world models do self-correct, **how does self-correction/backtracking play a role in influencing the total length of the CoT**? * To my understanding, none of the synthe
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management
MethodsALIGN
