Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning

Michael Hassid; Gabriel Synnaeve; Yossi Adi; Roy Schwartz

arXiv:2505.17813·cs.CL·February 4, 2026

Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning

Michael Hassid, Gabriel Synnaeve, Yossi Adi, Roy Schwartz

PDF

3 Reviews

TL;DR

This paper shows that shorter reasoning chains in large language models often lead to better accuracy and efficiency, challenging the assumption that longer chains improve reasoning performance.

Contribution

The paper introduces short-m@k, a novel inference method that uses shorter, parallel reasoning chains with majority voting, improving efficiency and accuracy over traditional longer chains.

Findings

01

Shorter reasoning chains are more accurate than longer ones.

02

Short-m@k reduces computation and inference time while maintaining or improving accuracy.

03

Training on shorter reasoning chains enhances model performance.

Abstract

Reasoning large language models (LLMs) heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive "thinking" chains. While demonstrating impressive results, this approach incurs significant computational costs and inference time. In this work, we challenge the assumption that long thinking chains results in better reasoning capabilities. We first demonstrate that shorter reasoning chains within individual questions are significantly more likely to yield correct answers - up to 34.5% more accurate than the longest chain sampled for the same question. Based on these results, we suggest short-m@k, a novel reasoning LLM inference method. Our method executes k independent generations in parallel and halts computation once the first m thinking processes are done. The final answer is chosen using majority voting among these m chains. Basic short-1@k…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- The experiments are based on the most difficult challenging task set, such as AIME 2025, HMMT etc., which represents the frontier of LLM reasoning performances. - It seems easy to implement the proposed methods and to replicate the experiments on other models. - The topic is a core issue faced by most reasoning LLMs.

Weaknesses

- In general, the findings of this paper are a bit empirical, which lacks theoretical insights, or interpretations from case-by-case analysis, about *why* shorter reasoning trajectories are more beneficial than longer ones. Also, questions like *how much longer* the thinking process should be for harder tasks can be asked. - The S1 data seems very an important one to validate the results, but its nature and how it is constructed seems not introduced as all. I understood that this is from other f

Reviewer 02Rating 8Confidence 4

Strengths

* The study focuses on an important topic in reasoning LLMs by challenging that longer chains enhance reasoning. * The proposed short-m@k framework introduces an elegant parallel decoding mechanism. This approach is well-motivated by the authors’ empirical findings and leads to measurable compute and time savings. * The authors validate across four major reasoning models and multiple benchmarks, combining performance, compute, and wall-time analyses. Additional fine-tuning experiments support th

Weaknesses

* The paper primarily provides empirical evidence without formal analysis of why shorter chains outperform longer ones. Also see questions. * The proposed method assumes access to batch inference resources; its effectiveness under memory-constrained or latency-constrained conditions remains unclear. Evaluation in sequential or streaming inference settings could provide further robustness evidence.

Reviewer 03Rating 4Confidence 4

Strengths

- The proposed short-m@k inference scheme sounds like a pragmatic knob to trade inference time for accuracy, and the compute/time reduction angle is appealing. - The empirical evaluation is broad: multiple reasoning LLMs, multiple math datasets, compute/time/accuracy slicing.

Weaknesses

**W1. Prior work draws almost the opposite conclusion.** For example, [1] explicitly encourages longer trajectories and then uses self-consistency voting over longer chains. This paper reports the opposite monotonic trend. The authors do not directly reconcile this contradiction. Clarifying this contradiction is necessary before readers can interpret the result as a generally valid principle. **W2. The novelty is unclear.** The related work section itself (line ~141 *More relevant to our work…

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.