AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models
Feng Luo, Yu-Neng Chuang, Guanchu Wang, Hoang Anh Duy Le, Shaochen Zhong, Hongyi Liu, Jiayi Yuan, Yang Sui, Vladimir Braverman, Vipin Chaudhary, Xia Hu

TL;DR
AutoL2S introduces a distillation framework enabling large language models to adaptively perform long or short reasoning, significantly reducing inference costs while maintaining high accuracy on complex tasks.
Contribution
AutoL2S is the first method to learn a switching token for instance-wise long-short reasoning selection, improving efficiency without sacrificing accuracy.
Findings
Reduces reasoning length by up to 71%
Maintains high reasoning accuracy with minimal loss
Improves inference efficiency and reduces costs
Abstract
Reasoning-capable large language models (LLMs) achieve strong performance on complex tasks but often exhibit overthinking after distillation, generating unnecessarily long chain-of-thought (CoT) reasoning even for simple inputs and incurring high inference cost. However, naively shortening reasoning length can degrade reasoning accuracy, as concise reasoning may be insufficient for certain inputs and lacks explicit supervision. We propose Auto Long-Short Reasoning (AutoL2S), a distillation framework that empowers non-reasoning LLMs to think thoroughly but only when necessary. AutoL2S first learns a lightweight switching token with verified long-short CoTs to enable instance-wise long-short reasoning selection. Then it leverages long-short reasoning rollouts induced by a switching token in a GRPO-style loss to improve reasoning efficiency while maintaining accuracy. Experiments…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Originality lies in a clean recipe that pairs long and short CoT supervision with an explicit gating signal and a length-aware objective, giving a practical, model-agnostic way to control reasoning length. Quality shows as strong accuracy at substantially reduced tokens across several benchmarks, with clear training and inference procedures. Significance is high for latency and cost reduction in real deployments. Main weakness: novelty and contribution boundaries are not crisply isolated, with m
While the paper presents a neat engineering recipe, the conceptual novelty feels limited: the core ideas of mixing long and short CoT traces, learning an “easy” gate, and adding a length-aware fine-tuning step closely echo prior work on CoT compression and length control. The theoretical results are largely tautological reformulations of standard information-theoretic inequalities and risk trade-offs, and they do not yield actionable guidance for thresholds, divergence estimators, or training
1. AutoL2S can automatically decide between short and long reasoning based on the difficulty of the problem. 2. AutoL2S effectively reduces the reasoning length while maintaining only a small loss in accuracy. 3. The experimental results of AutoL2S appear to be effective and convincing.
1. According to Equation (4), for <EASY> questions, both long and short reasoning paths are learned together during SFT training. Wouldn’t this potentially confuse the model? Why not train only on the short reasoning paths directly? 2. For such automatic decision-making tasks, recent studies suggest that RL often outperforms SFT, as SFT tends to memorize specific formats rather than learn adaptive reasoning strategies. Have the authors considered exploring RL-based methods, such as GRPO, to lear
1. Instead of complex dataset rebuilding, this paper introduces a much easier way for constructing the datasets. The paper validate the method across multiple model families and diverse reasoning benchmarks, and provide extensive ablation studies and mechanistic analyses to prove why it works. 2. It achieves a over 50% reduction in reasoning length while maintaining or even improving accuracy provides a direct, high-impact solution for building faster, cheaper, and more scalable reasoning appli
1. The paper's evaluation is narrowly focused on maths and physics benchmarks. It makes unclear if the AutoL2S framework's ability to distinguish "easy" from "hard" questions will successfully generalize to other complex reasoning domains, such as commonsense, or coding. **OOD experiments are required to show the performance really does not "drop".** 2. There is no guidance on when this method works well. Later, if readers have their own datasets and domains, under what conditions will this met
1. The paper addresses a problem of high practical importance. The "overthinking" phenomenon is a well-recognized bottleneck for the deployment of CoT-enabled LLMs, making research into reasoning efficiency both timely and valuable. 2. The central mechanism—using a learned, special token (\<EASY\>) as an explicit gate for dynamically controlling reasoning length—is intuitive and interpretable. It presents a potentially simpler alternative to complex reinforcement learning reward-shaping or tr
1. The paper's primary weakness is its failure to properly situate its contribution within the existing literature. The central problem of "overthinking" and the proposed solution—a SFT framework to enable self-regulating reasoning length—are not novel. The paper omits citation of its most direct competitors, like Self-Braking Tuning (SBT), making it impossible to assess its incremental contribution. Both AutoL2S and SBT are SFT-based frameworks. Where AutoL2S uses an \<EASY\> token , SBT emplo
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
