DART: Difficulty-Adaptive Reasoning Truncation for Efficient Large Language Models

Ruofan Zhang; Bin Xia; Zhen Cheng; Cairen Jian; Minglun Yang; Ngai Wong; Yuan Cheng

arXiv:2511.01170·cs.AI·December 17, 2025

DART: Difficulty-Adaptive Reasoning Truncation for Efficient Large Language Models

Ruofan Zhang, Bin Xia, Zhen Cheng, Cairen Jian, Minglun Yang, Ngai Wong, Yuan Cheng

PDF

Open Access 3 Reviews

TL;DR

DART introduces a supervised framework that adaptively truncates reasoning in large language models based on problem difficulty, significantly improving efficiency while maintaining or enhancing accuracy across mathematical benchmarks.

Contribution

It presents a novel difficulty-adaptive reasoning truncation method that learns when to stop thinking, improving efficiency without sacrificing accuracy in LLM reasoning tasks.

Findings

01

Achieves 81.2% reasoning truncation on GSM8K dataset.

02

Provides 5.33× computational acceleration.

03

Maintains or improves reasoning accuracy.

Abstract

Adaptive reasoning is essential for aligning the computational effort of large language models (LLMs) with the intrinsic difficulty of problems. Current chain-of-thought methods boost reasoning ability but indiscriminately generate long explanations, leading to evident inefficiency. However, existing reinforcement learning approaches to adaptive thinking remain unstable and heavily reward-dependent. Here we propose \textbf{DART}, a supervised \textbf{D}ifficulty-\textbf{A}daptive \textbf{R}easoning \textbf{T}runcation framework that adjusts thinking length according to problem difficulty. By distilling concise reasoning patterns from stronger models, interpolating them into a continuum of reasoning styles, and curating optimal training data that balances correctness and compactness, DART learns when to ``stop thinking''. Across multiple mathematical benchmarks, experimental results…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The paper focuses on an important problem—the inefficiency of CoT. The proposed framework is presented as modular and is conceptually sound.

Weaknesses

- Clarity issues. Some parts of the methodology are not clearly explained. For example, it is unclear how the distillation teacher model shortens long reasoning chains and how this process affects the quality of the reasoning paths. Additionally, the paper does not discuss how the quality of the generated reasoning chains is controlled or verified. - Limited novelty. The proposed method essentially involves collecting question-answer pairs with varying reasoning lengths and using this dataset to

Reviewer 02Rating 4Confidence 4

Strengths

- The paper addresses an important problem, improving reasoning efficiency for large language models - The four-step framework (DISTILLING SHORT COTS, interpolation, CREATING A MODEL SPECTRUM, CURATING TRAINING DATA, adaptive training) is clearly structured and easy to follow. - The experiments cover several standard mathematical reasoning benchmarks and include analyses on certain hyperparameters, such as fusion coefficients and sampling density

Weaknesses

- Limited novelty. The idea of adaptive, difficulty-aware reasoning is not new, and prior work, such as CoT-Valve, has already explored similar strategies for interpolating model weights and curating adaptive data based on correctness. - The method appears less effective on DeepSeek-R1-Distill-Qwen-7B. On benchmarks such as GSM8K, MATH-500, and OLYMPAID, the generated token length is reduced, but the accuracy also drops. - The short-CoT data generated from DeepSeek-R1-Distill-Qwen-7B is importan

Reviewer 03Rating 4Confidence 4

Strengths

The written is straightforward and easy to understand. The paper proposes an angle to train efficient LRM basing on different difficulty level. The experiments show that the method has some improvements on different models with reduced generation length.

Weaknesses

It is not very clear what's the advantage of using the extrapolation to generate different lengths of response regarding different difficulty levels. I understand that the extrapolation could help to control the length of the generation, which can be further used to select and include the data used for the final training. It is not clear how this extrapolation based data generation method work compared with using the prompt based method to generate different lengths of response. Lack of experi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)