Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty
Zewei Yu, Lirong Gao, Yuke Zhu, Bo Zheng, Junbo Zhao, Sheng Guo, Haobo Wang

TL;DR
This paper introduces ARLCP, a reinforcement learning framework that reduces unnecessary reflection in large reasoning models, leading to more concise reasoning paths, lower token usage, and improved accuracy on mathematical benchmarks.
Contribution
ARLCP is a novel method that adaptively penalizes reflection and length, optimizing reasoning efficiency and accuracy in large reasoning models.
Findings
Reduces average response length by 53.1% in 1.5B models.
Improves accuracy by 5.8% in 1.5B models.
Achieves 35.0% length reduction and 2.7% accuracy gain in 7B models.
Abstract
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks by employing test-time scaling. However, they often generate over-long chains-of-thought that, driven by substantial reflections such as repetitive self-questioning and circular reasoning, lead to high token consumption, substantial computational overhead, and increased latency without improving accuracy, particularly in smaller models. Our observation reveals that increasing problem complexity induces more excessive and unnecessary reflection, which in turn reduces accuracy and increases token overhead. To address this challenge, we propose Adaptive Reflection and Length Coordinated Penalty (ARLCP), a novel reinforcement learning framework designed to dynamically balance reasoning efficiency and solution accuracy. ARLCP introduces two key innovations: (1) a reflection penalty that…
Peer Reviews
Decision·ICLR 2026 Poster
- The work directly addresses the "over-reflection" in Large Reasoning Models (LRMs), a key issue that leads to high token consumption, computational overhead, and latency without improving accuracy. - The core idea of an adaptive reflection penalty is a key innovation. - The thorough analysis and ablation show that the method improves both efficiency and accuracy.
- The core "Adaptive Reflection Penalty" relies entirely on a manually curated list of "reflection-trigger keywords" like "wait", "hmm", "alternatively", etc. This relies on a specific model response style and is not generalizable to other distributions or domains. - The method introduces several new and important hyperparameters: the complexity thresholds ($n_1=40, n_2=80$) and the penalty weights ($\lambda_1=0.05, \lambda_2=0.1, \lambda_3=0.15, \alpha=0.2$). The paper presents these as fixed v
**Strengths**: 1. The paper is well-written and clearly structured, making it easy to follow. 2. It identifies a practical issue—over-reflection in large reasoning models—and provides a targeted solution. 3. The proposed ARLCP method is simple yet effective, adaptively balancing reflection and reasoning length. 4. The experiments on multiple benchmarks demonstrate clear improvements in both accuracy and efficiency.
typo line 448: analysis -> analyze Weaknesses: 1. The method appears potentially sensitive to the hyperparameters n₁ and n₂. It would be helpful if the authors could analyze how these values affect stability and performance. 2. The paper lacks clear guidance or heuristics for selecting n₁ and n₂, which seem crucial to the method’s effectiveness. 3. The work assumes that longer reasoning tends to produce incorrect answers, yet the analysis mainly shows correlation rather than causation. It
1. The idea of estimating problem difficulty through the frequency of reflection keywords is novel and interesting. 2. The method is well-motivated, conceptually sound, and shows empirical effectiveness. 3. The writing is clear and the presentation is good.
1. The method may not generalize well to other model families. It assumes that reflection behaviors are expressed through specific keywords, which is a strong and model-dependent assumption. It would strengthen the paper if the authors extended the analysis of reflection-token counts versus problem difficulty to other reasoning model families (GPT-OSS, Qwen 3 Thinking, etc.). 2. All experiments are conducted on a single model family (DeepSeek-distilled Qwen), limiting the generality of the findi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
