Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, Junxian He

TL;DR
This paper introduces LASER and LASER-D, novel length-based reward shaping methods for reinforcement learning in large reasoning models, significantly improving reasoning efficiency and performance by adaptively penalizing redundant reasoning steps.
Contribution
It proposes a unified framework and a dynamic, difficulty-aware reward shaping method that enhances reasoning efficiency and performance in large models.
Findings
LASER outperforms previous methods in balancing performance and efficiency.
LASER-D achieves +6.1 improvement on AIME2024 and reduces token usage by 63%.
The approach produces more concise reasoning with less redundancy.
Abstract
Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel Length-bAsed StEp Reward shaping method (LASER), which employs a step function as the reward, controlled by a target length. LASER surpasses previous methods, achieving a superior Pareto-optimal balance between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is well motivated and shows a good understanding of existing methods in LLM efficient reasoning. - The experimental study seems comprehensive (applying the method to 5 LLMs and evaluating on both in-domain and out-of-domain tasks). - The paper is rigorous in terms of including important details like hyper-parameter choices and abundant ablation studies in the main text & appendix.
1. My main concern lies in the numerous *manual design choices* (often appearing as hyperparameters) despite the paper’s claim that most modules are *automatically adaptive*. For instance, the criterion for defining $L_A$ is set as the smallest value ensuring $ECR_d \geq 1$, where “1” represents the threshold for “at least one complete and correct response.” Why fix this threshold as a constant rather than make it depend on the rollout size $K$? Similarly, the lower bound length $L_T$ appears to
1. The paper presents empirical results, demonstrating a strong trade-off between efficiency and performance. The proposed methods, consistently reduce token usage by a good margin across multiple models and benchmarks while often improving accuracy. 2. The introduction of a unified framework for RL-based CoT compression is a good conceptual contribution. It provides a clear and structured approach to understanding and comparing different reward shaping strategies.
1. Framing and Claims: The framing of performance improvements and baseline comparisons could be strengthened. The paper claims adaptive methods like AutoThink and Thinkless remain "verbose on hard ones (AIME)." However, the data in Table 4 shows these methods do reduce token usage on AIME compared to the original model, albeit less than LASER-D. Furthermore, AutoThink achieves higher accuracy on AIME than several LASER and LASER-D variants, which complicates the claim. 2. Deepseek-r1 distilled
1. The paper conducts a lot of evaluation experiments, which point out the deficiency of the current reward shaping methods for efficient LRM training. 2. The design of the method, especially how to monitor the difficulty dymanics, and the logic of the paper are reasonable. 3. The experiment section includes a lot of baselines and different length settings, and different sizes of models.
1. While the method includes some novelty from the length penalty findings, it seems like an improved version of the previous method. 2. The task in the experiment is limited, which only contains mathematical reasoning. 3. The model family is limited, which only contains 2 versions of DeepSeek models. 4. While the tables in the experiment section show the effectiveness of this method, it is still a little bit hard to compare methods because sometimes other methods are better at token usage.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety
