TL;DR
LAPO introduces a reinforcement learning framework that enables reasoning models to adaptively control their reasoning depth, reducing token usage and improving accuracy on mathematical benchmarks by internalizing length-awareness.
Contribution
LAPO transforms reasoning length control into an intrinsic model capability using a two-stage reinforcement learning process, unlike prior external or post-hoc methods.
Findings
Reduces token usage by up to 40.9%.
Improves reasoning accuracy by 2.3%.
Models develop emergent resource allocation abilities.
Abstract
Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model's reasoning context to ensure inference-time…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Overall, the proposed method is well-motivated and makes sensible design choices. The experimental results are comprehensive and strong, both in terms of Pass@1 and token count. - The paper is generally well-written with a clear structure.
- For LAPO-D, the choice of [30, 70] percentile is a bit arbitrary, and I'm not fully convinced that it's necessary. Have you done ablations to determine whether percentile filtering is necessary? - The source of the performance gains is unclear to me. Examining the results, it appears that most of the gains over prior work are attributed to LAPO-D, and even the Acc-Only ablation is quite strong. Do you know why your baseline is so strong compared to prior works? Is there a difference in any key
- The use of a prompted length within a RL loop so that the model also learns to generate the length and adhere to it is a nice extension of methods that manually prompted the model with a target length. - The proposed method improves upon the base model in terms of token reduction.
- The paper proposes a two-stage method, but it is not clear why a two-stage method is needed. For example, it is unclear why optimizing a reward function that has a length penalty in a single stage would be worse than the proposed method. Additional experiments such as this or are needed to better justify the two-stage design. - The main experiments focus on justifying the method by comparing the new method with prior methods. However, the prior methods have several potential compounding factor
1. The paper addresses the critical and timely problem of computational inefficiency in chain-of-thought reasoning, which is a significant barrier to the practical deployment of large reasoning models. 2. The core concept of a two-stage "Discover-Internalize" process is novel (to the best of my knowledge) and well-motivated. The idea of transforming length control from an external constraint into an intrinsic model capability seems a promising research direction. 3. The experimental results demo
1. **Marginal Gains Over Simpler Baselines Undermined by Methodological Complexity:** The performance improvement of LAPO-I over the simpler single-stage "Acc-Only" RL baseline is marginal (e.g., from 63.9% to 64.8% average accuracy on the DeepScaleR model). This small gain is achieved via a complex two-stage framework with numerous design choices and new hyperparameters (e.g., the percentile range [P30, P70], reward weights \alpha and \beta, and the gaussian standard deviation \sigma). It seems
+ The primary strength is the methodological novelty. The core idea of a two-stage discover-then-internalize process is elegant. + The Discovery stage's use of a statistical median of successful solutions is a clever and simple heuristic, contrasting with more complex difficulty-prediction models. + The Internalization stage's mechanism, which combines an in-context, self-declarative statement with an explicit RL adherence reward, is a novel and interesting approach to instilling a policy for
**Major** + **Discrepancy in Baseline Performance and Lack of Statistical Variance:** The paper's reported baseline performance for DeepScaleR-1.5B-Preview shows a notable discrepancy with its original source (e.g., 35.5% Pass@1 on AIME2024 in this paper, vs. 43.1% reported in [1]). This raises questions about the experimental setup and the validity of the baseline reproductions. Furthermore, for high-variance tasks like long CoT reasoning, reporting statistical variance is crucial. The paper
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
