LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization

Xingyu Wu; Yuchen Yan; Shangke Lyu; Linjuan Wu; Yiwen Qiu; Yongliang Shen; Weiming Lu; Jian Shao; Jun Xiao; Yueting Zhuang

arXiv:2507.15758·cs.AI·August 15, 2025

LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization

Xingyu Wu, Yuchen Yan, Shangke Lyu, Linjuan Wu, Yiwen Qiu, Yongliang Shen, Weiming Lu, Jian Shao, Jun Xiao, Yueting Zhuang

PDF

4 Reviews

TL;DR

LAPO introduces a reinforcement learning framework that enables reasoning models to adaptively control their reasoning depth, reducing token usage and improving accuracy on mathematical benchmarks by internalizing length-awareness.

Contribution

LAPO transforms reasoning length control into an intrinsic model capability using a two-stage reinforcement learning process, unlike prior external or post-hoc methods.

Findings

01

Reduces token usage by up to 40.9%.

02

Improves reasoning accuracy by 2.3%.

03

Models develop emergent resource allocation abilities.

Abstract

Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model's reasoning context to ensure inference-time…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 3

Strengths

- Overall, the proposed method is well-motivated and makes sensible design choices. The experimental results are comprehensive and strong, both in terms of Pass@1 and token count. - The paper is generally well-written with a clear structure.

Weaknesses

- For LAPO-D, the choice of [30, 70] percentile is a bit arbitrary, and I'm not fully convinced that it's necessary. Have you done ablations to determine whether percentile filtering is necessary? - The source of the performance gains is unclear to me. Examining the results, it appears that most of the gains over prior work are attributed to LAPO-D, and even the Acc-Only ablation is quite strong. Do you know why your baseline is so strong compared to prior works? Is there a difference in any key

Reviewer 02Rating 2Confidence 4

Strengths

- The use of a prompted length within a RL loop so that the model also learns to generate the length and adhere to it is a nice extension of methods that manually prompted the model with a target length. - The proposed method improves upon the base model in terms of token reduction.

Weaknesses

- The paper proposes a two-stage method, but it is not clear why a two-stage method is needed. For example, it is unclear why optimizing a reward function that has a length penalty in a single stage would be worse than the proposed method. Additional experiments such as this or are needed to better justify the two-stage design. - The main experiments focus on justifying the method by comparing the new method with prior methods. However, the prior methods have several potential compounding factor

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper addresses the critical and timely problem of computational inefficiency in chain-of-thought reasoning, which is a significant barrier to the practical deployment of large reasoning models. 2. The core concept of a two-stage "Discover-Internalize" process is novel (to the best of my knowledge) and well-motivated. The idea of transforming length control from an external constraint into an intrinsic model capability seems a promising research direction. 3. The experimental results demo

Weaknesses

1. **Marginal Gains Over Simpler Baselines Undermined by Methodological Complexity:** The performance improvement of LAPO-I over the simpler single-stage "Acc-Only" RL baseline is marginal (e.g., from 63.9% to 64.8% average accuracy on the DeepScaleR model). This small gain is achieved via a complex two-stage framework with numerous design choices and new hyperparameters (e.g., the percentile range [P30, P70], reward weights \alpha and \beta, and the gaussian standard deviation \sigma). It seems

Reviewer 04Rating 2Confidence 4

Strengths

+ The primary strength is the methodological novelty. The core idea of a two-stage discover-then-internalize process is elegant. + The Discovery stage's use of a statistical median of successful solutions is a clever and simple heuristic, contrasting with more complex difficulty-prediction models. + The Internalization stage's mechanism, which combines an in-context, self-declarative statement with an explicit RL adherence reward, is a novel and interesting approach to instilling a policy for

Weaknesses

**Major** + **Discrepancy in Baseline Performance and Lack of Statistical Variance:** The paper's reported baseline performance for DeepScaleR-1.5B-Preview shows a notable discrepancy with its original source (e.g., 35.5% Pass@1 on AIME2024 in this paper, vs. 43.1% reported in [1]). This raises questions about the experimental setup and the validity of the baseline reproductions. Furthermore, for high-variance tasks like long CoT reasoning, reporting statistical variance is crucial. The paper

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.