LACONIC: Length-Aware Constrained Reinforcement Learning for LLM
Chang Liu, Yiran Zhao, Lawrence Liu, Yaoqi Ye, Csaba Szepesv\'ari, Lin F. Yang

TL;DR
LACONIC is a reinforcement learning approach that effectively enforces length constraints on large language models, reducing output length by over 50% while maintaining or improving task performance across various benchmarks.
Contribution
It introduces a length-aware RL method with adaptive cost scaling that ensures robust length control without sacrificing task accuracy.
Findings
Reduces output length by over 50% on mathematical reasoning tasks.
Maintains out-of-domain performance with 44% fewer tokens.
Integrates seamlessly into existing RL-tuning pipelines.
Abstract
Reinforcement learning (RL) has enhanced the capabilities of large language models (LLMs) through reward-driven training. Nevertheless, this process can introduce excessively long responses, inflating inference latency and computational overhead. Prior length-control approaches typically rely on fixed heuristic reward shaping, which can misalign with the task objective and require brittle tuning. In this work, we propose LACONIC, a reinforcement learning method that enforces a target token budget during training. Specifically, we update policy models using an augmented objective that combines the task reward with a length-based cost. To balance brevity and task performance, the cost scale is adaptively adjusted throughout training. This yields robust length control while preserving task reward. We provide a theoretical guarantee that support the method. Across mathematical reasoning…
Peer Reviews
Decision·Submitted to ICLR 2026
- The proposed idea is notably simple and highly effective. Various problems in RL (for example, safety) involve multiple constraints. This paper effectively borrows from these approaches by formulating the length restriction as a constrained optimization problem. - The proposed approach results in a minor modification to the standard training process for LLMs on reasoning tasks. Empirically, it achieves marginally better or similar performance compared to its baselines, while using fewer output
- The primary weakness of the paper lies in the update rule for the Lagrange multiplier proposed in Eq. 6. It is unclear how that update rule is obtained, and it does not appear to follow the standard procedure of constrained optimization, where partial derivatives of the Lagrange function are set to zero and the system of equations is solved simultaneously. Furthermore, the cost expression in Eq. 3 is unbounded, and in certain cases, it might outweigh the reward term, resulting in undesirable u
- The paper is well written and presented. - The method is well motivation: the problem of inference time costs increasingly largely due to growing response lengths is important, and the paper tackles it with an appropriate solution. - The proposed LACONIC method which introduces a primal-dual optimization strategy is technically novel.
- All the conducted experiments are on small-scale models (1.5B models). Previous works like Sober reasoning (https://arxiv.org/abs/2504.07086) have shown that RL on small-scale models might not be reliable. Further experiments on larger and more diverse models (7B or larger) are required to ensure that the results are conclusive and transfer to real-world scales. - The numbers in table 1 are categorically lower than the baseline numbers reported in the Sober Reasoning paper. Since this previous
- The proposed solution is interesting. The method elegantly reinterprets length control as a constrained optimization problem rather than heuristic reward shaping. - The paper is easy to follow.
- The major concern is that the experiments are only conducted on 1.5B-scale models. Also, the results of Qwen2.5-Math-1.5B-Instruct are not very strong compared with GRPO. - While the primal-dual approach is conceptually sound, the paper lacks formal convergence analysis. - Baselines are mainly GRPO and L1-based methods. It would strengthen the paper to include comparisons with more recent efficiency-oriented RL approaches (e.g., GFPO [1]). [1] Sample More to Think Less: Group Filtered Policy
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
