TL;DR
This paper introduces LCPO, a method to significantly shorten large reasoning models' outputs by over 50% without sacrificing reasoning quality, using limited data and training.
Contribution
The paper proposes Length Controlled Preference Optimization (LCPO), a novel approach for reducing output length in LRMs through limited tuning and preference optimization.
Findings
LCPO reduces output length by over 50% across multiple benchmarks.
LCPO maintains reasoning performance despite shorter outputs.
Analysis of preference optimization objectives under a unified framework.
Abstract
Recent advances in Large Reasoning Models (LRMs) have demonstrated strong performance on complex tasks through long Chain-of-Thought (CoT) reasoning. However, their lengthy outputs increase computational costs and may lead to overthinking, raising challenges in balancing reasoning effectiveness and efficiency. Current solutions often compromise reasoning quality or require extensive resources. In this paper, we investigate how to reduce the generation length of LRMs with limited tuning. We analyze generation path distributions and filter generated trajectories through difficulty estimation. Subsequently, we analyze the convergence characteristics of various preference optimization objectives under a unified Bradley-Terry loss based framework. Based on the analysis, we propose Length Controlled Preference Optimization (LCPO) that directly balances the implicit reward related to NLL loss.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
