Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization

Bin Hong; Jiayu Liu; Kai Zhang; Jianwen Sun; Mengdi Zhang; Zhenya Huang

arXiv:2508.10164·cs.AI·April 16, 2026

Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization

Bin Hong, Jiayu Liu, Kai Zhang, Jianwen Sun, Mengdi Zhang, Zhenya Huang

PDF

1 Video

TL;DR

This paper introduces LCPO, a method to significantly shorten large reasoning models' outputs by over 50% without sacrificing reasoning quality, using limited data and training.

Contribution

The paper proposes Length Controlled Preference Optimization (LCPO), a novel approach for reducing output length in LRMs through limited tuning and preference optimization.

Findings

01

LCPO reduces output length by over 50% across multiple benchmarks.

02

LCPO maintains reasoning performance despite shorter outputs.

03

Analysis of preference optimization objectives under a unified framework.

Abstract

Recent advances in Large Reasoning Models (LRMs) have demonstrated strong performance on complex tasks through long Chain-of-Thought (CoT) reasoning. However, their lengthy outputs increase computational costs and may lead to overthinking, raising challenges in balancing reasoning effectiveness and efficiency. Current solutions often compromise reasoning quality or require extensive resources. In this paper, we investigate how to reduce the generation length of LRMs with limited tuning. We analyze generation path distributions and filter generated trajectories through difficulty estimation. Subsequently, we analyze the convergence characteristics of various preference optimization objectives under a unified Bradley-Terry loss based framework. Based on the analysis, we propose Length Controlled Preference Optimization (LCPO) that directly balances the implicit reward related to NLL loss.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization· slideslive