TL;DR
This paper introduces LC-R1, a post-training method that significantly compresses reasoning chains in large models by removing invalid steps, achieving around 50% shorter sequences with minimal accuracy loss.
Contribution
The paper proposes a novel fine-grained principle-based approach, LC-R1, for reducing reasoning chain length in LRMs through a new reward system, improving efficiency without substantial accuracy loss.
Findings
Achieves ~50% reduction in reasoning sequence length.
Maintains ~98% of original accuracy after compression.
Demonstrates robustness across multiple benchmarks.
Abstract
Large Reasoning Models (LRMs) have achieved remarkable success, yet they often suffer from producing unnecessary and verbose reasoning chains. We identify a core aspect of this issue as "invalid thinking" -- models tend to repeatedly double-check their work after having derived the correct answer. To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
