Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning

Xintong Li; Sha Li; Rongmei Lin; Hongye Jin; Linwei Li; Hejie Cui; Sarah Zhang; Chia-Yuan Chang; Kewei Cheng; Besnik Fetahu; Priyanka Nigam; Jingbo Shang; Bing Yin

arXiv:2603.00296·cs.CL·March 3, 2026

Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning

Xintong Li, Sha Li, Rongmei Lin, Hongye Jin, Linwei Li, Hejie Cui, Sarah Zhang, Chia-Yuan Chang, Kewei Cheng, Besnik Fetahu, Priyanka Nigam, Jingbo Shang, Bing Yin

PDF

Open Access

TL;DR

This paper introduces SWAP, a fine-grained reinforcement learning framework that reduces reasoning chain length in large models by penalizing less important steps, leading to shorter, more accurate reasoning chains.

Contribution

SWAP is a novel step-level length penalty method that dynamically allocates penalties based on step importance, improving reasoning efficiency and accuracy.

Findings

01

Reduces reasoning length by 64.3% on average.

02

Improves accuracy by 5.7% relative to the base model.

03

Demonstrates effective length reduction without sacrificing correctness.

Abstract

Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length penalties, which cannot distinguish essential from redundant reasoning steps and therefore yield blunt compression. Although recent work incorporates step-level signals, such as offline pruning, supervised data construction, or verifier-based intermediate rewards, reasoning length is rarely treated as an explicit step-level optimization objective during RL. We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution. We estimate step importance from the model's on-policy log-probability improvement toward the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Topic Modeling