Shorter Thoughts, Same Answers: Difficulty-Scaled Segment-Wise RL for CoT Compression
Ye Tian, Aijun Liu

TL;DR
This paper introduces DSS-GRPO, a novel reinforcement learning method that compresses reasoning traces in language models by considering difficulty and segment boundaries, maintaining answer quality while reducing token usage.
Contribution
The paper proposes DSS-GRPO, a difficulty-scaled, segment-wise RL approach that improves reasoning trace compression without compromising answer accuracy.
Findings
Effective reduction in reasoning trace length
Maintains answer quality despite compression
Outperforms naive RL approaches in experiments
Abstract
Chain-of-thought (CoT) improves reasoning reliability but increases token cost, motivating post-training compression of explicit reasoning traces. However, the shortest sufficient reasoning is not universal: it depends on difficulty, model capacity, and training state, making fixed length targets brittle. In practice, naive RL-based compression can also undesirably shorten the user-facing answer, because a single completion-level learning signal leaks across the think/answer boundary. We propose Difficulty-Scaled Segment-Wise GRPO (DSS-GRPO), which decomposes returns into think and answer components, computes group-relative advantages per segment, and routes them with hard token masks so compression updates act only on think while answer alignment acts only on answer. DSS-GRPO uses prompt-wise within-group shaping and difficulty-aware scaling to encourage concise reasoning without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms
