DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains
Tian Liang, Wenxiang Jiao, Zhiwei He, Jiahao Xu, Haitao Mi, Dong Yu

TL;DR
DeepCompress introduces a dual reward strategy that adaptively adjusts reasoning chain lengths in large reasoning models, improving both accuracy and efficiency by classifying problems as simple or hard in real-time.
Contribution
It proposes a novel adaptive length reward mechanism that dynamically balances reasoning depth based on problem difficulty, enhancing model performance.
Findings
Outperforms baseline methods on mathematical benchmarks.
Achieves higher accuracy with reduced token usage.
Effectively balances exploration and efficiency in reasoning chains.
Abstract
Large Reasoning Models (LRMs) have demonstrated impressive capabilities but suffer from cognitive inefficiencies like "overthinking" simple problems and "underthinking" complex ones. While existing methods that use supervised fine-tuning (SFT) or reinforcement learning (RL) with token-length rewards can improve efficiency, they often do so at the cost of accuracy. This paper introduces DeepCompress, a novel framework that simultaneously enhances both the accuracy and efficiency of LRMs. We challenge the prevailing approach of consistently favoring shorter reasoning paths, showing that longer responses can contain a broader range of correct solutions for difficult problems. DeepCompress employs an adaptive length reward mechanism that dynamically classifies problems as "Simple" or "Hard" in real-time based on the model's evolving capability. It encourages shorter, more efficient…
Peer Reviews
Decision·ICLR 2026 Poster
- This paper introduces a method that rewards shorter reasoning for simple problems and encourages longer, exploratory reasoning for difficult ones. This challenges the standard approach of always favoring conciseness. - This paper demonstrates that DeepCompress models achieve state-of-the-art results, outperforming strong baseline models across multiple challenging mathematical reasoning benchmarks, especially on difficult problems like AIME. - The research provides a thorough analysis showin
- The core mechanism hinges on classifying a problem as "Simple" or "Hard" based on whether its group pass ratio (Pg) is above or below the batch pass ratio (Pb). This is a noisy and relative metric. A problem isn't inherently "Hard"; it's just harder than the batch average. This could lead to suboptimal rewards, especially in batches with skewed difficulty distributions. - The paper proposes conditioning the length reward on the correctness of a solution to prevent reward hacking. However, th
+ Observation: The observation that Pass@1 decreases while Pass@32 increases with length (Figure 1) is genuinely insightful. + The experiments is fully evaluated acorss different becnhmark (7 benchmark) and is hoslitc enough for the math problem. + The paper presents a coherent and compelling research: clear motivation, principled method design, and in-depth analysis. The writing clearly connects empirical findings to design choices to performance outcomes, making the contribution well-motiva
1. The authors observe that Pass@1 diminishes with length increase, while Pass@32 generally increases. The paper attributes this to "longer responses contain a wider coverage of potentially correct solutions." However, this interpretation requires deeper scrutiny: - **Is this truly about difficulty?** Or is it simply a statistical artifact? When you have 32 samples, longer responses might just have more opportunities to "stumble upon" correct solutions through increased exploration, not necessar
- The problem setup—balancing reasoning efficiency and accuracy—is interesting and timely. - The adaptive length reward (shorter for easy, longer for hard) is conceptually reasonable and aligns with observed overthinking/underthinking phenomena in LLMs. - The paper provides extensive experimental evaluation on multiple math reasoning benchmarks. - Figures and tables (e.g., Table 1, Fig. 4) show measurable improvements over baseline RL models like DeepMath-Zero.
1. Missing ablation and causal analysis: Although the authors attribute improvements to the dual reward, there are no clear ablations isolating which component (dual reward vs. model-aware difficulty) contributes most to the gain. The provided variants (length bonus / penalty) are simplistic and insufficient to explain why the accuracy improves. There’s also no sensitivity study on α, β, or λ—key hyperparameters controlling reward magnitude. 2. It seems the results contradict the motivation on
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
