Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning
Yinan Xia, Haotian Zhang, Huiming Wang

TL;DR
This paper introduces DDPO, a reinforcement learning method that optimizes answer length based on task difficulty, improving accuracy and efficiency in large reasoning models by balancing overconfidence and overthinking.
Contribution
It proposes a novel difficulty-differentiated policy optimization algorithm that adjusts output length for simple and complex tasks separately, with theoretical guidance and empirical validation.
Findings
Reduces answer length by 12% on average
Improves accuracy by 1.85% across benchmarks
Enhances the trade-off between answer quality and length
Abstract
Large Reasoning Models (LRMs) have shown exceptional reasoning capabilities, but they also suffer from the issue of overthinking, often generating excessively long and redundant answers. For problems that exceed the model's capabilities, LRMs tend to exhibit the overconfidence phenomenon, generating overly short but incorrect answers, which may contribute to suboptimal performance. To address these issues, we propose Difficulty-Differentiated Policy Optimization (DDPO), an efficient reinforcement learning algorithm that optimizes simple and complex tasks separately based on the overconfidence phenomenon. Specifically, it reduces the output length for simple tasks without compromising accuracy, while for complex tasks, it expands the exploration space to improve performance. We further derive the theoretical conditions for maximizing expected accuracy, which require the length…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Topic Modeling
