Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning

Yinan Xia; Haotian Zhang; Huiming Wang

arXiv:2603.18533·cs.LG·March 23, 2026

Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning

Yinan Xia, Haotian Zhang, Huiming Wang

PDF

Open Access

TL;DR

This paper introduces DDPO, a reinforcement learning method that optimizes answer length based on task difficulty, improving accuracy and efficiency in large reasoning models by balancing overconfidence and overthinking.

Contribution

It proposes a novel difficulty-differentiated policy optimization algorithm that adjusts output length for simple and complex tasks separately, with theoretical guidance and empirical validation.

Findings

01

Reduces answer length by 12% on average

02

Improves accuracy by 1.85% across benchmarks

03

Enhances the trade-off between answer quality and length

Abstract

Large Reasoning Models (LRMs) have shown exceptional reasoning capabilities, but they also suffer from the issue of overthinking, often generating excessively long and redundant answers. For problems that exceed the model's capabilities, LRMs tend to exhibit the overconfidence phenomenon, generating overly short but incorrect answers, which may contribute to suboptimal performance. To address these issues, we propose Difficulty-Differentiated Policy Optimization (DDPO), an efficient reinforcement learning algorithm that optimizes simple and complex tasks separately based on the overconfidence phenomenon. Specifically, it reduces the output length for simple tasks without compromising accuracy, while for complex tasks, it expands the exploration space to improve performance. We further derive the theoretical conditions for maximizing expected accuracy, which require the length…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Topic Modeling