D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use
Bowen Xu, Shaoyu Wu, Hao Jiang, Kai Liu, Xin Chen, Lulu Hu, Bin Yang

TL;DR
D-CORE introduces a two-stage training framework that enhances large reasoning models' ability to decompose tasks and reason reflectively, significantly improving their complex tool use capabilities across benchmarks.
Contribution
The paper presents a novel two-stage training method combining self-distillation and reinforcement learning to improve task decomposition and reasoning in large models.
Findings
D-CORE-8B achieves 77.7% accuracy, outperforming previous models.
D-CORE-14B sets a new state-of-the-art at 79.3%.
Method improves tool-use performance across diverse benchmarks.
Abstract
Effective tool use and reasoning are essential capabilities for large reasoning models~(LRMs) to address complex real-world problems. Through empirical analysis, we identify that current LRMs lack the capability of sub-task decomposition in complex tool use scenarios, leading to Lazy Reasoning. To address this, we propose a two-stage training framework D-CORE~(\underline{\textbf{D}}ecomposing tasks and \underline{\textbf{Co}}mposing \underline{\textbf{Re}}asoning processes) that first incentivize the LRMs' task decomposition reasoning capability via self-distillation, followed by diversity-aware reinforcement learning~(RL) to restore LRMs' reflective reasoning capability. D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales. Experiments on BFCLv3 demonstrate superiority of our method: D-CORE-8B reaches 77.7\% accuracy, surpassing the best-performing 8B…
Peer Reviews
Decision·Submitted to ICLR 2026
Clear Problem Definition and Motivation: The paper compellingly identifies and analyzes the “Lazy Reasoning” phenomenon in LRMs, providing behavioral and quantitative evidence of how excessive reflection impedes performance in multi-turn tool-use scenarios. This focus on reasoning efficiency rather than sheer model size is a valuable direction for scaling reasoning systems. Strong Empirical Validation: Results are robust and consistent across multiple model sizes and benchmarks. The reported +3
Limited Novelty of Core Mechanisms: The core technical ideas, self distillation using ground truth guided decomposition and entropy-based reinforcement learning, are adaptations of existing concepts rather than fundamental innovations. Stage 1 primarily represents a structured data-generation and fine tuning pipeline, while Stage 2 employs a standard entropy based exploration adjustment. The methodology feels more engineering-driven than theoretically novel. Dependence on Ground Truth for Decom
- The problem studied in this paper is interesting. The Lazy Reasoning phenomenon is indeed commonly observed when using large language models, making the motivation of this work clear and compelling. - The paper proposes a novel and well-structured framework. The framework itself, along with the specific techniques employed in each component (e.g., self-distillation, the generation of self-distillation instances, and diversity-aware reinforcement learning), is thoughtfully designed and empiric
- The overall writing could be improved. Some parts of the paper are difficult to follow at first reading. For example, the description related to Figure 3(c) is somewhat confusing, as the detailed definitions and procedures are only provided in the appendix. It would improve readability if part of this information were moved to the main text to make the presentation more self-contained. - In Algorithm 1, some referenced functions are not rigorously defined (e.g., Verify($\hat{Y}, Y*$)). While
This paper presents a well-motivated and empirically grounded study on addressing the “Lazy Reasoning” phenomenon in large reasoning models (LRMs), offering a novel and systematic framework called D-CORE that combines self-distillation with diversity-aware reinforcement learning. The problem is clearly identified through quantitative and qualitative evidence, showing how excessive reflection hinders multi-turn tool use reasoning. The proposed two-stage method is conceptually coherent—self-distil
The entropy-based advantage formulation is only intuitively justified, without a formal connection to reasoning diversity or convergence guarantees. Several implementation details—such as the construction of the 40K self-distillation dataset, the specific prompt templates, and hyperparameter settings—are deferred to the appendix or omitted entirely, making the training pipeline hard to replicate. The evaluation scope is relatively narrow, relying only on BFCLv3 and τ-Bench, both focused on struc
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications
