TL;DR
This paper introduces Multi-Turn Decomposition (MinD), a structured approach to improve reasoning efficiency in Large Reasoning Models by enabling explicit, turn-wise interactions and iterative refinement, significantly reducing token usage and latency.
Contribution
MinD provides a novel structured multi-turn reasoning framework that enhances efficiency and control in Large Reasoning Models, with effective training via supervised fine-tuning and reinforcement learning.
Findings
Achieves up to 70% reduction in token usage and time to first token.
Maintains competitive reasoning performance on benchmarks.
Enables explicit control over iterative reasoning process.
Abstract
Large Reasoning Models (LRMs) are criticized for the excessively lengthy Chain-of-Thought (CoT) to derive the final answer, suffering from high first-token and overall latency. Typically, the CoT of LRMs mixes multiple thinking units; each unit attempts to produce a candidate answer to the original query. Hence, a natural idea to improve efficiency is to reduce the unit number. Yet, the fact that the thinking units in vanilla CoT cannot be explicitly managed renders doing so challenging. This paper introduces Multi-Turn Decomposition (MinD) to decode conventional CoT into a sequence of explicit, structured, and turn-wise interactions to bridge the gap. In MinD, the model provides a multi-turn response to the query, where each turn embraces a thinking unit and yields a corresponding answer. The subsequent turns can reflect, verify, revise, or explore alternative approaches to both the…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper conducts an exploration of reasoning unit redundancy, which better demonstrates the motivation rather than just directly claiming LRM tokens are redundant. 2. A novel method design that utilizes GRPO's implicit ability for shorter turns rather than incorporating a length penalty directly into the reward function.
1. Regarding the reasoning units, it is necessary to measure the quality of the obtained unit splits and whether they clearly reflect the LRM's reasoning process. 2. The authors explored too few LRMs with only two models distilled from DeepSeek-R1. This makes it difficult to prove whether the unit segmentation is overly dependent on DeepSeek-R1's text style, and it also needs to be verified if the multi-turn training method is applicable to other LRMs. 3. More baselines could be considered, su
- The research problem under study, i.e., ‘overthinking’ in reasoning generation to arrive at the final answer, is very prominent in LRMs, resulting in frequent wrong generation and latency overload compared to non-LRMs. This work introduces a competitive methodology for addressing this issue. - The proposed methodology reformulates the CoT mechanism in LRMs into multi-turn decomposition, where each turn provides independent reasoning with a candidate final answer, allowing early exit and reduc
- **Generalization of empirical results beyond the choice of LRM**: The entire experiments are conducted considering only one reasoning model (DeepSeek-R1-Distill-Qwen) with two different model sizes (1.5B and 7B). Though the empirical results regarding reduced token utilization and latency while maintaining model accuracy are clearly demonstrated for the Qwen model, the absence of incorporating other reasoning-based models (eg, DeepSeek-R1-Distill-Llama-8B) limits its generalization and reliabi
The authors propose a concise method to compress the reasoning process of Large Reasoning Models, which effectively reduces the number of tokens consumed during inference. Additionally, the paper is clearly written, allowing readers to easily follow the content and grasp the key ideas.
1. Insufficient reproducibility: The paper fails to provide detailed experimental settings for both the proposed method and baselines, such as specific hyperparameter configurations and whether repeated experiments were performed. This lack of key information hinders other researchers from replicating the study’s results and verifying its conclusions. 2. Limited diversity of baselines: Most baselines in the experiments fall into the category of methods that control early stopping via token budge
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
