TL;DR
This paper introduces DRQA, a method that uses reinforcement learning to enable large language models to allocate reasoning resources adaptively, reducing overthinking and improving efficiency without sacrificing accuracy.
Contribution
DRQA transfers resource competition benefits from batch processing to single-question inference, enabling adaptive reasoning depth based on question difficulty.
Findings
DRQA significantly reduces token usage on reasoning benchmarks.
DRQA maintains or improves answer accuracy while decreasing overthinking.
The method enhances efficiency in deploying large language models for reasoning tasks.
Abstract
Reasoning large language models (RLLMs), such as OpenAI-O3 and DeepSeek-R1, have recently demonstrated remarkable capabilities by performing structured and multi-step reasoning. However, recent studies reveal that RLLMs often suffer from overthinking, i.e., producing unnecessarily lengthy reasoning chains even for simple questions, leading to excessive token consumption and computational inefficiency. Interestingly, we observe that when processing multiple questions in batch mode, RLLMs exhibit more resource-efficient behavior by dynamically compressing reasoning steps for easier problems, due to implicit resource competition. Inspired by this, we propose Dynamic Reasoning Quota Allocation (DRQA), a novel method that transfers the benefits of resource competition from batch processing to single-question inference. Specifically, DRQA leverages batch-generated preference data and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* Good writing. * Good baseline coverage on efficient reasoning methods. * Good coverage on reasoning tasks. * The idea of adopting prompt batching to compress reasoning traces seems interesting.
* Lack of novelty and insufficient connection to existing literature. The idea of putting multiple questions into the same prompt has already been well explored in existing literature [1, 2, 3] and likely more, but none are cited or discussed. Since DRQA is largely a combination of this prompt batching idea and GRPO, the paper oversells its novelty. Its relation to existing methods deserves much better coverage. * While the updated version does cite some of these works, they are still not pr
- The paper identifies and formalizes a previously underexplored phenomenon in reasoning LLMs that batch inference implicitly encourages concise reasoning through token competition. - The proposed Dynamic Reasoning Quota Allocation (DRQA) effectively bridges the batch and single-question reasoning paradigms using reinforcement learning with preference data. - DRQA is evaluated across diverse and comprehensive reasoning benchmarks, multiple model sizes, and ablation settings.
- While empirically demonstrated, the notion of “resource competition pressure” lacks a clear theoretical or mechanistic explanation. The current argument remains descriptive, leaving ambiguity about whether the effect stems from model inductive bias, context compression, or decoding heuristics. - Reward formulation is indirect and loosely tied to reasoning efficiency. DRQA trains a classifier to label reasoning chains as A/B/C (correct-concise vs. correct-verbose vs. incorrect) and uses GRPO t
- This paper targets "overthinking," a significant and practical issue in RLLMs where models are computationally inefficient. - The observation of "resource competition pressure" is an interesting and novel motivation for the work. - The method consistently achieves a "most favorable trade-off": while other methods can produce even shorter outputs, they often "suffer from severe accuracy degradation," a problem DRQA avoids.
- The paper's core motivation is the "resource competition pressure" observed in batch inference, but this mechanism is not directly applied to the method. The authors' method just aims to mimic the results of this finding, not the process itself. - There is a lack of qualitative analysis of the "overthinking" phenomenon and how batch processing solves this by, for example, removing specific types of redundancy. - The methodological contribution is somewhat thin; the paper argues, but does not d
1. The proposed method shows strong empirical effectiveness. 2. The method is well-motivated and clearly presented.
1. It's unclear how the proposed objective (classifying CoTs as accurate/concise) translates into better generation quality. Predicting labels may not directly improve the model's ability to produce accurate, succinct solutions. 2. The data source used for genearting concise CoTs is questionable. Batch inference is only one of many ways to elicit shorter CoTs and is not reliably shorter than single inference; using it as the primary mechanism to obtain concise traces seems brittle compared with
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
