AutoTool: Automatic Scaling of Tool-Use Capabilities in RL via Decoupled Entropy Constraints
Yirong Zeng, Xiao Ding, Yufei Liu, Yuxian Wang, Qunyao Du, Yutai Hou, Wu Ning, Haonan Song, Duyu Tang, Dandan Tu, Bing Qin, Ting Liu

TL;DR
This paper introduces AutoTool, a novel RL training paradigm that combines supervised fine-tuning and entropy-based optimization to enable AI models to automatically scale reasoning length for efficient tool use, improving accuracy and reducing computational costs.
Contribution
It proposes a new training approach with entropy-based objectives for automatic reasoning length scaling in RL, addressing inefficiencies in current methods.
Findings
Achieves 9.8% accuracy improvement on benchmarks.
Reduces computational overhead by approximately 81%.
Enables models to automatically adjust reasoning length for complex and simple problems.
Abstract
Tool use represents a critical capability for AI agents, with recent advances focusing on leveraging reinforcement learning (RL) to scale up the explicit reasoning process to achieve better performance. However, there are some key challenges for tool use in current RL-based scaling approaches: (a) direct RL training often struggles to scale up thinking length sufficiently to solve complex problems, and (b) scaled-up models tend to overthink simpler problems, resulting in substantial token inefficiency. To address these challenges, we propose a novel training paradigm that first employs warm-up supervised fine-tuning to help models distinguish between simple and complex problems, followed by RL that enable models to automatically determine appropriate reasoning trajectories. Furthermore, to tackle the issue of automatic thinking-length scaling, we discover that entropy-based optimization…
Peer Reviews
Decision·ICLR 2026 Poster
- The decoupled entropy constraint mechanism is novel. By applying different entropy constraint strengths to short and long reasoning paths, it effectively solves the reasoning collapse problem prevalent in traditional RL training. - The paper is well-structured, with logical coherence from problem analysis and method design to experimental validation. The figures are well-designed, particularly the training dynamics visualization, which effectively demonstrates the auto-scaling effects. - The e
- Entropy constraint hyperparameter selection: Although an adaptive mechanism is proposed, the initial choices for H_l and B_s lack sufficient theoretical justification or ablation analysis. - While sample filtering based on reward variance is mentioned, the specific threshold settings and selection criteria are not described in detail.
* This paper proposes a new loss function designed to apply an entropy penalty that is conditional on the trajectory length within the RL process.
* The paper suffers from a critically confusing narrative and fundamental conceptual ambiguity. The core concept of TTS introduced in line 40 is unsupported in the literature and largely diverges from the definitions provided in the very papers cited. This narrative choice, which mixes TTS terminology with RL problem, is highly confusing and obscures the paper's actual contribution. * For the RL component, the idea of activating or deactivating thinking via special tokens is a well-established a
1. Clear motivation & mechanism. The paper links tool-use failures under RL to entropy collapse and proposes a targeted fix—decoupled, adaptive entropy—implemented with minimal changes to GRPO. 2. Solid empirical suite. Evaluations span BFCL (non-live/live/multi-turn), API-Bank, and ACEBench, with consistent gains over SFT/distillation/RLVR-like baselines; ablations remove each component (data refine, decouple, adapt-coef) and show clear drops. 3. Efficiency analysis. The paper reports token-c
1. Data construction bias and potential leakage. The curated training set mixes downsampled public data with RL-refined, low-variance samples, which likely skews toward cleaner/easier cases; possible overlaps with evaluation sets are not audited. 2. Hyperparameter sensitivity. In multi-turn settings, performance appears sensitive to target entropies and the choice of β; broader sweeps or learned schedules would strengthen the claims.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
