AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning
Jiaru Zou, Ling Yang, Yunzhe Qi, Sirui Chen, Mengting Ai, Ke Shen, Jingrui He, Mengdi Wang

TL;DR
AutoTool enables large language models to dynamically select and integrate external tools during reasoning, improving adaptability and performance across diverse tasks by using a novel dataset and a dual-phase optimization pipeline.
Contribution
The paper introduces AutoTool, a framework for dynamic tool selection in LLM agents, with a new dataset and optimization methods that enhance reasoning and generalization.
Findings
AutoTool outperforms existing methods on multiple benchmarks.
It achieves average gains of 6.4% in math & science reasoning.
AutoTool generalizes to unseen tools during inference.
Abstract
Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, limiting LLM agents' adaptability to new or evolving toolsets. We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. We first construct a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Building on this data foundation, AutoTool employs a dual-phase optimization pipeline: (i) supervised and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce ranking to refine consistent multi-step tool selection. Across ten diverse benchmarks, we…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper addresses a genuine limitation in existing work—most approaches assume fixed toolsets, whereas real-world scenarios require dynamic tool selection from evolving inventories. The dual-phase optimization pipeline is well-designed, with Phase I establishing stable reasoning patterns and Phase II specifically targeting tool-selection refinement through PL ranking.
While the combination is effective, the individual components (SFT, GRPO, Plackett-Luce ranking) are well-established techniques. The main contribution appears to be applying PL ranking to tool selection, which is somewhat incremental. The paper would benefit from discussing recent work on tool retrieval and generation. Also there are notation inconsistencies: The paper switches between τ and T for trajectories/trajectory sets.
- Comprehensive empirical results, spanning a diverse set of evaluation datasets - Results compared against relevant baselines such as stronger reasoning models, existing tool integration methods and traditional fine-tuning - Strong results, the proposed AutoTool framework achieves consistent gains on the diverse datasets compared to multiple approaches.
- I couldn't find the results on the generalization performance on unseen tools during inference. The key proposal for the embedding-anchored selection method is that it should be able to dynamically adapt to new tools provided during inference, but none of the experimental results seem to highlight it. - Not sure I follow why the analysis of autotool is needed with an oracle tool assignment agent. Ideally, the oracle numbers should be present in Table 1 to directly compare other methods on how
- AutoTool innovatively integrates embedding-anchored tool selection and KL-regularized PL ranking into the learning of LLM agents, which contributes to decent originality. - The presentation of AutoTool dual-phase learning scheme is theoretically well-motivated and mathematically well-grounded. - AutoTool’s proposed challenge of dynamic tool selection under evolving tool environments is crucial for robust and scalable LLM agentic framework development.
- The experimental analysis of this paper falls short of justifying AutoTool’s effectiveness on improving dynamic tool selection under evolving tool environments, i.e., whether AutoTool performs better tool selection when generalizing to unseen toolsets, which is however the most significant challenge raised by the paper. Evaluation on a new or heldout set of tools and tasks that are unseen at training phase would help further justify this important point. - It is unclear how the evolving toolse
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
