Learning Evolving Tools for Large Language Models
Guoxin Chen, Zhong Zhang, Xin Cong, Fangda Guo, Yesai Wu, Yankai Lin,, Wenzheng Feng, Yasheng Wang

TL;DR
This paper introduces ToolEVO, a framework that enhances large language models' ability to adaptively learn and update tool usage in dynamic environments, addressing the challenge of tool obsolescence.
Contribution
We propose ToolEVO, a novel method using Monte Carlo Tree Search for adaptive tool learning in evolving environments, along with a new benchmark ToolQA-D for evaluation.
Findings
ToolEVO improves LLM adaptability to changing tools.
The approach demonstrates stable and effective tool updating.
Experimental results validate the importance of adaptability in tool learning.
Abstract
Tool learning enables large language models (LLMs) to interact with external tools and APIs, greatly expanding the application scope of LLMs. However, due to the dynamic nature of external environments, these tools and APIs may become outdated over time, preventing LLMs from correctly invoking tools. Existing research primarily focuses on static environments and overlooks this issue, limiting the adaptability of LLMs in real-world applications. In this paper, we propose ToolEVO, a novel framework designed to enhance the adaptive and reflective capabilities of LLMs against tool variability. By leveraging Monte Carlo Tree Search, ToolEVO facilitates active exploration and interaction of LLMs within dynamic environments, allowing for autonomous self-reflection and self-updating of tool usage based on environmental feedback. Additionally, we introduce ToolQA-D, a benchmark specifically…
Peer Reviews
Decision·ICLR 2025 Poster
1. Considers adaptation of LLMs to the changing environments which is a real problem for existing LLMs
1. The method is computationally inefficient and unrealistic in a real-world setting 2. A simple method of accessing the APi's through a proxy can do the same job in a much more efficient way. 3. There have been a lot of work on using design patterns to deal with evolving environments. The authors seem to be unaware of that literature. As a result, they have designed a cumbersome method that is not likely to work in practice. 4. In addition, the LLM itself might be modified. Thus the adaptat
1. This paper addresses an important research problem in the community, i.e., adapting LLMs to an evolving external environment. 2. The proposed approach based on MCTS is grounded in a rich literature, and the self-reflection and tool-update mechanisms are novel contributions. 3. ToolEvo exhibits strong empirical results on the benchmark, outperforming a suite of proprietary, open-source, and fine-tuned LLMs. 4. The released benchmark could be useful for future research in this direction.
1. The setting is rather contrived. The authors curate three predefined sets of APIs, but the APIs don't change *within an evaluation episode*. So performing well on the test set doesn't necessarily mean the method adapts to a *constantly* evolving environment. In fact, The proposed approach wouldn't apply to a constantly evolving environment, as the Q value in the MCTS is not adaptive. 2. The evaluation is potentially unfair. For a LLM to adapt to a new set of APIs, it needs to receive feedback
- The paper tackles an important and interesting problem of adapting LLMs to be able to invoke changing APIs correctly. This is a practical and important use case for LLMs. Progress here is likely to be of interest to the community. - The paper considers a number of strong LLM baselines consisting of SOTA closed and open-access LLMs and performs a large set of computationally intensive experiments. While it's not clear if ToolEVO outperforms these, it's still useful to see relative performance.
- Overall, I found the problem setup and proposed methods a little challenging to understand in detail. The implementation details of MCTS and its various enhancements (cached rollout, inference, computational costs) are not clearly described. More on this below. - I didn't find the problem setup convincing. Restricting the LLM to only use P_C in the static prompt (Line 296) seems very limiting. Why can't the LLM's prompt be continuously optimized using recent API data in a separate process? T
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
