RefTool: Reference-Guided Tool Creation for Knowledge-Intensive Reasoning
Xiao Liu, Da Yin, Zirui Wu, Yansong Feng

TL;DR
RefTool is a reference-guided framework that enables large language models to automatically create and organize external tools from reference materials, significantly improving reasoning accuracy in knowledge-intensive tasks.
Contribution
It introduces a novel reference-guided approach for automatic tool creation and hierarchical organization, enhancing LLM reasoning beyond internal knowledge.
Findings
Outperforms existing methods by 12.3% on average accuracy
Effective in scientific and low-resource language tasks
Grounded tools are accurate and faithful
Abstract
Large Language Models (LLMs) can enhance their reasoning capabilities by using external tools. However, many tasks lack predefined tools. Prior works have explored instructing LLMs to generate tools on their own, but such approaches depend heavily on internal knowledge and struggle when tasks fall outside the model's knowledge scope. To address this limitation, we propose RefTool, a reference-guided framework for automatic tool creation that leverages external materials, such as textbooks and knowledge snippets. RefTool consists of two modules: (1) tool creation, where LLMs generate executable tools from reference content, validate them using illustrative examples, and organize them hierarchically into a toolbox; and (2) tool utilization, where LLMs navigate the toolbox structure to select and apply the appropriate tools to solve problems. Experiments on causality, physics, and…
Peer Reviews
Decision·ICLR 2026 Poster
1. Enabling LLMs to automatically create executable tools from unstructured text references is more efficient than traditional text retrieval. This structured generation approach (including pseudocode and code templates) is more effective than simple text retrieval. By transforming textbooks into a toolbox, REFTOOL empowers models to solve problems previously beyond their capabilities. 2. REFTOOL achieves significant performance improvements on knowledge-intensive benchmarks like causal reasonin
1. The method currently relies on GPT-4 (or similar large models) for tool generation and verification. While understandable (tool synthesis is difficult), this means the initial setup can be expensive and dependent on proprietary models. Without GPT-4 or similar high-performance models, the quality of the generated tools may decline. This introduces an unfairness in experimental baseline comparisons. The article does not deeply explore using open-source LLMs for tool creation; it would be worth
- The paper introduces a new perspective on tool creation shifting from internally generated tools (as in Creator, TroVE, etc.) to reference-guided tool generation grounded in external textual materials. This is elegant and underexplored. - The approach generalizes to multiple datasets within the domain and also to areas beyond scientific reasoning. - The experiments are extensive, spanning multiple scientific domains and extending to non-scientific domains. The evaluation is robust and covers t
- the validation of generated tools is based on available examples or generation of appropriate examples. It is unclear how one would ensure the correctness of the generated examples and their solution. - the choice of GPT-4o for tool creation, the number of tools and categories is unclear - what is the basis for these choices? - While causality, physics, and chemistry benchmarks are informative, they are narrow in scope and format (mostly numerical or formula-based). The framework’s generality
1. Principled approach to grounding tool generation in external, authoritative sources. It is a well-motivated and logical solution for specialized domains. 2. The introduction of a hierarchical toolbox structure that mirrors the organization of the reference material is a strong design choice. 3. Demonstrated generalization.
1. The framework's performance depends on the quality and structure of the reference material. It assumes a well-organized, comprehensive textbook with content that can be easily split into discrete functions. How does the framework cope with noisy, theoretical, or poorly structured references? 2. The core components are incremental and miss the core baseline, such as fine-tuning the base LLM directly on textbook content. It is unclear whether the explicit, complex, and potentially brittle tool-
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis · Model-Driven Software Engineering Techniques
