TL;DR
This paper introduces Tool-DE, a benchmark and framework that enhances tool documentation through structured expansion, significantly improving retrieval performance with specialized models and large-scale data, advancing tool retrieval research.
Contribution
We propose a scalable document expansion pipeline and two dedicated models, Tool-Embed and Tool-Rank, to improve tool retrieval using enriched documentation and large datasets.
Findings
Document expansion substantially improves retrieval performance.
Tool-Embed and Tool-Rank achieve new state-of-the-art results.
Analysis reveals the importance of individual documentation fields.
Abstract
Large Language Models (LLMs) have recently demonstrated strong capabilities in tool use, yet progress in tool retrieval remains hindered by incomplete and heterogeneous tool documentation. To address this challenge, we introduce Tool-DE, a new benchmark and framework that systematically enriches tool documentation with structured fields to enable more effective tool retrieval, together with two dedicated models, Tool-Embed and Tool-Rank. We design a scalable document expansion pipeline that leverages both open- and closed-source LLMs to generate, validate, and refine enriched tool profiles at low cost, producing large-scale corpora with 50k instances for embedding-based retrievers and 200k for rerankers. On top of this data, we develop two models specifically tailored for tool retrieval: Tool-Embed, a dense retriever, and Tool-Rank, an LLM-based reranker. Extensive experiments on…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper effectively addresses the limitations of current tool learning paradigm where existing tool documents are underspecified for effective tool retrieval by using the idea of tool expansion using LLM which shows strong emperical results with and without dedicated trained retrievers and rerankers. - It is an end-to-end framework where they create a tool document dataset and train the retriever and reranker. - The paper includes extensive ablations studies on impact of each field of expan
- While the framework shows strong performance, the idea of revising or augmenting the tool documents has been explored by the previous works [1,2]. - To make the data generation more scalable, one might consider replacing human verification to using a strong LLM [1] Huang et al, Planning and Editing What You Retrieve for Enhanced Tool Learning. NAACL 2024 \ [2] Chen et al, EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction. ACL 2025
1. The benchmark is comprehensive. TOOL-DE is built over 35 datasets with a carefully validated expansion process, combining open and closed models (Qwen3, LLaMA-3.1, GPT-4o) and human checks. 2. Experiments show solid improvements. Both retriever and reranker consistently outperform strong baselines, demonstrating that simple, well-structured enrichment can yield significant improvements. 3. The paper is well-written and easy to follow.
1. The manuscript lacks a clear and comprehensive description of the dataset. While some details are provided in the appendix, it would significantly improve clarity and reproducibility to include a dedicated section in the main text describing the dataset composition (e.g., number and types of tools, instances per tool, data sources, and preprocessing steps). 2. The training and testing splits is insufficiently explained. The paper shows the proposed pipeline works well when train and test on t
1. The studied problem is interesting and valuable. Indeed, the quality of API documents often has high variance. Standardization of them is expected to be valuable. 2. The paper is well written and easy to follow.
1. The proposed pipeline for augmenting API documentation involves human annotation, which makes it hard to scale up. 2. According to Table 1, I find that the improvement of augmenting documentation without training is limited, especially for Qwen3-Embedding series. There is even a performance drop after augmenting the documentation, which makes me doubt the solidity of the motivation of this work. 3. Following the second point, it would be more valuable for applications if direct augmentation
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
