ToolWeaver: Weaving Collaborative Semantics for Scalable Tool Use in Large Language Models
Bowen Fang, Wen Ye, Yunyue Su, Jinghao Zhang, Qiang Liu, Yesheng Liu, Xin Sun, Shu Wu, Jiabing Yang, Baole Wei, Liang Wang

TL;DR
ToolWeaver introduces a hierarchical, semantic-aware tool encoding framework for large language models, significantly improving scalability and collaborative understanding of tools in large tool libraries.
Contribution
It proposes a novel hierarchical tool encoding and tokenization method that enhances scalability and enables learning of collaborative tool relationships.
Findings
Outperforms state-of-the-art methods on large tool libraries
Enables logarithmic vocabulary expansion with more tools
Improves semantic understanding and generalization in tool use
Abstract
Prevalent retrieval-based tool-use pipelines struggle with a dual semantic challenge: their retrievers often employ encoders that fail to capture complex semantics, while the Large Language Model (LLM) itself lacks intrinsic tool knowledge from its natural language pretraining. Generative methods offer a powerful alternative by unifying selection and execution, tasking the LLM to directly learn and generate tool identifiers. However, the common practice of mapping each tool to a unique new token introduces substantial limitations: it creates a scalability and generalization crisis, as the vocabulary size explodes and each tool is assigned a semantically isolated token. This approach also creates a semantic bottleneck that hinders the learning of collaborative tool relationships, as the model must infer them from sparse co-occurrences of monolithic tool IDs within a vast library. To…
Peer Reviews
Decision·ICLR 2026 Poster
1.Well-motivated problem setting. The paper clearly articulates the limitations of the “one-tool-one-token” paradigm in terms of vocabulary explosion and lack of expressivity for tool co-usage patterns, especially in multi-tool scenarios. This is a realistic pain point for large API/tool ecosystems. 2.Coherent end-to-end design. The framework is not just a better embedding scheme but a full pipeline: collaboration-aware RQ-VAE for code learning, a uniform mapping step to resolve code collisions
1.Evaluation scope is narrow. All core experiments are conducted on the ToolBench / StableToolBench family. While this is a large benchmark, the tool distributions and usage patterns are specific. It remains unclear how well ToolWeaver transfers to very different tool ecosystems (e.g., enterprise APIs, programmatic or mathematical tools), or to highly dynamic tool catalogs. 2.Construction and impact of the collaboration signal are under-analyzed. The tool–tool similarity matrix is central to th
1. The hierarchical quantization of tool semantics is novel and well-motivated. It offers a theoretically scalable alternative for tool representation. 2. Introducing inter-tool similarity via a Laplacian regularization term is an interesting way to infuse functional relationships into discrete token learning. 3. The experimental results show that ToolWeaver outperforms existing baselines significantly.
1. The paper states that generative methods suffer from a semantic bottleneck for complex reasoning and that the “one-token-per-tool” paradigm faces critical scalability and generalization challenges. However, prior work, such as ToolGen, has already implemented semantic and hierarchical indexing for tools, addressing these challenges. Thus, while these are valid drawbacks of the basic one-token paradigm, presenting them as new challenges motivating ToolWeaver seems overstated. 2. The paper is
1. Scalability. Demonstrates logarithmic vocabulary growth with better or comparable accuracy vs. state-of-the-art while degrading general NLP ability far less than one-token approaches—highly relevant as tool libraries scale, new conrtibution based on the old trie-like data structures. 2. Refactor tool IDs as structured codes learned with collaborative semantics, rather than flat special tokens; this work addresses vocabulary blow-up and enables shared structure across related tools. 3. The
1. The per-batch optimal-transport uniformity reduces collisions, but the stability, or compute overhead, and behavior under streaming addition of new tools (without retraining codebooks) are unclear. How do codes evolve when tools are added/removed? 2. The eval isn't super crisp when GPT-4o-mini is the judge. ToolBench is the primary testbed; API-Bank is mentioned in related work but not evaluated. Given SoWR uses GPT-4o-mini as the judge (which can bias toward itself), the adjustment is noted
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Digital Humanities and Scholarship · Software Engineering Research
