Enhancing Tool Calling in LLMs with the International Tool Calling Dataset
Zuoyu Zhang, Yancheng Zhu

TL;DR
This paper introduces the International Tool Calling (ITC) benchmark, a large-scale, multilingual dataset with real APIs to evaluate and improve large language models' ability to call tools across diverse languages and regions.
Contribution
It presents the ITC benchmark, addressing limitations of prior datasets by including real APIs, multilingual tasks, and diverse geographic scenarios, and demonstrates its effectiveness in improving LLM performance.
Findings
Fine-tuning on ITC improves non-English query handling.
Significant performance gaps exist between open- and closed-source LLMs.
ITC enhances cross-lingual generalization and robustness.
Abstract
Tool calling allows large language models (LLMs) to interact with external systems like APIs, enabling applications in customer support, data analysis, and dynamic content generation. While recent benchmarks have advanced tool-use research, they suffer from key limitations, including reliance on simulated or restricted APIs, limited reproducibility, and a lack of cultural and geographic diversity. To address these gaps, we introduce International Tool Calling (ITC), a large-scale, multilingual benchmark designed for realistic, globally distributed tool-calling scenarios. ITC includes 3,571 real APIs and 17,540 tool calling tasks across 20 categories and 40 countries. Experiments reveal substantial performance gaps between open- and closed-source LLMs, while fine-tuning on ITC yields significant improvements, particularly for non-English queries, enhancing cross-lingual generalization,…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1) The proposed ITC dataset is valuable for the community, considering its size, multilingual coverage (29 languages), and inclusion of both global and region-specific real publicly accessible APIs. 2) The paper benchmarks a wide range of models using multiple metrics (tool selection, invocation, format, and language matching), and provides error analysis and ablation studies.
1) The benchmark results show GPT-4o achieving the best performance, while o1-mini and o3-mini perform worse (even below GPT-4o-mini). This raises concerns about the reliability and fairness of the evaluation, especially since the answers in the dataset are generated by GPT-4o, potentially introducing bias in favor of this model. The paper does not systematically analyze or discuss this bias, nor does it provide sufficient ablation or sanity checks (e.g., cross-model answer generation, human eva
The strengths of the paper: - The dataset is open-sourced and benefits community research in related areas.
The weaknesses of the paper are listed below, - Introduction Section: - The authors first present ToolLLM as advancing real-world tool invocation, then claim “others use proprietary or inaccessible APIs, as in ToolLLM.” ToolLLM’s core point is using (mostly) publicly documented APIs (often via hubs/marketplaces with keys). Calling it “proprietary/inaccessible” is misleading and internally inconsistent with the prior praise. - The “US-centric” generalization + Yahoo-Weather exam
1. The dataset offers broad multilingual coverage, addressing a major gap since most existing tool-calling datasets are limited to English. 2. It is built on real, publicly accessible APIs rather than simulated or proprietary ones, making the data more authentic, reproducible, and closer to real-world applications.
Weaknesses: 1. The models evaluated are somewhat dated — including newer ones like GPT-5 or DeepSeek-V3.1 could make the results more representative. 2. The benchmark scores are already very high — for example, Watt-Tool and GPT-4o have achieved excellent results(>80%), which makes it difficult to observe meaningful performance differences across models. In addition, DeepSeek-V3’s Invocation F1 appears even higher than Watt-Tool’s, which might suggest a labeling or reporting inconsistency. 3.
The paper makes a meaningful contribution by expanding the toolset beyond English and including multiple languages and geographies. The dataset also includes a sizeable training set, which model builders can leverage to expand their model capabilities to languages beyond English.
**1. Non-standard function specification in the system prompt:** Looking at the submitted data, I found that the system prompt contains the function definition (see example below) instead of a separate tools key. This prevents the tokenizer from applying the appropriate chat template, which in turn negatively affects the model's performance. I suspect that due to this formatting, some of the models have lower-than-expected performance numbers, thus making the evaluation numbers unreliable in m
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · AI in Service Interactions
