Advancing and Benchmarking Personalized Tool Invocation for LLMs

Xu Huang; Yuefeng Huang; Weiwen Liu; Xingshan Zeng; Yasheng Wang,; Ruiming Tang; Hong Xie; Defu Lian

arXiv:2505.04072·cs.CL·May 8, 2025

Advancing and Benchmarking Personalized Tool Invocation for LLMs

Xu Huang, Yuefeng Huang, Weiwen Liu, Xingshan Zeng, Yasheng Wang,, Ruiming Tang, Hong Xie, Defu Lian

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces personalized tool invocation for LLMs, proposing a new framework and benchmark to improve user-specific tool selection and parameter inference, with fine-tuned models demonstrating effectiveness.

Contribution

It presents the concept of personalized tool invocation, a data synthesis framework (PTool), and the first benchmark (PTBench) for evaluating this capability in LLMs.

Findings

01

Effective fine-tuning of open-source models on PTool data.

02

PTBench provides a standardized evaluation for personalized tool invocation.

03

Models show improved performance on personalized tool tasks.

Abstract

Tool invocation is a crucial mechanism for extending the capabilities of Large Language Models (LLMs) and has recently garnered significant attention. It enables LLMs to solve complex problems through tool calls while accessing up-to-date world knowledge. However, existing work primarily focuses on the fundamental ability of LLMs to invoke tools for problem-solving, without considering personalized constraints in tool invocation. In this work, we introduce the concept of Personalized Tool Invocation and define two key tasks: Tool Preference and Profile-dependent Query. Tool Preference addresses user preferences when selecting among functionally similar tools, while Profile-dependent Query considers cases where a user query lacks certain tool parameters, requiring the model to infer them from the user profile. To tackle these challenges, we propose PTool, a data synthesis framework…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. Clearly distinguishing the two key challenges of tool preference and parameter inference, thereby filling a gap in the existing tool learning literature regarding personalization. 2. Proposes a systematic data synthesis framework (PTool) that covers tool generation, user profile construction, behavior simulation, and query-solution generation. The pipeline is clear and extensible. 3. Constructs the first benchmark for personalized tool invocation (PTBench), comprising high-quality, human-anno

Weaknesses

1.While PTBench contains 1,199 test samples, the training data is primarily synthetic and does not incorporate real user behavior data, which might affect the model's generalization in real-world scenarios. 2. Evaluation metrics rely heavily on automated measures, lacking subjective assessment of the model's reasoning process or user satisfaction, making it difficult to fully reflect the "human-like" quality of personalized invocations. 3. The description of the user profile construction process

Reviewer 02Rating 6Confidence 2

Strengths

1. This paper proposes a paradigm for personalized tool invocation, which is an important research topic. 2. A complete pipeline for data synthesis with rule-based + LLM verification and human-in-the-loop curation yields a usable benchmark. 3. Experiments with both API and OSS models show the effectiveness of the proposed PTool method.

Weaknesses

1. All data are synthetic (seeded by GPT-4-turbo) with limited manual verification scale (1,199 test items), leaving open how results transfer to real-wolrd scenarios. 2. While many baselines are reported, there is no head-to-head against other personalization methods or retrieval-augmented profile resolvers, and no BFCL evaluation. 3. The code and experimental data are not open-sourced.

Reviewer 03Rating 4Confidence 3

Strengths

Personalized tool use is a pivotal challenge for practical LLM deployment. This paper frames the problem via tool personalization and parameter personalization, proposes a data-generation pipeline accordingly, and substantiates its design with targeted ablation studies.

Weaknesses

1. The first benchmark to evaluate personalized tool invocation claim likely overstates. Contemporary works [1] [2] already target personalized tool use. 2. The scale and diversity of the dataset may be insufficient. With 80 users and 3 platforms per scenario, it’s unclear whether user profile coverage is broad enough and whether platform differences are strongly identifiable across scenarios. 3. The benchmark hides profiles at query time but expects profile-based completion. Add adversarial pai

Code & Models

Repositories

hyfshadow/ptbench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification