TPS-Bench: Evaluating AI Agents' Tool Planning \& Scheduling Abilities in Compounding Tasks
Hanwen Xu, Xuyao Huang, Yuzhe Liu, Kai Yu, Zhijie Deng

TL;DR
This paper introduces TPS-Bench, a benchmark for evaluating large language models' ability to plan and schedule tools for complex, multi-step tasks, highlighting differences in model strategies and potential improvements through reinforcement learning.
Contribution
The paper presents TPS-Bench, a new benchmark with 200 tasks to assess LLMs' tool planning and scheduling, and provides empirical insights into model performance and reinforcement learning enhancements.
Findings
Most models can plan tools reasonably well.
Models differ significantly in scheduling strategies.
Reinforcement learning can improve efficiency and performance.
Abstract
Large language model (LLM) agents have exhibited strong problem-solving competence across domains like research and coding. Yet, it remains underexplored whether LLM agents can tackle compounding real-world problems that require a diverse set of tools to complete. Given a broad, heterogeneous tool repository, LLM agents must not only select appropriate tools based on task planning analysis but also strategically schedule the execution order to ensure efficiency. This paper introduces TPS-Bench to benchmark the ability of LLM agents in solving such problems that demand Tool Planning and Scheduling. TPS-Bench collects 200 compounding tasks of two difficulty levels, based on a tool repository containing hundreds of model context protocol (MCP) tools. In particular, each task is composed of multiple subtasks, such as web search, map navigation, calendar checking, etc., and each subtask can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education · Topic Modeling
