TPS-Bench: Evaluating AI Agents' Tool Planning \& Scheduling Abilities in Compounding Tasks

Hanwen Xu; Xuyao Huang; Yuzhe Liu; Kai Yu; Zhijie Deng

arXiv:2511.01527·cs.AI·November 4, 2025

TPS-Bench: Evaluating AI Agents' Tool Planning \& Scheduling Abilities in Compounding Tasks

Hanwen Xu, Xuyao Huang, Yuzhe Liu, Kai Yu, Zhijie Deng

PDF

Open Access

TL;DR

This paper introduces TPS-Bench, a benchmark for evaluating large language models' ability to plan and schedule tools for complex, multi-step tasks, highlighting differences in model strategies and potential improvements through reinforcement learning.

Contribution

The paper presents TPS-Bench, a new benchmark with 200 tasks to assess LLMs' tool planning and scheduling, and provides empirical insights into model performance and reinforcement learning enhancements.

Findings

01

Most models can plan tools reasonably well.

02

Models differ significantly in scheduling strategies.

03

Reinforcement learning can improve efficiency and performance.

Abstract

Large language model (LLM) agents have exhibited strong problem-solving competence across domains like research and coding. Yet, it remains underexplored whether LLM agents can tackle compounding real-world problems that require a diverse set of tools to complete. Given a broad, heterogeneous tool repository, LLM agents must not only select appropriate tools based on task planning analysis but also strategically schedule the execution order to ensure efficiency. This paper introduces TPS-Bench to benchmark the ability of LLM agents in solving such problems that demand Tool Planning and Scheduling. TPS-Bench collects 200 compounding tasks of two difficulty levels, based on a tool repository containing hundreds of model context protocol (MCP) tools. In particular, each task is composed of multiple subtasks, such as web search, map navigation, calendar checking, etc., and each subtask can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education · Topic Modeling