ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents
Haiyang Shen, Yue Li, Desong Meng, Dongqi Cai, Sheng Qi, Li Zhang,, Mengwei Xu, Yun Ma

TL;DR
ShortcutsBench is a comprehensive benchmark designed to evaluate API-based agents' ability to handle complex, real-world tasks, revealing significant limitations in current models' reasoning and task execution capabilities.
Contribution
Introduces a large-scale, real-world benchmark for API-based agents, including diverse APIs and detailed evaluation protocols, to assess their reasoning and task-solving abilities.
Findings
Existing benchmarks struggle with advanced reasoning tasks.
Current API-based agents face significant challenges in complex query handling.
Evaluation reveals limitations in both open-source and closed-source LLMs.
Abstract
Recent advancements in integrating large language models (LLMs) with application programming interfaces (APIs) have gained significant interest in both academia and industry. Recent work demonstrates that these API-based agents exhibit relatively strong autonomy and planning capabilities. However, their ability to handle multi-dimensional difficulty levels, diverse task types, and real-world demands remains unknown. In this paper, we introduce \textsc{ShortcutsBench}, a large-scale benchmark for the comprehensive evaluation of API-based agents in solving real-world complex tasks. \textsc{ShortcutsBench} includes a wealth of real APIs from Apple Inc., refined user queries, human-annotated high-quality action sequences, detailed parameter filling values, and parameters requesting necessary input from the system or user. We revealed how existing benchmarks~/~datasets struggle to…
Peer Reviews
Decision·ICLR 2025 Poster
This paper makes a notable contribution by creating a comprehensive benchmark for API-based agents, utilizing data extracted from Shortcuts. Compared to other API-based benchmarks, it offers several benefits, including a focus on the agents' ability to request necessary input from either the assistant or user and diverse difficulty of tasks. It covers a range of tasks, from simple ones to those involving complex APIs, queries, and action sequences. Additionally, the paper ensures quality by invo
1. Although the paper emphasizes that the benchmark includes high-quality human-annotated action sequences from shortcut developers and queries derived from real user demands, it only mentions the shortcut developers are our annotators. Further details in this area would be beneficial. 2. In section 3.2, the paper describes using GPT-4o to simulate user queries. However, it would be helpful to include the steps taken to verify the correctness and ensure the diversity of these user queries. 3.
- The authors introduce ShortcutsBench, which is a more holistic benchmark that contains real APIs, well-designed queries and actions. This could contribute to better evaluation of current agents's API calling capabilities in solving real-world tasks. - The authors provide example instances from ShortcutsBench in the appendix, which helps understanding the types of tasks in this benchmark. - The authors provide detailed analysis based on the evaluation results of several API-based agents.
- Section 3 is not well-elaborated. readers will benefit from clearer description of this process. For example, for (2), the authors say 'after duplicating based on icloud link, ....', it is not very clear what is duplicated and why this step helps. It would be good if the authors could refine their descriptions on their methodology. - The authors cite each work too many times in the paper, for example, a research paper is is cited five times in one paragraph in Section 2. Referencing previous w
- Interesting approach: mining APIs/existing action sequence from Shortcut Apps makes a lot of sense, which is a resource previous works haven't tapped into. - Comprehensive evaluation: the authors evaluated a wide range of LMs across Open and close source models.
1. Limited evaluation: The paper primarily assesses the model’s ability to choose correct actions based on ground-truth sequences but doesn’t evaluate its end-to-end task success rate (as done in, for example, AppWorld [1]). Experiments linking these aspects are missing. 2. No human validation: Given the synthetic nature of the benchmark, it’s uncertain whether all tasks are truly solvable or what the benchmark’s upper bound is. Including human performance as a reference would add clarity. [1]
Code & Models
Videos
Taxonomy
TopicsBusiness Process Modeling and Analysis · Multi-Agent Systems and Negotiation · Service-Oriented Architecture and Web Services
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Linear Layer · Adam · Dropout · Dense Connections · Weight Decay · Multi-Head Attention · Residual Connection
