Live API-Bench: 2500+ Live APIs for Testing Multi-Step Tool Calling
Benjamin Elder, Anupama Murthi, Jungkoo Kang, Ankita Rajaram Naik, Kiran Kate, Kinjal Basu, Danish Contractor

TL;DR
Live API Bench offers a large, realistic benchmark with over 2,500 APIs derived from NL2SQL datasets, enabling systematic evaluation of LLMs' multi-step tool calling capabilities across diverse domains.
Contribution
We introduce Live API Bench, a comprehensive, reproducible benchmark transforming NL2SQL datasets into interactive API environments for evaluating LLM tool use.
Findings
Low task completion rates (7-47%) for LLMs on the benchmark.
Interactive agent settings improve performance modestly to 50%.
Highlights significant room for enhancing LLM tool calling abilities.
Abstract
Large language models (LLMs) increasingly rely on external tools and APIs to execute complex tasks specified in natural language. Evaluating such tool calling capabilities in realistic enterprise settings is challenging: APIs are often proprietary, heterogeneous, and difficult to share, limiting reproducible benchmarks. To address this, we introduce Live API Bench, a comprehensive benchmark constructed by transforming NL2SQL datasets into interactive API environments. Our pipeline converts SQL queries from BIRD SQL into executable API sequences across three formulations SLOT, SEL, and REST covering minimal general purpose operations, domain specific multi step tasks, and function oriented RESTful interactions, respectively. The benchmark spans 11 databases with over 2,500 invocable tools, paired with human authored queries, ground truth API sequences, and verified final answers. Live…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsService-Oriented Architecture and Web Services
MethodsSparse Evolutionary Training
