Live API-Bench: 2500+ Live APIs for Testing Multi-Step Tool Calling

Benjamin Elder; Anupama Murthi; Jungkoo Kang; Ankita Rajaram Naik; Kiran Kate; Kinjal Basu; Danish Contractor

arXiv:2506.11266·cs.SE·January 27, 2026

Live API-Bench: 2500+ Live APIs for Testing Multi-Step Tool Calling

Benjamin Elder, Anupama Murthi, Jungkoo Kang, Ankita Rajaram Naik, Kiran Kate, Kinjal Basu, Danish Contractor

PDF

Open Access 1 Video

TL;DR

Live API Bench offers a large, realistic benchmark with over 2,500 APIs derived from NL2SQL datasets, enabling systematic evaluation of LLMs' multi-step tool calling capabilities across diverse domains.

Contribution

We introduce Live API Bench, a comprehensive, reproducible benchmark transforming NL2SQL datasets into interactive API environments for evaluating LLM tool use.

Findings

01

Low task completion rates (7-47%) for LLMs on the benchmark.

02

Interactive agent settings improve performance modestly to 50%.

03

Highlights significant room for enhancing LLM tool calling abilities.

Abstract

Large language models (LLMs) increasingly rely on external tools and APIs to execute complex tasks specified in natural language. Evaluating such tool calling capabilities in realistic enterprise settings is challenging: APIs are often proprietary, heterogeneous, and difficult to share, limiting reproducible benchmarks. To address this, we introduce Live API Bench, a comprehensive benchmark constructed by transforming NL2SQL datasets into interactive API environments. Our pipeline converts SQL queries from BIRD SQL into executable API sequences across three formulations SLOT, SEL, and REST covering minimal general purpose operations, domain specific multi step tasks, and function oriented RESTful interactions, respectively. The benchmark spans 11 databases with over 2,500 invocable tools, paired with human authored queries, ground truth API sequences, and verified final answers. Live…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Live API-Bench: 2500+ Live APIs for Testing Multi-Step Tool Calling· underline

Taxonomy

TopicsService-Oriented Architecture and Web Services

MethodsSparse Evolutionary Training