Evaluating LLMs on Sequential API Call Through Automated Test Generation

Yuheng Huang; Jiayang Song; Da Song; Zhenlan Ji; Wenhan Wang; Shuai Wang; Lei Ma

arXiv:2507.09481·cs.SE·December 3, 2025

Evaluating LLMs on Sequential API Call Through Automated Test Generation

Yuheng Huang, Jiayang Song, Da Song, Zhenlan Ji, Wenhan Wang, Shuai Wang, Lei Ma

PDF

Open Access 1 Datasets

TL;DR

This paper introduces StateGen, an automated framework for generating and evaluating complex sequential API call tasks for LLMs, addressing limitations of existing benchmarks by providing diverse, verifiable, and realistic test cases.

Contribution

The paper presents StateGen and StateEval, novel tools for automated, realistic, and verifiable API interaction testing of LLMs, filling a gap in current evaluation methods.

Findings

01

StateGen effectively generates challenging API-oriented tasks.

02

StateEval provides a diverse benchmark with 120 verified test cases.

03

Results highlight current LLM limitations in API integration.

Abstract

By integrating tools from external APIs, Large Language Models (LLMs) have expanded their promising capabilities in a diverse spectrum of complex real-world tasks. However, testing, evaluation, and analysis of LLM tool use remain in their early stages. Most existing benchmarks rely on manually collected test cases, many of which cannot be automatically checked for semantic correctness and instead depend on static methods such as string matching. Additionally, these benchmarks often overlook the complex interactions that occur between sequential API calls, which are common in real-world applications. To fill the gap, in this paper, we introduce StateGen, an automated framework designed to generate diverse coding tasks involving sequential API interactions. StateGen combines state-machine-based API constraint solving and validation, energy-based sampling, and control-flow injection to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

yuhenghuang/StateEval
dataset· 9 dl
9 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software System Performance and Reliability · Digital Rights Management and Security