RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements

Leon Kogler; Stefan Hangler; Maximilian Ehrhart; Benedikt Dornauer; Roland Wuersching; and Peter Schrammel

arXiv:2604.25862·cs.SE·April 29, 2026

RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements

Leon Kogler, Stefan Hangler, Maximilian Ehrhart, Benedikt Dornauer, Roland Wuersching, and Peter Schrammel

PDF

TL;DR

RESTestBench is a benchmark designed to evaluate the effectiveness of LLM-generated REST API test cases based on natural language requirements, addressing limitations of traditional metrics by introducing requirement-based mutation testing.

Contribution

It introduces RESTestBench, a benchmark with controlled REST services and a new mutation testing metric for requirement-based test evaluation, and assesses LLM approaches in this context.

Findings

01

Test effectiveness drops with faulty or mutated code, especially for vague requirements.

02

Refinement-based generation benefits less when requirement detail is high.

03

RESTestBench enables controlled, reproducible evaluation of requirement-based test generation.

Abstract

Existing REST API testing tools are typically evaluated using code coverage and crash-based fault metrics. However, recent LLM-based approaches increasingly generate tests from NL requirements to validate functional behaviour, making traditional metrics weak proxies for whether generated tests validate intended behaviour. To address this gap, we present RESTestBench, a benchmark comprising three REST services paired with manually verified NL requirements in both precise and vague variants, enabling controlled and reproducible evaluation of requirement-based test generation. RESTestBench further introduces a requirements-based mutation testing metric that measures the fault-detection effectiveness of a generated test case with respect to a specific requirement, extending the property-based approach of Bartocci et al. . Using RESTestBench, we evaluate two approaches across multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.