Assessing REST API Test Generation Strategies with Log Coverage
Nana Reinikainen, Mika M\"antyl\"a, Yuqing Wang

TL;DR
This paper evaluates different REST API test generation strategies using novel log coverage metrics, revealing their complementarity and effectiveness in uncovering diverse runtime behaviors.
Contribution
It introduces three log coverage metrics for black-box testing and empirically compares evolutionary, LLM-based, and human tests on a microservice system.
Findings
Claude Opus 4.6 uncovers 28.4% more log templates than human tests.
Combining human and Claude tests increases total log coverage by 78.4%.
GPT-5.2-Codex uncovers 38.6% fewer logs but complements other strategies.
Abstract
Assessing the effectiveness of REST API tests in black-box settings can be challenging due to the lack of access to source code coverage metrics and polyglot tech stack. We propose three metrics for capturing average, minimum, and maximum log coverage to handle the diverse test generation results and runtime behaviors over multiple runs. Using log coverage, we empirically evaluate three REST API test generation strategies, Evolutionary computing (EvoMaster v5.0.2), LLMs (Claude Opus 4.6 and GPT-5.2-Codex), and human-written Locust load tests, on Light-OAuth2 authorization microservice system. On average, Claude Opus 4.6 tests uncover 28.4% more unique log templates than human-written tests, whereas EvoMaster and GPT-5.2-Codex find 26.1% and 38.6% fewer, respectively. Next, we analyze combined log coverage to assess complementarity between strategies. Combining human-written tests with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
