LLM-Based Robustness Testing of Microservice Applications: An Empirical Study
Hrushitha Goud Tigulla, Marco Vieira

TL;DR
This study evaluates how different prompt strategies and models affect the diversity and effectiveness of LLM-generated robustness tests for microservice APIs, revealing that prompt design significantly impacts failure coverage.
Contribution
It introduces and empirically compares prompt strategies, including GuidedFewShot, demonstrating their influence on failure detection and the importance of domain context in LLM-based testing.
Findings
Prompt strategy influences failure diversity more than model size.
GuidedFewShot achieves high failure mode coverage with low similarity.
Taxonomy rules alone are insufficient without concrete examples.
Abstract
Malformed, missing, or boundary-value inputs in microservice APIs can cascade across dependent services, threatening reliability. Robustness testing systematically exercises such inputs to expose server-side failures, but generating diverse, effective tests remains challenging. Large Language Models can generate such tests from API specifications; however, it is unknown whether different models and prompt strategies produce diverse failure sets or converge on the same failures. We report a controlled experiment applying 7 prompt strategies to 3 open-source LLMs (14B-70B parameters) targeting 2 architecturally distinct microservice systems: one Java monolingual (6 services, 9 failure modes) and one polyglot (27 services, 14 failure modes), yielding 38 valid runs and 663 generated tests. We find that prompt strategy explains more variation in diversity than model size: a Structured prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
