Practical Limits of Autonomous Test Repair: A Multi-Agent Case Study with LLM-Driven Discovery and Self-Correction
Hyukjoo Lee

TL;DR
This study evaluates a multi-agent autonomous testing system driven by large language models, revealing its capabilities in feature discovery and self-repair, but also highlighting the need for constraints and human oversight for reliable enterprise testing.
Contribution
The paper presents an industrial case study demonstrating autonomous UI test repair and feature discovery using LLMs, emphasizing the importance of constraints for operational reliability.
Findings
System discovered over 100 features across 10 UI screens.
Achieved 70% repair convergence rate with an average of 3.4 iterations.
Unrestricted autonomy led to unstable outcomes, requiring constraints and oversight.
Abstract
Maintaining reliable UI test suites in large-scale enterprise applications is a persistent and costly challenge. We present an industrial case study of a multi-agent autonomous testing system evaluated using anonymized execution data from a production-like enterprise UI testing prototype. The application features several hundred dynamic UI elements per screen. Built on a large language model with LangGraph orchestration, Playwright execution, and a RAG knowledge base, the system evolves from human-directed testing toward High-autonomy feature discovery and test execution: given no explicit test targets, it discovers over 100 testable features across 10 UI screens, dynamically expands coverage by an additional 15--30 features through runtime DOM analysis, and iteratively repairs failing tests without human intervention. We analyzed 300 consecutive autonomous execution reports…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
