WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing
Fanheng Kong, Jingyuan Zhang, Yang Yue, Chenxi Sun, Yang Tian, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Jun Du, Wenchong Zeng, Han Li, Kun Gai

TL;DR
WebTestBench is a comprehensive benchmark designed to evaluate and improve end-to-end automated web testing by assessing large language models' ability to generate checklists and detect defects in diverse web applications.
Contribution
The paper introduces WebTestBench, a new benchmark for evaluating automated web testing, and proposes WebTester, a baseline framework to address current challenges in the field.
Findings
LLMs show limited test completeness and detection capabilities.
Current automated testing methods face reliability issues in open-ended environments.
Significant gap exists between current capabilities and industrial requirements.
Abstract
The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm has driven automated webpage development, but it introduces a new requirement about how to automatically verify whether the web functionalities are reliably implemented. Existing works struggle to adapt, relying on static visual similarity or predefined checklists that constrain their utility in open-ended environments. Furthermore, they overlook a vital aspect of software quality, namely latent logical constraints. To address these gaps, we introduce WebTestBench, a benchmark for evaluating end-to-end automated web testing. WebTestBench encompasses comprehensive dimensions across diverse web application categories. We decompose the testing process…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software System Performance and Reliability
