WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

Fanheng Kong; Jingyuan Zhang; Yang Yue; Chenxi Sun; Yang Tian; Shi Feng; Xiaocui Yang; Daling Wang; Yu Tian; Jun Du; Wenchong Zeng; Han Li; Kun Gai

arXiv:2603.25226·cs.SE·March 27, 2026

WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

Fanheng Kong, Jingyuan Zhang, Yang Yue, Chenxi Sun, Yang Tian, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Jun Du, Wenchong Zeng, Han Li, Kun Gai

PDF

Open Access 1 Datasets

TL;DR

WebTestBench is a comprehensive benchmark designed to evaluate and improve end-to-end automated web testing by assessing large language models' ability to generate checklists and detect defects in diverse web applications.

Contribution

The paper introduces WebTestBench, a new benchmark for evaluating automated web testing, and proposes WebTester, a baseline framework to address current challenges in the field.

Findings

01

LLMs show limited test completeness and detection capabilities.

02

Current automated testing methods face reliability issues in open-ended environments.

03

Significant gap exists between current capabilities and industrial requirements.

Abstract

The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm has driven automated webpage development, but it introduces a new requirement about how to automatically verify whether the web functionalities are reliably implemented. Existing works struggle to adapt, relying on static visual similarity or predefined checklists that constrain their utility in open-ended environments. Furthermore, they overlook a vital aspect of software quality, namely latent logical constraints. To address these gaps, we introduce WebTestBench, a benchmark for evaluating end-to-end automated web testing. WebTestBench encompasses comprehensive dimensions across diverse web application categories. We decompose the testing process…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

friedrichor/WebTestBench
dataset· 129 dl
129 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software System Performance and Reliability