WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

Peng Yuan; Yuyang Yin; Yuxuan Cai; Zheng Wei

arXiv:2604.10988·cs.AI·April 14, 2026

WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

Peng Yuan, Yuyang Yin, Yuxuan Cai, Zheng Wei

PDF

1 Repo 1 Datasets

TL;DR

WebForge introduces an automated framework for creating realistic, reproducible, and scalable browser agent benchmarks with multi-dimensional capability profiling.

Contribution

It presents the first fully automated pipeline that overcomes the realism-reproducibility-scalability trilemma in browser benchmarking.

Findings

01

WebForge-Bench includes 934 tasks across 7 domains and 3 difficulty levels.

02

Difficulty stratification effectively differentiates model capabilities.

03

Multi-dimensional evaluation reveals capability biases invisible to aggregate metrics.

Abstract

Existing browser agent benchmarks face a fundamental trilemma: real-website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real-web noise, and both require costly manual curation that limits scalability. We present WebForge, the first fully automated framework that resolves this trilemma through a four-agent pipeline -- Plan, Generate, Refine, and Validate -- that produces interactive, self-contained web environments end-to-end without human annotation. A seven-dimensional difficulty control framework structures task design along navigation depth, visual complexity, reasoning difficulty, and more, enabling systematic capability profiling beyond single aggregate scores. Using WebForge, we construct WebForge-Bench, a benchmark of 934 tasks spanning 7 domains and 3 difficulty levels. Multi-model experiments show that difficulty…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuandaxia2001/WebForge
github

Datasets

yuandaxia/WebForge
dataset· 4.4k dl
4.4k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.