WebCanvas: Benchmarking Web Agents in Online Environments
Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang,, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, Zhengyang Wu

TL;DR
WebCanvas introduces a dynamic online benchmarking framework for web agents, featuring a new evaluation metric, a comprehensive dataset, and open-source tools to assess and improve agent performance in evolving web environments.
Contribution
The paper presents WebCanvas, a novel online evaluation framework with a new metric, a refined dataset, and tools for community-driven assessment of web agents in real-time settings.
Findings
Best agent achieves 23.1% success rate
Dataset contains 542 tasks with 2439 states
Framework enables realistic online evaluation
Abstract
For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the web. To bridge this gap, we introduce WebCanvas, an innovative online evaluation framework for web agents that effectively addresses the dynamic nature of web interactions. WebCanvas contains three main components to facilitate realistic assessments: (1) A novel evaluation metric which reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements. (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The motivation and the problem are very relevant
The technical quality of the work is under concerns. The work relates to evaluation methodology, and the main contribution is the proposed benchmark based on key nodes. I expect an analysis of how the proposed metric for web agents correlates with the goal metrics such as success rate based on outcomes. We can annotate, for a number of agents, outcome results for a representative number of tasks, and compare the correlation between “key nodes-based success rate” and outcome-based success rate ag
- Introduces a innovative evaluation framework WebCanvas for web agent. By focusing on “key nodes”, this framework provides a more reliable and accurate assessment compared to traditional methods that only consider the final task success rate. - Constructs a online and dynamic benchmark Mind2Web-Live that is an enhanced version of the original Mind2Web static dataset. - The authors have developed a community-driven platform where users can report issues with the dataset, and regular updates are
- When the data size was reduced from Mind2Web's original 2000+ tasks to 500 +, the authors did not analyze how many different domains the Mind2Web-Live can cover and whether there are enough tasks for each domain. - There is a problem of scalability in this dataset because updating data requires people to maintain it. When the scale of dataset increases, maintenance costs will increase.
1. The paper contributes a significant new benchmark for web mining, which is expected to provide substantial value to the research community. 2. The benchmark incorporates several valuable features, including intermediate state evaluations, a user-friendly interface with plugin support, and access to live datasets. 3. The writing is clear and well-structured, with numerous case studies that aid in understanding the framework and its applications.
1. The experimental evaluation is limited to a comparison with Mind2Web. It would be beneficial to include comparisons with additional benchmarks, evaluating a wider range of models to yield deeper insights. 2. The paper lacks a detailed breakdown of the sample categories. Providing statistical information on the task categories would help demonstrate the scope and coverage of the benchmark. 3. The benchmark currently offers a relatively small set of tasks. Expanding the sample size in
+ This paper is mostly well-written and easy to follow. + The paper is technically sound with most claims supported sufficiently by experimental results. + The proposed evaluation metrics and datasets seem novel.
- The problem formulation is incomplete in Section 2. The authors should bring some contents in Section E.1 back to the main paper. Additionally, the final objective function is missing in Section 2 as well. - It is a bit odd that “include match” and “semantic match” share the same evaluation targets for step score. Not sure if it is better to introduce additional aspects to distinguish them. - Some parts of the presentation could be improved, e.g., in Line 136, the notation of action history
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPeer-to-Peer Network Technologies · Mobile Agent-Based Network Management · Multi-Agent Systems and Negotiation
