WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek, Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon,, Graham Neubig

TL;DR
WebArena provides a realistic, web-based environment for developing and benchmarking autonomous agents performing complex, real-world internet tasks, revealing current AI limitations and guiding future improvements.
Contribution
The paper introduces WebArena, a highly realistic web environment with benchmark tasks for evaluating autonomous agents' ability to perform internet-based tasks.
Findings
GPT-4-based agents achieve only 14.41% success rate
Humans outperform agents with 78.24% success rate
WebArena exposes challenges in current AI capabilities for web tasks
Abstract
With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task…
Peer Reviews
Decision·ICLR 2024 poster
**Originality**: I am truly delighted to have the opportunity to review this work. I had the privilege of reading this manuscript a few months ago, and its significance resonated with me. The issues addressed in this paper are both critical and captivating. The work notably bridges a substantial gap, laying a pivotal foundation for future industrial applications of web agents. Over the last six months, I've come across numerous works on agent benchmarks. However, this particular study stands out
While the work presented is undeniably valuable, from an academic perspective, I believe there are several weaknesses, primarily related to experimental evaluations and the choice of baselines. Here are the specific areas of concern: 1. **Lack of Evaluation with the Latest Intelligent Agents**: The paper seems to miss out on evaluating some of the latest intelligent agents, especially those grounded in modern reasoning and planning methods. Works like the "Tree of Thought" and the new "Reflec
The authors propose an Independent platform, implementing a large variety of realistic end-user tasks on the Web. The framework provides provides realistical, challenging tasks for Web agents. The quality of the benchmark is sufficiently high. To this end, a good choice of task variety was made, which is backed up by a user study. This is very nice to see, as the taken design decisions then are probabily matching with user needs. The paper includes a preliminary evaluation of agents based on
The related work advantage not completely clear. The related work states functional correctness as advantage over AndroidEnv, but no further explanation is given. It might hint to the diffeence between the used evaluation metrics, but it would be interesting/important to clarify this. Also, it mentions the lack of diverse or complex task availability, but new tasks can be defined within the framework. The agent evaluation is performaned with standard GPT variants only, not pointing to stronge
1. This paper proposes a highly-realistic and complicated web environment compared with the previous simplified environment; 2. The proposed environment includes four common and real domains; 3. The paper is well written and easy to follow.
1. The major weakness of this paper is the lack of technical novelty. Though the contribution on simulated environment/datasets/resources are welcomed and very important to the research community, such papers may not match the general style of ICLR papers. 2. For evaluation, the proposed framework uses GPT4 to evaluate the answer or the execution paths, which potentially has two issues: 1. GPT4 is a commercial tool, which may limit the potential use of this environment; 2. GPT4 is not guarantee
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsFocus
