Emergence WebVoyager: Toward Consistent and Transparent Evaluation of (Web) Agents in The Wild
Deepak Akkil, Mowafak Allaham, Amal Raj, Tamer Abuelsaad, Ravi Kokku

TL;DR
This paper introduces Emergence WebVoyager, a standardized, transparent evaluation framework for web AI agents that improves reproducibility and reveals significant performance variability across tasks.
Contribution
It presents Emergence WebVoyager, an enhanced benchmark with standardized evaluation guidelines, achieving high inter-annotator agreement and enabling more rigorous comparison of web agents.
Findings
Emergence WebVoyager achieves 95.9% inter-annotator agreement.
Evaluation of OpenAI Operator shows a success rate of 68.6%.
Performance is significantly lower than previously reported, highlighting evaluation variability.
Abstract
Reliable evaluation of AI agents operating in complex, real-world environments requires methodologies that are robust, transparent, and contextually aligned with the tasks agents are intended to perform. This study identifies persistent shortcomings in existing AI agent evaluation practices that are particularly acute in web agent evaluation, as exemplified by our audit of WebVoyager, including task-framing ambiguity and operational variability that hinder meaningful and reproducible performance comparisons. To address these challenges, we introduce Emergence WebVoyager, an enhanced version of the WebVoyager benchmark that standardizes evaluation methodology through clear guidelines for task instantiation, failure handling, annotation, and reporting. Emergence WebVoyager achieves an inter-annotator agreement of 95.9\%, indicating improved clarity and reliability in both task formulation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
