WebGames: Challenging General-Purpose Web-Browsing AI Agents
George Thomas, Alex J. Chan, Jikun Kang, Wenqi Wu, Filippos, Christianos, Fraser Greenlee, Andy Toulis, Marvin Purtorab

TL;DR
WebGames is a new benchmark suite with over 50 challenges designed to evaluate and compare the web-browsing capabilities of AI agents, revealing significant performance gaps compared to humans.
Contribution
Introduces WebGames, a comprehensive, reproducible benchmark for assessing general-purpose web-browsing AI agents across diverse tasks and interactions.
Findings
AI systems achieve only 43.1% success rate versus 95.7% for humans.
WebGames provides a standardized, reproducible platform for evaluating web-browsing AI.
Current AI models show fundamental limitations in handling common web interactions.
Abstract
We introduce WebGames, a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents through a collection of 50+ interactive challenges. These challenges are specifically crafted to be straightforward for humans while systematically testing the limitations of current AI systems across fundamental browser interactions, advanced input processing, cognitive tasks, workflow automation, and interactive entertainment. Our framework eliminates external dependencies through a hermetic testing environment, ensuring reproducible evaluation with verifiable ground-truth solutions. We evaluate leading vision-language models including GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL against human performance. Results reveal a substantial capability gap, with the best AI system achieving only 43.1% success rate compared to human performance of 95.7%, highlighting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Advanced Malware Detection Techniques · Web Data Mining and Analysis
