EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments
Zefang Liu, Yinzhu Quan

TL;DR
EconWebArena is a comprehensive benchmark for testing autonomous agents on complex economic tasks within realistic web environments, emphasizing multimodal reasoning and real-world data fidelity.
Contribution
The paper introduces a new benchmark with 360 tasks from authoritative sources, focusing on economic reasoning and web interaction, generated via LLMs and human curation.
Findings
State-of-the-art multimodal LLMs show significant performance gaps.
Visual grounding and web navigation remain challenging for current models.
Ablation studies highlight the importance of reasoning and interaction design.
Abstract
We introduce EconWebArena, a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments. The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy. Each task challenges agents to navigate live websites, interpret structured and visual content, interact with real interfaces, and extract precise, time-sensitive data through multi-step workflows. We construct the benchmark by prompting multiple large language models (LLMs) to generate candidate tasks, followed by rigorous human curation to ensure clarity, feasibility, and source reliability. Unlike prior work, EconWebArena emphasizes fidelity to authoritative data sources and the need for grounded web-based economic reasoning. We evaluate a diverse set of state-of-the-art multimodal LLMs as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
