EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments

Zefang Liu; Yinzhu Quan

arXiv:2506.08136·cs.CL·May 12, 2026

EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments

Zefang Liu, Yinzhu Quan

PDF

1 Datasets

TL;DR

EconWebArena is a comprehensive benchmark for testing autonomous agents on complex economic tasks within realistic web environments, emphasizing multimodal reasoning and real-world data fidelity.

Contribution

The paper introduces a new benchmark with 360 tasks from authoritative sources, focusing on economic reasoning and web interaction, generated via LLMs and human curation.

Findings

01

State-of-the-art multimodal LLMs show significant performance gaps.

02

Visual grounding and web navigation remain challenging for current models.

03

Ablation studies highlight the importance of reasoning and interaction design.

Abstract

We introduce EconWebArena, a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments. The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy. Each task challenges agents to navigate live websites, interpret structured and visual content, interact with real interfaces, and extract precise, time-sensitive data through multi-step workflows. We construct the benchmark by prompting multiple large language models (LLMs) to generate candidate tasks, followed by rigorous human curation to ensure clarity, feasibility, and source reliability. Unlike prior work, EconWebArena emphasizes fidelity to authoritative data sources and the need for grounded web-based economic reasoning. We evaluate a diverse set of state-of-the-art multimodal LLMs as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

EconWebArena/EconWebArena
dataset· 132 dl
132 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.