WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks
Zihao Sun, Ling Chen

TL;DR
WebArXiv introduces a static, reproducible benchmark for evaluating multimodal web agents on arXiv tasks, addressing evaluation instability and proposing a dynamic reflection mechanism to improve agent performance.
Contribution
The paper presents WebArXiv, a novel static benchmark for web agents, and introduces a dynamic reflection method to enhance agent decision-making and evaluation reliability.
Findings
WebArXiv provides consistent, reproducible evaluation of web agents.
Agents show varied performance on WebArXiv, highlighting the benchmark's discriminative power.
The dynamic reflection mechanism improves agent decision accuracy.
Abstract
Recent progress in large language models (LLMs) has enabled the development of autonomous web agents capable of navigating and interacting with real websites. However, evaluating such agents remains challenging due to the instability and inconsistency of existing benchmarks, which often rely on dynamic content or oversimplified simulations. In this work, we introduce WebArXiv, a static and time-invariant benchmark comprising 275 web-based tasks grounded in the arXiv platform. WebArXiv ensures reproducible and reliable evaluation by anchoring tasks in fixed web snapshots with deterministic ground truths and standardized action trajectories. Through behavioral analysis, we identify a common failure mode, Rigid History Reflection, where agents over-rely on fixed interaction histories. To address this, we propose a lightweight dynamic reflection mechanism that allows agents to selectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Web Data Mining and Analysis
