WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks

Zihao Sun; Ling Chen

arXiv:2507.00938·cs.IR·August 14, 2025

WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks

Zihao Sun, Ling Chen

PDF

Open Access

TL;DR

WebArXiv introduces a static, reproducible benchmark for evaluating multimodal web agents on arXiv tasks, addressing evaluation instability and proposing a dynamic reflection mechanism to improve agent performance.

Contribution

The paper presents WebArXiv, a novel static benchmark for web agents, and introduces a dynamic reflection method to enhance agent decision-making and evaluation reliability.

Findings

01

WebArXiv provides consistent, reproducible evaluation of web agents.

02

Agents show varied performance on WebArXiv, highlighting the benchmark's discriminative power.

03

The dynamic reflection mechanism improves agent decision accuracy.

Abstract

Recent progress in large language models (LLMs) has enabled the development of autonomous web agents capable of navigating and interacting with real websites. However, evaluating such agents remains challenging due to the instability and inconsistency of existing benchmarks, which often rely on dynamic content or oversimplified simulations. In this work, we introduce WebArXiv, a static and time-invariant benchmark comprising 275 web-based tasks grounded in the arXiv platform. WebArXiv ensures reproducible and reliable evaluation by anchoring tasks in fixed web snapshots with deterministic ground truths and standardized action trajectories. Through behavioral analysis, we identify a common failure mode, Rigid History Reflection, where agents over-rely on fixed interaction histories. To address this, we propose a lightweight dynamic reflection mechanism that allows agents to selectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Web Data Mining and Analysis