WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents
Yao Zhang, Shijie Tang, Zeyu Li, Zhen Han, Volker Tresp

TL;DR
WebArbiter introduces a principle-guided, reasoning-based reward model for web agents that generates structured justifications, improving task success and robustness over existing methods.
Contribution
It presents WebArbiter, a novel WebPRM that uses text generation for reasoning and verdicts, with a two-stage training pipeline and a new benchmark for evaluation.
Findings
WebArbiter-7B outperforms GPT-5 by 9.1 points on WebPRMBench.
It surpasses prior WebPRMs by up to 6.4 points in trajectory search.
The model demonstrates improved robustness and interpretability in complex web tasks.
Abstract
Web agents hold great potential for automating complex computer tasks, yet their interactions involve long-horizon, sequential decision-making with irreversible actions. In such settings, outcome-based supervision is sparse and delayed, often rewarding incorrect trajectories and failing to support inference-time scaling. This motivates the use of Process Reward Models (WebPRMs) for web navigation, but existing approaches remain limited: scalar WebPRMs collapse progress into coarse, weakly grounded signals, while checklist-based WebPRMs rely on brittle template matching that fails under layout or semantic changes and often mislabels superficially correct actions as successful, providing little insight or interpretability. To address these challenges, we introduce WebArbiter, a reasoning-first, principle-inducing WebPRM that formulates reward modeling as text generation, producing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
