TL;DR
This paper introduces the Universal Verifier, a robust system for verifying web task trajectories that aligns well with human judgment, reduces false positives, and improves reliability over previous baselines.
Contribution
The paper presents a set of design principles for building effective verifiers and introduces the Universal Verifier system, validated on a new benchmark with open-source code.
Findings
Universal Verifier agrees with humans as often as humans agree with each other.
False positive rates reduced to near zero compared to baselines.
Auto-research agent achieves 70% of expert quality in 5% of the time.
Abstract
Verifying the success of computer use agent (CUA) trajectories is a critical challenge: without reliable verification, neither evaluation nor training signal can be trusted. In this paper, we present lessons learned from building a best-in-class verifier for web tasks we call the Universal Verifier. We design the Universal Verifier around four key principles: 1) constructing rubrics with meaningful, non-overlapping criteria to reduce noise; 2) separating process and outcome rewards that yield complementary signals, capturing cases where an agent follows the right steps but gets blocked or succeeds through an unexpected path; 3) distinguishing between controllable and uncontrollable failures scored via a cascading-error-free strategy for finer-grained failure understanding; and 4) a divide-and-conquer context management scheme that attends to all screenshots in a trajectory, improving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
