Autonomous Evaluation and Refinement of Digital Agents
Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine,, and Alane Suhr

TL;DR
This paper introduces domain-general automatic evaluators that enhance digital agents' performance in web navigation and device control by providing accurate, cost-effective evaluation and refinement methods validated across multiple benchmarks.
Contribution
It presents novel, domain-general evaluation models that improve agent performance through fine-tuning and inference guidance without extra supervision.
Findings
Achieved 74.4% to 92.9% agreement with oracle metrics
Improved WebArena performance by 29%
Enhanced device control success rates by around 75%
Abstract
We show that domain-general automatic evaluators can significantly improve the performance of agents for web navigation and device control. We experiment with multiple evaluation models that trade off between inference cost, modularity of design, and accuracy. We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with oracle evaluation metrics. Finally, we use these evaluators to improve the performance of existing agents via fine-tuning and inference-time guidance. Without any additional supervision, we improve state-of-the-art performance by 29% on the popular benchmark WebArena, and achieve around 75% relative improvement in device control settings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Multi-Agent Systems and Negotiation · Mobile Crowdsensing and Crowdsourcing
