Autonomous Evaluation and Refinement of Digital Agents

Jiayi Pan; Yichi Zhang; Nicholas Tomlin; Yifei Zhou; Sergey Levine,; and Alane Suhr

arXiv:2404.06474·cs.AI·October 8, 2024·3 cites

Autonomous Evaluation and Refinement of Digital Agents

Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine,, and Alane Suhr

PDF

Open Access 1 Repo

TL;DR

This paper introduces domain-general automatic evaluators that enhance digital agents' performance in web navigation and device control by providing accurate, cost-effective evaluation and refinement methods validated across multiple benchmarks.

Contribution

It presents novel, domain-general evaluation models that improve agent performance through fine-tuning and inference guidance without extra supervision.

Findings

01

Achieved 74.4% to 92.9% agreement with oracle metrics

02

Improved WebArena performance by 29%

03

Enhanced device control success rates by around 75%

Abstract

We show that domain-general automatic evaluators can significantly improve the performance of agents for web navigation and device control. We experiment with multiple evaluation models that trade off between inference cost, modularity of design, and accuracy. We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with oracle evaluation metrics. Finally, we use these evaluators to improve the performance of existing agents via fine-tuning and inference-time guidance. Without any additional supervision, we improve state-of-the-art performance by 29% on the popular benchmark WebArena, and achieve around 75% relative improvement in device control settings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

berkeley-nlp/agent-eval-refine
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Multi-Agent Systems and Negotiation · Mobile Crowdsensing and Crowdsourcing