Large Language Models Can Self-Improve At Web Agent Tasks

Ajay Patel; Markus Hofmarcher; Claudiu Leoveanu-Condrei,; Marius-Constantin Dinu; Chris Callison-Burch; Sepp Hochreiter

arXiv:2405.20309·cs.LG·October 3, 2024

Large Language Models Can Self-Improve At Web Agent Tasks

Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei,, Marius-Constantin Dinu, Chris Callison-Burch, Sepp Hochreiter

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper demonstrates that large language models can significantly improve their web navigation and task completion abilities through self-generated fine-tuning, achieving notable performance gains on the WebArena benchmark.

Contribution

It introduces a self-improvement method for LLM-based web agents, showing a 31% performance boost and developing new evaluation metrics for detailed assessment.

Findings

01

31% improvement in task completion rate

02

Effective self-fine-tuning on synthetic data

03

Novel metrics for evaluating agent trajectories

Abstract

Training models to act as agents that can effectively navigate and perform actions in a complex environment, such as a web browser, has typically been challenging due to lack of training data. Large language models (LLMs) have recently demonstrated some capability to navigate novel environments as agents in a zero-shot or few-shot fashion, purely guided by natural language instructions as prompts. Recent research has also demonstrated LLMs have the capability to exceed their base performance through self-improvement, i.e. fine-tuning on data generated by the model itself. In this work, we explore the extent to which LLMs can self-improve their performance as agents in long-horizon tasks in a complex environment using the WebArena benchmark. In WebArena, an agent must autonomously navigate and perform actions on web pages to achieve a specified objective. We explore fine-tuning on three…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

- well written.

Weaknesses

- The innovation of the work is very limited! - Lack of enough experiments! The experiments are not comprehensive! It should consider a good analysis across different datasets and careful reasoning of what is going there! For example, having complete oblations studies versus other possibilities. - Lack of comparison with baselines! There are a lot of works of self-improvement and self-correctness! There should be considered as baselines. - This work needs a proper and further study and analysi

Reviewer 02Rating 6Confidence 4

Strengths

### strenghts: 1. **Capabilities through Self-Improvement:** The paper demos how LLMs can extend their capabilities through self-improvement techniques, particularly in the context of complex, long-horizon web agent tasks. This ability to acquire new capabilities while largely retaining existing ones is notable. 2. **Appropriate Eval Metrics:** The introduction of novel metrics, and scores to evaluate the quality of trajectories, adds depth to the evaluation process. These metrics provide a nuan

Weaknesses

Good to address: 1. **Hyperparameter Selection:** The paper lacks a clear justification for the choice of hyperparameters in the synthetic data generation process and generating new objectives. e.g. using 4 or 2 few-shot samples, temperature value, and perhaps just one sentence on why 0.7 cosine similarity used is better. Any missing details may lead readers to question the replicability and robustness of the results. A more thorough analysis or rationale for these choices would strengthen the p

Reviewer 03Rating 3Confidence 5

Strengths

- The paper shows that you can use in-domain trajectories as few-shot examples to generate novel out-of-domain trajectories, which will be useful in producing synthetic data for finetuning agents. - The paper proposes a new metric $\text{VERTEX}_{\text{DTW}}$ which measures the similarity of an agent’s trajectory w.r.t. a reference (in this paper, the reference is a gpt-4 trajectory). This can be useful for providing a soft metric that has less false negatives than the WebArena functional correc

Weaknesses

My primary concern with this paper is that its evaluation setup is not fair. The models are essentially *training on the test set*, and doing a pass@2 on the evaluation set (these results align with the ablation experiments in the paper, where training on Mixture C actually makes the agent worse). Consider the filtered subset of in-domain examples (mixture A). It has a very high accuracy rate (0.919), which is expected as tasks that errored out or the model failed to solve are filtered out. In t

Code & Models

Repositories

AjayP13/webdreamer
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Natural Language Processing Techniques · Topic Modeling

MethodsBalanced Selection