Large Language Models Can Self-Improve At Web Agent Tasks
Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei,, Marius-Constantin Dinu, Chris Callison-Burch, Sepp Hochreiter

TL;DR
This paper demonstrates that large language models can significantly improve their web navigation and task completion abilities through self-generated fine-tuning, achieving notable performance gains on the WebArena benchmark.
Contribution
It introduces a self-improvement method for LLM-based web agents, showing a 31% performance boost and developing new evaluation metrics for detailed assessment.
Findings
31% improvement in task completion rate
Effective self-fine-tuning on synthetic data
Novel metrics for evaluating agent trajectories
Abstract
Training models to act as agents that can effectively navigate and perform actions in a complex environment, such as a web browser, has typically been challenging due to lack of training data. Large language models (LLMs) have recently demonstrated some capability to navigate novel environments as agents in a zero-shot or few-shot fashion, purely guided by natural language instructions as prompts. Recent research has also demonstrated LLMs have the capability to exceed their base performance through self-improvement, i.e. fine-tuning on data generated by the model itself. In this work, we explore the extent to which LLMs can self-improve their performance as agents in long-horizon tasks in a complex environment using the WebArena benchmark. In WebArena, an agent must autonomously navigate and perform actions on web pages to achieve a specified objective. We explore fine-tuning on three…
Peer Reviews
Decision·Submitted to ICLR 2025
- well written.
- The innovation of the work is very limited! - Lack of enough experiments! The experiments are not comprehensive! It should consider a good analysis across different datasets and careful reasoning of what is going there! For example, having complete oblations studies versus other possibilities. - Lack of comparison with baselines! There are a lot of works of self-improvement and self-correctness! There should be considered as baselines. - This work needs a proper and further study and analysi
### strenghts: 1. **Capabilities through Self-Improvement:** The paper demos how LLMs can extend their capabilities through self-improvement techniques, particularly in the context of complex, long-horizon web agent tasks. This ability to acquire new capabilities while largely retaining existing ones is notable. 2. **Appropriate Eval Metrics:** The introduction of novel metrics, and scores to evaluate the quality of trajectories, adds depth to the evaluation process. These metrics provide a nuan
Good to address: 1. **Hyperparameter Selection:** The paper lacks a clear justification for the choice of hyperparameters in the synthetic data generation process and generating new objectives. e.g. using 4 or 2 few-shot samples, temperature value, and perhaps just one sentence on why 0.7 cosine similarity used is better. Any missing details may lead readers to question the replicability and robustness of the results. A more thorough analysis or rationale for these choices would strengthen the p
- The paper shows that you can use in-domain trajectories as few-shot examples to generate novel out-of-domain trajectories, which will be useful in producing synthetic data for finetuning agents. - The paper proposes a new metric $\text{VERTEX}_{\text{DTW}}$ which measures the similarity of an agent’s trajectory w.r.t. a reference (in this paper, the reference is a gpt-4 trajectory). This can be useful for providing a soft metric that has less false negatives than the WebArena functional correc
My primary concern with this paper is that its evaluation setup is not fair. The models are essentially *training on the test set*, and doing a pass@2 on the evaluation set (these results align with the ablation experiments in the paper, where training on Mixture C actually makes the agent worse). Consider the filtered subset of in-domain examples (mixture A). It has a very high accuracy rate (0.919), which is expected as tasks that errored out or the model failed to solve are filtered out. In t
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Natural Language Processing Techniques · Topic Modeling
MethodsBalanced Selection
