How to Train Your LLM Web Agent: A Statistical Diagnosis

Dheeraj Vattikonda; Santhoshi Ravichandran; Emiliano Penaloza; Hadi Nekoei; Megh Thakkar; Thibault Le Sellier de Chezelles; Nicolas Gontier; Miguel Mu\~noz-M\'armol; Sahar Omidi Shayegan; Stefania Raimondo; Xue Liu; Alexandre Drouin; Laurent Charlin; Alexandre Pich\'e; Alexandre Lacoste; Massimo Caccia

arXiv:2507.04103·cs.AI·February 16, 2026

How to Train Your LLM Web Agent: A Statistical Diagnosis

Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Mu\~noz-M\'armol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Pich\'e

PDF

Open Access

TL;DR

This paper presents a statistically grounded method for efficiently training LLM web agents, combining supervised fine-tuning and reinforcement learning to improve performance while reducing compute costs.

Contribution

It introduces a hyperparameter sampling and bootstrapping approach to optimize training strategies, achieving better performance with less compute compared to traditional methods.

Findings

01

Combining SFT with on-policy RL outperforms individual approaches.

02

The proposed method reduces compute by 45% while maintaining peak performance.

03

It effectively closes the gap with closed-source models.

Abstract

LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security