Go-Browse: Training Web Agents with Structured Exploration

Apurva Gandhi; Graham Neubig

arXiv:2506.03533·cs.CL·March 4, 2026

Go-Browse: Training Web Agents with Structured Exploration

Apurva Gandhi, Graham Neubig

PDF

Open Access 2 Models 3 Reviews

TL;DR

Go-Browse introduces a structured exploration method for web agents, enabling efficient data collection and improving task success rates on web navigation benchmarks with a smaller language model.

Contribution

The paper presents a novel graph search-based exploration technique for web agents and demonstrates its effectiveness on the WebArena benchmark.

Findings

01

Collected 10K successful trajectories and 40K interaction steps.

02

Fine-tuned a 7B model achieving 21.7% success rate, surpassing previous models.

03

Outperformed GPT-4o mini and other sub-10B models in web navigation tasks.

Abstract

One of the fundamental problems in digital agents is their lack of understanding of their environment. For instance, a web browsing agent may get lost in unfamiliar websites, uncertain what pages must be visited to achieve its goals. To address this, we propose Go-Browse, a method for automatically collecting diverse and realistic web agent data at scale through structured exploration of web environments. Go-Browse achieves efficient exploration by framing data collection as a graph search, enabling reuse of information across exploration episodes. We instantiate our method on the WebArena benchmark, collecting a dataset of 10K successful task-solving trajectories and 40K interaction steps across 100 URLs. Fine-tuning a 7B parameter language model on this dataset achieves a success rate of 21.7% on the WebArena benchmark, beating GPT-4o mini by 2.4% and exceeding current…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The proposed method is well motivated, and the adaptation of Go-Explore to exploring from a frontier of discovered webpages is intuitive and novel. - The proposed method shows clear effectiveness over a comparable state-of-the-art method. - The experiments and analyses are thorough, and all details as well as prompts are provided, enhancing reproducibility. - The paper is well-written and easy to follow.

Weaknesses

- The method leverages claude-3.7-sonnet for trajectory gathering, and it is unclear whether this may be a significant advantage of the proposed approach over NnetNav. - I'm not sure I understand the purpose of the experiments on Online Mind2web, as the results seem to be evaluating WebArena-trained models on Online Mind2Web. However, my understanding is that the proposed method is more effective at exploring a given environment such as the websites in WebArena, while Online Mind2Web consists of

Reviewer 02Rating 4Confidence 5

Strengths

1. Treating websites as a URL graph with a maintained frontier reduces redundant exploration across episodes and helps reach deeper states that matter for task completion. The outer loop/frontier mechanism is well-motivated and is validated by broader site coverage and deeper success trajectories. 2. The dataset of ~10K trajectories is a valuable resource to the community to train web agents.

Weaknesses

1. The evaluation results on Online-M2W are weak. While the authors say it is due to a different domain, it does not help sell their synthetic trajectory generation approach. The primary purpose of synthetic data generation for web agents is to improve their performance on real-world websites in the wild. 2. This proposed pipeline may not work as well on the real-world websites, as these are dynamic and the graph can change during the course of exploration.

Reviewer 03Rating 4Confidence 4

Strengths

1. Motivations The work and proposed approach are reasonably motivated. As described by the authors, exploration in web environments is one of the challenges for web navigation data collection. Especially, the shortcomings of interaction-first and instruction-first approaches mentioned by the authors could be a bottleneck for scalable trajectory data collection on the web. 2. Presentation The manuscript provides comprehensive information. It contains figures and pseudo-codes that make it ea

Weaknesses

1. Scalability of the proposed exploration approach The authors use WebArena as the testbed for their exploration algorithm. However, the WebArena environment consists of concept/mockup websites and is limited in multiple aspects compared to real-world websites. Regardless of the number of unique web pages it provides, the structure of its websites and thus the possible patterns of navigation may not be diverse enough to test the scalability of the proposed approach. This is especially importa

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Multimodal Machine Learning Applications · Topic Modeling