InSTA: Towards Internet-Scale Training For Agents

Brandon Trabucco; Gunnar Sigurdsson; Robinson Piramuthu; Ruslan Salakhutdinov

arXiv:2502.06776·cs.LG·May 23, 2025

InSTA: Towards Internet-Scale Training For Agents

Brandon Trabucco, Gunnar Sigurdsson, Robinson Piramuthu, Ruslan Salakhutdinov

PDF

Open Access 3 Datasets 3 Reviews

TL;DR

This paper presents a scalable pipeline for training web navigation agents using large language models to annotate, execute, and filter tasks across the internet, reducing reliance on human data and achieving competitive performance.

Contribution

The authors introduce a novel internet-scale training pipeline that leverages LLMs for annotation, execution, and filtering, enabling efficient training of web agents without human supervision.

Findings

01

Achieved a success rate of 56.9% with the top agent.

02

LLM-based filtering achieves 97% accuracy in identifying harmful content.

03

Agents trained with this pipeline outperform larger models in web navigation tasks.

Abstract

The predominant approach for training web navigation agents is to gather human demonstrations for a set of popular websites and hand-written tasks, but it is becoming clear that human data is an inefficient resource. We develop a pipeline to facilitate internet-scale training for agents without laborious human annotations. In the first stage, an LLM annotates 150k sites with agentic tasks. In the next stage, LLM agents complete tasks and produce trajectories. In the final stage, an LLM filters trajectories by judging their success. Language models are powerful data curation tools, identifying harmful content with an accuracy of 97%, judging successful trajectories with an accuracy of 82.6%, and producing effective data. We train agents based on Qwen 3 1.7B that are competitive with frontier LLMs as web agents, while being smaller and faster. Our top agent reaches a success rate of…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The pipeline's execution at the scale of 150,000 live websites is a non-trivial engineering and research achievement, far surpassing the ~200 sites in many existing benchmarks. The release of the insta-150k-v2 task dataset and the larger reasoning dataset is a valuable contribution to the community. 2. The authors prudently integrate safety considerations from the start, rather than treating it as an afterthought. The LLM-based safety filter is shown to be highly effective (up to 97% accuracy

Weaknesses

1. The feedback loop is proposed as a key design in section 4.1. However, the paper explicitly states, "For this paper, we employ one loop of task generation". This is a major limitation, which indicates that the full promise of an iterative system where tasks get incrementally harder is not actually realized or evaluated. 2. The paper claims to generate "challenging" tasks. However, the analysis in Appendix F (Figure 14) shows the most solved tasks are simple information retrieval (e.g., "conta

Reviewer 02Rating 10Confidence 3

Strengths

1. This is a great paper, with really impressive results, showing that frontier agent performance can be achieved with just a few hundred dollars. I expect it will be of broad interest to agent researchers 2. The task generation setup is clever — for each website, the proposed method first generates a simple task which gives an agent incentive to explore the website, and then generates more complex tasks conditioned on this trajectory. 3. The paper conducts two useful human validations: first

Weaknesses

I think the abstract/intro could have been a bit more clear that the headline result requires both the policy and judge models to be 235B models, but given that the final performance of the distilled 1.7B model surpasses the performance of the 235B model, I don’t consider this to be a major weakness.

Reviewer 03Rating 4Confidence 4

Strengths

1. The trajectory dataset is a valuable resource. 2. The performance improvement with scale is a positive indicator of dataset quality.

Weaknesses

1. Missing external baselines for Mind2Web and WebLINX, and WebVoyager. Also, it is not clear how the set of 500 diverse test tasks for Mind2web was chosen. 2. Missing references - Explorer [1] - similar pipeline for web agent trajectory synthesis [1] Pahuja, Vardaan, et al. "Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents." ACL 2025.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation