WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming, Zhang, Zhenzhong Lan, Dong Yu

TL;DR
WebVoyager is a novel multimodal web agent powered by large models that interacts with real websites to complete user tasks, establishing a new benchmark and evaluation protocol for real-world web applications.
Contribution
Introduction of WebVoyager, a large multimodal model-based web agent capable of end-to-end interaction with real websites and a new benchmark with automatic evaluation protocol.
Findings
WebVoyager achieves 59.1% task success rate on the new benchmark.
The automatic evaluation protocol has 85.3% agreement with human judgment.
WebVoyager significantly outperforms GPT-4 (All Tools) and text-only setups.
Abstract
The rapid advancement of large language models (LLMs) has led to a new era marked by the development of autonomous applications in real-world scenarios, which drives innovation in creating advanced web agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we establish a new benchmark by compiling real-world tasks from 15 popular websites and introduce an automatic evaluation protocol leveraging multimodal understanding abilities of GPT-4V to evaluate open-ended web agents. We show that WebVoyager achieves a 59.1% task success rate on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ByteDance-Seed/UI-TARS-1.5-7Bmodel· 142k dl· ♡ 533142k dl♡ 533
- 🤗Mungert/UI-TARS-1.5-7B-GGUFmodel· 1.7k dl· ♡ 131.7k dl♡ 13
- 🤗Hcompany/Holo1-3Bmodel· 330 dl· ♡ 82330 dl♡ 82
- 🤗Hcompany/Holo1-7Bmodel· 519 dl· ♡ 224519 dl♡ 224
- 🤗Mungert/Holo1-7B-GGUFmodel· 314 dl314 dl
- 🤗Mungert/Holo1-3B-GGUFmodel· 244 dl· ♡ 1244 dl♡ 1
- 🤗what2up/UI-TARS-1.5-7Bmodel· 17 dl· ♡ 117 dl♡ 1
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Residual Connection · Dropout · Byte Pair Encoding · Adam · Label Smoothing · Linear Layer · Multi-Head Attention · Softmax · Dense Connections
