Multimodal Web Navigation with Instruction-Finetuned Foundation Models
Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra, Faust, Shixiang Shane Gu, Izzeddin Gur

TL;DR
This paper introduces WebGUM, a multimodal web navigation agent trained offline with instruction-finetuned foundation models, achieving state-of-the-art results on multiple benchmarks and demonstrating strong generalization and transfer capabilities.
Contribution
The paper presents WebGUM, a novel offline-trained, instruction-finetuned multimodal web navigation model that outperforms prior methods and large language models on key benchmarks.
Findings
WebGUM outperforms prior offline methods by over 45.8% on MiniWoB.
WebGUM surpasses online state-of-the-art and human performance on MiniWoB.
The model achieves superior results to PaLM-540B on WebShop.
Abstract
The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision encoder with temporal and local perception on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning, outperforming prior works…
Peer Reviews
Decision·ICLR 2024 poster
1. The performance of their model is good. It outperforms previous methods under different settings. 2. Writing is clear. 3. They performed detailed analysis such as dataset and model size scaling.
1. The technical contribution is very limited. The takeaway is to do supervised training on a large-scale model-generated dataset. It feels like knowledge distillation of a combination of model outputs (as described in 4.3 they used various LLMs to generate such data). 2. The dataset creation process and quality is not clear.
Novelty: The proposed WebGUM agent exhibits a novel combination of HTML and image modalities to tackle the challenges in web navigation. Performance: The empirical results are compelling, with the model showing substantial improvements on the MiniWoB and WebShop benchmarks. Resource Contribution: The authors have collected and made available a significant corpus of high-quality demonstrations, which is 38 times larger than previous datasets. This contribution is likely to be valuable for the b
- Generalization: It’s not clear how well the proposed method generalizes to a broader range of web navigation tasks outside the tested benchmarks, especially HTML that are longer than context length. More discussions or evaluations on the generalizability could strengthen the paper. "Because the context length is insufficient for raw HTML, we preprocess context HTML by extracting a snippet that includes the answers in advance." - Figure 1 and 5 are blurry.
1. This paper is well written and easy to follow. 2. The proposed method achieved amazing performance on MiniWob.
1. The proposed method is neither impressive nor novel. Given the fact that multimodal MLLM can already complete various difficult tasks such as visual reasoning or science QA [1,3], achieving the highest score in MiniWob does not seem very surprising. 2. Since there is a collection of [opensourced MLLMs](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models), which also combine a VIT-like vision encoder and an LLM, similar to the proposed WebGUM. The reviewer does not feel like t
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Layer Normalization · Linear Layer · Dense Connections · Residual Connection · Vision Transformer
