AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from   Human Demonstrations

Gaurav Verma; Rachneet Kaur; Nishan Srishankar; Zhen Zeng; Tucker; Balch; Manuela Veloso

arXiv:2411.13451·cs.AI·November 21, 2024

AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations

Gaurav Verma, Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Tucker, Balch, Manuela Veloso

PDF

Open Access

TL;DR

AdaptAgent introduces a framework enabling multimodal web agents to adapt to new websites and domains using only a few human demonstrations, significantly improving task success rates beyond traditional large-scale training methods.

Contribution

The paper presents AdaptAgent, a novel approach for few-shot adaptation of multimodal web agents using human demonstrations, enhancing generalization to unseen websites and domains.

Findings

01

Task success rate increased by up to 7.21% with adaptation.

02

Multimodal demonstrations outperform text-only demonstrations.

03

Number of few-shot examples influences success rate.

Abstract

State-of-the-art multimodal web agents, powered by Multimodal Large Language Models (MLLMs), can autonomously execute many web tasks by processing user instructions and interacting with graphical user interfaces (GUIs). Current strategies for building web agents rely on (i) the generalizability of underlying MLLMs and their steerability via prompting, and (ii) large-scale fine-tuning of MLLMs on web-related tasks. However, web agents still struggle to automate tasks on unseen websites and domains, limiting their applicability to enterprise-specific and proprietary platforms. Beyond generalization from large-scale pre-training and fine-tuning, we propose building agents for few-shot adaptability using human demonstrations. We introduce the AdaptAgent framework that enables both proprietary and open-weights multimodal web agents to adapt to new websites and domains using few human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications