Exposing Limitations of Language Model Agents in Sequential-Task   Compositions on the Web

Hiroki Furuta; Yutaka Matsuo; Aleksandra Faust; Izzeddin Gur

arXiv:2311.18751·cs.LG·January 3, 2025·2 cites

Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web

Hiroki Furuta, Yutaka Matsuo, Aleksandra Faust, Izzeddin Gur

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new benchmark, CompWoB, to evaluate language model agents on complex web tasks, revealing significant performance gaps in compositional scenarios and proposing a new model that surpasses human performance on some benchmarks.

Contribution

The paper presents CompWoB, a challenging compositional web automation benchmark, and introduces HTML-T5++, a model that outperforms humans on MiniWoB and shows improved generalization.

Findings

01

Prompted LMAs' success drops from 94% to 24.9% on compositional tasks.

02

Transferred LMAs' success drops from 85.4% to 54.8%.

03

HTML-T5++ achieves 95.2% on MiniWoB and 61.5% zero-shot on CompWoB.

Abstract

Language model agents (LMA) recently emerged as a promising paradigm on muti-step decision making tasks, often outperforming humans and other reinforcement learning agents. Despite the promise, their performance on real-world applications that often involve combinations of tasks is still underexplored. In this work, we introduce a new benchmark, called CompWoB -- 50 new compositional web automation tasks reflecting more realistic assumptions. We show that while existing prompted LMAs (gpt-3.5-turbo or gpt-4) achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks. On the other hand, transferred LMAs (finetuned only on base tasks) show less generalization gap, dropping from 85.4% to 54.8%. By balancing data distribution across tasks, we train a new model, HTML-T5++, that surpasses human-level performance (95.2%) on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/google-research
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Topic Modeling · Multimodal Machine Learning Applications

MethodsBalanced Selection