Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web
Hiroki Furuta, Yutaka Matsuo, Aleksandra Faust, Izzeddin Gur

TL;DR
This paper introduces a new benchmark, CompWoB, to evaluate language model agents on complex web tasks, revealing significant performance gaps in compositional scenarios and proposing a new model that surpasses human performance on some benchmarks.
Contribution
The paper presents CompWoB, a challenging compositional web automation benchmark, and introduces HTML-T5++, a model that outperforms humans on MiniWoB and shows improved generalization.
Findings
Prompted LMAs' success drops from 94% to 24.9% on compositional tasks.
Transferred LMAs' success drops from 85.4% to 54.8%.
HTML-T5++ achieves 95.2% on MiniWoB and 61.5% zero-shot on CompWoB.
Abstract
Language model agents (LMA) recently emerged as a promising paradigm on muti-step decision making tasks, often outperforming humans and other reinforcement learning agents. Despite the promise, their performance on real-world applications that often involve combinations of tasks is still underexplored. In this work, we introduce a new benchmark, called CompWoB -- 50 new compositional web automation tasks reflecting more realistic assumptions. We show that while existing prompted LMAs (gpt-3.5-turbo or gpt-4) achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks. On the other hand, transferred LMAs (finetuned only on base tasks) show less generalization gap, dropping from 85.4% to 54.8%. By balancing data distribution across tasks, we train a new model, HTML-T5++, that surpasses human-level performance (95.2%) on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Topic Modeling · Multimodal Machine Learning Applications
MethodsBalanced Selection
