WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents
Ralph Peeters, Aaron Steiner, Luca Schwarz, Julian Yuya Caspary, Christian Bizer

TL;DR
WebMall is an offline benchmark simulating multiple e-shops with heterogeneous data, designed to evaluate complex web agents on challenging comparison shopping tasks involving retrieval and checkout processes.
Contribution
It introduces WebMall, the first multi-shop offline benchmark for evaluating web agents on complex e-commerce tasks with heterogeneous product data.
Findings
Best agents achieved below 65% success in key tasks.
WebMall exposes the difficulty of multi-shop comparison shopping.
Validation with diverse agents demonstrates benchmark's challenge.
Abstract
LLM-based web agents have the potential to automate long-running web tasks, such as searching for products in multiple e-shops and subsequently ordering the cheapest products that meet the users needs. Benchmarks for evaluating web agents either require agents to perform tasks online using the live Web or offline using simulated environments, the latter allowing for the exact reproduction of the experimental setup. While DeepShop and ShoppingComp provide online benchmarks that require agents to perform challenging shopping tasks, existing offline benchmarks such as WebShop, WebArena, and Mind2Web cover only comparatively simple e-commerce tasks performed against a single shop containing product data from a single source. What is missing is an e-commerce benchmark that simulates multiple shops containing heterogeneous product data and requires agents to perform complex retrieval tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
