WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

Ralph Peeters; Aaron Steiner; Luca Schwarz; Julian Yuya Caspary; Christian Bizer

arXiv:2508.13024·cs.CL·May 1, 2026

WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

Ralph Peeters, Aaron Steiner, Luca Schwarz, Julian Yuya Caspary, Christian Bizer

PDF

TL;DR

WebMall is an offline benchmark simulating multiple e-shops with heterogeneous data, designed to evaluate complex web agents on challenging comparison shopping tasks involving retrieval and checkout processes.

Contribution

It introduces WebMall, the first multi-shop offline benchmark for evaluating web agents on complex e-commerce tasks with heterogeneous product data.

Findings

01

Best agents achieved below 65% success in key tasks.

02

WebMall exposes the difficulty of multi-shop comparison shopping.

03

Validation with diverse agents demonstrates benchmark's challenge.

Abstract

LLM-based web agents have the potential to automate long-running web tasks, such as searching for products in multiple e-shops and subsequently ordering the cheapest products that meet the users needs. Benchmarks for evaluating web agents either require agents to perform tasks online using the live Web or offline using simulated environments, the latter allowing for the exact reproduction of the experimental setup. While DeepShop and ShoppingComp provide online benchmarks that require agents to perform challenging shopping tasks, existing offline benchmarks such as WebShop, WebArena, and Mind2Web cover only comparatively simple e-commerce tasks performed against a single shop containing product data from a single source. What is missing is an e-commerce benchmark that simulates multiple shops containing heterogeneous product data and requires agents to perform complex retrieval tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.