A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains

Xianren Zhang; Shreyas Prasad; Di Wang; Qiuhai Zeng; Suhang Wang; Wenbo Yan; Mat Hans

arXiv:2508.15832·cs.CL·April 22, 2026

A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains

Xianren Zhang, Shreyas Prasad, Di Wang, Qiuhai Zeng, Suhang Wang, Wenbo Yan, Mat Hans

PDF

TL;DR

This paper introduces Amazon-Bench, a comprehensive benchmark for evaluating web agents in e-commerce, focusing on functionality coverage and safety, revealing current agents' limitations in complex tasks and risk management.

Contribution

The paper presents Amazon-Bench, a new benchmark with a data generation pipeline and evaluation framework for assessing functionality and safety of web agents in e-commerce.

Findings

01

Current agents struggle with complex, multi-step queries.

02

Existing agents pose safety risks like incorrect purchases or account changes.

03

Evaluation reveals the need for more robust and reliable web agents.

Abstract

Web agents have shown great promise in performing many tasks on ecommerce website. To assess their capabilities, several benchmarks have been introduced. However, current benchmarks in the e-commerce domain face two major problems. First, they primarily focus on product search tasks (e.g., Find an Apple Watch), failing to capture the broader range of functionalities offered by real-world e-commerce platforms such as Amazon, including account management and gift card operations. Second, existing benchmarks typically evaluate whether the agent completes the user query, but ignore the potential risks involved. In practice, web agents can make unintended changes that negatively impact the user account or status. For instance, an agent might purchase the wrong item, delete a saved address, or incorrectly configure an auto-reload setting. To address these gaps, we propose a new benchmark…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.