ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

Jiangyuan Wang; Kejun Xiao; Qi Sun; Huaipeng Zhao; Tao Luo; Jian Dong Zhang; Xiaoyi Zeng

arXiv:2508.04266·cs.CL·December 11, 2025

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jian Dong Zhang, Xiaoyi Zeng

PDF

1 Video

TL;DR

ShoppingBench introduces a comprehensive, real-world shopping benchmark with complex user intents, a large-scale simulated environment, and evaluation of language agents' performance, revealing significant challenges and proposing methods for agent improvement.

Contribution

We present ShoppingBench, a scalable, intent-grounded shopping benchmark with a large simulated environment and novel training strategies for improving LLM-based shopping agents.

Findings

01

State-of-the-art agents achieve under 50% success rate on benchmark tasks.

02

Our training methods enable smaller agents to perform competitively with GPT-4.1.

03

ShoppingBench highlights the complexity of real-world shopping tasks for LLMs.

Abstract

Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents· underline