TL;DR
ShoppingBench introduces a comprehensive, real-world shopping benchmark with complex user intents, a large-scale simulated environment, and evaluation of language agents' performance, revealing significant challenges and proposing methods for agent improvement.
Contribution
We present ShoppingBench, a scalable, intent-grounded shopping benchmark with a large simulated environment and novel training strategies for improving LLM-based shopping agents.
Findings
State-of-the-art agents achieve under 50% success rate on benchmark tasks.
Our training methods enable smaller agents to perform competitively with GPT-4.1.
ShoppingBench highlights the complexity of real-world shopping tasks for LLMs.
Abstract
Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
